Decona_plus

Decona_plus contains 3 modified pipelines based on Decona (Oosterbroek et al., preprint; https://github.com/Saskia-Oosterbroek/decona) and 2 standalone executable file for classifying ONT amplicon sequence data. These different pipelines and executables are intended to be used based on the research question you are trying to investigate.

At a glace

Overview of the decona pipeline. An in depth description can be found here.

How it works

eDNA classification can be summarized in X major steps

1) Quality control and filtering: when we are trying to identify organisms using eDNA sequences, we want to make sure the DNA sequences we are using are of high quality. Quality control and filtering ensure that the DNA was will classify are not damaged or misread. We also want to remove any sequences that are either too long or too short based on the piece of DNA we amplified.

2) Clustering the DNA: once we have removed all of the junk DNA, we can cluster the remaining sequences based on % similarity. This means an algorithm will search through all of the DNA sequences and collect all of the sequences that are, for example, >95% similar to each other.

3) Generating a consensus sequence: after we have clustered all of the similar DNA sequences together we can summarize the cluster into one representative sequence called a consensus sequence. This process will remove any of the lingering errors and ensure the highest quality DNA sequence.

4) Sequence identification: we then take our consensus sequence and compare it to a database of known sequences using the Basic Local Alignment Tool (BLAST) algorithm to identify who in the database is the closest match to our DNA sequences. Think of the song identification tool “SHAZAM”. By playing some of the song, SHAZAM will tell you the song name and artist. However, instead of playing a song, we are using DNA sequences to find the organism it came from.

5) Calculating relative abundance: since each consensus was generated using a cluster of DNA sequences, once we have a organism identification of the consensus sequence, we can assume that all of the DNA in the cluster also belongs to that organism. Then we can back calculate how many DNA sequence belong to each organism in the sequencing run and will give us an idea of which genera are present in the samples we sequenced and how DNA they account for in our sequencing run.