Now that we have thoroughly cleaned our data, we are finally ready to assign a taxonomy to our sequences.
We will do this using a Bayesian classifier (via the Classify.seqs tool) and a mothur-formatted training set provided by the Schloss lab based on the RDP (Ribosomal Database Project, Cole et al. 2013) reference taxonomy.
The most commonly used reference taxonomies are:
The choice of taxonomic classifier and reference taxonomy can impact downstream results. The figure from Liu et al. 2008 given below shows the taxonomic composition determined when using different classifiers and reference taxonomies, for different primer sets (16S regions).
Figure: Compositions at the phylum level for each of the three datasets: (a) Guerrero Negro mat, (b) Human gut and (c) Mouse gut, using a range of different methods (separate subpanels within each group). The x-axis of each graph shows region sequenced. The y-axis shows abundance as a fraction of the total number of sequences in the community. The legend shows colors for phyla (consistent across graphs).
Which reference taxonomy is best for your experiments depends on a number of factors such as the type of sample and variable region sequenced.
Another discussion about how these different databases compare was described by Balvočiūtė and Huson 2017.
Despite all we have done to improve data quality, there may still be more to do: there may be 18S rRNA gene fragments or 16S rRNA from Archaea, chloroplasts, and mitochondria that have survived all the cleaning steps up to this point. We are generally not interested in these sequences and want to remove them from our dataset.
Step 1: Taxonomic Classification and Removal of undesired sequences
This step involves using Classify.seqs with the fasta file and count table generated by Remove.seqs and the reference and taxonomy data that we loaded into this project during the data preparation step. We will then use Remove.lineage we will remove any non-bacterial sequences.
Chloroplast-Mitochondria-unknown-Archaea-Eukaryota
Question:
How many unique (representative) sequences were removed in this step?
How many sequences in total?
20 representative sequences were removed. The fasta file output from Remove.seqs had 2246 sequences while the fasta output from Remove.lineages contained 2226 sequences.
162 total sequences were removed. If you run summary.seqs with the count table, you will see that we now have 2261 unique sequences representing a total of 113,883 total sequences (down from 114,045 before). This means 162 of our sequences were in represented by these 20 representative sequences.