We are now ready to align our sequences to the reference. We will use SILVA, the high-quality ribosomal RNA databases, as our reference. This is an important step to improve the clustering of your operational taxonomic unit (OTUs). An OTU is an operational definition used to classify groups of closely related individuals. Here, we need to align our sequences to a reference alignment using align.seqs. Our sequences aren’t just random strings of nucleotides; they are part of a known gene (such as 16S). After alignment is complete, we will get the alignment statistics by running Summary.seqs, and perform additional data cleaning using Filter.seqs.
Have a look at the alignment output:
After the alignment is done, have a look at the summary output (log file):
The Start and End columns tell us that the majority of reads aligned between positions 1968 and 11550, which is what we expect to find given the reference file we used. However, some reads align to very different positions, which could indicate insertions or deletions at the terminal ends of the alignments or other complicating factors.
Also notice the Polymer column in the output table. This indicates the average homopolymer length. Since we know that our reference database does not contain any homopolymer stretches longer than 8 bases, any reads containing such long stretches are likely the result of PCR errors and we would be wise to remove them.
Next we will clean our data further by removing poorly aligned sequences and any sequences with long homopolymer stretches.
Question: How many sequences were removed in this step?
128 sequences were removed. This is the number of lines in the bad.accnos output generated from Screen.seqs.
Click the View data (eye icon) next to the Filter.seqs ... : filtered fasta result in the History panel. Your resulting alignment (filtered fasta output) should look something like this (don't click Show all or Save):
These are all our representative reads again, now with additional alignment information.
Click on the View data icon next to Filter.seqs ... : logfile, and when the results are loaded in the center panel, scroll to the very bottom of the page. You should see the following additional information:
From this log file we see that while our initial alignment was 13425 positions wide, after filtering the overhangs (trump parameter) and removing positions that had a gap in every aligned read (vertical parameter), we have trimmed our alignment down to a length of 362.
Because any filtering step we perform might lead to sequences no longer being unique, we deduplicate our data by re-running the Unique.seqs tool:
Question: How many duplicate sequences did our filter step produce?
The number of unique sequences was reduced from 15,799 to 15,796. You can see the number of sequences from the fasta files generated by Filter.seqs and Unique.seqs
The next step in cleaning our data, is to merge near-identical sequences together. Sequences that only differ by around 1 in every 100 bases are likely to represent sequencing errors, not true biological variation. Because our contigs are ~250 bp long, we will set the threshold to 2 mismatches.
Question: How many unique sequences are we left with after this clustering of highly similar sequences?
4,942: This is the number of lines in the fasta output
We have now thoroughly cleaned our data and removed as much sequencing error as we can. Next, we will look at a class of sequencing artefacts known as chimeras.
Chimeras occur when the polymerase falls off one sequence and then reattaches to another during a single round of PCR. This results in a sequence that looks unique but is, in fact, just half of one and half of another e.g. not real. Needless to say, we do not want such sequencing artefacts confounding our results. We’ll do this chimera removal using the VSEARCH algorithm Rognes et al. 2016 that is called within mothur, using the Chimera.vsearch tool.
(slide credit: http://slideplayer.com/slide/4559004/ )
Chimera.vsearch will split the data by sample and check for chimeras. The recommended way of doing this is to use the abundant sequences as our reference.
Question: How many sequences were flagged as chimeric? what is the percentage? (Hint: summary.seqs)
Looking at the chimera.vsearch accnos output, we see that 3,377 representative sequences were flagged as chimeric. If we run summary.seqs on the resulting fasta file and count table, we see that we went from 124,531 sequences down to 114,045 total sequences in this step, for a reduction of 10,486 total sequences, or 8.4%. This is a reasonable number of sequences to be flagged as chimeric.