Data Quality Control

For more information on the topic of quality control, please see our training materials here.

Before starting any analysis, it is always a good idea to assess the quality of your input data and improve it where possible by trimming and filtering reads. The mothur toolsuite contains several tools to assist with this task. We will begin by 1) merging our reads into contigs, followed by 2) filtering and trimming of reads based on quality score and several other metrics, and 3) Optimize files for computation.

In this experiment, paired-end sequencing of the ~253 bp V4 region of the 16S rRNA gene was performed. The sequencing was done from either end of each fragment. Because the reads are about 250 bp in length, this results in a significant overlap between the forward and reverse reads in each pair. We will combine these pairs of reads into contigs.

The Make.contigs tool from mothur tool suite creates the contigs, and uses the paired collection as input. Make.contigs will look at each pair, take the reverse complement reverse read, and then determine the overlap between the two sequences. Where an overlapping base call differs between the two reads, the quality score is used to determine the consensus base call. A new quality score is derived by combining the two original quality scores in both of the reads for all the overlapping positions.

Step 1: Combine forward and reverse reads into contigs

Every time you see a tool symbol in the instruction, like the one below, that means you will need to locate that tool in Galaxy. To search for a tool, type the tool name in the search tools text box under Tools menu on the left. If the tool exists in Galaxy, you will see it listed under the Show Sections button. Click on the tool name and you will see the tool configuration page in the center page of the Galaxy.

In this step, we will use Make.contigs to merge the forward and reverse reads into contigs.

o Click the Execute button to run the tool after choosing all the required options.

o You will see that the tasks are either waiting (gray with an hour glass) or running (orange) or done (green) under

the History panel.

Once the jobs are done, if you click on Make.contigs on .... others: group file in the History panel, you can preview the data in that file.

Make.contigs combined the forward and reverse reads for each sample, and also combined the resulting contigs from all samples into a single file. So we have gone from a paired collection of 19x2 FASTQ files, to a single FASTA file. In order to retain information about which reads originated from which samples, the tool also output a group file. View that file now, it should see the preview of the file has something like this:

M00967_43_000000000-A3JHG_1_1101_10011_3881 F3D0

M00967_43_000000000-A3JHG_1_1101_10050_15564 F3D0

M00967_43_000000000-A3JHG_1_1101_10051_26098 F3D0

[..]

Here the first column contains the read name, and the second column contains the sample name.

Step 2: Cleaning Data

In this step, we want to improve the quality of our data. But first, let’s get a feel of our dataset by getting the summary statistics of the output file trim.contigs.fasta generated by Make.contigs. And then we will run Screen.seqs to filter reads based on quality and length.

Get summary statistics of the contigs

Click on the Visualize this data icon after you click on Summary.seq....:logfile, you can select a Editor to view this txt file.

After the log file is opened, you can scroll to the bottom of the page and see the summary statistics

In this dataset:

Almost all of the reads are between 248 and 253 bases long.
2.5% or more of our reads had ambiguous base calls (Ambigs column).
The longest read in the dataset is 502 bases.
There are 147,581 sequences.

Our region of interest, the V4 region of the 16S gene, is only around 250 bases long. Any reads significantly longer than this expected value likely did not assemble well in the Make.contigs step. Furthermore, we see that 2.5% of our reads had between 6 and 248 ambiguous base calls (Ambigs column). In the next steps we will clean up our data by removing these problematic reads.

We do this data cleaning using the Screen.seqs tool, which removes

sequences with ambiguous bases (maxambig) and
contigs longer than a given threshold (maxlength).

Filter reads based on quality and length

The rest of the fields will use default values and click Execute

Question: How many reads were removed in this screening step? (Hint: run the summary.seqs tool again to check the statistics of good.fasta)

Answer: Look at the log file generated by Summary.seq for good.fasta, you should see that good.fasta has 124,739 sequences. Before data cleaning, trim.contigs.fasta has There are 147,581 sequences. That means (147,581 - 124,739 = 22,842 sequences are removed at the Screen.seq step. You can also find out this number by look at the number of lines in bad.accnos generated by Screen.seq

Step 3: Optimize files for computing

Microbiome samples typically contain a large numbers of the same organism, and therefore we expect to find many identical sequences in our data. In order to speed up computation, we first determine the unique reads, and then record how many times each of these different reads was observed in the original dataset. We do this by using the Unique.seqs tool.

Remove duplicate sequences

Question: How many sequences were unique? How many duplicates were removed?

Answer: 15,920 unique sequences. The number of duplicate sequences can be determined from the number of lines in the fasta (or names) output, compared to the number of lines in the fasta file before this step. good.fasta has 124,739 sequences. Therefore, 124,739 - 15,920 = 108,819 duplicate sequences.

Here we see that this step has greatly reduced the size of our sequence file; not only will this speed up further computational steps, it will also greatly reduce the amount of disk space (and your Galaxy quota) needed to store all the intermediate files generated during this analysis. This Unique.seqs tool created two files, one is a FASTA file containing only the unique sequences, and the second is a so-called names file. This names file consists of two columns, the first contains the sequence names for each of the unique sequences, and the second column contains all other sequence names that are identical to the representative sequence in the first column. Here is a toy-example of the names file

To recap, we now have the following files:

a FASTA file containing every distinct sequence in our dataset (the representative sequences)
a names file containing the list of duplicate sequences
a group file containing information about the samples each read originated from

To further reduce file sizes and streamline analysis, we can use the Count.seqs tool to combine the group file and the names file into a single count table.

Generate count table - to combine the group file and the names file into a single count table

Have a look at the count_table output from the Count.seqs tool in the preview window in the History panel, it summarizes the number of times each unique sequence was observed across each of the samples. It will look something like this:

The first column contains the read names of the representative sequences, and the subsequent columns contain the number of duplicates of this sequence observed in each sample.

Previous Step - Obtain & Prepare for Data

Next Step - Sequence Alignment

Page updated

Report abuse