OTU Clustering

In this tutorial we will continue with an OTU-based approach, for the phylotype and phylogenic approaches, please refer to the mothur wiki page.

In 16S metagenomics approaches, OTUs are clusters of similar sequence variants of the 16S rDNA marker gene sequence. Each of these clusters is intended to represent a taxonomic unit of a bacteria species or genus depending on the sequence similarity threshold. Typically, OTU cluster are defined by a 97% identity threshold of the 16S gene sequence variants at species level. 98% or 99% identity is suggested for strain separation.

(Image credit: Danzeisen et al. 2013, 10.7717/peerj.237)

Cluster sequences into OTUs

We will use the Cluster.split tool to perform clustering of the sequences into OTUs. With this approach, the sequences are split into bins, and then clustered with each bin. Taxonomic information is used to guide this process. The Schloss lab have published results showing that if you split at the level of Order or Family, and cluster to a 0.03 cutoff, you’ll get just as good of clustering as you would with the “traditional” approach. In addition, this approach is less computationally expensive and can be parallelized, which is especially advantageous when you have large datasets.

We’ll now use the Cluster tool, with taxlevel set to 4, requesting that clustering be done at the Order level.

Question: Which samples contained sequences belonging to an OTU classified as Staphylococcus? (Hint: look at tax.summary file output by Classify.otu)

Samples F3D141, F3D142, F3D144, F3D145, F3D2. This answer can be found by examining the tax.summary output and finding the columns with nonzero values for the line of Staphylococcus.

Complete Assignment 1

Once Classify.otu finishes, follow the guidelines in Canvas and complete Assignment 1 for the Galaxy Project. Assignment 1 includes the Pivot Table and Stacked Bar Graph.

Before we continue, let’s remind ourselves what we set out to do. Our original question was about the stability of the microbiome and whether we could observe any change in community structure between the early and late samples.

Because some of our sample may contain more sequences than others, it is generally a good idea to normalize the dataset by subsampling.

Subsampling

Question: How many sequences did the smallest sample consist of?

From group.count generated by Count.groups, the smallest sample is F3D143, and consists of 2389 sequences. This is a reasonable number, so we will now subsample all the other samples down to this level.

Question: What would you expect the result of count.groups on this new shared output collection to be?

All groups (samples) should now have 2389 sequences. Run count.groups again on the shared output collection by the sub.sample tool to confirm that this is indeed what happened.

Note: since subsampling is a stochastic process, your results from any tools using this subsampled data will deviate from the ones presented here.

Previous Step - Taxonomic Classification

Next Step - Visualizations

Page updated

Report abuse