Teresita M. Porter - OTUs versus ESVs

14 - OTUs versus ESVs

So we've all been using operational taxonomic units (OTUs) since the 2000's but now everyone is talking about exact sequence variants (ESVs).

Initially, OTUs were a convenient way to:

reduce data set size for downstream analyses
approximate 'species'
absorb sequence errors

At that time, programs and algorithms were not well developed for analyzing high throughput sequencing (at the time, this referred to 96 or 384 well plates for Sanger sequencing). Analyzing multiple plates of data was a big deal. Using OTUs to reduce data set size was a convenient way to make the data more tractable. Anyone remember EstimateS?

At that time, folks were also trying to emulate 'species' units for use with ecological indices. A 95-99% sequence similarity cutoff (depending on the marker and taxonomic group) seemed to work for this purpose. These cutoffs didn't work for many groups, however, leading many a grad student to wring their hands while trying to choose the best cutoff for their group (myself included).

We knew from the extensive 16S microbial literature that mixed-template PCR amplification from eDNA produced errors such as chimeric sequences and heteroduplexes. We also knew that different sequencing technologies produced different kinds of error profiles (varied length homopolymer runs or indels). At the time, clustering similar reads together into OTUs seemed to be a good way to absorb some of these erroneous sequences.

So in the past couple decades, software and algorithms have improved to the point where big data sets can be easily managed on high performance computing systems, manipulated using Python to create data frames suitable for different types of analyses, and ecological community analyses are easily handled using R packages like vegan. Data set reduction is no longer a driving need for OTUs.

Most groups have also come to terms with the fact that you cannot expect OTUs based on marker sequences to correspond with Linnean taxonomic groups. Apples and oranges, right?

So what about the problem of sequence errors. Though read output has increased over the years with platforms like Illumina MiSeq being the current popular choice for metabarcode studies, Illumina HiSeq and NovaSeq produce gobs more data; and single strand sequencing methods are also coming along. Each platform comes with their own read output vs. read quality vs. read length profile. No single platform optimizes all three parameters. There are many pipelines that handle sequence errors such as screening for putative chimeric sequences (MOTHUR, QIIME, USEARCH) and allow the user to exclude rare sequences that have been shown to be particularly prone to sequence errors. But what about errors produced by the sequencing platform? There is a new generation of denoisers that attempt to address this: DADA2, USEARCH10-unoise3, DEBLUR. Here is a nice paper that reviews these methods: "Denoising the denoisers: An independent evaluation of microbiome sequence error-correction methods". Each of these methods also happen to produce ESVs, or simply OTUs defined by 100% sequence similarity, also called uniques.

What are the advantages of ESVs over OTUs?

Increased resolution (no lumping of sequence variants belonging to different species into OTUs)
Reproducibility (sequence-order matters when creating OTUs, but not so when creating ESVs)
Easier comparisons with other data sets (no need to re-generate OTUs when new reads are added)
More intuitive interpretation (a sequence variant is what it is, whereas an OTU can be a diffuse cloud of similar sequences)

Here are some papers that discuss the topic further:

Exact sequence variants should replace operational taxonomic units in marker-gene data analysis

Ecological patterns are robust to use of exact sequence variants versus operational taxonomic units

Updating the 16S 97% identity threshold for 16S ribosomal RNA OTUs

Denoising the denoisers: An independent evaluation of microbiome sequence error-correction methods

Scaling up: A guide to high-throughput genomic approaches for biodiversity analysis

Oligotyping: differentiating between closely related microbial taxa using 16S rRNA gene data

And here are a couple good blog posts:

Lumping versus splitting - is it time for microbial ecologists to abandon OTUs?

Metabarcoding for every body, every habitat, every time