introduction

Background

DADA is an algorithm for removing errors from PCR-amplified metagenome data.

An investigator wants to sample the genetic diversity in a bacterial or viral population (e.g. the three bacteria in the figure above). Rather than performing shotgun sequencing, she decides that she will amplify a particular genomic region in the sample. She will then pool the amplicons and sequence them. However, the PCR process has resulted in errors (the red letters in the figure), and because high-throughput methods derive each read from an individual molecule, the errors will show up in the final data.

This is the point where DADA enters. It infers the genotypes of the organisms in the sample along with the error rates of the PCR/Sequencing process from the noisy sequence data (the right side of the figure). This is achieved by iteratively updating a guess on the genotypes and the error parameters until convergence to a mutually consistent set. The details of this process are explained in our manuscript, but see below for a schematic.