how to use

Prerequisites

    1. MATLAB. DADA is currently implemented as a Matlab script that must be executed from within the MATLAB environment.
    2. MATLAB Bioinformatics Toolbox. A few basic bioinformatics tools are used from this toolbox by DADA.
    3. PERL. We include several PERL scripts to get your data from .fasta to the .uniques format accepted by DADA

Downloading DADA

You can download the DADA project by clicking here for a zip archive or here for a tarball. DADA is also available via our github repository. Once unpacked, the archive will contain the main dada.m MATLAB script, several auxiliary MATLAB and PERL scripts, and a README (containing the same information you see here) and a License.

Setting up DADA

Before running DADA it is necessary to compile the C programs align2dom_homo.c and align2dom.c. This is easy to do within MATLAB. Open MATLAB, go to the directory where the files are located, and type into the command window:

mex align2dom.c

and

mex align2dom_homo.c

This will create executable files with a suffix dependent on your local architecture. Place these files in the same directory as dada.m. It is recommended that you place these compiled C programs and dada.m in a folder together and add this to your MATLAB path. This way you may call them while working in any directory where your data may happen to be.

Upstream processing

DADA accepts data in a format that we call .uniques, an efficient way to store and manipulate deep sequence data. Here is an example of a .uniques file.

435 GGCGGCGATATCCTGC

384 GGCTGCGATATCCTGC

286 GGCGGCGATATCATGC

199 GGCGGCGATAGCCTGC

...

Each line is an integer and a string separated by a tab. The string is a nucleotide sequence and the integer is the number of times that this sequence was observed in the data set. We provide some simple PERL scripts to easily get your data into .uniques format from .fasta.

fasta2seqs.pl

takes a .fasta file, strips away the headers and newlines, and creates a .seqs file, which consists of one complete sequence on each line. For example,

GGCGGCGATATCCTGC

GGCTGCGATATCCTGC

GGCGGCGATATCCTGC

GGCGGCGATAGCCTGC

fasta2seqs.pl can be called on a folder of .fasta files or a single .fasta file. For example:

perl fasta2seqs.pl ~/myfastas

or

perl fasta2seqs.pl ./myfasta.fasta

A new folder with _seqs appended to its name will be created with each .fasta turned into a .seqs

lengthTrim.pl

takes a folder of .seqs files, trims sequences down to a given length, and creates a folder of new .seqs files. Sequences shorter than the provided length are removed. You can call it with

perl lengthTrim.pl ~/myfastas_seqs 100

This creates the folder ~/myfastas_seqs100, which will contain .seqs files of 100bp sequences.

clusterUniques.pl

takes in a folder of .seqs files (or single .seqs file) and outputs a folder (or file) in the .unique format described above. This folder will have _uniques appended to its name.

Running DADA

Now that you have converted your data into a folder of .uniques files and have compiled the two C programs, you are ready to run DADA. In its simplest form, DADA may be called in the command window with

dada('inFolder')

where inFolder is the path for a folder of .uniques files. DADA can also be with options via

dada('inFolder','opt1',op1val,'opt2',op2val...)

These options are:

  • omegaA, the abundance p-value significance threshold, set to .01 by default
  • omegaR, the read p-value significance threshold, set to .01 by default
    • G, the gap penalty, set to -4 by default
  • GH, a homopolymer gap penalty, set to -1 by default
    • context, whether or not to use context-dependent error probabilities during clustering, set to "false" by default. Use values "true" and "false" to turn context-dependence on or off
  • tol, a tolerance on the norm of the difference between T matrices on consecutive rounds for DADA to quit (see manuscript), set to 1e-6 by default.
  • err, optional initial set of error probabilities for the first round of clustering (instead of estimating these by assuming each file contains only a single genotype) -- may save time if clustering the same or similar data sets many times and can be useful if trying to determine what clustering a particular set of error probabilities would imply.
  • noUpdate, whether to not update error probabilities, set to "false" by default. It can be useful to set this to true in combination with 'err' if you want to see the effect of clustering the data with a particular set of error probabilities.

The output of this run is placed in a new folder within the working directory. The name of this folder will be the date and time at which DADA was invoked follow by any options that were used. Within this folder will be subfolders named 0, 1, 2, ..., each containing the results of the clustering after each round of the updating of error probabilities. If you would just like a .fasta of the genotypes inferred by DADA, you skip ahead to the description of bin2fasta.m below. However, if you want to explore the clustering, here is a description of the DADA output: within each subfolder will be one .mat file for each input .uniques file given to DADA. When loaded within MATLAB, these .mat files contain several variables:

  • bin, a struct array containing detailed information about the clustering.
  • reads, a vector of the read abundances associated with each inferred genotype
  • reals, a cell array containing the inferred genotypes
  • D, a pairwise distrance matrix between these genotypes.

The subfolders may also contain (as long as noUpdate was not set to "true") the file ERR.mat, which contains the context-independent probabilities, ERRa, and context-dependent probilities, ERR. In the language of the methods of our paper, ERR(L,i,j,R) = T^{L,R}_{ij} and ERRa(i,j) = T_{ij}.

DOWNSTREAM of DADA

bin2fasta.m

If you wish to work with .fasta files of the inferred genotypes after calling DADA, you can call the included MATLAB script bin2fasta.m. This takes as its input the location of a folder of .mat files produced by DADA and creates .fasta files of the inferred genotypes, placing the abundance of each sequence in its header. If you want all genotypes output to a single file called 'all.fasta', you call it as

bin2fasta('inFolder')

you can also output a single .uniques file for each file, which you would achieve by calling

bin2fasta('inFolder','fasta',false)

omegaA_rejoin.m

takes a folder of .mat files produced by dada and returns a vector of approximate omegaA values that would be required for each cluster to be reabsorbed into some other cluster. Making a histogram of these is a good way to see whether the results are likely to be sensitive to the choice of the omegaA parameter, and if more conservative values ought to be used. These p-values are determined for the scenario that the reads of the reads of the error-free family of a given cluster were all errors away from some other cluster (for which all associated reads are included).

discrimPlot.m

shows a discrimination plot of the type shown in the DADA paper. Its arguments are:

  1. omegaA
  2. omegaR
  3. 4x4 matrix of context-independent error probabilities
  4. the number of reads in a cluster
  5. a 1x4 vector of the base composition of the sample genotype for a cluster
  6. the number of families in this cluster

For example,

discrimPlot(.01,.01,ERRa,5000,[50 50 50 50],1000)

where ERRa came from the ERR.mat file of the final round of clustering for your data set. Because of the way the effective Hamming distance is defined (see the DADA paper), everything above the curved line or to the right of the vertical line would be rejected as an error by DADA at the given significance levels if the x-axis is interpreted as a true Hamming distance, so this plot will show you the minimum resolution DADA was able to achieve at the provided significance level. If you want finer resolution than this, you can run the algorithm again with one or the other significance level changed, but you may have to accept additional false positives in the process.

A call for open-sourcing?

DADA itself is, for now, implemented in MATLAB. If you feel strongly, for philosophical or practical reasons, that it should be implemented in an open source language like C, let us know (and what your reasons are)!

Software License

DADA is licensed under the GPL v3; you may edit the code and share your edited code, but you may not sell the code and must maintain the license if you do distribute it.