In order to run PUMAA and the associated downstream tools, you will need several files that are the standard output from one of several raw 16S sequencing and data processing platforms, along with metadata about the samples. For the purposes of this manual version (v1.2), example data sets will use data produced using the Anacapa toolkit[1], which uses the CRUX filtered 16S database to assign taxonomy[2]. All example data is from the 70% cutoff for the Bayesian Lowest Common Ancestor (BLCA) method of taxonomic classification[3]. See the 16 Metagenomic Sequence Data Processing slides on the left for an overview.
1. Taxonomy Table(s). This is a table that lists all of the Operational Taxonomic Unit (OTU) or Amplicon Sequence Variant (ASV) numbers, followed how many of each sequence are in each sample. Importantly, each of the OTU/ASV numbers has been assigned a taxonomy (Phylum; Class; Order; Family; Genus; Species) by cross referencing the 16S sequencing data with databases of classified 16S sequences.
From Anacapa: Open ”16S_ASV_raw_taxonomy_70_Skirball-example.txt” as an Excel spreadsheet to see an example.
2. FASTA File(s) with Representative Sequences. Each OTU or ASV represents a specific sequence corresponding to an amplified 16S gene, and these can be found in one or more FASTA files.
See “forward16S_Skirball-example.fasta.txt” for an example of sequences that correspond to the example ASVs.
3. Metadata Table. The metadata, or the data about the data, may be the most important component of any project. This is how you ascribe biological or categorical information to the samples. The metadata is what will be used to make comparisons and formulate hypotheses about any microbiome samples. The more metadata that is collected, the more fine the comparisons that can be made. Metadata can be categorical (i.e. sample type or location) or continuous (i.e. % soil moisture content), but not all analysis and visualization tools will be able to use both kinds of data. We recommend that continuous data be binned into categories (i.e. Low, Med., High), for the tools used in this pipeline. It is very important that the sample names in the OTU/ASV taxonomy file match the sample names in the metadata table exactly. The PUMAA program will verify that the sample data and metadata match.
Open “Metadata_anacapa_skirball-example.txt” as an Excel spreadsheet to see the example metadata corresponding to the samples in the above examples.
All PUMAA wrapper file conversion steps can be performed once by the instructional staff, and the resultant output files can be provided to students for data analysis and visualization.
Once PUMAA is installed and running (see PUMA Github for instillation instructions), it will guide you through the process of uploading the files required based on your input platform. You will be asked to enter the desired rarefaction depth and iterations. Briefly, rarefaction is a method to randomly subsample each community to a common depth[4]. This can be necessary if the samples being analyzed have vastly different numbers of total sequences, which could affect subsequent diversity calculations. A common practice is to rarefy to the minimum number of sequences in any one sample. This number can be determined by summing all of the sequences per sample in the taxonomy table you will be using for analysis and taking the smallest sum. The rarefaction iterations refer to how many times this subsampling will be performed. If more than one iteration is performed, all of the subsamples are added together to get a better estimate of the total diversity.
[1]Curd E. 2019. The Anacapa Toolkit. Python. https://github.com/limey-bean/Anacapa
[2] Curd, E. (2018). CRUX: Creating Reference libraries Using eXisting tools. https://github.com/limey-bean/CRUX_Creating-Reference-libraries-Using-eXisting-tools
[3] Gao, X., Lin, H., Revanna, K., and Dong, Q. (2017). A Bayesian taxonomic classification method for 16S rRNA gene sequences with improved species-level accuracy. BMC Bioinformatics 18, 247.
[4] Cárcer, D.A. de, Denman, S.E., McSweeney, C., and Morrison, M. (2011). Evaluation of Subsampling-Based Normalization Strategies for Tagged High-Throughput Sequencing Data Sets from Gut Microbiomes. Appl. Environ. Microbiol. 77, 8795–8798.