QIIME workflow

Author: Hu Huang (huan0764@umn.edu)
       Last update: 09/12/2014

           Updates (09/12/2014)

 - Added a one-step, once-for-all Frustration-free QIIME Configuration instruction (credit: Gabe Al Ghalith)
Updates (07/04/2014)
            - Fixed the bugs of parallel commands
            - Updated the tips for OTU picking. 

(For details on processing demultiplexed sequences (e.g. seqs from UMGC), please check Section 7 and use the corresponding command files to process.)

(It will be helpful to make a log-file for each data set being processed and to record every command that is used in each step.)

QIIME (Quantitative Insights Into Microbial Ecology, pronounced as "chime") analysis workflow at a glance.  
(Figure Source: 
Ref [1])

1.     Preprocessing

1)   Convert FASTQ files to FASTA and QUAL files (optional)

Different commands need different formats of data. Here is one command to convert from FASTQ files to FASTA (or FNA) and QAUL files, or the reverse conversion.

convert_fastaqual_fastq.py -c fastq_to_fastaqual -f seqs.fastq -o fastaqual/   (convert from a fastq file to a fasta file and a qual file)

convert_fastaqual_fastq.py -f seqs.fasta -q seqs.qual -o fastqfiles/                 (convert from a fasta file and a qual file to a fastq file)

2)  Truncate fasta files at the specified base position (optional)

First, check the quality distribution over base pairs:

 quality_scores_plot.py -q seqs.qual -o quality_histogram/

This command gives you a guideline about the poor reading bases through two statistic result files.

  Example files:


# Suggested nucleotide truncation position (None if quality score average did not drop below the score minimum threshold): 100
# Average quality score bins
# Standard deviation bins
# Total bases per nucleotide position bins


By using this guideline, determine the last base position that you want to keep, i.e., discard all the rest bases after this base position.

truncate_fasta_qual_files.py -f seqs.fna -q seqs.qual -b 100 -o filtered100/

This command will keep the first 100 bps in the sequence seqs.fna as well as the corresponding qual file seqs.qual

3) Demultiplexing, removing primers and quality filtering

About mapfiles: The high-throughput sequence data from NextGen sequencing facilities normally need to be demultiplexed according to the barcode sequences, whose corresponding sample IDs are written in a mapping file. QIIME has a strict requirement for its mapfile format: if your mapfile does not follow the rules, QIIME could not do the subsequent downstream analysis. Basically, it is required to have the column names in the first row, starting with a "#", e.g.

#SampleID BarcodeSequence LinkerPrimerSequence Treatment ReversePrimer Description

Treatment and ReversePrimer is optional, and you could also add other fields if needed, BUT the last column should ALWAYS be Description. It is recommended to check your map files before proceeding to the next step. Use following command to check the mapfiles, it will give a .html report on whether there are any errors or warnings detected in your mapfile.

validate_mapping_file.py -m mapfile.txt -o mapping_output

Filtering, including demultiplexing, truncating primers and quality filtering: This could be done by one command.

split_libraries.py -m mapfile.txt -f seqs.fasta -b 10 -l 50 -o slout/

(-b 10 is a barcode type parameter, change it accordingly. This command is mainly designed for 454 sequence data.)

For quality filtering, a qual file is required (-q seqs.qual). In addition, there are several parameters needed to be configured for quality control: -s, -a, -w.

For .fastq format of sequence files, use split_libraries_fastq.py instead, which is normally used for Illumina MiSeq sequencing. Note the following differences from the previous command: 

i) the barcode sequence is required to be stored in an additional .fastq file, and normally this is a standard output format of a certain type of sequencing machines; 

ii) this command does NOT remove the primers in the sequence if has one. This could be a bug, somehow, and it is expected to be fixed in the future version of QIIME. In some standard protocols, though, the sequence files do not contain primers (they are trimmed off by the machine automatically). Double check your sequence files before determining which command to use

iii) consider to use the following parameters for quality control: -r, -q.

split_libraries_fastq.py -i seqs.fastq -b seqs_barcodes.fastq --barcode_type 10 -o slout_r3_q20/ -m mapfile.txt -r 3 -q 20

(For more details on these two commands, please refer to http://qiime.org/scripts/split_libraries_fastq.html and http://qiime.org/scripts/split_libraries.html)

NOTE: This could be done through a PBS job. In our experiment, a sequence file with 7,185,950 sequences of length 161 (including 10 bp of barcode) took less than 1 hour (35 min ~ 52 min) when requiring 16 nodes and 16 GB of memory. But for split_libraries.py, if the primer starts with more than 4 (including 4) "N"s, it requires drastically more resources and time. (Same number of sequences took 1 hour 55 min when using 64 nodes.) ** ppn = 8 in Itasca 

        4)  Other useful commands

Sequence statistics: count the number of sequences, and mean/standard deviation of sequence lengths in the sequence file.

count_seqs.py -i seqs.fna

92651  : seqs.fna (Sequence lengths (mean +/- std): 151.0000 +/- 0.0000)

92651  : Total

Reverse and complement: If the sequences are pair-ended reads, and the reads are from the reverse strand, it needs to reverse and complement the sequences in order to match to the reference database:

adjust_seq_orientation.py -i seqs.fna

By default, the new sequence will be stored in seqs_rc.fna under the same directory. This can be changed by using -o parameter. There is --rev_comp option in split_libraries_fastq.py, with exactly the same function.

2.    OTU Picking

To use parallel QIIME commands, you MUST first set up parallel QIIME as described in "Frustration-free QIIME Configuration" here Section 1.6) . This is only required once.

It is recommended to use a parallel pipeline to pick OTUs, since it may take up to several hours to finish, depending on the number of sequences, as well as the configuration of a computing machine. One of the parallel OTU picking commands is shown as below (another option in parallel OTU picking is parallel_pick_otus_uclust_ref.py):

parallel_pick_otus_usearch61_ref.py -i seqs.fna -r /panfs/roc/groups/8/knightsd/public/gg_13_8_otus/rep_set/97_otus.fasta -o usearch_ref_otu/ -O 8 -X pickOTU

Note that "seqs.fna" is the file of input sequences. Option -X assigns the job name appeared on the job list, and -O determines the number of parallel jobs. 
This parallel pipeline includes the following procedures: 
(1) The sequence files are divided into 8 (as indicated in the above command) files; 
(2) One PBS job is automatically created, which include the pick_otus.py command with appropriate parameters and other file operations;            
            1) 8 sequence subfiles are calculated in parallel using 8 cores in one node.

poller.py combines the 8 results file into one final set of result files: _otus.txt, _otus.log, _failures.txt.

NOTE: the updated command files provided in "Metagenomics at MSI" have fixed the following bugs:
       -- All the cores in a node are now fully utilized in parallel!
          (Originally, it only uses one of the 8 cores in a node, and thus parallel commands were not in "real" parallel. It also caused a waste of SUs.)
       -- Fixed the problem of 
poller.py: it won't used one independent PBS job which will save a huge amount of resources (SUs) and also no need to manually run this command, as its original version.

Tips for the parameter -O: the updated commands will request 1 node and 8 cores, which are now fully in parallel; thus 8 would be sufficient for normal size of sequence files (~ 2G). The default walltime is set as 30 minutes. According to our experiment using multiple ~1.5GB sequence files (subsampled from Global Gut dataset), it normally took ~ 5 minutes to finish for each file. But the time is also depends on how easy it could find hits from the GreenGene database. The value could be larger than 8, but then it may need more memory and walltime, because in this case, the job will request more than one node and on the other hand itasca will only assign one node to one job (according to MSI), which means that one node needs to handle the tasks of multiple nodes.

3.     Making OTU biom table

1) The output of parallel_pick_otu is an OTU table in .txt format, but the downstream analysis of QIIME (or PICRUSt) needs a BIOM format of OTU table. The conversion can be done by using make_otu_table.py (you might need to have the reference OTU-taxonomy table)

make_otu_table.py -i seqs_otus.txt -/panfs/roc/groups/8/knightsd/public/gg_13_8_otus/taxonomy/97_otu_taxonomy.txt -o seqs_otus.biom 

2) The following command will add metadata to the OTU table, which is necessary for the downstream taxonomic analysis:

biom add-metadata -i seqs_otus.biom -o biom-taxa.biom --observation-metadata-fp /panfs/roc/groups/8/knightsd/public/gg_13_8_otus/taxonomy/97_otu_taxonomy.txt --observation-header "OTU_ID,taxonomy" --sc-separated taxonomy

3) In order to avoid a biased analysis due to the huge differences in sequence depths among different species, it usually needs to rarefy the OTU table before performing diversity analysis. For details about rarefaction, please refer to [1]. The sequence depth could be determined according to the sequence statistics:

biom summarize-table -i seqs_otus.biom -o biom_summary.txt

4) Convert BIOM format table to a human-readable .txt format tab-delimited table:

biom convert -i seqs_otus.biom -o otu_table.txt -b --header-key taxonomy

4. Summarize Taxa in Communities via Charts (pie, area, bar)

The first step to roughly identify samples that may be different from others is through visualizing the OTUs in each sample on multiple taxonomic levels. QIIME provides 7 different taxonomic levels of summaries (L2~L6 by default), defined as following:

L1: Kingdom level 
      e.g. k__Bacteria
L2: Phylum level
      e.g. k__Bacteria;p__Acidobacteria
L3: Class level
      e.g. k__Bacteria;p__Acidobacteria;c__Chloracidobacteria
L4: Oder level
      e.g. k__Bacteria;p__Acidobacteria;c__Solibacteres;
L5: Family level
       e.g. k__Bacteria;p__Actinobacteria;c__Actinobacteria (class);
L6: Genus level
       e.g. k__Bacteria;p__Actinobacteria;c__Actinobacteria (class);
L7: Species level
       e.g. k__Bacteria;p__Actinobacteria;c__Actinobacteria;

There is a workflow script, summarize_taxa_through_plots.py, visualizing the relative abundances of OTUs by samples (default) or categories (option -c). This pipeline includes summarize_otu_by_cat.pysummarize_taxa.py and plot_taxa_summary.py

summarize_taxa_through_plots.py -i seqs_otus.biom -o taxa_summary -m mapfile.txt

In the result folder, by default there will be one log file (.txt), L2~L6 levels of OTU table (.biom and .txt for each level) and a taxa_summary_plots folder which contains area and bar charts. By default, the plots are grouped by samples, however, it could be grouped by any categories in the mapfile (column name) by passing category name using option -c. Multiple categories could be specified by using comma (,) between two categories without space. For example

summarize_taxa_through_plots.py -i seqs_otus.biom -o taxa_summary -m mapfile.txt -c SAMPTYPE,Treatment

The output can be customized by passing a parameter file which specifies different parameter values for different commands in the pipeline. The parameter file can be created by using echo command or text editor. Each line specifies one parameter of one command. The format is as following: 

command_name:paramter_name paramter_values

Here, the parameter_name is the name with -- in the corresponding command help document. For example, plot_taxa_summary.py -h will give the following options (here shows a part of them):

Option -c CHART_TYPE also has a second name, --chart_type=CHART_TYPE, and here chart_type would be the paramter_name. If we also want a pie chart in addition to area and bar charts, specify the following line in the parameter file, and save it as a .txt file, summarize_param.txt, for example. 
plot_taxa_summary:chart_type pie,area,bar

or using echo command line:

echo "plot_taxa_summary:chart_type pie,area,bar" > summarize_param.txt

And then pass this parameter file to the pipeline script:

summarize_taxa_through_plots.py -i seqs_otus.biom -o taxa_summary -m mapfile.txt -p summarize_param.txt -c SAMPTYPE

This parameter_file rule is also applicable to all other QIIME pipeline commands which includes multiple commands.


i) Sometimes, the summary plots do not give the results that we want. For example, the order of groups are not in the desired order, or the bacteria types are too many that the plot looks ugly. This could be solved by using .biom and .txt files generated in the result folder. Preferably, use Excel to open the desired level of .txt file (e.g. otu_table_L2.txt), and edit the columns and rows as we want. Also we could collapse (sum up) those bacteria with low abundances to a new row. And save it as a Tab-delimited text file. And then use plot_taxa_summary.py command to get the new charts. Multiple levels could use comma (,) in between files to plot in the same .html file. For example,

plot_taxa_summary.py -i new_otu_table_L2.txt,new_otu_table_L3.txt -o new_L2L3_chart -c pie,area,bar

ii) If we want to use some of the charts, for example in a paper manuscripts, we could import from the original PDF files to the document. On top left of each chart, there are two PDF links (as shown below): View Figure (.PDF)  and View Legend (.PDF). Click both of them, it will open two web pages, and from the address line of these web pages (as shown below), we could find the original directory of these two PDF files. Normally they are located under the directory of taxa_summary/charts/. 

5.    Diversity analysis

1. Alpha-diversity Analysis

"Alpha-diversity is defined as the diversity of organisms in one sample or environment." [1] (within-sample diversity, answering the questions of "what is there?","how much is there"). It is recommended to use phylogenetic distance/diversity (PD_whole_tree) over OTU counts, but the choice of distance metrics depends on the specific questions, such as estimates of community richness (observed number of OTUs, observed_species or the Chao1 estimator of the total number that would observed with infinite sampling) and estimates of combination of richness and evenness (Shannon entropy).

alpha_rarefaction.py is a series of scripts that perform alpha diversity analysis and visualization. It includes multiple_rarefaction.pyalpha_diversity.pycollate_alpha.py and make_rarefactioin_plots.py. For alpha_diversity.py, we need to specify the distance metrics, and this can be done by a parameter file option.

echo "alpha_diversity:metrics shannon,PD_whole_tree,chao1,observed_species" > alpha_params.txt
alpha_rarefaction.py -i seqs_otus.biom -m mapfile.txt -o alpha_div/ -p alpha_params.txt -t /panfs/roc/groups/8/knightsd/public/gg_13_8_otus/trees/97_otus.tree


i) Similarly, the parameter file, alpha_params.txt, could be created by using echo command or using a text editor. Each line specifies one parameter of one command.  alpha_diversity -h will give the following options:

Here, --metrics==METRIC, which means parameter_name is metrics. Then we could specify this parameter in the param_file as alpha_diversity:metrics shannon,chao1

ii) This pipeline may take a long time, depending on the file size. For large dataset, it is recommended to use PBS job, requesting more computation resources (e.g., 64 nodes with 2 hours walltime). 

iii) If there is an error, open the log file in the result folder, and check which command of the pipeline has the problem. After fixing the issue, it needs to start from the specific command, as in the log file, to the next commands in the pipeline, one by one. 

2. Beta-diversity Analysis

"Beta-diversity is the difference in diversities across samples or environments." [1] (between-sample diversity, answering the question of "how similar or different are samples?") QIIME implements two variants of UniFrac-based beta-diversity analysis: weighted UniFrac and unweighted UniFrac. 
  • Weighted UniFrac
    • Weighted by the difference in probability mass of OTUs from each community for each branch in the phylogenetic tree
    • Detect community differences that arise from differences in relative abundance of taxa
  • Unweighted UniFrac
    • the absence/presence of the OTUs from each community
    • Detect community differences that arise from difference in presence of certain taxa
    • might be better in analysis of clinical or environmental variables, for example
Similarly, there is a workflow commands for beta-diversity analysis and visualization: beta_diversity_through_plots.py. This pipeline includes single_rarefaction.py, beta_diversity.py, principal_coordinates.py and make_emperor.py. The final visualization is performed by Emperor in a web browser. The results files should be downloaded to local directory in order to view the PCoA cluster effects. 

beta_diveristy_through_plots.py -i seqs_otus.biom -m mapfiles.txt -o beta_div/ -t /panfs/roc/groups/8/knightsd/public/gg_13_8_otus/trees/97_otus.tree -e 200

The option -e refers to the sequence depth, which could be determined from OTU table summary in Section 3, 3).

NOTES: Similar to alpha_rarefaction.py, if there is an error, check the log file in the result folder and fix the problem of corresponding command, and start with that command to the following commands one by one. This pipeline takes relatively short time to finish.

6.    Metagenome prediction through PICRUSt -- PICRUSt is now installed on MSI!

module load picrust

-- Normalize the OTU table - the first step in PICRUSt
normalize_by_copy_number.py -i biom-taxa-sc-rare.biom -o normalized_otus.biom

-- Partition metagenome functional contributions according to function, OTU, and sample, for a given OTU table.

metagenome_contributions.py -i metagenome_prediction.biom -l K01442,K01550 -o metagenome_congtrib.tab

-- Collapse table data to a specific level in a hierarchy

categorize_by_function.py -i metagenome_prediction.biom -c "KEGG_Pathways" -l 3 -o metagenome_at_level3.biom  

Example: First create a file, qiime_params.txt, which includes the following parameters:

summarize_taxa:md_identifier    "KEGG_Pathways"

summarize_taxa:absolute_abundance   True

summarize_taxa:level    3

     Then use the command below:

summarize_taxa_through_plots.py -i metagenome_at_level3.biom -p qiime_params.txt -o metagenome_plots_level3

7. Preprocessing Demultiplexed Sequence Data 

Sequences from UMGC are normally demultiplexed, (each sequence file (or R1 and R2 pair) corresponds to one individual sample), which is a little different from the standard workflow of  QIIME pipeline. Here we use a few tricks to preprocess this kind of demultiplexed sequences, by two pipeline scripts written in Perl. The workflow is as following:

1) Convert all .fastq files to .fna and .qual formats;
2) Check the qualities of reads for all (or parts of) samples, and then determine and truncate low quality sequences if needed;
3) Create mapfiles for individual samples;
4) Using individual mapfiles to perform quality control and to remove the primers;
5) Reverse and complement the sequences from negative strand (R2); 
6) Combine all the processed sequences into one single file and save it.

Step 1) could be done by step1_convert_all_fastq.pl, for example:

perl step1_convert_all_fastq.pl

The quality scores of the reads from UMGC are normally high enough (Q>30) so that it doesn't need to truncate the sequences. But it the negative strand (R2) sequences are normally said to have lower qualities, so check more of R2 sequence files in Step 2). 

In Step 3), it is flexible to create mapfiles for individual sequences. The creation is performed by using a MATLAB function, hh_make_mapfiles.m, and in order to run the command correctly, it needs to load MATLAB module priorly (module load matlab). The individual mapfiles are used for removing primers from each sequence. It normally needs two mapfiles, one forward mapfile for R1 files, and one reverse mapfile for R2 files. The .m file provided here is written for V1V3 and V4V6 regions, and the mapfiles should be named as


Steps 4) ~ 6) is performed by step2_trim_primer_merge.pl, using command line

perl step2_trim_primer_merge.pl

This command file provided here is written for V1V3 and V4V6 regions, and at the end it will generate two final preprocessed files: all_seqs_V1V3.fna and all_seqs_V4V6.fna. Using these two files, we can perform OTU picking and the downstream analysis described in Sections 2~6. 

i) In order to run the commands correctly, all these three command files should be in the same directory as the .fastq sequence files.
ii) step2_trim_primer_merge.pl and hh_make_mapfiles.m files should be modifed according to specific mapfiles and situations
iii) Add module load matlab to .bashrc file, if you'll use MATLAB functions in future.
iv) This is a temporary method to preprocess demultiplexed sequence files for the specific regions. A more general-purpose function will be developed shortly. 
v)  It will generate three types of files: *_R1.fna, *_R2.fna, and all_seqs_*.fna. The first two are preprocess sequences from R1 and R2 , respectively, whereas the last one is a simple combination of these two files (no pair-end joinment). Since they are pair-end readings, using  all_seqs_*.fna may cause a problem of different OTU assignments for the same sequence (one for R1 and a different one for R2). This problem will be fixed in a later version, by joining the pair-ending reads together first. For now, we could use single-end reads (only R1) to pick OTUs.

For detailed workflow and explanation for the results, please refer to references [1, 2].

[1] Navas-Molina, José A., et al. "Advancing Our Understanding of the Human Microbiome Using QIIME." Microbial Metagenomics, Metatranscriptomics, and Metaproteomics (2013): 371.

Hu Huang,
Jun 3, 2014, 11:50 AM
Hu Huang,
Jun 3, 2014, 11:50 AM
Hu Huang,
Jun 3, 2014, 11:51 AM
Hu Huang,
Jun 17, 2014, 10:51 AM