Folder: Microbiome on my desktop and melissa_Lmelis in /labs/evolution/projects on the cluster
Files:
1). gomp011_S1_L001_R1_001.fastq : Read1 from Illumina
2). gomp011_S1_L001_R2_001.fastq : Read2 from Illumina
3). gomp011_S1_L001_I1_001.fastq : Index file for barcodes from Illumina
4). mapping_file.csv : Metadata mapping file with barcodes and sample IDs
Step 1:
Validate the mapping file:
validate_mapping_file.py -m ./mapping_file.csv -o ./
This generates 3 files:
mapping_file.csv_corrected.txt
mapping_file.csv.html
mapping_file.csv.log
No errors were produced.
Step 2:
Use split_libraries_fastq.py on each read separately.
split_libraries_fastq.py -i gomp011_S1_L001_R2_001.fastq -b gomp011_S1_L001_I1_001.fastq -o fout_R2/ -m mapping_file.csv --rev_comp_mapping_barcodes
Options:
-i input fastq read1 file
-b index file with barcodes
-o fout_R2 folder in which output files are created
-m mapping file with sample ids and barcodes
--rev_comp_mapping_barcodes reverse barcodes in mapping file to match the index file
Output files created:
histogram.txt
seq.fna
split_library_log.txt
Step 3
pick_closed_reference_otus.py -i ./fout_R1/seqs.fna -o ./picked_closed_OTUs_R1
pick_rep_set.py -i ./uclust_ref_picked_otus/seqs_otus.txt -f ../fout_R1/seqs.fna -o ./rep_set1.fna
assign_taxonomy.py -i ./picked_closed_OTUs_R1/rep_set1.fna
align_seqs.py -i ./picked_closed_OTUs_R1/rep_set1.fna -o ./pynast_aligned_defaults/
filter_alignment.py -i ./pynast_aligned_defaults/rep_set1_aligned.fasta -o ./pynast_aligned_seqs/
make_phylogeny.py -i ./pynast_aligned_seqs/rep_set1_aligned_pfiltered.fasta -o ./rep_phlyo.tre
biom summarize-table -i ./picked_closed_OTUs_R1/otu_table.biom
make_otu_network.py -i ./picked_closed_OTUs_R1/otu_table.biom -m ./picked_closed_OTUs_R1/97_otus.tree -o ./otu_network/
summarize_taxa_through_plots.py -i ./picked_closed_OTUs_R1/otu_table.biom -o ./taxa_summary -m ./mapping_file.csv_corrected.txt
beta_diversity.py -i ./picked_closed_OTUs_R1/otu_table.biom -o beta_div
single_rarefaction.py -i ./picked_closed_OTUs_R1/otu_table.biom -o otu_table_even100.biom -d 100
beta_diversity.py -i ./picked_closed_OTUs_R1/otu_table.biom -o beta_div -m weighted_unifrac -t rep_phlyo.tre
WORK ON 07/01/2016
Started repeating the above steps for both reads of the Illumina sequencing. The mapping file was used from step 1 above which is mapping_file.csv.
Step 1:
Use split_libraries_fastq.py for both reads. Following command line options were used for this script:
split_libraries_fastq.py -i gomp011_S1_L001_R1_001.fastq,gomp011_S1_L001_R2_001.fastq -o fout_BothReads/ -m mapping_file.csv,mapping_file.csv --rev_comp_mapping_barcodes -b gomp011_S1_L001_I1_001.fastq,gomp011_S1_L001_I1_001.fastq
The following are the options provided for this first pass:
Options:
-i input fastq read1 and read2 (comma delimited)
-b index file with barcodes (given twice for each read,also comma seperated)
-o fout_Bothreads folder in which output files are created
-m mapping file with sample ids and barcodes
--rev_comp_mapping_barcodes reverse barcodes in mapping file to match the index file
Step 2:
Picking OTUs used the following command line options:
pick_closed_reference_otus.py -i seqs.fna -o ./picked_closed_OTUs
I used close reference option because we are clustering reads against a reference sequence collection. By this process any reads which do not hit a sequence in the reference sequence collection are excluded from downstream analysis. QIIME specifies that this option MUST be used if I am comparing non-overlapping amplicons such as the V2 and the V4 regions of the 16s rRNA. Reference sequences must span both of the regions being sequenced.
Copied from the QIIME website:
Pros:
Speed. Closed-reference OTU picking is fully parallelizable, so is useful for extremely large data sets.
Better trees and taxonomy. Because all OTUs are already defined in your reference sequence collection, you may already have a tree and a taxonomy that you trust for those OTUs. You have the option of using those, or building a tree and taxonomy from your sequence data.
Cons:
Inability to detect novel diversity with respect to your reference sequence collection. Because reads that don’t hit the reference sequence collection are discarded, your analyses only focus on the diversity that you “already know about”. Also, depending on how well-characterized the environment that you’re working in is, you may end up throwing away a small fraction of your reads (e.g., discarding 1-10% of the reads is common for 16S-based human microbiome studies, where databases like Greengenes cover most of the organisms that are typically present) or a large fraction of your reads (e.g, discarding 50-80% of the reads has been observed for “unusual” environments like the Guerrero Negro microbial mats).
Step 3:
pick_rep_set.py -i ./picked_closed_OTUs/uclust_ref_picked_otus/seqs_otus.txt -f seqs.fna -o ./rep_set1.fna --result_fp ./rep_Set
The following were the options used for command line:
-i input file is the seqs_otu.txt file from previous step
-f is the fasta file. Here we used the seqs.fna which contains all of the sequences whose identifiers are listed in the OTU map
--result_fp option was used to save the seqs_rep_set.fasta file for next step
output file is rep_set1.fna
Step 4:
assign_taxonomy.py -i ./seqs_rep_set.fasta --read_1_seqs_fp ../gomp011_S1_L001_R1_001.fastq --read_2_seqs_fp ../gomp011_S1_L001_R2_001.fastq -r ./rep_set1.fna
The default option was used to assign taxonomy using uclust taxonomy assigner. I did not give the -t, --id_to_taxonomy_fp option because I did not know how to prepare this file. This was also the FIRST PASS WITH DEFAULT UCLUST SEARCH.
The following were the options used for command line:
-i input file is seqs_rep_set.fasta from the previous step
--read_1_seqs_fp and 2 option provides the two reads used in the analysis
-r is the reference sequences file. Here it is the rep_set1.fna from previous analysis
Step 5:
align_seqs.py -i rep_set1.fna -o ./pynast_aligned_defaults/
I used the default arguments here. The alignment is done using PyNAST which is python implementation of the NAST algorithm of sequence alignment.
NOTE: The NAST algorithm aligns each provided sequence (the “candidate” sequence) to the best-matching sequence in a pre-aligned database of sequences (the “template” sequence). Candidate sequences are not permitted to introduce new gap characters into the template database, so the algorithm introduces local mis-alignments to preserve the existing template sequence.The quality thresholds are the minimum requirements for matching between a candidate sequence and a template sequence. The set of matching template sequences will be searched for a match that meets these requirements, with preference given to the sequence length. By default, the minimum sequence length is 150 and the minimum percent id is 75%.
Following arguments were used:
-i input file is the rep_set1.fna from step 3 which is a file of unaligned sequences.
-o is saved in folder pynast_aligned_defaults where three files are created: rep_set1_aligned.fasta, rep_set1_failures.fasta, rep_set1_log.txt
-t option was set default where PyNAST uses /Users/caporaso/.virtualenvs/qiime/lib/python2.7/site-packages/qiime_default_reference/gg_13_8_otus/rep_set_aligned/`85_otus.py <./85_otus.html>`_nast.fasta for template alignment.
-a, --pairwis_alignment_method: default which is uclust
-e, --min_length: default ie 75% of the median input sequence length
-p, --min_percent_id: default:0.75
-d, --blast_db: default: created on-the-fly from template_alignment
Step 6:
filter_alignment.py -i ./pynast_aligned_defaults/rep_set1_aligned.fasta -o ./filtered_alignments/
NOTE:This script will remove positions which are gaps in every sequence (common for PyNAST, as typical sequences cover only 200-400 bases, and they are being aligned against the full 16S gene). FILTERING ALIGNMENTS WHICH WERE BUILT WITH PYNAST AGAINST THE GREENGENES CORE SET ALIGNMENT SHOULD BE CONSIDERED AN ESSENTIAL STEP.
The command line options used are:
-i is the input file containing alignment created by previous set ie rep_set1_aligned.fasta
-o output directory
Used default options here also.I have not defined any:
--lane_mask_fp option: because I do not have this file
--allowed_gap_frac: Gap filter threshold, filters positions which are gaps in > allowed_gap_frac of the sequences [default: 0.999999]
--remove_outliers :Remove seqs very dissimilar to the alignment consensus (see –threshold). [default: False]
--threshold
With -r, remove seqs whose dissimilarity to the consensus sequence is approximately > x standard deviations above the mean of the sequences [default: 3.0]
--entropy_threshold
Percent threshold for removing base positions with the highest entropy, expressed as a fraction between 0 and 1. For example, if 0.10 were specified, the top 10% most entropic base positions would be filtered. If this value is used, any lane mask supplied will be ignored. Entropy filtering occurs after gap filtering. [default: None]
Step 7:
biom summarize-table -i ./picked_closed_OTUs/otu_table.biom
Summary Results:
Num samples: 69
Num observations: 4150
Total count: 9274061
Table density (fraction of non-zero values): 0.077
Counts/sample summary:
Min: 1.0
Max: 591739.0
Median: 77444.000
Mean: 134406.681
Std. dev.: 155501.530
Sample Metadata Categories: None provided
Observation Metadata Categories: taxonomy
Counts/sample detail:
F4.HWR.L.15: 1.0
L20.BST.L: 1.0
L1hwr: 2.0
F5.BST.L.15: 43.0
L11.HWR.L: 849.0
L22.HWR.L: 1490.0
L28.BST.M: 1733.0
F40.HWR.L.15: 2413.0
F17.BST.M.15: 2710.0
W1.HWR.WL.M: 3982.0
F23.BST.M.15: 4483.0
L8.BST.M: 5338.0
A2.bst: 5709.0
F44.BST.M.20: 6759.0
F11.HWR.M.15: 7070.0
F18.BST.M.15: 7365.0
F10.BST.M.15: 10062.0
F42.HWR.L.20: 12283.0
F77.HWR.L.15: 13144.0
F72.BST.L.20: 14925.0
F43.HWR.M.20: 18149.0
F21.BST.M.15: 18688.0
F51.HWR.L.15: 19619.0
F20.HWR.L.15: 22472.0
L25.BST.M: 36825.0
F76.HWR.M.15: 37408.0
F54.HWR.L.15: 39915.0
F78.HWR.L.15: 40002.0
L2.hwr: 40536.0
F3.HWR.L.15: 43852.0
L34.HWR.L: 44457.0
F50.HWR.L.20: 49870.0
L17.HWR.L: 63537.0
F8.BST.M.15: 75957.0
F95.HWR.L.20: 77444.0
F47.HWR.M.15: 81100.0
F2.BST.L.15: 85556.0
L1.HWR.M: 86264.0
F60.HWR.M.15: 86961.0
L33.BST.M: 88005.0
F87.HWR.M.15: 92114.0
F45.HWR.M.15: 96573.0
A2.hwr: 111034.0
L4.HWR.L: 122391.0
A4.hwr: 130900.0
L19.HWR.M: 136355.0
P9.HWR.L1.L: 152627.0
F25.BST.M.15: 155420.0
F86.HWR.L.15: 184581.0
F24.BST.M.15: 193020.0
A1.HWR: 204180.0
F68.BST.M.20: 211407.0
P8.HWR.A4.M: 224239.0
F83.BST.M.25: 245923.0
F49.HWR.M.15: 249023.0
F14.BST.M.20: 275515.0
F88.BST.M.25: 276321.0
P2.BST.A2.M: 293556.0
F80.HWR.M.25: 316468.0
P4.BST.A4.M: 336582.0
P3.BST.A3.M: 356839.0
F82.BST.M.25: 379655.0
F81.BST.M.25: 391573.0
F30.HWR.M.20: 417380.0
P5.HWR.A1.M: 428788.0
F28.BST.M.15: 438613.0
F90.BST.M.25: 540004.0
F7.HWR.M.15: 564262.0
F29.HWR.L.20: 591739.0
Step 8:
summarize_taxa_through_plots.py -i ./picked_closed_OTUs/otu_table.biom -o taxa_summary -m ../mapping_file.csv
By this file I have summarized the taxa table from the alignment. The file is saved in the folder taxa_summary as otu_table_L3.txt. This is the file I will use for further analysis in R.
ANALYSIS WITH SUBSAMPLING FOR OPEN-REFERENCE-OTU-CLUSTERING
filter_otus_from_otu_table.py -i otu_table_mc2_w_tax_no_pynast_failures.biom -o otu_table_no_singletons.biom -n 2
multiple_rarefactions.py -i ./otu_table_no_singletons.biom -m 200 -x 500 -s 10 -n 2 -o ./rarefied_otu_tables500/
Did this for 1000,1500 and 2000
single_rarefaction.py -i otu_table_no_singletons.biom -o otu_table_even500.biom -d 500
summarize_taxa.py -i ./otu_table_even500.biom -o ./single500 -L 4 -m ../../mapping_file.csv