9 – Getting taxonomic lineage information out of MEGAN.

The free to use community edition of MEGAN6 is a tool that can be used to parse through BLAST output, setting varying stringency criteria to help sort out good from not-so-good taxonomic assignments and is a nice alternative to the widely used top BLAST hit taxonomic assignment approach.

The top BLAST hit taxonomic assignment approach can help to put names on unknown OTUs accurately so long as the appropriate reference sequences are present in the database. When this is not the case, a false positive ‘assignment’ is made. Sequence identity and query coverage may not be able to screen out these false positives. MEGAN, however, can help make more conservative taxonomic assignments in a consistent and objective manner.

Top BLAST hit versus MEGAN taxonomic assignments:

Instead of using percent identity and query coverage to select a good top hit, MEGAN uses a bitscore cutoff value. The bitscore value takes into consideration the overall quality of the HSP (high-scoring segment pair) alignment such as gaps (insertions/deletions), matches, and mismatches. The rule of thumb I use is that a good bitscore value should be around 2x the length of the query sequence (give or take). The idea here is to use a bitscore cutoff value large enough to exclude short but nearly identical matches that represent only partial matches of the query to a reference sequence.

Planning your BLAST and MEGAN analysis of many samples:

Because I’m currently working with the Illumina sequencing platform, the reads have long complicated id names. Because I have used USEARCH to cluster these reads, the OTUs have OTU size information appended to the end of these already long ids. Because I often work with dozens of BLAST output files from different samples, I often add sample name (and amplicon name) identifiers to the start of each OTU id. This point is important because the community edition of MEGAN6 no longer allows command line access and the program does not allow many sample files to be processed in an automated manner. To make life easier, I concatenate all the BLAST files from multiple samples into a single file for importing. I import one giant file and sort samples out later (usually Excel is sufficient). I have had to increase the amount of memory available to MEGAN6. This can be done easily by editing the MEGAN.vmoptions file present in the installation directory.

Useful parameters to adjust when importing BLAST files into MEGAN:

Perhaps the most important parameter to set (and reset during the data exploration phase) is the minimum bitscore cutoff value. This can be set with LCA parameter [minscore].

The real value of MEGAN, though, is how the software automates the summary of sets of top hits with equally good (or nearly equally good) bitscores. The default is to analyze sets of top hits within 10% of the best bitscore. Set with LCA parameter [toppercent].

One of the quirks with MEGAN, is that is assumes it is parsing individual BLAST reports for each individual read. If you’re like me, running a BLAST analysis for a million reads takes too long even if the jobs are split and run across a cluster. Instead, I run BLAST analyses on clustered OTUs (singeltons and doubletons already excluded). When you do this, it is important to turn off the [Min Support Percent] and [Min Support] LCA parameters by setting their values to zero. The default behavior in MEGAN is that reads need to represent more than 0.1% of all reads to be counted as present. If you are processing are filtered OTUs, not reads, then this parameter may not be needed.

Another nice MEGAN feature is that the program can ignore top hits that are insufficiently identified such as ‘Uncultured fungus’ or ‘Environmental sequence’. Although this type of information can be useful, sometimes all you really want to do is compare your sequences against database sequences that have names. Though there are ways of getting around this problem during the BLAST step, this can also be addressed after BLAST in MEGAN from the Edit-> Preferences-> Taxon disabling menu.

What MEGAN seems to lack, is a way to get taxonomic lineage information (not just the taxonids) out of the program in a simple format that can be used to summarize taxonomic information at various ranks. This can’t really be done without having some scripting skills. This is how I do it.

Getting taxonomic lineage information for MEGAN taxonomic assignments:

I import my text formatted OTU BLAST (BLAST –outfmt 0) results into MEGAN6. These are the LCA parameters I like to use: min score = [a value ~ 2x the average query length], max expected = 0.01, min percent identity = 0, top percent = 1, min support percent = 0 (off), min support = 0, min-complexity filter = 0.3. I adjust the displayed tree to show the taxonomic rank I’m interested in, usually, ‘species’ since I can still summarize taxonomy to more inclusive taxonomic ranks later if needed. I select ‘All nodes’ and export a tab delimited CSV file. For this example, I called this file cat-ex.txt.

From the command-line, I extract a list of just the GenBank taxonomy id’s from the MEGAN file:

$ cat cat-ex.txt | awk ‘{print $2}’ > cat_taxid.txt

Then I use a Perl script to grab taxonomic information from the NCBI taxonomy database. This script requires the BioPerl module Bio::LITE::Taxonomy::NCBI. Updated versions of the names.dmp, nodes.dmp, and merged.dmp files from the NCBI also need to be available (taxdmp.tar.gz). These paths are hard coded and should be updated in the script.

$ perl getTaxonomicLineage.plx cat_taxid.txt

The output file is called taxid.parsed.

At this point I usually edit the original cat-ex.txt file to put interesting fields of information into their own columns. I use regular expressions in gvim to do this but there are other ways to do this as well. I like to separate the sample name and OTU size fields into their own tab delimited columns. This leaves column1 = sampleID, column2 = OTUid, column3 = OTUsize, column4 = taxonid. This step can be omitted if not needed.

Next step is to concatenate the edited cat-ex.txt file (that contains the taxonids from MEGAN) with the taxid.parsed file (that contains taxonomic lineage information from GenBank). This can be done on the command line.

$ pr –m –t –J cat-ex.txt taxid.parsed | gawk ‘{print $1”|”$2"|"$3"|"$4"|"$5"|"$6"|"$7"|"$8"|"$9"|"$10"|"$11"|"$12"|"$13"|"$14}' > cat_merged.txt

The blue and red color coding shows where each column of data comes from and where it ends up in the outfile cat_merged.txt. If you only had two columns of data as in the original cat-ex.txt file, then field numbers only need to go up to $12 (instead of $14).

The cat_merged.txt file can then be imported into Excel as a tab delimited file, adding the extra “|” as a separator. This data can be turned into a table and filters can be used to easily sort various columns (sample name, OTUsize, taxonomic ranks) for further processing.

Take home message:

This method should help to reduce the number of false positives that would otherwise arise from using the simple top BLAST hit approach. This is especially true if taxonomic assignments are summarized to more inclusive taxonomic ranks.