daio

DAIO

Overview

DAIO (Domain-architecture Aware Inference of Orthologs) is a computational approach for the analysis of genomes by combining phylogenetic and protein domain-architecture information. With this, is it possible to perform systematic phylogenetic and protein domain architecture-based studies, encompassing the entire proteomes of a group of species to define Strict Ortholog Groups (SOGs). Besides assessing the taxonomic distribution for each protein, this allows to computationally infer gene duplications and performs a protein domain architecture analysis for every protein family.

Custom programs needed to run DAIO

  • surfacing

  • tap.rb

    • "mse -- multi_sequence_extractor"

    • "x_phylo.sh" (shell script)

  • GSDI

    • phylo_pl.pl

    • "phylogenies_decorator"

    • Optional: RIO

All these programs are part of forester, and can be cloned or downloaded from GitHub.

Besides these programs, DAIO analysis also makes use of standard programs used in genomics and phylogenetics: HMMER, MAFFT, CD-HIT, TREE-PUZZLE, FastME, optional (for RIO analysis): MrBayes

Pipline

Example bashrc entries are here.

Input: Entire genomes/proteomes as protein/amino acid Fasta files (one Fasta file per species, in cases of viruses, pool subspecies and and strains into - quasi - species, using fasta_split).

Example input data (9 human Herpesvirus proteomes) can downloaded from here

An entire working directory (including results) can be downloaded here

1. Hmmscan Pfam analysis

Use hmmscan from Sean Eddy's HMMER package to analysis entire genomes/proteomes (in Fasta format) against Pfam domains. Need to download Pfam database for this step as well as install the HMMER package.

Command line example:

% hmmscan --domtblout Human_herpesvirus_1_hmmscan -E 10 path/to/Pfam-A.hmm Human_herpesvirus_1.fasta

Example results of this step (for 9 human Herpesvirus proteomes) can downloaded from here

2. Domain architecture analysis

Use surfacing Java program from forester package to analyse hmmscan output.

As mentioned above, an entire example working directory and results (in "hv_outdir") can downloaded from here

It is best to have an alias to the surfacing Java program in your .bashrc, like so:

alias surfacing='java -Xmx40968m -cp path/to/forester.jar org.forester.application.surfacing'

Example command line (in "SURFACING_WORKDIR" directory if using the example data):

% surfacing -species_tree=herpesvirus_tree.xml -no_eo -ie=1e-6 -mrel=0.4 -mo=5 -dufs -genomes=hmmscan_genomes_loc.txt -write_DA_maps -obtain_DA_names_from_db -out_dir=hv_outdir -o=hv

In this step, downloading names for domain architectures from Uniprot ("obtain_DA_names_from_db" option) is very slow.

Thus, in subsequent runs, it is possible to use the generated domain architecture to protein name map ("hv_DA_NAME_MAP.txt" in "hv_outdir" in the example) to avoid having to download domain architecture names again, like so:

% surfacing -species_tree=herpesvirus_tree.xml -no_eo -ie=1e-6 -mrel=0.4 -mo=5 -dufs -genomes=hmmscan_genomes_loc.txt -write_DA_maps -input_DA_name_map=hv_DA_NAME_MAP.txt -out_dir=hv_outdir2 -o=hv2

Arguments and options for surfacing:

  • -species_tree=<species tree in phyloXML format>: This tree establishes the evolutionary relationships of the genomes being analyzed. The taxonomy codes in the external nodes (e.g. "<code>HHV1</code>" need to match the entries in genomes location file set with option "-genomes=" (e.g. "../HMMSCAN_GENOMES/Human_herpesvirus_1_hmmscan HHV1"). In addition, the scientific names of internal nodes (e.g. "<scientific_name>Roseolovirus</scientific_name>") are used for creating names for domain architectures

  • -p2g=<file>: Pfam to GO mapping file, default is "pfam2go.txt"

  • -obo=<file>: GO terms file (OBO format), default is "go.obo"

  • -no_eo: To ignore engulfed lower confidence domains

  • -ie=<decimal>: max (inclusive) iE-value

  • -mrel=<decimal>: minimal relative domain length (compared to Pfam model length) to be included in analysis (recommended value: 0.4)

  • -mo=<integer>: maximal allowed domain overlap (recommended value: 5)

  • -dufs: to not ignore DUFs (domains with unknown function)

  • -genomes=<file>: this file lists the hmmscan tabular output files to use as input and the corresponding taxonomy codes

  • -write_DA_maps: to output domain architecture map file

  • -obtain_DA_names_from_db: to obtain names for domain architectures from Uniprot (slow)

  • -input_DA_name_map=<file>: to read names for domain architectures from file

  • -out_dir=<dir>: name for the output directory

  • -o=<base name>: the base name for the output files

Results

For DAIO purposes, the most important output file is the DA_SPECIES_IDS_MAP file (called hv_DA_SPECIES_IDS_MAP.txt in the example). The rows of this file are:

  1. Domain architecture using Pfam domain names, examples: "7tm_1", "Cyclin_N--K-cyclin_vir_C"

  2. Name for this domain architecture, with suffix indicating taxonomic distribution, examples: "G-protein coupled receptor homolog_herpesviridae", "glycoprotein L_BETAHERPESVIRINAE"

  3. Taxonomic code, example: "HCMV"

  4. Sequence identifiers, examples: "AHJ82261,AHJ82380,AHJ82381"

3. Domain architecture sequence extraction

This step is to obtain the actual molecular sequences corresponding to a domain architecture (DA).

In essence, this step is to go from line

COWPX Cowpox_virus|ATB55174.1|CPXV_CheHurley_DK_2012|NA|Cheetah|NA|CPXV132_protein zf-FCS--Pox_TAA1 /49-150/

in file "zf-FCS--Pox_TAA1_DA.prot"

to:

>Cowpox_virus|ATB55174.1|CPXV_CheHurley_DK_2012|NA|Cheetah|NA|CPXV132_protein [49-150] [zf-FCS--Pox_TAA1_DA] [COWPX] DKCWFCNQDLVFKPISIETFKGGEVGYFCSKICRDSLASMVKSHVALREEPKISLLPLVF

YEDKEKVINTINLLRDKDGVYGSCYFKENSQIIDISLRSLL

In the surfacing output directory, cd into the "_DAS" directory and execute the following command:

% mse prot . FL_seqs DA_seqs ../../genome_locations.txt

4. Optional: CD-HIT

Use CD-HIT to remove near identical sequences: x_cdhit.sh

In the surfacing output directory, cd into the "_DAS" directory and execute the following command:

% sh x_cdhit.sh DA_seqs DA_seqs_095

5. Phylogenetic and phylogenomic analysis of domain architecture sequences

Use "x_phylo.sh" shell script (GitHub) to perform phylogenetic analysis.

"x_phylo.sh" has some dependencies. Paths to these programs need to be set in the following variables:

    • TAP: taxonomy processor (written in Ruby) tap.rb

    • MSA_PRO: multiple sequence alignment processor (written in Ruby) msa_pro.rb

    • MAFFT: famous multiple sequence inference program MAFFT

    • PHYLO_PL: phylogenetic inference pipeline (written in Perl 5): phylo_pl.pl

    • PHYLOGENIES_DECORATOR: phylogenies decorator phylogenies_decorator.rb

    • GSDI: GSDI

Example use:

In the surfacing output directory, cd into the "_DAS" directory and execute the following command:

% sh x_phylo.sh DA_seqs_095 ../../species_tree.xml DAS_095_trees

Optional: RIO (GitHub) analysis

Last updated: 2019-06-26