# load modules
module load hmmer3/3.0 raxml/7.5.4 emboss/6.4.0 amphora2/2.0

#copy example data

cp -r /gpfs/runtime/opt/amphora2/2.0/AMPHORA2-master/TestData .

1. Marker identification
Use to identify bacterial and/or archaeal marker sequences. Given a sequence file, this program will identify markers from the input sequences and generate a protein fasta file for each marker gene in your working directory. For example, rpoB.pep, rpsJ.pep. When DNA input sequences are used, this program first identifies ORFs longer than 100 bp in all six reading frames, then scans the translated peptide sequences for the phylogenetic markers. sequence-file

        -DNA: input sequences are DNA. Default: no.
        -Evalue: HMMER evalue cutoff. Default: 1e-3 
        -Bacteria: input sequences are Bacterial sequences
        -Archaea: input sequences are Archaeal sequences
        -ReferenceDirectory: the file directory that contain the reference alignments, hmms and masks. Default: /home/foo/AMPHORA2/Marker
        -Help: print help       

1a. Identify phylogenetic markers from the E. coli proteome.  -Bacteria TestData/ecoli.pep

1b. Identify phylogenetic markers from the E. coli genome assembly
        -Bacteria -DNA TestData/ecoli.fasta

If AMPHORA2 has been installed correctly, at the end of the run for example 1a or 1b, you should see 31 marker protein sequences (*.pep) in your working directory.

1c. If you want to identify phylogenetic markers from metagenomic sequence reads (e.g., 454 reads) of a mixed bacterial and archaeal population, -DNA metagenomic.fasta

However, if you know your input sequences only contain bacterial or archaeal sequences, then use the -Bacterial or -Archaeal option. This makes the AMPHORA2 run faster and the results will be more accurate. 

2. Marker sequence alignment and trimming
This program will align, mask and trim the marker protein sequences. Output will be aligned/trimmed sequences. For example, rpoB.aln, rpsJ.aln and their corr
esponding alignment masks. The alignment masks can be used to weigh the alignment columns with the RAxML's -a option (for untrimmed alignment only).

        -Trim:  trim the alignment using masks embedded with the marker database. Default: no
        -Cutoff: the Zorro masking confidence cutoff value (0 - 1.0; default: 0.4);
        -ReferenceDirectory: the file directory that contain the reference alignments, hmms and masks. Default: /home/foo/AMPHORA2/Marker
        -Directory: the file directory where sequences to be aligned are located. Default: current directory
        -OutputFormat: output alignment format. Default: phylip. Other supported formats include: fasta, stockholm, selex, clustal
        -WithReference: keep the reference sequences in the alignment. Default: no
        -Help:  print help 

Example: -WithReference -OutputFormat phylip

If AMPHORA2 has been installed correctly, at the end of the run, you should see an alignment file (*.aln) and a mask file (*.mask) for each of the 31 ecoli m
arker proteins in your working directory.

It is important to know that in order to run the script properly, the needs to be run using  '-WithReference -OutputFormat
 phylip' options.

3. Phylotyping
Use to assign phylotypes for each identified marker sequences. This program will assign each identified marker sequence a phylotype using pars
imony method or the evolutionary placement algorithm of RAxML. The marker sequences need to be aligned first with the reference sequences using MarkerAlignTr (see above). The alignments should be in the phylip format. 

        -Method: use 'maximum likelihood' (ml) or 'maximum parsimony' (mp) for phylotyping. Default: ml
        -CPUs: turn on the multiple thread option and specify the number of CPUs/cores to use. Important: Make sure raxmlHPC-PTHREADs is installed. If the nu
mber specified here is larger than the number of cores that are free and available, it will actually slow down the script.
        -Help: print help;  

assign phylotypes using the maximum likelihood method -CPUs 6 > phylotype.result

Again, if AMPHORA2 is installed correctly, you should see something like this as the output:

Query   Marker  Superkingdom    Phylum  Class   Order   Family  Genus   Species
NP_417776-NC_000913     rplB    Bacteria(0.96)  Proteobacteria(0.96)    Gammaproteobacteria(0.96)       Enterobacteriales(0.96) Enterobacteriaceae(0.96)     
   Escherichia(0.50)       Escherichia coli(0.48)
NP_417097-NC_000913     rplS    Bacteria(0.96)  Proteobacteria(0.96)    Gammaproteobacteria(0.96)       Enterobacteriales(0.96) Enterobacteriaceae(0.96)     
   Escherichia(0.80)       Escherichia coli(0.80)
NP_414711-NC_000913     rpsB    Bacteria(0.96)  Proteobacteria(0.96)    Gammaproteobacteria(0.96)       Enterobacteriales(0.96) Enterobacteriaceae(0.96)     
   Escherichia(0.80)       Escherichia coli(0.78)

The phylotyping results are tab-delimited. The numbers within the parentheses are the confidence scores of the assignment. 

If you see the following error message when you run, you can delete the 'name2id' file in the folder AMPHORA2_home/Taxonomy/ and run the scrip
t again.

------------- EXCEPTION: Bio::Root::Exception -------------
MSG: No such file or directory AMPHORA2_home/Taxonomy/names2id
STACK: Error::throw
STACK: Bio::Root::Root::throw /lib/site_perl/5.16.3/Bio/Root/
STACK: Bio::DB::Taxonomy::flatfile::_db_connect /lib/site_perl/5.16.3/Bio/DB/Taxonomy/
STACK: Bio::DB::Taxonomy::flatfile::new /lib/site_perl/5.16.3/Bio/DB/Taxonomy/
STACK: Bio::DB::Taxonomy::new /lib/site_perl/5.16.3/Bio/DB/
STACK: AMPHORA2_home/Scripts/