maker -CTL
##editing repeatmasker output file for MAKER
#convert the .out file to gff3 format
/uufs/chpc.utah.edu/common/home/u6007910/bin/RepeatMasker/util/rmOutToGFF3.pl ./melissa_ref.fasta.out > melissa_ref.fasta.out.gff3
grep -v -e "Satellite" -e ")n" -e "-rich" melissa_ref.fasta.out.gff3 > melissa_ref.complex.gff3
cat melissa_ref.fasta.out.gff3 | perl -ane '$id; if(!/^\#/){@F = split(/\t/, $_); chomp $F[-1];$id++; $F[-1] .= "\;ID=$id"; $_ = join("\t", @F)."\n"} print $_' > melissa_ref.reformat.gff3
### round 2 #####################
#before this I copied the log file to _mod file and changed the directory of scaffold 1631 & 1638 and changed their status to finished
fasta_merge -d melissa_mod.log
gff3_merge -d melissa_mod.log
#trying with -x 0.25
maker2zff -x 0.25 -l 50 ../../melissa_round1.maker.output/melissa_mod.log.all.gff
#fathom
fathom genome.ann genome.dna -gene-stats > gene-stats.log
fathom genome.ann genome.dna -validate > validate.log 2>&1
fathom genome.ann genome.dna -categorize 1000 > categorize.log #creates alt, err, old, uni and wrn files
fathom uni.ann uni.dna -export 1000 -plus > uni-plus.log #creates export files
# create the training parameters
mkdir params
cd params
forge ../export.ann ../export.dna > ../forge.log 2>&1
cd ..
hmm-assembler.pl melissa params > melissa_r1_length50_aed0.25.hmm
#### training augustus
I used the augustus training out from the BUSCO insecta run. I copied files from BUSCO folder to maker/augustus folder:
cp ../../busco/run_insecta/augustus_output/retraining_parameters/* ./
#change name of all the files
for file in * ; do mv $file ${file//BUSCO_insecta_1988554456/lycaeides_melissa}; done
awk '{ if ($2 == "est2genome") print $0}' melissa_round1.maker.output/melissa_mod.log.noseq.gff > melissa_r1_maker_est2genome.gff
awk '{ if ($2 == "protein2genome") print $0}' melissa_round1.maker.output/melissa_mod.log.noseq.gff > melissa_r1_maker_protein2genome.gff
awk '{ if ($2 ~ "repeat") print $0}' melissa_round1.maker.output/melissa_mod.log.noseq.gff > melissa_r1_maker_repeats.gff
##Running augustus
Asked Anita to create a lycaeides_melissa folder in the config/species folder for augustus and then asked her to copy the retraining files to that folder: /uufs/chpc.utah.edu/sys/installdir/augustus/3.3/config/species/lycaeides_melissa
Then make the following changes in the maker_opts_round2.ctl file: change augustus_species to lycaeides_melissa
In maker_exe.ctl file: add the path to the agusutus executable: /uufs/chpc.utah.edu/sys/installdir/augustus/3.3/src/augustus
In the bash script to submit to the cluster ad the config path: export AUGUSTUS_CONFIG_PATH="/uufs/chpc.utah.edu/sys/installdir/augustus/3.3/config"
#######################################################
POST MAKER 2nd RUN
#generate an id mapping file using maker_map_ids
maker_map_ids --prefix melissa_ melissa_round2.all.gff > melissa_round2.all.map
(This creates a two-column tab-delimited file with the original id in column 1 and the new
id in column 2. The --prefix is where you give your registered genome prefix; the value
following --justify determines the length of the number following the prefix (make
sure that you allow adequate places for the number of genes in the annotation set, e.g., if
you have 10,000 genes, --justify should be set to at least 5).
#use map file to change ids in gff3 and fasta file
cp melissa_round2.all.gff melissa_round2.all.ids.gff
map_gff_ids melissa_round2.all.map melissa_round2.all.ids.gff
map_fasta_ids melissa_round2.all.map melissa_round2.all.maker.proteins.fasta
map_fasta_ids melissa_round2.all.map melissa_round2.all.maker.transcripts.fasta
##assigning putative gene function using maker and NCBI BLAST+
mkdir uniprot
#download the uniprot file from website (http://www.uniprot.org) to the desktop.Then copy it to the cluster.
scp ./Downloads/uniprot_sprot.fasta.gz ssh u6007910@kingspeak.chpc.utah.edu:/uufs/chpc.utah.edu/common/home/gompert-group1/data/lycaeides/dovetail_melissa_genome/Annotation/maker/melissa_round2.maker.output/uniprot
1. Index the UniProt/Swiss-Prot multi-FASTA file using makeblastdb:
makeblastdb -in uniprot/uniprot_sprot.fasta -input_type fasta -dbtype prot
(creates 3 files uniprot_sprot.fasta.phr uniprot_sprot.fasta.pin uniprot_sprot.fasta.psq)
2. BLAST the MAKER-generated protein FASTA file to UniProt/SwissProt with
BLASTP.
blastp -db uniprot/uniprot_sprot.fasta -query melissa_round2.all.maker.proteins.fasta -out maker2uni.blastp -evalue .000001 -outfmt 6 -num_alignments 1 -seg yes -soft_masking true -lcase_masking
Used the following bash script for this:
#!/bin/bash
#SBATCH -n 12
#SBATCH -N 1
#SBATCH -t 300:00:00
#SBATCH -p usubio-kp
#SBATCH -A usubio-kp
#SBATCH -J maker
module load maker
cd /uufs/chpc.utah.edu/common/home/gompert-group1/data/lycaeides/dovetail_melissa_genome/Annotation/maker/melissa_round2.maker.output
blastp -db uniprot/uniprot_sprot.fasta -query melissa_round2.all.maker.proteins.fasta -out maker2uni.blastp -evalue .000001 -outfmt 6 -num_alignments 1 -seg yes -soft_masking true -lcase_masking
The key parts of this BLAST command line include the specification of the tabular format (-outfmt 6), and the -num_alignments 1. The output for this BLAST search is:
3. Add protein homology data to the MAKER GFF3 and FASTA files
maker_functional_gff uniprot/uniprot_sprot.fasta maker2uni.blastp melissa_round2.all.ids.gff > melissa_functional_blast.gff
maker_functional_fasta uniprot/uniprot_sprot.fasta maker2uni.blastp melissa_round2.all.maker.proteins.fasta > melissa_proteins_functional_blast.fasta
############ Annotating using Interproscan ############################
1. Download and extract interproscan (https://github.com/ebi-pf-team/interproscan/wiki/HowToDownload) Also look at the how to install link on this page to see installations requirement.
mkdir interproscan
cd interproscan
wget ftp://ftp.ebi.ac.uk/pub/software/unix/iprscan/5/5.32-71.0/interproscan-5.32-71.0-64-bit.tar.gz
wget ftp://ftp.ebi.ac.uk/pub/software/unix/iprscan/5/5.32-71.0/interproscan-5.32-71.0-64-bit.tar.gz.md5
# Recommended checksum to confirm the download was successful:
md5sum -c interproscan-5.32-71.0-64-bit.tar.gz.md5
# Must return *interproscan-5.32-71.0-64-bit.tar.gz: OK*
# If not - try downloading the file again as it may be a corrupted copy.
Extract the tar ball:
tar -pxvzf interproscan-5.32-71.0-*-bit.tar.gz
# where:
# p = preserve the file permissions
# x = extract files from an archive
# v = verbosely list the files processed
# z = filter the archive through gzip
# f = use archive file
2. Installing Panther Models
InterProScan 5 includes the Panther member database analysis.
Before Installing Panther Data
First ensure you have extracted the distribution of InterProScan 5
The path where this is extracted will be referred to below as [InterProScan5 home].
Download the Panther model data:
Download the latest Panther data files from the FTP site into the [InterProScan5 home]/data/ directory:
cd [InterProScan5 home]/data/
wget ftp://ftp.ebi.ac.uk/pub/software/unix/iprscan/5/data/panther-data-12.0.tar.gz
wget ftp://ftp.ebi.ac.uk/pub/software/unix/iprscan/5/data/panther-data-12.0.tar.gz.md5
md5sum -c panther-data-12.0.tar.gz.md5
# This must return *panther-data-12.0.tar.gz: OK*
# If not - try downloading the file again as it may be a corrupted copy.
Extract the Panther data files to the required location:
tar -pxvzf panther-data-12.0.tar.gz
3. Running interproscan
Testing if interproscan is running
module load python3
cd interproscan-5.32-71.0
./interproscan.sh -i test_proteins.fasta -f tsv
./interproscan.sh -i test_proteins.fasta -f tsv -dp
The first test should create an output file with the default file name test_proteins.fasta.tsv, and the second would then create test_proteins.fasta_1.tsv (since the default filename already exists).