maker

maker -CTL

##editing repeatmasker output file for MAKER

#convert the .out file to gff3 format

/uufs/chpc.utah.edu/common/home/u6007910/bin/RepeatMasker/util/rmOutToGFF3.pl ./melissa_ref.fasta.out > melissa_ref.fasta.out.gff3

grep -v -e "Satellite" -e ")n" -e "-rich" melissa_ref.fasta.out.gff3 > melissa_ref.complex.gff3

cat melissa_ref.fasta.out.gff3 | perl -ane '$id; if(!/^\#/){@F = split(/\t/, $_); chomp $F[-1];$id++; $F[-1] .= "\;ID=$id"; $_ = join("\t", @F)."\n"} print $_' > melissa_ref.reformat.gff3

### round 2 #####################

#before this I copied the log file to _mod file and changed the directory of scaffold 1631 & 1638 and changed their status to finished

fasta_merge -d melissa_mod.log

gff3_merge -d melissa_mod.log

#trying with -x 0.25

maker2zff -x 0.25 -l 50 ../../melissa_round1.maker.output/melissa_mod.log.all.gff

#fathom

fathom genome.ann genome.dna -gene-stats > gene-stats.log

fathom genome.ann genome.dna -validate > validate.log 2>&1

fathom genome.ann genome.dna -categorize 1000 > categorize.log #creates alt, err, old, uni and wrn files

fathom uni.ann uni.dna -export 1000 -plus > uni-plus.log #creates export files

# create the training parameters

mkdir params

cd params

forge ../export.ann ../export.dna > ../forge.log 2>&1

cd ..

hmm-assembler.pl melissa params > melissa_r1_length50_aed0.25.hmm

#### training augustus

I used the augustus training out from the BUSCO insecta run. I copied files from BUSCO folder to maker/augustus folder:

cp ../../busco/run_insecta/augustus_output/retraining_parameters/* ./

#change name of all the files

for file in * ; do mv $file ${file//BUSCO_insecta_1988554456/lycaeides_melissa}; done

awk '{ if ($2 == "est2genome") print $0}' melissa_round1.maker.output/melissa_mod.log.noseq.gff > melissa_r1_maker_est2genome.gff

awk '{ if ($2 == "protein2genome") print $0}' melissa_round1.maker.output/melissa_mod.log.noseq.gff > melissa_r1_maker_protein2genome.gff

awk '{ if ($2 ~ "repeat") print $0}' melissa_round1.maker.output/melissa_mod.log.noseq.gff > melissa_r1_maker_repeats.gff

##Running augustus

Asked Anita to create a lycaeides_melissa folder in the config/species folder for augustus and then asked her to copy the retraining files to that folder: /uufs/chpc.utah.edu/sys/installdir/augustus/3.3/config/species/lycaeides_melissa

Then make the following changes in the maker_opts_round2.ctl file: change augustus_species to lycaeides_melissa

In maker_exe.ctl file: add the path to the agusutus executable: /uufs/chpc.utah.edu/sys/installdir/augustus/3.3/src/augustus

In the bash script to submit to the cluster ad the config path: export AUGUSTUS_CONFIG_PATH="/uufs/chpc.utah.edu/sys/installdir/augustus/3.3/config"

#######################################################

POST MAKER 2nd RUN

#generate an id mapping file using maker_map_ids

maker_map_ids --prefix melissa_ melissa_round2.all.gff > melissa_round2.all.map

(This creates a two-column tab-delimited file with the original id in column 1 and the new

id in column 2. The --prefix is where you give your registered genome prefix; the value

following --justify determines the length of the number following the prefix (make

sure that you allow adequate places for the number of genes in the annotation set, e.g., if

you have 10,000 genes, --justify should be set to at least 5).

#use map file to change ids in gff3 and fasta file

cp melissa_round2.all.gff melissa_round2.all.ids.gff

map_gff_ids melissa_round2.all.map melissa_round2.all.ids.gff

map_fasta_ids melissa_round2.all.map melissa_round2.all.maker.proteins.fasta

map_fasta_ids melissa_round2.all.map melissa_round2.all.maker.transcripts.fasta

##assigning putative gene function using maker and NCBI BLAST+

mkdir uniprot

#download the uniprot file from website (http://www.uniprot.org) to the desktop.Then copy it to the cluster.

scp ./Downloads/uniprot_sprot.fasta.gz ssh u6007910@kingspeak.chpc.utah.edu:/uufs/chpc.utah.edu/common/home/gompert-group1/data/lycaeides/dovetail_melissa_genome/Annotation/maker/melissa_round2.maker.output/uniprot

1. Index the UniProt/Swiss-Prot multi-FASTA file using makeblastdb:

makeblastdb -in uniprot/uniprot_sprot.fasta -input_type fasta -dbtype prot

(creates 3 files uniprot_sprot.fasta.phr uniprot_sprot.fasta.pin uniprot_sprot.fasta.psq)

2. BLAST the MAKER-generated protein FASTA file to UniProt/SwissProt with

BLASTP.

blastp -db uniprot/uniprot_sprot.fasta -query melissa_round2.all.maker.proteins.fasta -out maker2uni.blastp -evalue .000001 -outfmt 6 -num_alignments 1 -seg yes -soft_masking true -lcase_masking

Used the following bash script for this:

#!/bin/bash

#SBATCH -n 12

#SBATCH -N 1

#SBATCH -t 300:00:00

#SBATCH -p usubio-kp

#SBATCH -A usubio-kp

#SBATCH -J maker

module load maker

cd /uufs/chpc.utah.edu/common/home/gompert-group1/data/lycaeides/dovetail_melissa_genome/Annotation/maker/melissa_round2.maker.output

blastp -db uniprot/uniprot_sprot.fasta -query melissa_round2.all.maker.proteins.fasta -out maker2uni.blastp -evalue .000001 -outfmt 6 -num_alignments 1 -seg yes -soft_masking true -lcase_masking

The key parts of this BLAST command line include the specification of the tabular format (-outfmt 6), and the -num_alignments 1. The output for this BLAST search is:

3. Add protein homology data to the MAKER GFF3 and FASTA files

maker_functional_gff uniprot/uniprot_sprot.fasta maker2uni.blastp melissa_round2.all.ids.gff > melissa_functional_blast.gff

maker_functional_fasta uniprot/uniprot_sprot.fasta maker2uni.blastp melissa_round2.all.maker.proteins.fasta > melissa_proteins_functional_blast.fasta

############ Annotating using Interproscan ############################

1. Download and extract interproscan (https://github.com/ebi-pf-team/interproscan/wiki/HowToDownload) Also look at the how to install link on this page to see installations requirement.

mkdir interproscan

cd interproscan

wget ftp://ftp.ebi.ac.uk/pub/software/unix/iprscan/5/5.32-71.0/interproscan-5.32-71.0-64-bit.tar.gz

wget ftp://ftp.ebi.ac.uk/pub/software/unix/iprscan/5/5.32-71.0/interproscan-5.32-71.0-64-bit.tar.gz.md5

# Recommended checksum to confirm the download was successful:

md5sum -c interproscan-5.32-71.0-64-bit.tar.gz.md5

# Must return *interproscan-5.32-71.0-64-bit.tar.gz: OK*

# If not - try downloading the file again as it may be a corrupted copy.

Extract the tar ball:

tar -pxvzf interproscan-5.32-71.0-*-bit.tar.gz

# where:

# p = preserve the file permissions

# x = extract files from an archive

# v = verbosely list the files processed

# z = filter the archive through gzip

# f = use archive file

2. Installing Panther Models

InterProScan 5 includes the Panther member database analysis.

Before Installing Panther Data

First ensure you have extracted the distribution of InterProScan 5

The path where this is extracted will be referred to below as [InterProScan5 home].

Download the Panther model data:

Download the latest Panther data files from the FTP site into the [InterProScan5 home]/data/ directory:

cd [InterProScan5 home]/data/

wget ftp://ftp.ebi.ac.uk/pub/software/unix/iprscan/5/data/panther-data-12.0.tar.gz

wget ftp://ftp.ebi.ac.uk/pub/software/unix/iprscan/5/data/panther-data-12.0.tar.gz.md5

md5sum -c panther-data-12.0.tar.gz.md5

# This must return *panther-data-12.0.tar.gz: OK*

# If not - try downloading the file again as it may be a corrupted copy.

Extract the Panther data files to the required location:

tar -pxvzf panther-data-12.0.tar.gz

3. Running interproscan

Testing if interproscan is running

module load python3

cd interproscan-5.32-71.0

./interproscan.sh -i test_proteins.fasta -f tsv

./interproscan.sh -i test_proteins.fasta -f tsv -dp

The first test should create an output file with the default file name test_proteins.fasta.tsv, and the second would then create test_proteins.fasta_1.tsv (since the default filename already exists).