There were two common issues encountered with the tomato and potato genome assemblies that had to addressed before we could use STAR.
Issue 1) The chromosome names in the assembly and annotation file didn't match.
Issue 2) The annotation files were .gff format, but STAR needs .gtf format.
Here we present the solutions we used. You do not need to repeat these steps for the class, but we wanted to provide these scripts in case you encounter the issues in your own research.
zcat original-genome.fasta.gz | grep ">"
Output:
>1
>2
>3
>4
>5
>6
>7
>8
>9
>10
>11
>12
>0
zcat original-genome.gff3.gz | head
Output:
chr00 maker gene 17008 19024 . + . ID=Sopim_TS265_00G000001;Name=maker-chr00-pred_gff_AUGUSTUS-gene-18.1305;score=0.12
chr00 maker mRNA 17008 19024 . + . ID=Sopim_TS265_00T000001.1;Parent=Sopim_TS265_00G000001;ID=Sopim_TS265_00T000001.1;_AED=0.49;_eAED=1.00;_QI=0|0|0|1|1|1|7|0|451;score=0.12
chr00 maker exon 17008 17263 . + . Parent=Sopim_TS265_00T000001.1
chr00 maker exon 17367 17494 . + . Parent=Sopim_TS265_00T000001.1
chr00 maker exon 17654 17858 . + . Parent=Sopim_TS265_00T000001.1
chr00 maker exon 17966 18114 . + . Parent=Sopim_TS265_00T000001.1
chr00 maker exon 18232 18449 . + . Parent=Sopim_TS265_00T000001.1
chr00 maker exon 18524 18838 . + . Parent=Sopim_TS265_00T000001.1
chr00 maker exon 18940 19024 . + . Parent=Sopim_TS265_00T000001.1
chr00 maker CDS 17008 17263 . + 0 Parent=Sopim_TS265_00T000001.1
Result: The chromosome names do not match
1) First make a 'new_chr_names.txt' file that looks like
chr01
chr02
chr03
chr04
chr05
chr06
chr07
chr08
chr09
chr10
chr11
chr12
chr00
You can use a grep command to grab the chromosome names and save them in a txt file. Then you can mannual edit
2) Then run this script
#!/bin/tcsh
#BSUB -J rename_chr #job name
#BSUB -W 4:0 #time for job to complete
#BSUB -o rename_chr.%J.o #output file
#BSUB -e rename_chr.%J.e #error file
# Unzip the assembly file
gunzip original-genome.fasta.gz
# This awk commands replaces the chromosome names in the assembly with the names in new_chr_names.txt
awk 'NR == FNR { o[n++] = $0; next } /^>/ && i < n { $0 = ">" o[i++] } 1' new_chr_names.txt original-genome.fasta > new-genome.fasta
We used AGAT software to convert the given gff annotation files to the gtf format.
#!/bin/tcsh
#BSUB -J agat #job name
#BSUB -n 20 #number of nodes
#BSUB -W 5:0 #time for job to complete
#BSUB -o agat_%J.out #output file
#BSUB -e agat_%J.err #error file
/usr/local/usrapps/bitcpt/agat/bin/agat_convert_sp_gff2gtf.pl --gff original-genome.gff3 -o original-genome.agat.gtf
#!/bin/tcsh
#BSUB -J agat_Gm #job name
#BSUB -n 20 #number of nodes
#BSUB -W 5:0 #time for job to complete
#BSUB -o agat_Gm_%J.out #output file
#BSUB -e agat_Gm_%J.err #error file
/usr/local/usrapps/bitcpt/agat/bin/agat_convert_sp_gff2gtf.pl --gff ../../referenceGenomes/Glycine_max_Lee_v2/glyma.Lee.gnm2.ann1.1FNT.gene_models_main.gff3 -o ../../referenceGenomes/Glycine_max_Lee_v2/glyma.Lee.gnm2.ann1.1FNT.gene_models_main.AGAT.gtf