One of the input files that Salmon requires is a transcriptome file, which is not the full genome assembly. The transcriptome file is created from the genome assembly and the annotation file. The transcriptome file contains only the cDNA transcripts of expressed genes. For the Tair10 Arabidopsis genome assembly, the transcriptome file is available to download. For most other genome assemblies, the transcriptome file is not available and you'll have to make it. Here is the code to make a transcriptome file.
#!/bin/tcsh
#BSUB -J Gm_make-transciptome #job name
#BSUB -n 20 #number of threads
#BSUB -W 2:0 #time for job to complete
#BSUB -R "rusage[mem=0.2]" #to request a node with 20MB of memory
#BSUB -o Gm_make-transciptome.%J.out #output file
#BSUB -e Gm_make-transciptome.%J.err #error file
set gffread=/usr/local/usrapps/bitcpt/gffread/bin/gffread
set dir=/share/bitcpt/S23/referenceGenomes/Glycine_max_Lee_v2
set gen=glyma.Lee.gnm2
# The genome assembly file: glyma.Lee.gnm2.K7BV.genome_main.fna
# The annotation file: glyma.Lee.gnm2.ann1.1FNT.gene_models_main.AGAT.gtf
# Command example for gffread:
# gffread -w output.fa -g genome.fa genome.gtf
${gffread} -w ${dir}/${gen}_transcriptome.fasta -g ${dir}/${gen}.K7BV.genome_main.fna ${dir}/${gen}.ann1.1FNT.gene_models_main.AGAT.gtf
../../referenceGenomes/Portfolios/Glycine_max_THEreference/GCF_000004515.6_Glycine_max_v4.0_genomic.fna
../../referenceGenomes/Portfolios/Glycine_max_THEreference/genomic.gtf
#!/bin/tcsh
#BSUB -J Gm4.0_make-transciptome #job name
#BSUB -n 20 #number of threads
#BSUB -W 2:0 #time for job to complete
#BSUB -R "rusage[mem=0.2]" #to request a node with 20MB of memory
#BSUB -o Gm4.0_make-transciptome.%J.out #output file
#BSUB -e Gm_make-transciptome.%J.err #error file
set gffread=/usr/local/usrapps/bitcpt/gffread/bin/gffread
set dir=/share/bitcpt/S23/referenceGenomes/Portfolios/Glycine_max_THEreference
set gen=GCF_000004515.6_Glycine_max_v4.0
# GCF_000004515.6_Glycine_max_v4.0_genomic.fna
# genomic.gtf
# Command example for gffread:
# gffread -w output.fa -g genome.fa genome.gtf
${gffread} -w ${dir}/${gen}_transcriptome.fasta -g ${dir}/${gen}_genomic.fna ${dir}/genomic.gtf
../../referenceGenomes/Portfolios/Glycine_max_Lee_v1/glyma.Lee.gnm1.BXNC.genome_main.fna
../../referenceGenomes/Portfolios/Glycine_max_Lee_v1/glyma.Lee.gnm1.ann1.6NZV.gene_models_main.AGAT.gtf
#!/bin/tcsh
#BSUB -J Glycine_max_Lee_v1_make-transciptome #job name
#BSUB -n 20 #number of threads
#BSUB -W 2:0 #time for job to complete
#BSUB -R "rusage[mem=0.2]" #to request a node with 20MB of memory
#BSUB -o Glycine_max_Lee_v1_make-transciptome.%J.out #output file
#BSUB -e Glycine_max_Lee_v1_make-transciptome.%J.err #error file
set gffread=/usr/local/usrapps/bitcpt/gffread/bin/gffread
set dir=/share/bitcpt/S23/referenceGenomes/Portfolios/Glycine_max_Lee_v1
set gen=glyma.Lee.gnm1
# glyma.Lee.gnm1.BXNC.genome_main.fna
# glyma.Lee.gnm1.ann1.6NZV.gene_models_main.AGAT.gtf
# Command example for gffread:
# gffread -w output.fa -g genome.fa genome.gtf
${gffread} -w ${dir}/${gen}_transcriptome.fasta -g ${dir}/${gen}.BXNC.genome_main.fna ${dir}/${gen}.ann1.6NZV.gene_models_main.AGAT.gtf
#!/bin/tcsh
#BSUB -J Glycine_max_Wm82-ISU01_v2_make-transciptome #job name
#BSUB -n 20 #number of threads
#BSUB -W 2:0 #time for job to complete
#BSUB -R "rusage[mem=0.2]" #to request a node with 20MB of memory
#BSUB -o Glycine_max_Wm82-ISU01_v2_make-transciptome.%J.out #output file
#BSUB -e Glycine_max_Wm82-ISU01_v2_make-transciptome.%J.err #error file
set gffread=/usr/local/usrapps/bitcpt/gffread/bin/gffread
set dir=/share/bitcpt/S23/referenceGenomes/Portfolios/Glycine_max_Wm82-ISU01_v2
set gen=glyma.Wm82_ISU01.gnm2
# glyma.Wm82_ISU01.gnm2.JFPQ.genome_main.fna
# glyma.Wm82_ISU01.gnm2.ann1.FGFB.gene_models_main.AGAT.gtf
# Command example for gffread:
# gffread -w output.fa -g genome.fa genome.gtf
${gffread} -w ${dir}/${gen}_transcriptome.fasta -g ${dir}/${gen}.JFPQ.genome_main.fna ${dir}/${gen}.ann1.FGFB.gene_models_main.AGAT.gtf
#!/bin/tcsh
#BSUB -J NEW_Glycine_soja_W05_v2_make-transciptome #job name
#BSUB -n 20 #number of threads
#BSUB -W 2:0 #time for job to complete
#BSUB -R "rusage[mem=0.2]" #to request a node with 20MB of memory
#BSUB -o NEW_Glycine_soja_W05_v2_make-transciptome.%J.out #output file
#BSUB -e NEW_Glycine_soja_W05_v2_make-transciptome.%J.err #error file
set gffread=/usr/local/usrapps/bitcpt/gffread/bin/gffread
set dir=/share/bitcpt/S23/referenceGenomes/Portfolios/NEW_Glycine_soja_W05
set gen=Glycine_soja_W05_v2
# GCF_004193775.1_ASM419377v2_genomic.fna
# genomic.AGAT.gtf
# Command example for gffread:
# gffread -w output.fa -g genome.fa genome.gtf
${gffread} -w ${dir}/${gen}_transcriptome.fasta -g ${dir}/GCF_004193775.1_ASM419377v2_genomic.fna ${dir}/genomic.AGAT.gtf
When generating the transcriptome file for
Error: discarding overlapping duplicate gene feature (96743-220521) with ID=gene-GlmaCp045
Error parsing attribute gene_id ('"' required for GTF) at line:
NC_038247.2 BestRefSeq%2CGnomon gene 2542898 2549819 . - . gene_id "gene-ACX1;1"; Dbxref "GeneID:547625"; ID "gene-ACX1;1"; Name "ACX1;1"; description "acyl-CoA oxidase"; gbkey
"Gene"; gene "ACX1;1"; gene_biotype "protein_coding";
The annotation part of the gtf file is a semi-colon (;) delimted. Because there is a semi-colon in the gene name, the gtf file is now improperly delimited!
We need to find and replace these instances with this
sed 's/word1/word2/g' input.file > output.file
Where s/ means subsititue
and /g means globally (without /g, only the first instance would be replaced)
sed 's/;1"/.1"/g' input.file > output.file