Common issues

One of the input files that Salmon requires is a transcriptome file, which is not the full genome assembly. The transcriptome file is created from the genome assembly and the annotation file. The transcriptome file contains only the cDNA transcripts of expressed genes. For the Tair10 Arabidopsis genome assembly, the transcriptome file is available to download. For most other genome assemblies, the transcriptome file is not available and you'll have to make it. Here is the code to make a transcriptome file.

Script to create the Soybean (Lee v2) transcriptome file

#!/bin/tcsh

#BSUB -J Gm_make-transciptome #job name

#BSUB -n 20 #number of threads

#BSUB -W 2:0 #time for job to complete

#BSUB -R "rusage[mem=0.2]" #to request a node with 20MB of memory

#BSUB -o Gm_make-transciptome.%J.out #output file

#BSUB -e Gm_make-transciptome.%J.err #error file

set gffread=/usr/local/usrapps/bitcpt/gffread/bin/gffread

set dir=/share/bitcpt/S23/referenceGenomes/Glycine_max_Lee_v2

set gen=glyma.Lee.gnm2

# The genome assembly file: glyma.Lee.gnm2.K7BV.genome_main.fna

# The annotation file: glyma.Lee.gnm2.ann1.1FNT.gene_models_main.AGAT.gtf

# Command example for gffread:

# gffread -w output.fa -g genome.fa genome.gtf

${gffread} -w ${dir}/${gen}_transcriptome.fasta -g ${dir}/${gen}.K7BV.genome_main.fna ${dir}/${gen}.ann1.1FNT.gene_models_main.AGAT.gtf

Make transcriptome for THEreference (Glycine_max_v4.0)

../../referenceGenomes/Portfolios/Glycine_max_THEreference/GCF_000004515.6_Glycine_max_v4.0_genomic.fna

../../referenceGenomes/Portfolios/Glycine_max_THEreference/genomic.gtf

#!/bin/tcsh

#BSUB -J Gm4.0_make-transciptome #job name

#BSUB -n 20 #number of threads

#BSUB -W 2:0 #time for job to complete

#BSUB -R "rusage[mem=0.2]" #to request a node with 20MB of memory

#BSUB -o Gm4.0_make-transciptome.%J.out #output file

#BSUB -e Gm_make-transciptome.%J.err #error file

set gffread=/usr/local/usrapps/bitcpt/gffread/bin/gffread

set dir=/share/bitcpt/S23/referenceGenomes/Portfolios/Glycine_max_THEreference

set gen=GCF_000004515.6_Glycine_max_v4.0

# GCF_000004515.6_Glycine_max_v4.0_genomic.fna

# genomic.gtf

# Command example for gffread:

# gffread -w output.fa -g genome.fa genome.gtf

${gffread} -w ${dir}/${gen}_transcriptome.fasta -g ${dir}/${gen}_genomic.fna ${dir}/genomic.gtf

Make transcriptome for Glycine_max_Lee_v1

../../referenceGenomes/Portfolios/Glycine_max_Lee_v1/glyma.Lee.gnm1.BXNC.genome_main.fna

../../referenceGenomes/Portfolios/Glycine_max_Lee_v1/glyma.Lee.gnm1.ann1.6NZV.gene_models_main.AGAT.gtf

#!/bin/tcsh

#BSUB -J Glycine_max_Lee_v1_make-transciptome #job name

#BSUB -n 20 #number of threads

#BSUB -W 2:0 #time for job to complete

#BSUB -R "rusage[mem=0.2]" #to request a node with 20MB of memory

#BSUB -o Glycine_max_Lee_v1_make-transciptome.%J.out #output file

#BSUB -e Glycine_max_Lee_v1_make-transciptome.%J.err #error file

set gffread=/usr/local/usrapps/bitcpt/gffread/bin/gffread

set dir=/share/bitcpt/S23/referenceGenomes/Portfolios/Glycine_max_Lee_v1

set gen=glyma.Lee.gnm1

# glyma.Lee.gnm1.BXNC.genome_main.fna

# glyma.Lee.gnm1.ann1.6NZV.gene_models_main.AGAT.gtf

# Command example for gffread:

# gffread -w output.fa -g genome.fa genome.gtf

${gffread} -w ${dir}/${gen}_transcriptome.fasta -g ${dir}/${gen}.BXNC.genome_main.fna ${dir}/${gen}.ann1.6NZV.gene_models_main.AGAT.gtf

Make transcriptome for Glycine_max_Wm82-ISU01_v2

#!/bin/tcsh

#BSUB -J Glycine_max_Wm82-ISU01_v2_make-transciptome #job name

#BSUB -n 20 #number of threads

#BSUB -W 2:0 #time for job to complete

#BSUB -R "rusage[mem=0.2]" #to request a node with 20MB of memory

#BSUB -o Glycine_max_Wm82-ISU01_v2_make-transciptome.%J.out #output file

#BSUB -e Glycine_max_Wm82-ISU01_v2_make-transciptome.%J.err #error file

set gffread=/usr/local/usrapps/bitcpt/gffread/bin/gffread

set dir=/share/bitcpt/S23/referenceGenomes/Portfolios/Glycine_max_Wm82-ISU01_v2

set gen=glyma.Wm82_ISU01.gnm2

# glyma.Wm82_ISU01.gnm2.JFPQ.genome_main.fna

# glyma.Wm82_ISU01.gnm2.ann1.FGFB.gene_models_main.AGAT.gtf

# Command example for gffread:

# gffread -w output.fa -g genome.fa genome.gtf

${gffread} -w ${dir}/${gen}_transcriptome.fasta -g ${dir}/${gen}.JFPQ.genome_main.fna ${dir}/${gen}.ann1.FGFB.gene_models_main.AGAT.gtf

Make transcriptome for NEW_Glycine_soja_W05

#!/bin/tcsh

#BSUB -J NEW_Glycine_soja_W05_v2_make-transciptome #job name

#BSUB -n 20 #number of threads

#BSUB -W 2:0 #time for job to complete

#BSUB -R "rusage[mem=0.2]" #to request a node with 20MB of memory

#BSUB -o NEW_Glycine_soja_W05_v2_make-transciptome.%J.out #output file

#BSUB -e NEW_Glycine_soja_W05_v2_make-transciptome.%J.err #error file

set gffread=/usr/local/usrapps/bitcpt/gffread/bin/gffread

set dir=/share/bitcpt/S23/referenceGenomes/Portfolios/NEW_Glycine_soja_W05

set gen=Glycine_soja_W05_v2

# GCF_004193775.1_ASM419377v2_genomic.fna

# genomic.AGAT.gtf

# Command example for gffread:

# gffread -w output.fa -g genome.fa genome.gtf

${gffread} -w ${dir}/${gen}_transcriptome.fasta -g ${dir}/GCF_004193775.1_ASM419377v2_genomic.fna ${dir}/genomic.AGAT.gtf

Special character in gtf file

When generating the transcriptome file for

Error: discarding overlapping duplicate gene feature (96743-220521) with ID=gene-GlmaCp045

Error parsing attribute gene_id ('"' required for GTF) at line:

NC_038247.2 BestRefSeq%2CGnomon gene 2542898 2549819 . - . gene_id "gene-ACX1;1"; Dbxref "GeneID:547625"; ID "gene-ACX1;1"; Name "ACX1;1"; description "acyl-CoA oxidase"; gbkey

"Gene"; gene "ACX1;1"; gene_biotype "protein_coding";

The annotation part of the gtf file is a semi-colon (;) delimted. Because there is a semi-colon in the gene name, the gtf file is now improperly delimited!

We need to find and replace these instances with this

sed 's/word1/word2/g' input.file > output.file

Where s/ means subsititue

and /g means globally (without /g, only the first instance would be replaced)

sed 's/;1"/.1"/g' input.file > output.file

Page updated

Report abuse