** You do not need to get or make the transcriptome files for this class. We already did it for all the referenceGenomes we'll be using **
This info is so you know what we did in case you need to do it in the future
Salmon needs a transcriptome file. The transcriptome file is not not your genome file and it's not your annotation file. It's a fasta file of only the genes within a given genome assembly.
For Arabidopsis, a community generated file already exists so we will use it. We found it here:
http://ftp.ensemblgenomes.org/pub/plants/release-53/fasta/arabidopsis_thaliana/cdna/
It's the file named: Arabidopsis_thaliana.TAIR10.cdna.all.fa.gz
To put it on the HPC, we used this command
Arabidopsis is a special case that already has a transcriptome file made. In almost all other situations, we will have to make the transcriptome file ourselves. It's easy as long as we have the genome assembly (fasta) and the gene annotations (gtf). Here's how we did it for Soybean genome:
#!/bin/tcsh
#BSUB -J Gm_make-transciptome #job name
#BSUB -n 20 #number of threads
#BSUB -W 2:0 #time for job to complete
#BSUB -R "rusage[mem=0.2]" #to request a node with 20MB of memory
#BSUB -o Gm_make-transciptome.%J.out #output file
#BSUB -e Gm_make-transciptome.%J.err #error file
# We're using the gffread software to make the transcritpome file.
set gffread=/usr/local/usrapps/bitcpt/gffread/bin/gffread
# The input files:
# 1) The genome assembly: glyma.Lee.gnm2.K7BV.genome_main.fna
# 2) The annotation: glyma.Lee.gnm2.ann1.1FNT.gene_models_main.AGAT.gtf
# Set the directory of the genome assembly:
set dir=/share/bitcpt/S23/referenceGenomes/Glycine_max_Lee_v2
# Set the name of the assembly:
set gen=glyma.Lee.gnm2
# Command example for gffread:
# gffread -w output.fa -g genome.fa genome.gtf
# Run it!
${gffread} -w ${dir}/${gen}_transcriptome.fasta -g ${dir}/${gen}.K7BV.genome_main.fna ${dir}/${gen}.ann1.1FNT.gene_models_main.AGAT.gtf
#!/bin/tcsh
#BSUB -J salmonquant_At #job name
#BSUB -n 12 #number of threads
#BSUB -W 5:0 #time for job to complete
#BSUB -R "rusage[mem=0.2]" #to request a node with 0.2 GB of memory
#BSUB -o salmonquant_At.%J.out #output file
#BSUB -e salmonquant_At.%J.err #error file
#to quantify aligned reads using salmon in quasi indexing mode
#set threads under 12 on Hazel
#working directory path is /share/bitcpt/S23/unityID/At
#input of aligned reads path is /share/bitcpt/S23/unityID/At/AlignedToTranscriptome
#output of aligned reads will go into salmon_align_quant subdirectory in working directory
##########################
# Set the variables
##########################
set SALMON=/usr/local/usrapps/bitcpt/salmon/bin/salmon
set cdna=/share/bitcpt/S23/referenceGenomes/Arabidopsis_thaliana/tair10/Arabidopsis_thaliana.TAIR10.cdna.all.fa.gz
set IN=AlignedToTranscriptome
##########################
# At-Leaf 1
##########################
set s=Col-0_Leaf_Rep1
${SALMON} quant -l A -a ${IN}/${s}_Aligned.toTranscriptome.out.bam --targets ${cdna} -o salmon_align_quant/${s}.quant
##########################
# At-Leaf 2
##########################
set s=Col-0_Leaf_Rep2
${SALMON} quant -l A -a ${IN}/${s}_Aligned.toTranscriptome.out.bam --targets ${cdna} -o salmon_align_quant/${s}.quant
##########################
# At SAM 1
##########################
set s=Col-0_SAM_rep1_L002
${SALMON} quant -l A -a ${IN}/${s}_Aligned.toTranscriptome.out.bam --targets ${cdna} -o salmon_align_quant/${s}.quant
##########################
# At SAM 2
##########################
set s=Col-0_SAM_rep2_L002
${SALMON} quant -l A -a ${IN}/${s}_Aligned.toTranscriptome.out.bam --targets ${cdna} -o salmon_align_quant/${s}.quant
##########################
# At SAM 3
##########################
set s=Col-0_SAM_rep3_L002
${SALMON} quant -l A -a ${IN}/${s}_Aligned.toTranscriptome.out.bam --targets ${cdna} -o salmon_align_quant/${s}.quant
##########################
# At-Leaf 1
##########################
${SALMON} quant
-l A
-a ${IN}/${s}_Aligned.toTranscriptome.out.bam
--targets ${cdna}
-o salmon_align_quant/${s}.quant
1) Call the software (salmon) and then specify which procedure within the software that we want to run (quant).
quant is the quantification procedure
salmon quant
2) Specify library type
-l = library flag
followed by library type. We provided A to tell Salmon to infer the library type automatically (A=automatic)
-l A
3) Specify the alignment file
-a = alignment file input flag
followed by the STAR alignment file that we generated
-a ${IN}/${s}_Aligned.toTranscriptome.out.bam
We set the IN and s variables earlier in the code with
set IN=/share/bitcpt/Fall2022/casjogre/At/AlignedToTranscriptome
set s=Col-0_Leaf_Rep1
So when we put it all together, it is the same as
-a /share/bitcpt/Fall2022/casjogre/At/AlignedToTranscriptome/Col-0_Leaf_Rep1_Aligned.toTranscriptome.out.bam
4) Specify the file of the transcriptome targets file
--targets = transcriptome targets input file flag
Salmon uses the genome wide transcriptome file instead of a genome assembly for reference and mapping. The instructors have already found and downloaded these files for the class. When working with your organism you will need to find this file and download to your HPC from Ensembl
We found the transcriptome file for Arabidopsis here: http://ftp.ensemblgenomes.org/pub/plants/release-53/fasta/arabidopsis_thaliana/cdna/
--targets ${cdna}
We set the cdna variable earlier in the code with:
set cdna=/share/bitcpt/Fall2022/referenceGenomes/Arabidopsis_thaliana/tair10/Arabidopsis_thaliana.TAIR10.cdna.all.fa.gz
5) Specify what the name the output and where Salmon should save it
-o = output flag
path/to/write/the/output/filename.quant
-o salmon_align_quant/${s}.quant
1) https://combine-lab.github.io/salmon/getting_started/
2) https://salmon.readthedocs.io/en/latest/salmon.html
2) https://www.nature.com/articles/nmeth.4197
access through NCSU libraries for free