Quantification & Normalization

Method 1: Find a community based transcriptome file in a database

Method 2: Make the cDNA file for your genome (this will be the most common method)

Script for Quantifying aligned reads using Salmon

Get or make your transcriptome file

** You do not need to get or make the transcriptome files for this class. We already did it for all the referenceGenomes we'll be using **

This info is so you know what we did in case you need to do it in the future

The why

Salmon needs a transcriptome file. The transcriptome file is not not your genome file and it's not your annotation file. It's a fasta file of only the genes within a given genome assembly.

Method 1: Find a community based transcriptome file in a database

For Arabidopsis, a community generated file already exists so we will use it. We found it here:

http://ftp.ensemblgenomes.org/pub/plants/release-53/fasta/arabidopsis_thaliana/cdna/
It's the file named: Arabidopsis_thaliana.TAIR10.cdna.all.fa.gz

To put it on the HPC, we used this command

wget http://ftp.ensemblgenomes.org/pub/plants/release-53/fasta/arabidopsis_thaliana/cdna/Arabidopsis_thaliana.TAIR10.cdna.all.fa.gz

Method 2: Make the cDNA file for your genome (this will be the most common method)

Arabidopsis is a special case that already has a transcriptome file made. In almost all other situations, we will have to make the transcriptome file ourselves. It's easy as long as we have the genome assembly (fasta) and the gene annotations (gtf). Here's how we did it for Soybean genome:

#!/bin/tcsh

#BSUB -J Gm_make-transciptome #job name

#BSUB -n 20 #number of threads

#BSUB -W 2:0 #time for job to complete

#BSUB -R "rusage[mem=0.2]" #to request a node with 20MB of memory

#BSUB -o Gm_make-transciptome.%J.out #output file

#BSUB -e Gm_make-transciptome.%J.err #error file

# We're using the gffread software to make the transcritpome file.

set gffread=/usr/local/usrapps/bitcpt/gffread/bin/gffread

# The input files:

# 1) The genome assembly: glyma.Lee.gnm2.K7BV.genome_main.fna

# 2) The annotation: glyma.Lee.gnm2.ann1.1FNT.gene_models_main.AGAT.gtf

# Set the directory of the genome assembly:

set dir=/share/bitcpt/S23/referenceGenomes/Glycine_max_Lee_v2

# Set the name of the assembly:

set gen=glyma.Lee.gnm2

# Command example for gffread:

# gffread -w output.fa -g genome.fa genome.gtf

# Run it!

${gffread} -w ${dir}/${gen}_transcriptome.fasta -g ${dir}/${gen}.K7BV.genome_main.fna ${dir}/${gen}.ann1.1FNT.gene_models_main.AGAT.gtf

Script for Quantifying aligned reads using Salmon

#!/bin/tcsh

#BSUB -J salmonquant_At #job name

#BSUB -n 12 #number of threads

#BSUB -W 5:0 #time for job to complete

#BSUB -R "rusage[mem=0.2]" #to request a node with 0.2 GB of memory

#BSUB -o salmonquant_At.%J.out #output file

#BSUB -e salmonquant_At.%J.err #error file

#to quantify aligned reads using salmon in quasi indexing mode

#set threads under 12 on Hazel

#working directory path is /share/bitcpt/S23/unityID/At

#input of aligned reads path is /share/bitcpt/S23/unityID/At/AlignedToTranscriptome

#output of aligned reads will go into salmon_align_quant subdirectory in working directory

##########################

# Set the variables

##########################

set SALMON=/usr/local/usrapps/bitcpt/salmon/bin/salmon

set cdna=/share/bitcpt/S23/referenceGenomes/Arabidopsis_thaliana/tair10/Arabidopsis_thaliana.TAIR10.cdna.all.fa.gz

set IN=AlignedToTranscriptome

##########################

# At-Leaf 1

##########################

set s=Col-0_Leaf_Rep1