We've checked the quality of our RNA-seq data with FastQC and trimmed it with Fastp. We're now ready to align the data to our reference genome using STAR (Spliced Transcripts Alignment to a Reference) software.
The STAR alignment consists of two steps:
1) Index the genome with STAR
2) Align data to indexed genome with STAR
The trimmed Arabidopsis reads are in this directory: /share/bitcpt/Fall2022/CleanData/Arabidopsis_thaliana
The trimmed Tomato reads are in this directory: /share/bitcpt/Fall2022/CleanData/Solanum_lycopersicum
Arabidopsis Genome Assembly directory: /share/bitcpt/Fall2022/Genomes/Arabidopsis_thaliana
Tomato Genome Assembly directory: /share/bitcpt/Fall2022/Genomes/Solanum_lycopersicum
#!/bin/tcsh
#BSUB -J starindices_At_GroupName #job name
#BSUB -n 10 #number of nodes
#BSUB -W 2:0 #time for job to complete
#BSUB -o starindices.out.%J #output file
#BSUB -e starindices.err.%J #error file
# For running star to generate genome index
# Run in working directory /share/bitcpt/Fall2022/UnityID/At
# Must run this in working directory with subdirectory named starindices/
module load conda
conda activate /usr/local/usrapps/bitcpt/star
set IN=/share/bitcpt/Fall2022/referenceGenomes/Arabidopsis_thaliana/tair10
STAR \
--runThreadN 10
--runMode genomeGenerate
--genomeSAindexNbases 12
--genomeDir starindices/
--genomeFastaFiles ${IN}/1.fas ${IN}/2.fas ${IN}/3.fas ${IN}/4.fas ${IN}/5.fas ${IN}Pt.fas ${IN}/Mt.fas
--sjdbGTFfile ${IN}/TAIR10.E41.protein_coding.gtf
--sjdbOverhang 36
Option Breakdown
STAR
this is the command to run star software
--runThreadN 10
set the number of threads, must match #BSUB -n value
--runMode genomeGenerate
STAR can do genome indexing and alignment. Here we're telling STAR that we want to index the genome.
--genomeSAindexNbases 12
default: 14
int: length (bases) of the SA pre-indexing string. Typically between 10 and 15. Longer strings will use much more memory, but allow faster searches. For small genomes, the parameter –genomeSAindexNbases must be scaled down to min(14, log2(GenomeLength)/2 - 1).
set to star manual's recommendation for Arabidopsis genome size
Wolframalpha link shows math where 119667750 is Arabidopsis genome size:
The theoretical genome length of Arabidposis is 135 Mb, so log2(135000000)/2-1 = ~12.5 , always best to round (down) to the nearest whole number.
The genome assembly length of Arabidposis is 119.7 Mb, so log2(119667750)/2-1 = 12.4 and we round down to 12
--genomeDir starindices/
Tell STAR where to write the output
--genomeFastaFiles ${IN}/1.fas ${IN}/2.fas ${IN}/3.fas ${IN}/4.fas ${IN}/5.fas ${IN}Pt.fas ${IN}/Mt.fas
Tell STAR where the genome assembly file(s) are
--sjdbGTFfile ${IN}/TAIR10.E41.protein_coding.gtf
Tell STAR where the annotation file is
--sjdbOverhang 36
Length of RNA-seq reads - 1. After cleaning, our Arabidopsis reads are 37 bp. You can find these numbers in the fastp outputs in CleanData/
From the manual: In most cases, the default value of 100 will work as well as the ideal value.
ls -lh starindices/
total 7.5G
-rw-r--r--. 1 eedelore bitcpt 116 Nov 20 20:12 chrLength.txt
-rw-r--r--. 1 eedelore bitcpt 194 Nov 20 20:12 chrNameLength.txt
-rw-r--r--. 1 eedelore bitcpt 78 Nov 20 20:12 chrName.txt
-rw-r--r--. 1 eedelore bitcpt 131 Nov 20 20:12 chrStart.txt
-rw-r--r--. 1 eedelore bitcpt 6.6M Nov 20 20:12 exonGeTrInfo.tab
-rw-r--r--. 1 eedelore bitcpt 2.6M Nov 20 20:12 exonInfo.tab
-rw-r--r--. 1 eedelore bitcpt 2.2M Nov 20 20:12 geneInfo.tab
-rw-r--r--. 1 eedelore bitcpt 822M Nov 20 20:32 Genome
-rw-r--r--. 1 eedelore bitcpt 682 Nov 20 20:32 genomeParameters.txt
-rw-r--r--. 1 eedelore bitcpt 7.9K Nov 20 20:33 Log.out
-rw-r--r--. 1 eedelore bitcpt 6.7G Nov 20 20:33 SA
-rw-r--r--. 1 eedelore bitcpt 5.9M Nov 20 20:33 SAindex
-rw-r--r--. 1 eedelore bitcpt 4.4M Nov 20 20:30 sjdbInfo.txt
-rw-r--r--. 1 eedelore bitcpt 4.9M Nov 20 20:12 sjdbList.fromGTF.out.tab
-rw-r--r--. 1 eedelore bitcpt 4.0M Nov 20 20:30 sjdbList.out.tab
-rw-r--r--. 1 eedelore bitcpt 2.6M Nov 20 20:12 transcriptInfo.tab
Here is a generalized STAR alignment code of Arabidposis. Find the full code in /share/bitcpt/Fall2022/scripts/At.star.align.sh
#!/bin/tcsh
#BSUB -J staralign_At #job name
#BSUB -n 12 #number of threads
#BSUB -W 10:0 #time for job to complete
#BSUB -o staralign_At_%J.out #output file
#BSUB -e staralign_At_%J.err #error file
cd /share/bitcpt/Fall2022/UnityID/At
​module load conda
conda activate /usr/local/usrapps/bitcpt/star
# SET VARIABLES
set IN=/share/bitcpt/Fall2022/CleanData/Arabidopsis_thaliana
set index=starindices
set out=AlignedToTranscriptome
################################
## Leaf Rep 1
################################
# RNA-seq data are in format Col-0_Leaf_Rep1_1.fp.fq.gz
set S=Col-0_Leaf_Rep1
set EN=fp.fq.gz
# Print the file name to make sure it is right
echo ${IN}/${S}_1.${EN}
STAR
--runThreadN 12
--runMode alignReads
--genomeDir starindices/
--outFileNamePrefix ${out}/${S}_
--readFilesIn ${IN}/${S}_1.${EN} ${IN}/${S}_2.${EN}
--readFilesCommand zcat
--outSAMtype BAM Unsorted
--twopassMode Basic
--quantMode TranscriptomeSAM
Option breakdown:
--runThreadN 20
set the number of threads, must match #BSUB -n value
--runMode alignReads
STAR software has several different functions. Previously, we used STAR to index the genome. Now we're telling STAR we want it to align reads.
--genomeDir ../starindices/
directory where STAR index files are located
--outFileNamePrefix AlignedToTranscriptome/At_SAM
directory/file-prefix for writing output
--readFilesIn ${IN}/At-Leaf1_L02_1.fp.fq.gz ${IN}/At-Leaf1_L02_2.fp.fq.gz
input files (PE)
--readFilesCommand zcat
zcat reads gzipped input files
--outSAMtype BAM Unsorted
output type set to an unsorted bam
--twopassMode Basic
have STAR align all reads (1st pass), add those discovered junction sites to the genome index and then align reads again (2nd pass). Improves the number of reads detected per splice site.
--quantMode TranscriptomeSAM
generates the Aligned.toTranscriptome.out.bam file, which we need for Quantification step
ls -lh AlignedToTranscriptome/
total 12G
-rw-r--r--. 1 eedelore bitcpt 6.9G Nov 2 22:55 At_SAM_Aligned.out.bam
-rw-r--r--. 1 eedelore bitcpt 4.7G Nov 2 22:55 At_SAM_Aligned.toTranscriptome.out.bam
-rw-r--r--. 1 eedelore bitcpt 2.0K Nov 2 22:55 At_SAM_Log.final.out
-rw-r--r--. 1 eedelore bitcpt 12K Nov 2 22:55 At_SAM_Log.out
-rw-r--r--. 1 eedelore bitcpt 3.8K Nov 2 22:55 At_SAM_Log.progress.out
-rw-r--r--. 1 eedelore bitcpt 5.0M Nov 2 22:55 At_SAM_SJ.out.tab
drwx--S---. 2 eedelore bitcpt 4.0K Nov 2 22:34 At_SAM_STARgenome
drwx--S---. 2 eedelore bitcpt 4.0K Nov 2 22:34 At_SAM_STARpass1
make sure to check --sjdbOverhang length matches the tomato data (find in the CleanData/ fastp html files)
Everyone on your research team will be using the M82_vTS3 reference genome and the same dataset for your Team analysis. For your individual analysis, you must index your individual project reference genome and align your team RNA-seq reads to that genome. You will have to adapt both an indexing and alignment script for this final portfolio task.