We've checked the quality of our RNA-seq data with FastQC and trimmed it with Fastp. We're now ready to align the data to our reference genome using STAR (Spliced Transcripts Alignment to a Reference) software.
The STAR alignment consists of two steps:
1) Index the genome with STAR
2) Align data to indexed genome with STAR
The cleaned Arabidopsis reads are in this directory: /share/bitcpt/S23/cleandata/Arabidopsis_thaliana
The cleaned Soy reads are in this directory: /share/bitcpt/S23/cleandata/Glycine_max
Arabidopsis Genome Assembly directory: /share/bitcpt/S23/referenceGenomes/Arabidopsis_thaliana/tair10
Soy Genome Assembly directory: /share/bitcpt/S23/referenceGenomes/Glycine_max_Lee_v2
#!/bin/tcsh
#BSUB -J starindices_At_GroupName #job name
#BSUB -n 10 #number of nodes
#BSUB -W 2:0 #time for job to complete
#BSUB -o starindices.out.%J #output file
#BSUB -e starindices.err.%J #error file
# For running star to generate genome index
# Run in working directory /share/bitcpt/S23/UnityID/At
# Must run this in working directory with subdirectory named starindices/
#Set variables for STAR software and At reference genome
set STAR=/usr/local/usrapps/bitcpt/star/bin/STAR
set IN=/share/bitcpt/S23/referenceGenomes/Arabidopsis_thaliana/tair10
#Command to run STAR software to index the reference genome for At with specified options
${STAR} \
--runThreadN 10
--runMode genomeGenerate
--genomeSAindexNbases 12
--genomeDir starindices/
--genomeFastaFiles ${IN}/1.fas ${IN}/2.fas ${IN}/3.fas ${IN}/4.fas ${IN}/5.fas ${IN}Pt.fas ${IN}/Mt.fas
--sjdbGTFfile ${IN}/TAIR10.E41.protein_coding.gtf
--sjdbOverhang 100
Set Variables
set STAR=/usr/local/usrapps/bitcpt/star/bin/STAR
Command set defines variable STAR to be (=) the path to the user maintained software
set IN=/share/bitcpt/S23/referenceGenomes/Arabidopsis_thaliana/tair10
Command set defines variable IN to be (=) the path to the TAIR10 reference genome
Command Options Breakdown
${STAR}
this is the command to run star software
It calls the preset variable 'STAR'
--runThreadN 10
set the number of threads, must match #BSUB -n value. Hazel maximum is 12!
--runMode genomeGenerate
STAR can do genome indexing and alignment. Here we're telling STAR that we want to index the genome.
--genomeSAindexNbases 12
default: 14
int: length (bases) of the SA pre-indexing string. Typically between 10 and 15. Longer strings will use much more memory, but allow faster searches. For small genomes, the parameter –genomeSAindexNbases must be scaled down to min(14, log2(GenomeLength)/2 - 1).
set to star manual's recommendation for Arabidopsis genome size
Wolframalpha link shows math where 119667750 is Arabidopsis genome size:
The theoretical genome length of Arabidposis is 135 Mb, so log2(135000000)/2-1 = ~12.5 , always best to round (down) to the nearest whole number.
The genome assembly length of Arabidposis is 119.7 Mb, so log2(119667750)/2-1 = 12.4 and we round down to 12
--genomeDir starindices/
Tell STAR where to write the output
--genomeFastaFiles ${IN}/1.fas ${IN}/2.fas ${IN}/3.fas ${IN}/4.fas ${IN}/5.fas ${IN}Pt.fas ${IN}/Mt.fas
Tell STAR where the genome assembly file(s) are
--sjdbGTFfile ${IN}/TAIR10.E41.protein_coding.gtf
Tell STAR where the annotation file is
--sjdbOverhang 100
Length of RNA-seq reads - 1. However, from the manual: In most cases, the default value of 100 will work as well as the ideal value.
ls -lh starindices/
total 7.5G
-rw-r--r--. 1 eedelore bitcpt 116 Nov 20 20:12 chrLength.txt
-rw-r--r--. 1 eedelore bitcpt 194 Nov 20 20:12 chrNameLength.txt
-rw-r--r--. 1 eedelore bitcpt 78 Nov 20 20:12 chrName.txt
-rw-r--r--. 1 eedelore bitcpt 131 Nov 20 20:12 chrStart.txt
-rw-r--r--. 1 eedelore bitcpt 6.6M Nov 20 20:12 exonGeTrInfo.tab
-rw-r--r--. 1 eedelore bitcpt 2.6M Nov 20 20:12 exonInfo.tab
-rw-r--r--. 1 eedelore bitcpt 2.2M Nov 20 20:12 geneInfo.tab
-rw-r--r--. 1 eedelore bitcpt 822M Nov 20 20:32 Genome
-rw-r--r--. 1 eedelore bitcpt 682 Nov 20 20:32 genomeParameters.txt
-rw-r--r--. 1 eedelore bitcpt 7.9K Nov 20 20:33 Log.out
-rw-r--r--. 1 eedelore bitcpt 6.7G Nov 20 20:33 SA
-rw-r--r--. 1 eedelore bitcpt 5.9M Nov 20 20:33 SAindex
-rw-r--r--. 1 eedelore bitcpt 4.4M Nov 20 20:30 sjdbInfo.txt
-rw-r--r--. 1 eedelore bitcpt 4.9M Nov 20 20:12 sjdbList.fromGTF.out.tab
-rw-r--r--. 1 eedelore bitcpt 4.0M Nov 20 20:30 sjdbList.out.tab
-rw-r--r--. 1 eedelore bitcpt 2.6M Nov 20 20:12 transcriptInfo.tab
Here is a generalized STAR alignment code of Arabidposis. Find the full code in /share/bitcpt/S23/scripts/At.star.align.sh
#!/bin/tcsh
#BSUB -J staralign_At #job name
#BSUB -n 12 #number of threads
#BSUB -W 10:0 #time for job to complete
#BSUB -R span[hosts=1] #to keep tasks on one node
#BSUB -R "rusage[mem=0.2]" #to request a node with 20MB of memory
#BSUB -o staralign_At_%J.out #output file
#BSUB -e staralign_At_%J.err #error file
#Script to align RNA-seq reads to indexed genome using STAR
#STAR cannot make use of HPC MPI, must have -R options to set 1 node & memory
#set threads under 12 on Hazel
#input of indexed genome path is /share/bitcpt/S23/UNITYID/At/starindices
#input of sequence reads path is /share/bitcpt/S23/cleandata/Arabidopsis_thaliana/
#output of aligned reads will go into AlignedToTranscriptome subdirectory in wor
king directory
# SET SCRIPT VARIABLES
set STAR=/usr/local/usrapps/bitcpt/star/bin/STAR
set IN=/share/bitcpt/S23/cleandata/Arabidopsis_thaliana
set index=starindices
set out=AlignedToTranscriptome
################################
## Leaf Rep 1
################################
# SET SAMPLE VARIABLES
# RNA-seq data are in format Col-0_Leaf_Rep1_1.fp.fq.gz
set S=Col-0_Leaf_Rep1
set EN=fp.fq.gz
# Print the file name to make sure it is right
echo ${IN}/${S}_1.${EN}
#Run STAR to At alignReads with all specified alignment options
${STAR}
--runThreadN 12
--runMode alignReads
--genomeDir starindices/
--outFileNamePrefix ${out}/${S}_
--readFilesIn ${IN}/${S}_1.${EN} ${IN}/${S}_2.${EN}
--readFilesCommand zcat
--outSAMtype BAM Unsorted
--twopassMode Basic
--quantMode TranscriptomeSAM
Set Script Variables
set STAR=/usr/local/usrapps/bitcpt/star/bin/STAR
Command set defines variable STAR to be (=) the path to the user maintained software
set IN=/share/bitcpt/S23/cleandata/Arabidopsis_thaliana
Command set defines variable IN to be (=) the path to the RNA sequence reads that have passed QC and are trimmed AKA "clean"
set index=starindices
Command set defines variable index to be (=) the starindices directory containing the successfully indexed reference genome
set out=AlignedToTranscriptome
Command set defines variable out to be (=) the AlignedToTranscriptome directory where we desire the output to be written
Command Option breakdown:
${STAR}
this is the command to run star software when defined as a ${VARIABLE}
It calls the preset variable 'STAR' from the Set Script Variables section
--runThreadN 20
set the number of threads, must match #BSUB -n value
--runMode alignReads
STAR software has several different functions. Previously, we used STAR to index the genome. Now we're telling STAR we want it to align reads.
--genomeDir /starindices/
directory where STAR index files are located
--outFileNamePrefix AlignedToTranscriptome/At_SAM
directory/file-prefix for writing output
--readFilesIn ${IN}/At-Leaf1_L02_1.fp.fq.gz ${IN}/At-Leaf1_L02_2.fp.fq.gz
input files (PE)
--readFilesCommand zcat
zcat reads gzipped input files
--outSAMtype BAM Unsorted
output type set to an unsorted bam
--twopassMode Basic
have STAR align all reads (1st pass), add those discovered junction sites to the genome index and then align reads again (2nd pass). Improves the number of reads detected per splice site.
--quantMode TranscriptomeSAM
generates the Aligned.toTranscriptome.out.bam file, which we need for Quantification step
ls -lh AlignedToTranscriptome/
total 12G
-rw-r--r--. 1 eedelore bitcpt 6.9G Nov 2 22:55 At_SAM_Aligned.out.bam
-rw-r--r--. 1 eedelore bitcpt 4.7G Nov 2 22:55 At_SAM_Aligned.toTranscriptome.out.bam
-rw-r--r--. 1 eedelore bitcpt 2.0K Nov 2 22:55 At_SAM_Log.final.out
-rw-r--r--. 1 eedelore bitcpt 12K Nov 2 22:55 At_SAM_Log.out
-rw-r--r--. 1 eedelore bitcpt 3.8K Nov 2 22:55 At_SAM_Log.progress.out
-rw-r--r--. 1 eedelore bitcpt 5.0M Nov 2 22:55 At_SAM_SJ.out.tab
drwx--S---. 2 eedelore bitcpt 4.0K Nov 2 22:34 At_SAM_STARgenome
drwx--S---. 2 eedelore bitcpt 4.0K Nov 2 22:34 At_SAM_STARpass1
make sure to check --sjdbOverhang length matches the soy data (find in the cleandata/ fastp html files)
Everyone on your research team will be using the Glycine_max_Lee_v2 reference genome and the same dataset for your Team analysis. For your individual analysis, you must index your individual project reference genome. You will also align ALL of the raw data samples of RNA-seq reads to your reference genome. You will have to adapt both an indexing and alignment script for this final portfolio task.