Alignment

Sequence alignment

Arabidopsis Genome Index Code

Script

Output:

Arabidopsis Alignment Code

Output:

Soy Index Code

FOR YOUR FINAL PORTFOLIO

Sequence alignment

We've checked the quality of our RNA-seq data with FastQC and trimmed it with Fastp. We're now ready to align the data to our reference genome using STAR (Spliced Transcripts Alignment to a Reference) software.

Find the STAR manual here.
Find the STAR publication here.

The STAR alignment consists of two steps:

1) Index the genome with STAR

2) Align data to indexed genome with STAR

The cleaned Arabidopsis reads are in this directory: /share/bitcpt/S23/cleandata/Arabidopsis_thaliana

The cleaned Soy reads are in this directory: /share/bitcpt/S23/cleandata/Glycine_max

Arabidopsis Genome Assembly directory: /share/bitcpt/S23/referenceGenomes/Arabidopsis_thaliana/tair10

Soy Genome Assembly directory: /share/bitcpt/S23/referenceGenomes/Glycine_max_Lee_v2

Arabidopsis Genome Index Code

Script

#!/bin/tcsh

#BSUB -J starindices_At_GroupName #job name

#BSUB -n 10 #number of nodes

#BSUB -W 2:0 #time for job to complete

#BSUB -o starindices.out.%J #output file

#BSUB -e starindices.err.%J #error file

# For running star to generate genome index

# Run in working directory /share/bitcpt/S23/UnityID/At

# Must run this in working directory with subdirectory named starindices/

#Set variables for STAR software and At reference genome

set STAR=/usr/local/usrapps/bitcpt/star/bin/STAR

set IN=/share/bitcpt/S23/referenceGenomes/Arabidopsis_thaliana/tair10

#Command to run STAR software to index the reference genome for At with specified options

${STAR} \

--runThreadN 10

--runMode genomeGenerate

--genomeSAindexNbases 12

--genomeDir starindices/

--genomeFastaFiles ${IN}/1.fas ${IN}/2.fas ${IN}/3.fas ${IN}/4.fas ${IN}/5.fas ${IN}Pt.fas ${IN}/Mt.fas

--sjdbGTFfile ${IN}/TAIR10.E41.protein_coding.gtf

--sjdbOverhang 100

Set Variables

set STAR=/usr/local/usrapps/bitcpt/star/bin/STAR

Command set defines variable STAR to be (=) the path to the user maintained software

set IN=/share/bitcpt/S23/referenceGenomes/Arabidopsis_thaliana/tair10

Command set defines variable IN to be (=) the path to the TAIR10 reference genome

Command Options Breakdown

${STAR}

this is the command to run star software

It calls the preset variable 'STAR'

--runThreadN 10

set the number of threads, must match #BSUB -n value. Hazel maximum is 12!

--runMode genomeGenerate

STAR can do genome indexing and alignment. Here we're telling STAR that we want to index the genome.

--genomeSAindexNbases 12

default: 14
int: length (bases) of the SA pre-indexing string. Typically between 10 and 15. Longer strings will use much more memory, but allow faster searches. For small genomes, the parameter –genomeSAindexNbases must be scaled down to min(14, log2(GenomeLength)/2 - 1).
set to star manual's recommendation for Arabidopsis genome size
Wolframalpha link shows math where 119667750 is Arabidopsis genome size:
- The theoretical genome length of Arabidposis is 135 Mb, so log2(135000000)/2-1 = ~12.5 , always best to round (down) to the nearest whole number.
- The genome assembly length of Arabidposis is 119.7 Mb, so log2(119667750)/2-1 = 12.4 and we round down to 12

--genomeDir starindices/

Tell STAR where to write the output

--genomeFastaFiles ${IN}/1.fas ${IN}/2.fas ${IN}/3.fas ${IN}/4.fas ${IN}/5.fas ${IN}Pt.fas ${IN}/Mt.fas

Tell STAR where the genome assembly file(s) are

--sjdbGTFfile ${IN}/TAIR10.E41.protein_coding.gtf

Tell STAR where the annotation file is

--sjdbOverhang 100

Length of RNA-seq reads - 1. However, from the manual: In most cases, the default value of 100 will work as well as the ideal value.

Output:

ls -lh starindices/

total 7.5G

-rw-r--r--. 1 eedelore bitcpt 116 Nov 20 20:12 chrLength.txt

-rw-r--r--. 1 eedelore bitcpt 194 Nov 20 20:12 chrNameLength.txt

-rw-r--r--. 1 eedelore bitcpt 78 Nov 20 20:12 chrName.txt

-rw-r--r--. 1 eedelore bitcpt 131 Nov 20 20:12 chrStart.txt

-rw-r--r--. 1 eedelore bitcpt 6.6M Nov 20 20:12 exonGeTrInfo.tab

-rw-r--r--. 1 eedelore bitcpt 2.6M Nov 20 20:12 exonInfo.tab

-rw-r--r--. 1 eedelore bitcpt 2.2M Nov 20 20:12 geneInfo.tab

-rw-r--r--. 1 eedelore bitcpt 822M Nov 20 20:32 Genome

-rw-r--r--. 1 eedelore bitcpt 682 Nov 20 20:32 genomeParameters.txt

-rw-r--r--. 1 eedelore bitcpt 7.9K Nov 20 20:33 Log.out

-rw-r--r--. 1 eedelore bitcpt 6.7G Nov 20 20:33 SA

-rw-r--r--. 1 eedelore bitcpt 5.9M Nov 20 20:33 SAindex

-rw-r--r--. 1 eedelore bitcpt 4.4M Nov 20 20:30 sjdbInfo.txt

-rw-r--r--. 1 eedelore bitcpt 4.9M Nov 20 20:12 sjdbList.fromGTF.out.tab

-rw-r--r--. 1 eedelore bitcpt 4.0M Nov 20 20:30 sjdbList.out.tab

-rw-r--r--. 1 eedelore bitcpt 2.6M Nov 20 20:12 transcriptInfo.tab

Arabidopsis Alignment Code

Here is a generalized STAR alignment code of Arabidposis. Find the full code in /share/bitcpt/S23/scripts/At.star.align.sh

#!/bin/tcsh

#BSUB -J staralign_At #job name

#BSUB -n 12 #number of threads

#BSUB -W 10:0 #time for job to complete

#BSUB -R span[hosts=1] #to keep tasks on one node

#BSUB -R "rusage[mem=0.2]" #to request a node with 20MB of memory

#BSUB -o staralign_At_%J.out #output file

#BSUB -e staralign_At_%J.err #error file

#Script to align RNA-seq reads to indexed genome using STAR

#STAR cannot make use of HPC MPI, must have -R options to set 1 node & memory

#set threads under 12 on Hazel

#input of indexed genome path is /share/bitcpt/S23/UNITYID/At/starindices

#input of sequence reads path is /share/bitcpt/S23/cleandata/Arabidopsis_thaliana/

#output of aligned reads will go into AlignedToTranscriptome subdirectory in wor

king directory

# SET SCRIPT VARIABLES

set STAR=/usr/local/usrapps/bitcpt/star/bin/STAR

set IN=/share/bitcpt/S23/cleandata/Arabidopsis_thaliana

set index=starindices

set out=AlignedToTranscriptome

################################

## Leaf Rep 1

################################

# SET SAMPLE VARIABLES

# RNA-seq data are in format Col-0_Leaf_Rep1_1.fp.fq.gz

set S=Col-0_Leaf_Rep1

set EN=fp.fq.gz

# Print the file name to make sure it is right

echo ${IN}/${S}_1.${EN}

#Run STAR to At alignReads with all specified alignment options

${STAR}

--runThreadN 12

--runMode alignReads

--genomeDir starindices/

--outFileNamePrefix ${out}/${S}_

--readFilesIn ${IN}/${S}_1.${EN} ${IN}/${S}_2.${EN}

--readFilesCommand zcat

--outSAMtype BAM Unsorted

--twopassMode Basic

--quantMode TranscriptomeSAM

Set Script Variables

set STAR=/usr/local/usrapps/bitcpt/star/bin/STAR

Command set defines variable STAR to be (=) the path to the user maintained software

set IN=/share/bitcpt/S23/cleandata/Arabidopsis_thaliana

Command set defines variable IN to be (=) the path to the RNA sequence reads that have passed QC and are trimmed AKA "clean"

set index=starindices

Command set defines variable index to be (=) the starindices directory containing the successfully indexed reference genome

set out=AlignedToTranscriptome

Command set defines variable out to be (=) the AlignedToTranscriptome directory where we desire the output to be written

Command Option breakdown:

${STAR}

this is the command to run star software when defined as a ${VARIABLE}

It calls the preset variable 'STAR' from the Set Script Variables section

--runThreadN 20

set the number of threads, must match #BSUB -n value

--runMode alignReads

STAR software has several different functions. Previously, we used STAR to index the genome. Now we're telling STAR we want it to align reads.

--genomeDir /starindices/

directory where STAR index files are located

--outFileNamePrefix AlignedToTranscriptome/At_SAM

directory/file-prefix for writing output

--readFilesIn ${IN}/At-Leaf1_L02_1.fp.fq.gz ${IN}/At-Leaf1_L02_2.fp.fq.gz

input files (PE)

--readFilesCommand zcat

zcat reads gzipped input files

--outSAMtype BAM Unsorted

output type set to an unsorted bam

--twopassMode Basic

have STAR align all reads (1st pass), add those discovered junction sites to the genome index and then align reads again (2nd pass). Improves the number of reads detected per splice site.

--quantMode TranscriptomeSAM

generates the Aligned.toTranscriptome.out.bam file, which we need for Quantification step

Output:

ls -lh AlignedToTranscriptome/

total 12G

-rw-r--r--. 1 eedelore bitcpt 6.9G Nov 2 22:55 At_SAM_Aligned.out.bam

-rw-r--r--. 1 eedelore bitcpt 4.7G Nov 2 22:55 At_SAM_Aligned.toTranscriptome.out.bam

-rw-r--r--. 1 eedelore bitcpt 2.0K Nov 2 22:55 At_SAM_Log.final.out

-rw-r--r--. 1 eedelore bitcpt 12K Nov 2 22:55 At_SAM_Log.out

-rw-r--r--. 1 eedelore bitcpt 3.8K Nov 2 22:55 At_SAM_Log.progress.out

-rw-r--r--. 1 eedelore bitcpt 5.0M Nov 2 22:55 At_SAM_SJ.out.tab

drwx--S---. 2 eedelore bitcpt 4.0K Nov 2 22:34 At_SAM_STARgenome

drwx--S---. 2 eedelore bitcpt 4.0K Nov 2 22:34 At_SAM_STARpass1

Soy Index Code

make sure to check --sjdbOverhang length matches the soy data (find in the cleandata/ fastp html files)

FOR YOUR FINAL PORTFOLIO

Everyone on your research team will be using the Glycine_max_Lee_v2 reference genome and the same dataset for your Team analysis. For your individual analysis, you must index your individual project reference genome. You will also align ALL of the raw data samples of RNA-seq reads to your reference genome. You will have to adapt both an indexing and alignment script for this final portfolio task.

Page updated

Report abuse