Alignment

Sequence alignment

Arabidopsis Genome Index Code

Script

Output:

Arabidopsis Alignment Code

Output:

Tomato Index Code

FOR YOUR FINAL PORTFOLIO

Sequence alignment

We've checked the quality of our RNA-seq data with FastQC and trimmed it with Fastp. We're now ready to align the data to our reference genome using STAR (Spliced Transcripts Alignment to a Reference) software.

Find the STAR manual here.
Find the STAR publication here.

The STAR alignment consists of two steps:

1) Index the genome with STAR

2) Align data to indexed genome with STAR

The trimmed Arabidopsis reads are in this directory: /share/bitcpt/Fall2022/CleanData/Arabidopsis_thaliana

The trimmed Tomato reads are in this directory: /share/bitcpt/Fall2022/CleanData/Solanum_lycopersicum

Arabidopsis Genome Assembly directory: /share/bitcpt/Fall2022/Genomes/Arabidopsis_thaliana

Tomato Genome Assembly directory: /share/bitcpt/Fall2022/Genomes/Solanum_lycopersicum

Arabidopsis Genome Index Code

Script

#!/bin/tcsh

#BSUB -J starindices_At_GroupName #job name

#BSUB -n 10 #number of nodes

#BSUB -W 2:0 #time for job to complete

#BSUB -o starindices.out.%J #output file

#BSUB -e starindices.err.%J #error file

# For running star to generate genome index

# Run in working directory /share/bitcpt/Fall2022/UnityID/At

# Must run this in working directory with subdirectory named starindices/

module load conda

conda activate /usr/local/usrapps/bitcpt/star

set IN=/share/bitcpt/Fall2022/referenceGenomes/Arabidopsis_thaliana/tair10

STAR \

--runThreadN 10

--runMode genomeGenerate

--genomeSAindexNbases 12

--genomeDir starindices/

--genomeFastaFiles ${IN}/1.fas ${IN}/2.fas ${IN}/3.fas ${IN}/4.fas ${IN}/5.fas ${IN}Pt.fas ${IN}/Mt.fas

--sjdbGTFfile ${IN}/TAIR10.E41.protein_coding.gtf

--sjdbOverhang 36

Option Breakdown

STAR

this is the command to run star software

--runThreadN 10

set the number of threads, must match #BSUB -n value

--runMode genomeGenerate

STAR can do genome indexing and alignment. Here we're telling STAR that we want to index the genome.

--genomeSAindexNbases 12

default: 14
int: length (bases) of the SA pre-indexing string. Typically between 10 and 15. Longer strings will use much more memory, but allow faster searches. For small genomes, the parameter –genomeSAindexNbases must be scaled down to min(14, log2(GenomeLength)/2 - 1).
set to star manual's recommendation for Arabidopsis genome size
Wolframalpha link shows math where 119667750 is Arabidopsis genome size:
- The theoretical genome length of Arabidposis is 135 Mb, so log2(135000000)/2-1 = ~12.5 , always best to round (down) to the nearest whole number.
- The genome assembly length of Arabidposis is 119.7 Mb, so log2(119667750)/2-1 = 12.4 and we round down to 12

--genomeDir starindices/

Tell STAR where to write the output

--genomeFastaFiles ${IN}/1.fas ${IN}/2.fas ${IN}/3.fas ${IN}/4.fas ${IN}/5.fas ${IN}Pt.fas ${IN}/Mt.fas

Tell STAR where the genome assembly file(s) are

--sjdbGTFfile ${IN}/TAIR10.E41.protein_coding.gtf

Tell STAR where the annotation file is

--sjdbOverhang 36

Length of RNA-seq reads - 1. After cleaning, our Arabidopsis reads are 37 bp. You can find these numbers in the fastp outputs in CleanData/

From the manual: In most cases, the default value of 100 will work as well as the ideal value.

Output:

ls -lh starindices/

total 7.5G

-rw-r--r--. 1 eedelore bitcpt 116 Nov 20 20:12 chrLength.txt

-rw-r--r--. 1 eedelore bitcpt 194 Nov 20 20:12 chrNameLength.txt

-rw-r--r--. 1 eedelore bitcpt 78 Nov 20 20:12 chrName.txt

-rw-r--r--. 1 eedelore bitcpt 131 Nov 20 20:12 chrStart.txt

-rw-r--r--. 1 eedelore bitcpt 6.6M Nov 20 20:12 exonGeTrInfo.tab

-rw-r--r--. 1 eedelore bitcpt 2.6M Nov 20 20:12 exonInfo.tab

-rw-r--r--. 1 eedelore bitcpt 2.2M Nov 20 20:12 geneInfo.tab

-rw-r--r--. 1 eedelore bitcpt 822M Nov 20 20:32 Genome

-rw-r--r--. 1 eedelore bitcpt 682 Nov 20 20:32 genomeParameters.txt

-rw-r--r--. 1 eedelore bitcpt 7.9K Nov 20 20:33 Log.out

-rw-r--r--. 1 eedelore bitcpt 6.7G Nov 20 20:33 SA

-rw-r--r--. 1 eedelore bitcpt 5.9M Nov 20 20:33 SAindex

-rw-r--r--. 1 eedelore bitcpt 4.4M Nov 20 20:30 sjdbInfo.txt

-rw-r--r--. 1 eedelore bitcpt 4.9M Nov 20 20:12 sjdbList.fromGTF.out.tab

-rw-r--r--. 1 eedelore bitcpt 4.0M Nov 20 20:30 sjdbList.out.tab

-rw-r--r--. 1 eedelore bitcpt 2.6M Nov 20 20:12 transcriptInfo.tab

Arabidopsis Alignment Code

Here is a generalized STAR alignment code of Arabidposis. Find the full code in /share/bitcpt/Fall2022/scripts/At.star.align.sh

#!/bin/tcsh

#BSUB -J staralign_At #job name

#BSUB -n 12 #number of threads

#BSUB -W 10:0 #time for job to complete

#BSUB -o staralign_At_%J.out #output file

#BSUB -e staralign_At_%J.err #error file

cd /share/bitcpt/Fall2022/UnityID/At

module load conda

conda activate /usr/local/usrapps/bitcpt/star

# SET VARIABLES

set IN=/share/bitcpt/Fall2022/CleanData/Arabidopsis_thaliana

set index=starindices

set out=AlignedToTranscriptome

################################

## Leaf Rep 1

################################

# RNA-seq data are in format Col-0_Leaf_Rep1_1.fp.fq.gz

set S=Col-0_Leaf_Rep1

set EN=fp.fq.gz

# Print the file name to make sure it is right

echo ${IN}/${S}_1.${EN}

STAR

--runThreadN 12

--runMode alignReads

--genomeDir starindices/

--outFileNamePrefix ${out}/${S}_

--readFilesIn ${IN}/${S}_1.${EN} ${IN}/${S}_2.${EN}

--readFilesCommand zcat

--outSAMtype BAM Unsorted

--twopassMode Basic

--quantMode TranscriptomeSAM

Option breakdown:

--runThreadN 20

set the number of threads, must match #BSUB -n value

--runMode alignReads

STAR software has several different functions. Previously, we used STAR to index the genome. Now we're telling STAR we want it to align reads.

--genomeDir ../starindices/

directory where STAR index files are located

--outFileNamePrefix AlignedToTranscriptome/At_SAM

directory/file-prefix for writing output

--readFilesIn ${IN}/At-Leaf1_L02_1.fp.fq.gz ${IN}/At-Leaf1_L02_2.fp.fq.gz

input files (PE)

--readFilesCommand zcat

zcat reads gzipped input files

--outSAMtype BAM Unsorted

output type set to an unsorted bam

--twopassMode Basic

have STAR align all reads (1st pass), add those discovered junction sites to the genome index and then align reads again (2nd pass). Improves the number of reads detected per splice site.

--quantMode TranscriptomeSAM

generates the Aligned.toTranscriptome.out.bam file, which we need for Quantification step

Output:

ls -lh AlignedToTranscriptome/

total 12G

-rw-r--r--. 1 eedelore bitcpt 6.9G Nov 2 22:55 At_SAM_Aligned.out.bam

-rw-r--r--. 1 eedelore bitcpt 4.7G Nov 2 22:55 At_SAM_Aligned.toTranscriptome.out.bam

-rw-r--r--. 1 eedelore bitcpt 2.0K Nov 2 22:55 At_SAM_Log.final.out

-rw-r--r--. 1 eedelore bitcpt 12K Nov 2 22:55 At_SAM_Log.out

-rw-r--r--. 1 eedelore bitcpt 3.8K Nov 2 22:55 At_SAM_Log.progress.out

-rw-r--r--. 1 eedelore bitcpt 5.0M Nov 2 22:55 At_SAM_SJ.out.tab

drwx--S---. 2 eedelore bitcpt 4.0K Nov 2 22:34 At_SAM_STARgenome

drwx--S---. 2 eedelore bitcpt 4.0K Nov 2 22:34 At_SAM_STARpass1

Tomato Index Code

make sure to check --sjdbOverhang length matches the tomato data (find in the CleanData/ fastp html files)

FOR YOUR FINAL PORTFOLIO

Everyone on your research team will be using the M82_vTS3 reference genome and the same dataset for your Team analysis. For your individual analysis, you must index your individual project reference genome and align your team RNA-seq reads to that genome. You will have to adapt both an indexing and alignment script for this final portfolio task.

Page updated

Report abuse