Picture obtained from https://pxhere.com/
Glycine max (Williams 92 Genome Assembly) indexing
#!/bin/tcsh
#BSUB -J starindices_Soy_CCV-1 #job name
#BSUB -n 12 #number of nodes
#BSUB -W 2:0 #time for job to complete
#BSUB -o starindices.out.%J #output file
#BSUB -e starindices.err.%J #error file
# For running star to generate genome index
# Run in working directory /share/bitcpt/S23/cacofre/Portfolio
# Must run this in working directory with subdirectory named starindices/
set STAR=/usr/local/usrapps/bitcpt/star/bin/STAR
set IN=/share/bitcpt/S23/referenceGenomes/Portfolios/Glycine_max_Wm82-ISU01_v2/
${STAR} --runThreadN 12 --runMode genomeGenerate --genomeSAindexNbases 13 --genomeDir starindices/ --genomeFastaFiles ${IN}/glyma.Wm82_ISU01.gnm2.JFPQ.genome_main.fna --sjdbGTFfile ${IN}/glyma.Wm82_ISU01.gnm2.ann1.FGFB.gene_models_main.AGAT.gtf --sjdbOverhang 100
Set Variables
set STAR=/usr/local/usrapps/bitcpt/star/bin/STAR
Command set defines variable STAR to be (=) the path to the user maintained software
set IN=/share/bitcpt/S23/referenceGenomes/Soy/Glycine_max_Lee_v2/
Command set defines variable IN to be (=) the path to the reference genome
Command Options Breakdown
${STAR}
this is the command to run star software
It calls the preset variable 'STAR'
--runThreadN 12
set the number of threads, must match #BSUB -n value. Hazel maximum is 12!
--runMode genomeGenerate
STAR can do genome indexing and alignment. Here we're telling STAR that we want to index the genome.
--genomeSAindexNbases 13
default: 14
int: length (bases) of the SA pre-indexing string. Typically between 10 and 15. Longer strings will use much more memory, but allow faster searches. For small genomes, the parameter –genomeSAindexNbases must be scaled down to min(14, log2(GenomeLength)/2 - 1).
set to star manual's recommendation for Soy genome size
Wolframalpha link shows math where ~1015016903 is Soy genome size:
The theoretical genome length of Soy is 1.1 Gb, so log2(1015016903)/2-1 = ~13.9 , always best to round (down) to the nearest whole number (13).
--genomeDir starindices/
Tell STAR where to write the output
--genomeFastaFiles ${IN}/glymaa.Lee.gnm2.K7BV.genome_main.fna
Tell STAR where the genome assembly file(s) are
--sjdbGTFfile ${IN}/glyma.Lee.gnm2.ann1.1FNT.gene_models_main.AGAT.gtf
Tell STAR where the annotation file is
--sjdbOverhang 100
Length of RNA-seq reads - 1. However, from the manual: In most cases, the default value of 100 will work as well as the ideal value.
If the job is successfully completed, no message will appear when using the command: "more starindices.err.XXXXX", whereas the "succesfully completed" message will appear when checking the starindices.out.XXXXX file.