STAR (Spliced transcript alignment to a reference) indexing
The reference genome assemblies have gene locations and gene annotations (eg: gene functions) especially for most characterized organisms such as Arabidopsis thaliana.
** Look for splice aware software
The Diploid Potato reference genome is in gtf format and the quality checked sequences or reads are in fasta files. The sjdbOverhang is set at minimum read size minus 1, a safe option is 100-1=99.
Script below:
#!/bin/tcsh
#BSUB -J starindices_Tom #job name
#BSUB -n 10 #number of nodes
#BSUB -W 2:0 #time for job to complete
#BSUB -o starindices.out.%J #output file
#BSUB -e starindices.err.%J #error file
# For running star to generate genome index
# Run in working directory /share/bitcpt/Fall2022/maharry/Tom
# Must run this in working directory with subdirectory named /starindices
module load conda
conda activate /usr/local/usrapps/bitcpt/star
set IN=/share/bitcpt/Fall2022/referenceGenomes/Solanum_lycopersicum/Portfolio/Pot-Landrace (path to reference genome)
STAR --runThreadN 10 --runMode genomeGenerate --genomeSAindexNbases 13 --genomeDir starindices --genomeFastaFiles ${IN}/Pot-Landrace/Diploid-potato_assembly.fasta --sjdbGTFfile ${IN}/Pot-Landrace/Diploid-potato.agat.gtf --sjdbOverhang 99 (path to run STAR and tell STAR where the genome fasta files and gtf files are located and also where to generate output files which is genomeDir starindices)