ssh UnityID@login.hpc.ncsu.edu
# In the UnityID directory, create these specific directories
mkdir At
mkdir Tom
mkdir Portfolio
# Create the following directories in both At and Tom directories:
mkdir AlignedToTranscriptome
mkdir fastqc
mkdir salmon_align_quant
mkdir starindices
mkdir starOutputfiles
mkdir transcriptome
# Load conda into mobaxterm and activate
module load conda
conda activate /usr/local/usrapps/bitcpt/fastqc
# View help manual by writing
fastqc -h
# Script for running the job:
#!/bin/tcsh
#BSUB -J fastqc_At_GroupName #job name
#BSUB -n 20 #number of nodes
#BSUB -W 2:0 #time for job to complete
#BSUB -o fastqc.out.%J #output file
#BSUB -e fastqc.err.%J #error file
fastqc /share/bitcpt/Fall2022/RawData/Arabidopsis_thaliana* -t 20 -outdir ./fastqc
# Copy the script from Dr Sjogren into “At” directory
cp /share/bitcpt/Fall2022/scripts/At.fastqc.sh
# repeat copying the script into Tom directory and change name
Cp /share/bitcpt/Fall2022/scripts/At.fastqc.sh Tom.fastqc.sh
# cp function can be used to rename copied folder by putting the new name at the end of the script
# The script can be modified by using vi script
Vi Tom.fastqc.sh
# the file was changed to be accurate to the tom files
# submit work for both At and Tom
Bsub <At.fastqc.sh
Or
Bsub <Tom.fastqc.sh
# check directory for error files being generated or use script <Bjobs> to check for work in progress
# Inspect the error file
more fastqc.err.JOB#
Complete this step via Globus File Manager
Col-0_Leaf_Rep1_1.fq.gz
Basic statistics - sequence length = 100 bp
Per base sequence quality - Good = all green
Per sequence quality scores - Mean = 36
Per base sequence content - Good= linearizes after the initial noise
Per sequence GC content - Good because it still follows a bell curve (47%)
Per base N content - horizontal at 0, good
Sequence length distribution - Good
Sequence duplication levels - Bad (Percent of seqs remaining if deduplicated = 22.38%
Overrepresented sequences - 5 sequences listed, Warning
Adapter content - Good
Col-0_Leaf_Rep1_2.fq.gz
Basic statistics - sequence length = 100 bp
Per base sequence quality - Good = all green
Per sequence quality scores - Median
Per base sequence content - Good= linearizes after the initial noise
Per sequence GC content - Good because it still follows a bell curve (47%)
Per base N content - horizontal at 0, good
Sequence length distribution - Good
Sequence duplication levels - Percent of sequences remaining if duplicated: 28.55%
Overrepresented sequences - one sequence overrepresented, warning
Adapter content - Good
Col-0_Leaf_Rep2_1.fq.gz
Basic statistics - sequence length = 100 bp
Per base sequence quality - Good = all green
Per sequence quality scores - Median = 36
Per base sequence content - Good= linearizes after the initial noise
Per sequence GC content - Good because it still follows a bell curve (47%)
Per base N content - horizontal at 0, good
Sequence length distribution - Good
Sequence duplication levels: Percent of sequences remaining if duplicated: 20.11%
Overrepresented sequences: 6 sequences overrepresented
Adapter content -Good
Col-0_Leaf_Rep2_2.fq.gz
Basic statistics - sequence length = 100 bp
Per base sequence quality - Good = mostly, green, some yellow
Per sequence quality scores - Median = 36
Per base sequence content - Good= linearizes after the initial noise
Per sequence GC content - Good because it still follows a bell curve (47%)
Per base N content - horizontal at 0, good
Sequence length distribution - Good
Sequence duplication levels: Percent of sequences remaining if duplicated: 26.12%
Overrepresented sequences: 1 overrepresented sequence
Adapter content - Good
Col-0_SAM_rep1_L002_R1.fq.gz
Basic statistics - sequence length = 50 bp
Per base sequence quality - Good = very short
Per sequence quality scores - Median
Per base sequence content - Good= linearizes after the initial noise
Per sequence GC content - Good because it still follows a bell curve (45%)
Per base N content - horizontal at 0, good
Sequence length distribution - Good
Sequence duplication levels - Bad (Percent of seqs remaining if deduplicated = 26.78%
Overrepresented sequences - 2 overrepresented sequence
Adapter content
Col-0_SAM_rep1_L002_R2.fq.gz
Basic statistics - sequence length = 50 bp
Per base sequence quality - Good = all green
Per sequence quality scores - Median
Per base sequence content - Good= linearizes after the initial noise
Per sequence GC content - Good because it still follows a bell curve (45%)
Per base N content - horizontal at 0, good
Sequence length distribution - Good
Sequence duplication levels - Bad (Percent of seqs remaining if deduplicated = 28.63%
Overrepresented sequences - 4 overrepresented sequence
Adapter content - Good
Col-0_SAM_rep2_L002_R1.fq.gz
Basic statistics - sequence length = 50 bp
Per base sequence quality - Good = all green
Per sequence quality scores - Median
Per base sequence content - Good= linearizes after the initial noise
Per sequence GC content - Good because it still follows a bell curve (45%)
Per base N content - horizontal at 0, good
Sequence length distribution - Good
Sequence duplication levels - Bad (Percent of seqs remaining if deduplicated = 23.41%
Overrepresented sequences - 3 overrepresented sequence
Adapter content - Good
Col-0_SAM_rep2_L002_R2.fq.gz
Basic statistics - sequence length = 50 bp
Per base sequence quality - Good = all green
Per sequence quality scores - Median
Per base sequence content - Good= linearizes after the initial noise
Per sequence GC content - Good because it still follows a bell curve (45%)
Per base N content - horizontal at 0, good
Sequence length distribution - Good
Sequence duplication levels - Bad (Percent of seqs remaining if deduplicated = 24.93%
Overrepresented sequences - 6 overrepresented sequence
Adapter content - Good
Col-0_SAM_rep3_L002_R1.fq.gz
Basic statistics - sequence length = 50 bp
Per base sequence quality - Good = all green
Per sequence quality scores - Median
Per base sequence content - Good= linearizes after the initial noise
Per sequence GC content - Good because it still follows a bell curve (45%)
Per base N content - horizontal at 0, good
Sequence length distribution - Good
Sequence duplication levels - Bad (Percent of seqs remaining if deduplicated = 26.87%
Overrepresented sequences - 2 overrepresented sequence
Adapter content - Good
Col-0_SAM_rep3_L002_R2.fq.gz
Basic statistics - sequence length = 50 bp
Per base sequence quality - Good = all green
Per sequence quality scores - Median
Per base sequence content - Good= linearizes after the initial noise
Per sequence GC content - Good because it still follows a bell curve (45%)
Per base N content - horizontal at 0, good
Sequence length distribution - Good
Sequence duplication levels - Bad (Percent of seqs remaining if deduplicated = 28.14%
Overrepresented sequences: 1 overrepresented sequence
Adapter content - Good
The Fastqc data we have received for the 3x tomato sequences have similar quality compared to the data we have analyzed for the Arabidopsis sequences. We are required to still trim most of the sequences but overall quality are very good.
These sequences are good enough for RNA-seq analysis since some of the error warnings are characteristic of RNA-seq, and the 'Per base sequence content' can be improved by trimming. Specific trimming measures will be discussed further in the following section.
#Copy the index script into the At directory
#directory: /share/bitcpt/Fall2022/UnityID/At
cp /share/bitcpt/Fall2022/scripts/At.starindex.sh .
ll #to make sure 'At.starindex.sh' is located in the At directory
#To look at the content of the index script
more At.starindex.sh
#To edit the script
vi At.starindex.sh #press I to edit, "esc" to stop editing, and type ":wq" to exit the file.
#Arabidopsis index script, make sure to have the starindices directory in At
#!/bin/tcsh
#BSUB -J starindices_At_CherryTomatoesAt #job name
#BSUB -n 10 #number of nodes
#BSUB -W 2:0 #time for job to complete
#BSUB -o starindices.out.%J #output file
#BSUB -e starindices.err.%J #error file
# For running star to generate genome index
# Run in working directory /share/bitcpt/Fall2022/UnityID/At
# Must run this in working directory with subdirectory named /starindices
module load conda
conda activate /usr/local/usrapps/bitcpt/star
set IN=/share/bitcpt/Fall2022/referenceGenomes/Arabidopsis_thaliana/tair10
STAR --runThreadN 10 --runMode genomeGenerate --genomeSAindexNbases 12 --genomeD
ir starindices --genomeFastaFiles ${IN}/1.fas ${IN}/2.fas ${IN}/3.fas ${IN}/4.fa
s ${IN}/5.fas ${IN}/Pt.fas ${IN}/Mt.fas --sjdbGTFfile ${IN}/TAIR10.E41.protein_c
oding.gtf --sjdbOverhang 36
#Run the index Arabidopsis script
bsub <At.starindex.sh
#To look at your submitted jobs
bjobs
ll starindices #To look at the output files of starindices
total 1290755
-rw-r--r--. 1 ehernan6 bitcpt 59 Nov 22 13:55 chrLength.txt
-rw-r--r--. 1 ehernan6 bitcpt 75 Nov 22 13:55 chrNameLength.txt
-rw-r--r--. 1 ehernan6 bitcpt 16 Nov 22 13:55 chrName.txt
-rw-r--r--. 1 ehernan6 bitcpt 68 Nov 22 13:55 chrStart.txt
-rw-r--r--. 1 ehernan6 bitcpt 9604848 Nov 22 13:55 exonGeTrInfo.tab
-rw-r--r--. 1 ehernan6 bitcpt 4006498 Nov 22 13:55 exonInfo.tab
-rw-r--r--. 1 ehernan6 bitcpt 914934 Nov 22 13:55 geneInfo.tab
-rw-r--r--. 1 ehernan6 bitcpt 130148145 Nov 22 13:56 Genome
-rw-r--r--. 1 ehernan6 bitcpt 1829 Nov 22 13:56 genomeParameters.txt
-rw-r--r--. 1 ehernan6 bitcpt 8717 Nov 22 13:56 Log.out
-rw-r--r--. 1 ehernan6 bitcpt 1063527691 Nov 22 13:56 SA
-rw-r--r--. 1 ehernan6 bitcpt 97867203 Nov 22 13:56 SAindex
-rw-r--r--. 1 ehernan6 bitcpt 3413660 Nov 22 13:56 sjdbInfo.txt
-rw-r--r--. 1 ehernan6 bitcpt 3483166 Nov 22 13:55 sjdbList.fromGTF.out.tab
-rw-r--r--. 1 ehernan6 bitcpt 2752061 Nov 22 13:56 sjdbList.out.tab
-rw-r--r--. 1 ehernan6 bitcpt 2684730 Nov 22 13:55 transcriptInfo.tab
#!/bin/tcsh
#BSUB -J starindices_Tom_CherryTom #job name
#BSUB -n 10 #number of nodes
#BSUB -W 2:0 #time for job to complete
#BSUB -o starindices.out.%J #output file
#BSUB -e starindices.err.%J #error file
# For running star to generate genome index
# Run in working directory /share/bitcpt/Fall2022/UnityID/Tom
#for the portfolio, work in the portfolio directory and add the sepcific path (/Portolio)
# Must run this in working directory with subdirectory named /starindices
module load conda
conda activate /usr/local/usrapps/bitcpt/star
set IN=/share/bitcpt/Fall2022/referenceGenomes/Solanum_lycopersicum/M82_vTS3
STAR --runThreadN 10 --runMode genomeGenerate --genomeSAindexNbases 13 --genomeDir starindices --genomeFastaFiles ${IN}/M82vTS3_assembly.fasta --sjdbGTFfile ${IN}/M82vTS3.agat.gtf --sjdbOverhang 58
#the red mark is for us to change based on the name of the portfolio file
#!/bin/tcsh
#BSUB -J Tom_staralign #job name
#BSUB -n 12 #number of threads
#BSUB -W 10:0 #time for job to complete
#BSUB -R span[hosts=1] #to keep tasks on one node
#BSUB -R "rusage[mem=20000]" #to request a node with 20MB of memory
#BSUB -o Tom_staralign_%J.out #output file
#BSUB -e Tom_staralign_%J.err #error file
#to align RNA-seq reads to indexed genome using STAR
#STAR cannot make use of HPC MPI, must have -R options to set 1 node & memory
#set threads under 12 on Henry2
#input of indexed genome path is /share/bitcpt/Fall2022/UNITYID/Tom/starindices
#for the portfolio, work in the portfolio directory and add the sepcific path (/Portolio)
#input of sequence reads path is /share/bitcpt/Fall2022/CleanData/Solanum_lycopersicum/
#output of aligned reads will go into STAR_align_Tom subdirectory in working directory
module load conda
conda activate /usr/local/usrapps/bitcpt/star
# SET IN VARIABLES
set IN=/share/bitcpt/Fall2022/CleanData/Solanum_lycopersicum
set index=starindices
set out=AlignedToTranscriptome
################################
## Leaf Rep 1
################################
# RNA-seq data are in format Sl_Leaf_Rep1_3X_1.fp.fq.gz
set S=Sl_Leaf_Rep1_3X
set EN=fp.fq.gz
# Print the file name to make sure it is right
echo ${IN}/${S}_1.${EN}
STAR --runThreadN 12 --runMode alignReads --genomeDir ${index} --outFileNamePrefix ${out}/${S}_ --readFilesIn ${IN}/${S}_1.${EN} ${IN}/${S}_2.${EN} --readFilesCommand zcat --outSAMtype BAM Unsorted --twopassMode Basic --quantMode TranscriptomeSAM
################################
## Leaf Rep 2
################################
# RNA-seq data are in format Sl_Leaf_Rep2_3X_1.fp.fq.gz
set S=Sl_Leaf_Rep2_3X
set EN=fp.fq.gz
# Print the file name to make sure it is right
echo ${IN}/${S}_1.${EN}
STAR --runThreadN 12 --runMode alignReads --genomeDir ${index} --outFileNamePrefix ${out}/${S}_ --readFilesIn ${IN}/${S}_1.${EN} ${IN}/${S}_2.${EN} --readFilesCommand zcat --outSAMtype BAM Unsorted --twopassMode Basic --quantMode TranscriptomeSAM
################################
## Leaf Rep 3
################################
# RNA-seq data are in format Sl_Leaf_Rep3_3X_1.fp.fq.gz
set S=Sl_Leaf_Rep3_3X
set EN=fp.fq.gz
# Print the file name to make sure it is right
echo ${IN}/${S}_1.${EN}
STAR --runThreadN 12 --runMode alignReads --genomeDir ${index} --outFileNamePrefix ${out}/${S}_ --readFilesIn ${IN}/${S}_1.${EN} ${IN}/${S}_2.${EN} --readFilesCommand zcat --outSAMtype BAM Unsorted --twopassMode Basic --quantMode TranscriptomeSAM
################################
## SAM Rep 1
################################
#RNA-seq data are in format Sl_SAM_Rep1_3X_1.fp.fq.gz
set S=Sl_SAM_Rep1_3X
set EN=fp.fq.gz
#Print the file name to make sure it is right
echo ${IN}/${S}_1.${EN}
STAR --runThreadN 12 --runMode alignReads --genomeDir ${index} --outFileNamePrefix ${out}/${S}_ --readFilesIn ${IN}/${S}_1.${EN} ${IN}/${S}_2.${EN} --readFilesCommand zcat --outSAMtype BAM Unsorted --twopassMode Basic --quantMode TranscriptomeSAM
################################
## SAM Rep 2
################################
#RNA-seq data are in format Sl_SAM_Rep2_3X_1.fp.fq.gz
set S=Sl_SAM_Rep2_3X
set EN=fp.fq.gz
#Print the file name to make sure it is right
echo ${IN}/${S}_1.${EN}
STAR --runThreadN 12 --runMode alignReads --genomeDir ${index} --outFileNamePrefix ${out}/${S}_ --readFilesIn ${IN}/${S}_1.${EN} ${IN}/${S}_2.${EN} --readFilesCommand zcat --outSAMtype BAM Unsorted --twopassMode Basic --quantMode TranscriptomeSAM
################################
## SAM Rep 3
################################
#RNA-seq data are in format Sl_SAM_Rep3_3X_1.fp.fq.gz
set S=Sl_SAM_Rep3_3X
set EN=fp.fq.gz
#Print the file name to make sure it is right
echo ${IN}/${S}_1.${EN}
STAR --runThreadN 12 --runMode alignReads --genomeDir ${index} --outFileNamePrefix ${out}/${S}_ --readFilesIn ${IN}/${S}_1.${EN} ${IN}/${S}_2.${EN} --readFilesCommand zcat --outSAMtype BAM Unsorted --twopassMode Basic --quantMode TranscriptomeSAM
################################
## SAM Rep 4
################################
#RNA-seq data are in format Sl_SAM_Rep4_3X_1.fp.fq.gz
set S=Sl_SAM_Rep4_3X
set EN=fp.fq.gz
#Print the file name to make sure it is right
echo ${IN}/${S}_1.${EN}
STAR --runThreadN 12 --runMode alignReads --genomeDir ${index} --outFileNamePrefix ${out}/${S}_ --readFilesIn ${IN}/${S}_1.${EN} ${IN}/${S}_2.${EN} --readFilesCommand zcat --outSAMtype BAM Unsorted --twopassMode Basic --quantMode TranscriptomeSAM
[lmlower@login03 starindices]$ tree
.
├── chrLength.txt
├── chrNameLength.txt
├── chrName.txt
├── chrStart.txt
├── exonGeTrInfo.tab
├── exonInfo.tab
├── geneInfo.tab
├── Genome
├── genomeParameters.txt
├── Log.out
├── SA
├── SAindex
├── sjdbInfo.txt
├── sjdbList.fromGTF.out.tab
├── sjdbList.out.tab
└── transcriptInfo.tab
pwd
/share/bitcpt/Fall2022/unityID/At
#for the portfolio, work in the portfolio directory and add the sepcific path (/Portolio)
## copy script to the current directory
cp /share/bitcpt/Fall2022/scripts/At.salmon.sh .
## edit the script by changing the unityID to your ID when setting the IN variable
vi At.salmon.sh
## submit the job
bsub <At.salmon.sh
##Look at the quant files
cd share/bitcpt/Fall2022/unityID/At
ll salmon_align_quant
total 3
drwxr-sr-x. 5 ehernan6 bitcpt 4096 Nov 28 14:57 Col-0_Leaf_Rep1.quant
drwxr-sr-x. 5 ehernan6 bitcpt 4096 Nov 28 14:59 Col-0_Leaf_Rep2.quant
drwxr-sr-x. 5 ehernan6 bitcpt 4096 Nov 28 15:01 Col-0_SAM_rep1_L002.quant
drwxr-sr-x. 5 ehernan6 bitcpt 4096 Nov 28 15:04 Col-0_SAM_rep2_L002.quant
drwxr-sr-x. 5 ehernan6 bitcpt 4096 Nov 28 15:06 Col-0_SAM_rep3_L002.quant
pwd
/share/bitcpt/Fall2022/unityID/Tom
#for the portfolio, work in the portfolio directory and add the sepcific path (/Portolio)
## copy script to the current directory
cp /share/bitcpt/Fall2022/scripts/At.salmon.sh Tom.salmon.sh .
## edit the script by changing the unityID to your ID when setting the IN variable
vi Tom.salmon.sh
#!/bin/tcsh
#BSUB -J salmonquant_Tom #job name
#BSUB -n 12 #number of threads
#BSUB -W 5:0 #time for job to complete
#BSUB -R "rusage[mem=20000]" #to request a node with 20MB of memory
#BSUB -o salmonquant_Tom.%J.out #output file
#BSUB -e salmonquant_Tom.%J.err #error file
#to quantify aligned reads using salmon in quasi indexing mode
#set threads under 12 on Henry2
#working directory path is /share/bitcpt/Fall2022/UnityID/Tom
#input of aligned reads path is /share/bitcpt/Fall2022/UnityID/Tom/AlignedToTranscriptome
#output of aligned reads will go into salmon_align_quant subdirectory in working directory
module load conda
conda activate /usr/local/usrapps/bitcpt/salmon
##########################
# Set the variables
##########################
set cdna=/share/bitcpt/Fall2022/referenceGenomes/Solanum_lycopersicum/M82_vTS3/M82vTS3_transcriptome.fasta
set IN=/share/bitcpt/Fall2022/lscapozi/Tom/AlignedToTranscriptome
##########################
# Sl-Leaf 1
##########################
set s=Sl_Leaf_Rep1_3X
salmon quant -l A -a ${IN}/${s}_Aligned.toTranscriptome.out.bam --targets ${cdna} -o salmon_align_quant/${s}.quant
##########################
# Sl-Leaf 2
##########################
set s=Sl_Leaf_Rep2_3X
salmon quant -l A -a ${IN}/${s}_Aligned.toTranscriptome.out.bam --targets ${cdna} -o salmon_align_quant/${s}.quant
##########################
# Sl-Leaf 3
###########################
set s=Sl_Leaf_Rep3_3X
salmon quant -l A -a ${IN}/${s}_Aligned.toTranscriptome.out.bam --targets ${cdna} -o salmon_align_quant/${s}.quant
##########################
# Sl-SAM 1
##########################
set s=Sl_SAM_Rep1_3X
salmon quant -l A -a ${IN}/${s}_Aligned.toTranscriptome.out.bam --targets ${cdna} -o salmon_align_quant/${s}.quant
##########################
# Sl-SAM 2
##########################
set s=Sl_SAM_Rep2_3X
salmon quant -l A -a ${IN}/${s}_Aligned.toTranscriptome.out.bam --targets ${cdna} -o salmon_align_quant/${s}.quant
##########################
# Sl-SAM 3
##########################
set s=Sl_SAM_Rep3_3X
salmon quant -l A -a ${IN}/${s}_Aligned.toTranscriptome.out.bam --targets ${cdna} -o salmon_align_quant/${s}.quant
##########################
# Sl-SAM 4
##########################
set s=Sl_SAM_Rep4_3X
salmon quant -l A -a ${IN}/${s}_Aligned.toTranscriptome.out.bam --targets ${cdna} -o salmon_align_quant/${s}.quant
echo Done
##Look at the quant files
#for the portfolio, work in the portfolio directory and add the sepcific path (/Portolio)
cd share/bitcpt/Fall2022/unityID/Tom
ll salmon_align_quant
total 4
drwxr-sr-x. 5 ehernan6 bitcpt 4096 Nov 28 15:04 Sl_Leaf_Rep1_3X.quant
drwxr-sr-x. 5 ehernan6 bitcpt 4096 Nov 28 15:07 Sl_Leaf_Rep2_3X.quant
drwxr-sr-x. 5 ehernan6 bitcpt 4096 Nov 28 15:09 Sl_Leaf_Rep3_3X.quant
drwxr-sr-x. 5 ehernan6 bitcpt 4096 Nov 28 15:12 Sl_SAM_Rep1_3X.quant
drwxr-sr-x. 5 ehernan6 bitcpt 4096 Nov 28 15:15 Sl_SAM_Rep2_3X.quant
drwxr-sr-x. 5 ehernan6 bitcpt 4096 Nov 28 15:17 Sl_SAM_Rep3_3X.quant
drwxr-sr-x. 5 ehernan6 bitcpt 4096 Nov 28 15:20 Sl_SAM_Rep4_3X.quant
Description: Statistical visualization of euclidean distance which means the further our data points are from each other then the more different they are to one another and vice versa. This means that the SAM files are more similar to one another than the Leaf files.
Description: Sample-to-sample distances show that the similarity between the samples of each tissue type is high, shown by the dark blue color. As shown in the Principal Component figure, the SAM samples show high similarity between each other than the leaf samples, which show a blue color lighter than the SAM samples.
Description: Key components of dispersion estimate. First look at the figure to see red trend line, observe that a set of data in blue that is in both sides of red line and more black data points beyond that. It is for the dataset's mean of normalized counts compared to variance. The trend should start high which is the high number of mean and get lower variance . Roughly the points should be around the fit line.
The data points of mean normalized counts and dispersion follow the trend that dispersion decreases as mean of normalized count increases.
Description: Histogram of the p-value for each gene being compared. See a higher significance in over 14,000 genes that are likely under the 0.05 threshold. This is an unadjusted p-value histogram, adjusting these p-values would be beneficial to see more conservative p-values (less than 14,000, which would show us what is changing the most between these groups). Taking the extra step to adjust these p-values is recommended.
Description: Showing fold change: how much bigger r smaller has the transcripts changed between leaf vs. meristem. As we move across the x-axis the number of transcripts changes and we start to see smaller fold changes. More counts - blue dots are coming in and smaller differences are going to be more significant.