Cherry Tomatoes Lab Deliverables

Logging in to the HPC

Getting Organized on the HPC

Running Fastqc

Saving files from HPC

Arabidopsis - Fastqc Analysis

Tomato - Fastqc Analysis

Conclusions

Arabidopsis index script

Indexing Tomato script

Tomato Alignment Code

Tomato Tree

Salmon Script for Quantifying Aligned Reads using Salmon for Arabidopsis files

Salmon Script for Quantifying aligned reads using Salmon for Tomato

DESeq-2 Analysis Results Tomato

Logging in to the HPC

ssh UnityID@login.hpc.ncsu.edu

Getting Organized on the HPC

# In the UnityID directory, create these specific directories

mkdir At

mkdir Tom

mkdir Portfolio

# Create the following directories in both At and Tom directories:

mkdir AlignedToTranscriptome

mkdir fastqc

mkdir salmon_align_quant

mkdir starindices

mkdir starOutputfiles

mkdir transcriptome

Running Fastqc

# Load conda into mobaxterm and activate

module load conda

conda activate /usr/local/usrapps/bitcpt/fastqc

# View help manual by writing

fastqc -h

# Script for running the job:

#!/bin/tcsh

#BSUB -J fastqc_At_GroupName #job name

#BSUB -n 20 #number of nodes

#BSUB -W 2:0 #time for job to complete

#BSUB -o fastqc.out.%J #output file

#BSUB -e fastqc.err.%J #error file

fastqc /share/bitcpt/Fall2022/RawData/Arabidopsis_thaliana* -t 20 -outdir ./fastqc

# Copy the script from Dr Sjogren into “At” directory

cp /share/bitcpt/Fall2022/scripts/At.fastqc.sh

# repeat copying the script into Tom directory and change name

Cp /share/bitcpt/Fall2022/scripts/At.fastqc.sh Tom.fastqc.sh

# cp function can be used to rename copied folder by putting the new name at the end of the script

# The script can be modified by using vi script

Vi Tom.fastqc.sh

# the file was changed to be accurate to the tom files

# submit work for both At and Tom

Bsub <At.fastqc.sh

Bsub <Tom.fastqc.sh

# check directory for error files being generated or use script <Bjobs> to check for work in progress

# Inspect the error file

more fastqc.err.JOB#

Saving files from HPC

Complete this step via Globus File Manager

Arabidopsis - Fastqc Analysis

Col-0_Leaf_Rep1_1.fq.gz

Basic statistics - sequence length = 100 bp

Per base sequence quality - Good = all green

Per sequence quality scores - Mean = 36

Per base sequence content - Good= linearizes after the initial noise

Per sequence GC content - Good because it still follows a bell curve (47%)

Per base N content - horizontal at 0, good

Sequence length distribution - Good

Sequence duplication levels - Bad (Percent of seqs remaining if deduplicated = 22.38%

Overrepresented sequences - 5 sequences listed, Warning

Adapter content - Good

Col-0_Leaf_Rep1_2.fq.gz

Basic statistics - sequence length = 100 bp

Per base sequence quality - Good = all green

Per sequence quality scores - Median

Per base sequence content - Good= linearizes after the initial noise

Per sequence GC content - Good because it still follows a bell curve (47%)

Per base N content - horizontal at 0, good

Sequence length distribution - Good

Sequence duplication levels - Percent of sequences remaining if duplicated: 28.55%

Overrepresented sequences - one sequence overrepresented, warning

Adapter content - Good

Col-0_Leaf_Rep2_1.fq.gz

Basic statistics - sequence length = 100 bp

Per base sequence quality - Good = all green

Per sequence quality scores - Median = 36

Per base sequence content - Good= linearizes after the initial noise

Per sequence GC content - Good because it still follows a bell curve (47%)

Per base N content - horizontal at 0, good

Sequence length distribution - Good

Sequence duplication levels: Percent of sequences remaining if duplicated: 20.11%

Overrepresented sequences: 6 sequences overrepresented

Adapter content -Good

Col-0_Leaf_Rep2_2.fq.gz

Basic statistics - sequence length = 100 bp

Per base sequence quality - Good = mostly, green, some yellow

Per sequence quality scores - Median = 36

Per base sequence content - Good= linearizes after the initial noise

Per sequence GC content - Good because it still follows a bell curve (47%)

Per base N content - horizontal at 0, good

Sequence length distribution - Good

Sequence duplication levels: Percent of sequences remaining if duplicated: 26.12%

Overrepresented sequences: 1 overrepresented sequence

Adapter content - Good

Col-0_SAM_rep1_L002_R1.fq.gz

Basic statistics - sequence length = 50 bp

Per base sequence quality - Good = very short

Per sequence quality scores - Median

Per base sequence content - Good= linearizes after the initial noise

Per sequence GC content - Good because it still follows a bell curve (45%)

Per base N content - horizontal at 0, good

Sequence length distribution - Good

Sequence duplication levels - Bad (Percent of seqs remaining if deduplicated = 26.78%

Overrepresented sequences - 2 overrepresented sequence

Adapter content

Col-0_SAM_rep1_L002_R2.fq.gz

Basic statistics - sequence length = 50 bp

Per base sequence quality - Good = all green

Per sequence quality scores - Median

Per base sequence content - Good= linearizes after the initial noise

Per sequence GC content - Good because it still follows a bell curve (45%)

Per base N content - horizontal at 0, good

Sequence length distribution - Good

Sequence duplication levels - Bad (Percent of seqs remaining if deduplicated = 28.63%

Overrepresented sequences - 4 overrepresented sequence

Adapter content - Good

Col-0_SAM_rep2_L002_R1.fq.gz

Basic statistics - sequence length = 50 bp

Per base sequence quality - Good = all green

Per sequence quality scores - Median

Per base sequence content - Good= linearizes after the initial noise

Per sequence GC content - Good because it still follows a bell curve (45%)

Per base N content - horizontal at 0, good

Sequence length distribution - Good

Sequence duplication levels - Bad (Percent of seqs remaining if deduplicated = 23.41%

Overrepresented sequences - 3 overrepresented sequence

Adapter content - Good

Col-0_SAM_rep2_L002_R2.fq.gz

Basic statistics - sequence length = 50 bp

Per base sequence quality - Good = all green

Per sequence quality scores - Median

Per base sequence content - Good= linearizes after the initial noise

Per sequence GC content - Good because it still follows a bell curve (45%)

Per base N content - horizontal at 0, good

Sequence length distribution - Good

Sequence duplication levels - Bad (Percent of seqs remaining if deduplicated = 24.93%

Overrepresented sequences - 6 overrepresented sequence

Adapter content - Good

Col-0_SAM_rep3_L002_R1.fq.gz

Basic statistics - sequence length = 50 bp

Per base sequence quality - Good = all green

Per sequence quality scores - Median

Per base sequence content - Good= linearizes after the initial noise

Per sequence GC content - Good because it still follows a bell curve (45%)

Per base N content - horizontal at 0, good

Sequence length distribution - Good

Sequence duplication levels - Bad (Percent of seqs remaining if deduplicated = 26.87%

Overrepresented sequences - 2 overrepresented sequence

Adapter content - Good

Col-0_SAM_rep3_L002_R2.fq.gz

Basic statistics - sequence length = 50 bp

Per base sequence quality - Good = all green

Per sequence quality scores - Median

Per base sequence content - Good= linearizes after the initial noise

Per sequence GC content - Good because it still follows a bell curve (45%)

Per base N content - horizontal at 0, good

Sequence length distribution - Good

Sequence duplication levels - Bad (Percent of seqs remaining if deduplicated = 28.14%

Overrepresented sequences: 1 overrepresented sequence

Adapter content - Good

Tomato - Fastqc Analysis

The Fastqc data we have received for the 3x tomato sequences have similar quality compared to the data we have analyzed for the Arabidopsis sequences. We are required to still trim most of the sequences but overall quality are very good.

Conclusions

These sequences are good enough for RNA-seq analysis since some of the error warnings are characteristic of RNA-seq, and the 'Per base sequence content' can be improved by trimming. Specific trimming measures will be discussed further in the following section.

Arabidopsis index script

#Copy the index script into the At directory

#directory: /share/bitcpt/Fall2022/UnityID/At

cp /share/bitcpt/Fall2022/scripts/At.starindex.sh .

ll #to make sure 'At.starindex.sh' is located in the At directory

#To look at the content of the index script

more At.starindex.sh

#To edit the script

vi At.starindex.sh #press I to edit, "esc" to stop editing, and type ":wq" to exit the file.

#Arabidopsis index script, make sure to have the starindices directory in At

#!/bin/tcsh

#BSUB -J starindices_At_CherryTomatoesAt #job name

#BSUB -n 10 #number of nodes

#BSUB -W 2:0 #time for job to complete

#BSUB -o starindices.out.%J #output file

#BSUB -e starindices.err.%J #error file

# For running star to generate genome index

# Run in working directory /share/bitcpt/Fall2022/UnityID/At

# Must run this in working directory with subdirectory named /starindices

module load conda

conda activate /usr/local/usrapps/bitcpt/star

set IN=/share/bitcpt/Fall2022/referenceGenomes/Arabidopsis_thaliana/tair10

STAR --runThreadN 10 --runMode genomeGenerate --genomeSAindexNbases 12 --genomeD

ir starindices --genomeFastaFiles ${IN}/1.fas ${IN}/2.fas ${IN}/3.fas ${IN}/4.fa

s ${IN}/5.fas ${IN}/Pt.fas ${IN}/Mt.fas --sjdbGTFfile ${IN}/TAIR10.E41.protein_c

oding.gtf --sjdbOverhang 36

#Run the index Arabidopsis script

bsub <At.starindex.sh

#To look at your submitted jobs

bjobs

ll starindices #To look at the output files of starindices

total 1290755

-rw-r--r--. 1 ehernan6 bitcpt 59 Nov 22 13:55 chrLength.txt

-rw-r--r--. 1 ehernan6 bitcpt 75 Nov 22 13:55 chrNameLength.txt

-rw-r--r--. 1 ehernan6 bitcpt 16 Nov 22 13:55 chrName.txt

-rw-r--r--. 1 ehernan6 bitcpt 68 Nov 22 13:55 chrStart.txt

-rw-r--r--. 1 ehernan6 bitcpt 9604848 Nov 22 13:55 exonGeTrInfo.tab

-rw-r--r--. 1 ehernan6 bitcpt 4006498 Nov 22 13:55 exonInfo.tab

-rw-r--r--. 1 ehernan6 bitcpt 914934 Nov 22 13:55 geneInfo.tab

-rw-r--r--. 1 ehernan6 bitcpt 130148145 Nov 22 13:56 Genome

-rw-r--r--. 1 ehernan6 bitcpt 1829 Nov 22 13:56 genomeParameters.txt

-rw-r--r--. 1 ehernan6 bitcpt 8717 Nov 22 13:56 Log.out

-rw-r--r--. 1 ehernan6 bitcpt 1063527691 Nov 22 13:56 SA

-rw-r--r--. 1 ehernan6 bitcpt 97867203 Nov 22 13:56 SAindex

-rw-r--r--. 1 ehernan6 bitcpt 3413660 Nov 22 13:56 sjdbInfo.txt

-rw-r--r--. 1 ehernan6 bitcpt 3483166 Nov 22 13:55 sjdbList.fromGTF.out.tab

-rw-r--r--. 1 ehernan6 bitcpt 2752061 Nov 22 13:56 sjdbList.out.tab

-rw-r--r--. 1 ehernan6 bitcpt 2684730 Nov 22 13:55 transcriptInfo.tab

Indexing Tomato script

#!/bin/tcsh

#BSUB -J starindices_Tom_CherryTom #job name

#BSUB -n 10 #number of nodes

#BSUB -W 2:0 #time for job to complete

#BSUB -o starindices.out.%J #output file

#BSUB -e starindices.err.%J #error file

# For running star to generate genome index

# Run in working directory /share/bitcpt/Fall2022/UnityID/Tom

#for the portfolio, work in the portfolio directory and add the sepcific path (/Portolio)

# Must run this in working directory with subdirectory named /starindices

module load conda

conda activate /usr/local/usrapps/bitcpt/star

set IN=/share/bitcpt/Fall2022/referenceGenomes/Solanum_lycopersicum/M82_vTS3

STAR --runThreadN 10 --runMode genomeGenerate --genomeSAindexNbases 13 --genomeDir starindices --genomeFastaFiles ${IN}/M82vTS3_assembly.fasta --sjdbGTFfile ${IN}/M82vTS3.agat.gtf --sjdbOverhang 58

#the red mark is for us to change based on the name of the portfolio file

Tomato Alignment Code

#!/bin/tcsh

#BSUB -J Tom_staralign #job name

#BSUB -n 12 #number of threads

#BSUB -W 10:0 #time for job to complete

#BSUB -R span[hosts=1] #to keep tasks on one node

#BSUB -R "rusage[mem=20000]" #to request a node with 20MB of memory

#BSUB -o Tom_staralign_%J.out #output file

#BSUB -e Tom_staralign_%J.err #error file

#to align RNA-seq reads to indexed genome using STAR

#STAR cannot make use of HPC MPI, must have -R options to set 1 node & memory

#set threads under 12 on Henry2

#input of indexed genome path is /share/bitcpt/Fall2022/UNITYID/Tom/starindices

#for the portfolio, work in the portfolio directory and add the sepcific path (/Portolio)

#input of sequence reads path is /share/bitcpt/Fall2022/CleanData/Solanum_lycopersicum/

#output of aligned reads will go into STAR_align_Tom subdirectory in working directory

module load conda

conda activate /usr/local/usrapps/bitcpt/star

# SET IN VARIABLES

set IN=/share/bitcpt/Fall2022/CleanData/Solanum_lycopersicum

set index=starindices

set out=AlignedToTranscriptome

################################

## Leaf Rep 1

################################

# RNA-seq data are in format Sl_Leaf_Rep1_3X_1.fp.fq.gz

set S=Sl_Leaf_Rep1_3X

set EN=fp.fq.gz

# Print the file name to make sure it is right

echo ${IN}/${S}_1.${EN}

STAR --runThreadN 12 --runMode alignReads --genomeDir ${index} --outFileNamePrefix ${out}/${S}_ --readFilesIn ${IN}/${S}_1.${EN} ${IN}/${S}_2.${EN} --readFilesCommand zcat --outSAMtype BAM Unsorted --twopassMode Basic --quantMode TranscriptomeSAM

################################

## Leaf Rep 2

################################

# RNA-seq data are in format Sl_Leaf_Rep2_3X_1.fp.fq.gz

set S=Sl_Leaf_Rep2_3X

set EN=fp.fq.gz

# Print the file name to make sure it is right

echo ${IN}/${S}_1.${EN}

################################

## Leaf Rep 3

################################

# RNA-seq data are in format Sl_Leaf_Rep3_3X_1.fp.fq.gz

set S=Sl_Leaf_Rep3_3X

set EN=fp.fq.gz

# Print the file name to make sure it is right

echo ${IN}/${S}_1.${EN}

################################

## SAM Rep 1

################################

#RNA-seq data are in format Sl_SAM_Rep1_3X_1.fp.fq.gz

set S=Sl_SAM_Rep1_3X

set EN=fp.fq.gz

#Print the file name to make sure it is right

echo ${IN}/${S}_1.${EN}

################################

## SAM Rep 2

################################

#RNA-seq data are in format Sl_SAM_Rep2_3X_1.fp.fq.gz

set S=Sl_SAM_Rep2_3X

set EN=fp.fq.gz

#Print the file name to make sure it is right

echo ${IN}/${S}_1.${EN}

################################

## SAM Rep 3

################################

#RNA-seq data are in format Sl_SAM_Rep3_3X_1.fp.fq.gz

set S=Sl_SAM_Rep3_3X

set EN=fp.fq.gz

#Print the file name to make sure it is right

echo ${IN}/${S}_1.${EN}

################################

## SAM Rep 4

################################

#RNA-seq data are in format Sl_SAM_Rep4_3X_1.fp.fq.gz

set S=Sl_SAM_Rep4_3X

set EN=fp.fq.gz

#Print the file name to make sure it is right

echo ${IN}/${S}_1.${EN}

Tomato Tree

[lmlower@login03 starindices]$ tree

├── chrLength.txt

├── chrNameLength.txt

├── chrName.txt

├── chrStart.txt

├── exonGeTrInfo.tab

├── exonInfo.tab

├── geneInfo.tab

├── Genome

├── genomeParameters.txt

├── Log.out

├── SA

├── SAindex

├── sjdbInfo.txt

├── sjdbList.fromGTF.out.tab

├── sjdbList.out.tab

└── transcriptInfo.tab

Salmon Script for Quantifying Aligned Reads using Salmon for Arabidopsis files

pwd

/share/bitcpt/Fall2022/unityID/At

#for the portfolio, work in the portfolio directory and add the sepcific path (/Portolio)

## copy script to the current directory

cp /share/bitcpt/Fall2022/scripts/At.salmon.sh .

## edit the script by changing the unityID to your ID when setting the IN variable

vi At.salmon.sh

## submit the job

bsub <At.salmon.sh

##Look at the quant files

cd share/bitcpt/Fall2022/unityID/At

ll salmon_align_quant

total 3

drwxr-sr-x. 5 ehernan6 bitcpt 4096 Nov 28 14:57 Col-0_Leaf_Rep1.quant

drwxr-sr-x. 5 ehernan6 bitcpt 4096 Nov 28 14:59 Col-0_Leaf_Rep2.quant

drwxr-sr-x. 5 ehernan6 bitcpt 4096 Nov 28 15:01 Col-0_SAM_rep1_L002.quant

drwxr-sr-x. 5 ehernan6 bitcpt 4096 Nov 28 15:04 Col-0_SAM_rep2_L002.quant

drwxr-sr-x. 5 ehernan6 bitcpt 4096 Nov 28 15:06 Col-0_SAM_rep3_L002.quant

Salmon Script for Quantifying aligned reads using Salmon for Tomato

pwd

/share/bitcpt/Fall2022/unityID/Tom

#for the portfolio, work in the portfolio directory and add the sepcific path (/Portolio)

## copy script to the current directory

cp /share/bitcpt/Fall2022/scripts/At.salmon.sh Tom.salmon.sh .

## edit the script by changing the unityID to your ID when setting the IN variable

vi Tom.salmon.sh

#!/bin/tcsh

#BSUB -J salmonquant_Tom #job name

#BSUB -n 12 #number of threads

#BSUB -W 5:0 #time for job to complete

#BSUB -R "rusage[mem=20000]" #to request a node with 20MB of memory

#BSUB -o salmonquant_Tom.%J.out #output file

#BSUB -e salmonquant_Tom.%J.err #error file

#to quantify aligned reads using salmon in quasi indexing mode

#set threads under 12 on Henry2

#working directory path is /share/bitcpt/Fall2022/UnityID/Tom

#input of aligned reads path is /share/bitcpt/Fall2022/UnityID/Tom/AlignedToTranscriptome

#output of aligned reads will go into salmon_align_quant subdirectory in working directory

module load conda

conda activate /usr/local/usrapps/bitcpt/salmon

##########################

# Set the variables

##########################

set cdna=/share/bitcpt/Fall2022/referenceGenomes/Solanum_lycopersicum/M82_vTS3/M82vTS3_transcriptome.fasta

set IN=/share/bitcpt/Fall2022/lscapozi/Tom/AlignedToTranscriptome

##########################

# Sl-Leaf 1

##########################

set s=Sl_Leaf_Rep1_3X

salmon quant -l A -a ${IN}/${s}_Aligned.toTranscriptome.out.bam --targets ${cdna} -o salmon_align_quant/${s}.quant

##########################

# Sl-Leaf 2

##########################

set s=Sl_Leaf_Rep2_3X

salmon quant -l A -a ${IN}/${s}_Aligned.toTranscriptome.out.bam --targets ${cdna} -o salmon_align_quant/${s}.quant

##########################

# Sl-Leaf 3

###########################

set s=Sl_Leaf_Rep3_3X

salmon quant -l A -a ${IN}/${s}_Aligned.toTranscriptome.out.bam --targets ${cdna} -o salmon_align_quant/${s}.quant

##########################

# Sl-SAM 1

##########################

set s=Sl_SAM_Rep1_3X

salmon quant -l A -a ${IN}/${s}_Aligned.toTranscriptome.out.bam --targets ${cdna} -o salmon_align_quant/${s}.quant

##########################

# Sl-SAM 2

##########################

set s=Sl_SAM_Rep2_3X

salmon quant -l A -a ${IN}/${s}_Aligned.toTranscriptome.out.bam --targets ${cdna} -o salmon_align_quant/${s}.quant

##########################

# Sl-SAM 3

##########################

set s=Sl_SAM_Rep3_3X

salmon quant -l A -a ${IN}/${s}_Aligned.toTranscriptome.out.bam --targets ${cdna} -o salmon_align_quant/${s}.quant

##########################

# Sl-SAM 4

##########################

set s=Sl_SAM_Rep4_3X

salmon quant -l A -a ${IN}/${s}_Aligned.toTranscriptome.out.bam --targets ${cdna} -o salmon_align_quant/${s}.quant

echo Done

##Look at the quant files

#for the portfolio, work in the portfolio directory and add the sepcific path (/Portolio)

cd share/bitcpt/Fall2022/unityID/Tom

ll salmon_align_quant

total 4

drwxr-sr-x. 5 ehernan6 bitcpt 4096 Nov 28 15:04 Sl_Leaf_Rep1_3X.quant

drwxr-sr-x. 5 ehernan6 bitcpt 4096 Nov 28 15:07 Sl_Leaf_Rep2_3X.quant

drwxr-sr-x. 5 ehernan6 bitcpt 4096 Nov 28 15:09 Sl_Leaf_Rep3_3X.quant

drwxr-sr-x. 5 ehernan6 bitcpt 4096 Nov 28 15:12 Sl_SAM_Rep1_3X.quant

drwxr-sr-x. 5 ehernan6 bitcpt 4096 Nov 28 15:15 Sl_SAM_Rep2_3X.quant

drwxr-sr-x. 5 ehernan6 bitcpt 4096 Nov 28 15:17 Sl_SAM_Rep3_3X.quant

drwxr-sr-x. 5 ehernan6 bitcpt 4096 Nov 28 15:20 Sl_SAM_Rep4_3X.quant

DESeq-2 Analysis Results Tomato

Description: Statistical visualization of euclidean distance which means the further our data points are from each other then the more different they are to one another and vice versa. This means that the SAM files are more similar to one another than the Leaf files.

Description: Sample-to-sample distances show that the similarity between the samples of each tissue type is high, shown by the dark blue color. As shown in the Principal Component figure, the SAM samples show high similarity between each other than the leaf samples, which show a blue color lighter than the SAM samples.

Description: Key components of dispersion estimate. First look at the figure to see red trend line, observe that a set of data in blue that is in both sides of red line and more black data points beyond that. It is for the dataset's mean of normalized counts compared to variance. The trend should start high which is the high number of mean and get lower variance . Roughly the points should be around the fit line.

The data points of mean normalized counts and dispersion follow the trend that dispersion decreases as mean of normalized count increases.

Description: Histogram of the p-value for each gene being compared. See a higher significance in over 14,000 genes that are likely under the 0.05 threshold. This is an unadjusted p-value histogram, adjusting these p-values would be beneficial to see more conservative p-values (less than 14,000, which would show us what is changing the most between these groups). Taking the extra step to adjust these p-values is recommended.

Description: Showing fold change: how much bigger r smaller has the transcripts changed between leaf vs. meristem. As we move across the x-axis the number of transcripts changes and we start to see smaller fold changes. More counts - blue dots are coming in and smaller differences are going to be more significant.

Page updated

Report abuse