Su'ad and I realized in August 2020, that 4 samples were missing from our RNA seq analyses. Turns out those files were missing from the zipped file that Josh sent me. Josh resent the fastq files to me by zipping all the data.
Here is the command I used to download the data to this directory: /uufs/chpc.utah.edu/common/home/gompert-group1/data/lycaeides/matt_transcriptome/rawdata/
wget --no-check-certificate "https://publications.pathfinder.arcc.uwyo.edu/Botany/Harrison/RNA_Data_LycaeidesMel.tar.gz"
I unzipped the folder using tar -xf RNA_Data_LycaeidesMel.tar.gz
The unzipped folder is RNA_Data_Lycaeides which has two folders RNA_Lycaeides_melissa (sample IDs are PMKS00) and RNA_Lycaides_melissa2018 (sample IDs are KS00 and has the missing samples KS005-KS009). I then unzipped the fastq.gz files using:
for f in *.gz; do gunzip $f; done
I copied the fastq files of the missing samples to the rawdata/fastq folder. I plan on processing just these samples for trim galore but and alignments and then combine all files for the gene counts matrix.
1. Trim galore
Reads before and after adapter trimming:
2. TopHat alignment to trinity de novo reference
-- moved the *fq files for the new 5 samples to the newsamples folder (/uufs/chpc.utah.edu/common/home/gompert-group1/data/lycaeides/matt_transcriptome/trim_galore/newsamples)
I ran tophat for just the new samples using the script wrap_qsub_slurm_tophat_newsamples.pl. This script needs to be rin from the folder in which the fastq files are:
perl ../../txome_assembly/tophat_trinity_ref/wrap_qsub_slurm_tophat_newsamples.pl *R1*fq
Files will get saved in /scratch/general/lustre/tophat/
Moved the folders for each new sample to this directory: /uufs/chpc.utah.edu/common/home/gompert-group1/data/lycaeides/matt_transcriptome/txome_assembly/tophat_trinity_ref
Copied the accepted.bam file for each sample as sample name and moved these bams to newsamples folder in this directory to run cufflinks analyses.
3. Cufflinks
I reran cuflinks using the wrap_qsub_slurm_cufflinks_trinity_newsamples.pl script in the folder /uufs/chpc.utah.edu/common/home/gompert-group1/data/lycaeides/matt_transcriptome/cufflinks/trinitydenovo/. The new data file are stored in the "newsamples" folder.
4. Final file
I copied the transcript.gtf file from each samples folder to the main folder as samplenumber.gtf. I then used the following linux commands to create the final coverage and fpkm files for each sample.
for file in *.gtf; do awk -F"\t" '$3 == "transcript" { print $1"\t"$3"\t"$9 }' $file > ${file}_transcripts; done
for file in *.gtf; do cut -f 1 ${file}_transcripts > ${file}_scafpos; done
for file in *.gtf; do grep -E -o "FPKM\s\"[0-9.]+\"" ${file}_transcripts | grep -E -o "[0-9.]+" > ${file}_fpkm; done
for file in *.gtf; do grep -E -o "cov\s\"[0-9.]+\"" ${file}_transcripts | grep -E -o "[0-9.]+" > ${file}_cov; done
for file in *.gtf; do grep -E -o "transcript_id\s\"[A-Z0-9.]+\"" ${file}_transcripts | grep -E -o "[A-Z0-9.]+" > ${file}_transid; done
for file in *.gtf; do paste ${file}_scafpos ${file}_transid ${file}_fpkm > ${file}_final_fpkm; done
for file in *.gtf; do paste ${file}_scafpos ${file}_transid ${file}_cov > ${file}_final_cov; done
I then used R code in preparefinalfile.R to create the final files: fpkm_withnewsamples.txt and coverage_withnewsamples.txt. I sent this to Su'ad on 17 Aug 2020.
################################################################################################################
################################################################################################################
################################################################################################################
################################################################################################################
RNA SEQ ANALYSES VERSION 2: Following the cmac pipeline
################################################################################################################
Josh mentioned that it would be better to the analyses with raw counts rather than floats. Also, we have the genome annotation so I decided to redo the quality trimming and alignments following the same pipeline I used for cmac (which was suggested by Adam from Harvard RC). So here are the steps:
1. Fastqc
Ran FastQC to check quality of sequences (folder:/uufs/chpc.utah.edu/common/home/gompert-group1/data/lycaeides/matt_transcriptome/fastqc/) .
Here is the bash script I used:
#!/bin/bash
#SBATCH -n 12
#SBATCH -N 1
#SBATCH -t 300:00:00
#SBATCH -p usubio-kp
#SBATCH -A usubio-kp
#SBATCH -J fastqc
module load fastqc
for f in /uufs/chpc.utah.edu/common/home/gompert-group1/data/lycaeides/matt_transcriptome/rawdata/fastqc/*.fastq; do fastqc –o "./" $f; done
Fastqc outputs two sets of files for each sample: 1). html file 2). Zip file with all the statistics
I used the following bash code to compile the summaries and statistics from the zip folders for each sample. The output summaries are saved in the folder fq_aggregate.
# Run this script in a directory containing zip files from fastqc. It aggregates images of each type in individual folders
# So looking across data is quick. Output will be saved in the folder: fq_aggregate
zips=`ls *.zip`
for i in $zips; do
unzip -o $i &>/dev/null;
done
fastq_folders=${zips/.zip/}
rm -rf fq_aggregated # Remove aggregate folder if present
mkdir fq_aggregated
# Rename Files within each using folder name.
for folder in $fastq_folders; do
folder=${folder%.*}
img_files=`ls ${folder}/Images/*png`;
for img in $img_files; do
img_name=$(basename "$img");
img_name=${img_name%.*}
new_name=${folder};
mkdir -p fq_aggregated/${img_name};
mv $img fq_aggregated/${img_name}/${folder/_fastqc/}.png;
done;
done;
# Concatenate Summaries
for folder in $fastq_folders; do
folder=${folder%.*}
cat ${folder}/summary.txt >> fq_aggregated/summary.txt
done;
# Concatenate Statistics
for folder in $fastq_folders; do
folder=${folder%.*}
head -n 10 ${folder}/fastqc_data.txt | tail -n 7 | awk -v f=${folder/_fastqc/} '{ print $0 "\t" f }' >> fq_aggregated/statistics.txt
rm -rf ${folder}
done
The results show some summary stats as failed. I think running rcorrector will clean this data up and I will run fastqc again after running rcorrector.
2.rcorrector
Since rcorrector is not present as a module on the UofU CHPC, I installed it locally in my directory: /uufs/chpc.utah.edu/common/home/u6007910/projects/sam
Install rcorrector:
module load git
git clone https://github.com/mourisl/rcorrector.git #downloads rcorrector repository
cd rcorrector/
ls
make #installs and downloads jellyfish2 if it is not available in the rcorrector path
perl run_rcorrector.pl #test for installation
Running rcorrector:
mkdir rcorrector
bash script for rcorrector:
#!/bin/bash
#SBATCH -n 12
#SBATCH -N 1
#SBATCH -t 50:00:00
#SBATCH -p usubio-kp
#SBATCH -A usubio-kp
#SBATCH -J rcorrect
#SBATCH --mail-type=ALL # Type of email notification- BEGIN,END,FAIL,ALL
#SBATCH --mail-user=<samridhi.chaturvedi@gmail.com> # Email to which notifications will be sent
module load perl
cd /uufs/chpc.utah.edu/common/home/gompert-group1/data/lycaeides/matt_transcriptome/rcorrector/
for prefix in $(ls /uufs/chpc.utah.edu/common/home/gompert-group1/data/lycaeides/matt_transcriptome/rawdata/fastqc/*.fastq | sed -r 's/_R[12]_001[.]fastq//' | uniq)
do
perl /uufs/chpc.utah.edu/common/home/u6007910/projects/sam/rcorrector/run_rcorrector.pl -t 12 -1 "${prefix}_R1_001.fastq" -2 "${prefix}_R2_001.fastq"
done
After rcorrector has run, we will get outfiles with extensions "fq". Use the following steps to run some statistics on these files and get numbers for how many files are tagged with "cor" or "unfixable". I saved these numbers in this file:
#get file names
ls *.fq | cat
#get the total number of reads in each file:
for i in *fq; do grep ^@ $i | wc -l; done
#get reads with "cor" tag
for i in *fq; do grep "cor" $i | wc -l; done
#get reads with "unfixable" tag
for i in *fq; do grep "unfixable" $i | wc -l; done
After this I ran the script FilterUncorrectabledPEfastq.py to remove the unfixable reads. This outputs files with names: unfixrm*cor.fq