reanalyses for missing samples

Su'ad and I realized in August 2020, that 4 samples were missing from our RNA seq analyses. Turns out those files were missing from the zipped file that Josh sent me. Josh resent the fastq files to me by zipping all the data.

Here is the command I used to download the data to this directory: /uufs/chpc.utah.edu/common/home/gompert-group1/data/lycaeides/matt_transcriptome/rawdata/

wget --no-check-certificate "https://publications.pathfinder.arcc.uwyo.edu/Botany/Harrison/RNA_Data_LycaeidesMel.tar.gz"

I unzipped the folder using tar -xf RNA_Data_LycaeidesMel.tar.gz

The unzipped folder is RNA_Data_Lycaeides which has two folders RNA_Lycaeides_melissa (sample IDs are PMKS00) and RNA_Lycaides_melissa2018 (sample IDs are KS00 and has the missing samples KS005-KS009). I then unzipped the fastq.gz files using:

for f in *.gz; do gunzip $f; done

I copied the fastq files of the missing samples to the rawdata/fastq folder. I plan on processing just these samples for trim galore but and alignments and then combine all files for the gene counts matrix.

1. Trim galore

Reads before and after adapter trimming:

2. TopHat alignment to trinity de novo reference

-- moved the *fq files for the new 5 samples to the newsamples folder (/uufs/chpc.utah.edu/common/home/gompert-group1/data/lycaeides/matt_transcriptome/trim_galore/newsamples)

I ran tophat for just the new samples using the script wrap_qsub_slurm_tophat_newsamples.pl. This script needs to be rin from the folder in which the fastq files are:

perl ../../txome_assembly/tophat_trinity_ref/wrap_qsub_slurm_tophat_newsamples.pl *R1*fq

Files will get saved in /scratch/general/lustre/tophat/

Moved the folders for each new sample to this directory: /uufs/chpc.utah.edu/common/home/gompert-group1/data/lycaeides/matt_transcriptome/txome_assembly/tophat_trinity_ref

Copied the accepted.bam file for each sample as sample name and moved these bams to newsamples folder in this directory to run cufflinks analyses.

3. Cufflinks

I reran cuflinks using the wrap_qsub_slurm_cufflinks_trinity_newsamples.pl script in the folder /uufs/chpc.utah.edu/common/home/gompert-group1/data/lycaeides/matt_transcriptome/cufflinks/trinitydenovo/. The new data file are stored in the "newsamples" folder.

4. Final file

I copied the transcript.gtf file from each samples folder to the main folder as samplenumber.gtf. I then used the following linux commands to create the final coverage and fpkm files for each sample.

for file in *.gtf; do awk -F"\t" '$3 == "transcript" { print $1"\t"$3"\t"$9 }' $file > ${file}_transcripts; done

for file in *.gtf; do cut -f 1 ${file}_transcripts > ${file}_scafpos; done

for file in *.gtf; do grep -E -o "FPKM\s\"[0-9.]+\"" ${file}_transcripts | grep -E -o "[0-9.]+" > ${file}_fpkm; done

for file in *.gtf; do grep -E -o "cov\s\"[0-9.]+\"" ${file}_transcripts | grep -E -o "[0-9.]+" > ${file}_cov; done

for file in *.gtf; do grep -E -o "transcript_id\s\"[A-Z0-9.]+\"" ${file}_transcripts | grep -E -o "[A-Z0-9.]+" > ${file}_transid; done

for file in *.gtf; do paste ${file}_scafpos ${file}_transid ${file}_fpkm > ${file}_final_fpkm; done

for file in *.gtf; do paste ${file}_scafpos ${file}_transid ${file}_cov > ${file}_final_cov; done

I then used R code in preparefinalfile.R to create the final files: fpkm_withnewsamples.txt and coverage_withnewsamples.txt. I sent this to Su'ad on 17 Aug 2020.

################################################################################################################

RNA SEQ ANALYSES VERSION 2: Following the cmac pipeline

################################################################################################################

Josh mentioned that it would be better to the analyses with raw counts rather than floats. Also, we have the genome annotation so I decided to redo the quality trimming and alignments following the same pipeline I used for cmac (which was suggested by Adam from Harvard RC). So here are the steps:

1. Fastqc

Ran FastQC to check quality of sequences (folder:/uufs/chpc.utah.edu/common/home/gompert-group1/data/lycaeides/matt_transcriptome/fastqc/) .

Here is the bash script I used:

#!/bin/bash

#SBATCH -n 12

#SBATCH -N 1

#SBATCH -t 300:00:00

#SBATCH -p usubio-kp

#SBATCH -A usubio-kp

#SBATCH -J fastqc

module load fastqc

for f in /uufs/chpc.utah.edu/common/home/gompert-group1/data/lycaeides/matt_transcriptome/rawdata/fastqc/*.fastq; do fastqc –o "./" $f; done

Fastqc outputs two sets of files for each sample: 1). html file 2). Zip file with all the statistics

I used the following bash code to compile the summaries and statistics from the zip folders for each sample. The output summaries are saved in the folder fq_aggregate.

# Run this script in a directory containing zip files from fastqc. It aggregates images of each type in individual folders

# So looking across data is quick. Output will be saved in the folder: fq_aggregate

zips=`ls *.zip`

for i in $zips; do

unzip -o $i &>/dev/null;

done

fastq_folders=${zips/.zip/}

rm -rf fq_aggregated # Remove aggregate folder if present

mkdir fq_aggregated

# Rename Files within each using folder name.

for folder in $fastq_folders; do

folder=${folder%.*}

img_files=`ls ${folder}/Images/*png`;

for img in $img_files; do

img_name=$(basename "$img");

img_name=${img_name%.*}

new_name=${folder};

mkdir -p fq_aggregated/${img_name};

mv $img fq_aggregated/${img_name}/${folder/_fastqc/}.png;

done;

# Concatenate Summaries

for folder in $fastq_folders; do

folder=${folder%.*}

cat ${folder}/summary.txt >> fq_aggregated/summary.txt

done;

# Concatenate Statistics

for folder in $fastq_folders; do

folder=${folder%.*}

head -n 10 ${folder}/fastqc_data.txt | tail -n 7 | awk -v f=${folder/_fastqc/} '{ print $0 "\t" f }' >> fq_aggregated/statistics.txt

rm -rf ${folder}

done

The results show some summary stats as failed. I think running rcorrector will clean this data up and I will run fastqc again after running rcorrector.

2.rcorrector

Since rcorrector is not present as a module on the UofU CHPC, I installed it locally in my directory: /uufs/chpc.utah.edu/common/home/u6007910/projects/sam

Install rcorrector:

module load git

git clone https://github.com/mourisl/rcorrector.git #downloads rcorrector repository

cd rcorrector/

make #installs and downloads jellyfish2 if it is not available in the rcorrector path

perl run_rcorrector.pl #test for installation

Running rcorrector:

mkdir rcorrector

bash script for rcorrector:

#!/bin/bash

#SBATCH -n 12

#SBATCH -N 1

#SBATCH -t 50:00:00

#SBATCH -p usubio-kp

#SBATCH -A usubio-kp

#SBATCH -J rcorrect

#SBATCH --mail-type=ALL # Type of email notification- BEGIN,END,FAIL,ALL

#SBATCH --mail-user=<samridhi.chaturvedi@gmail.com> # Email to which notifications will be sent

module load perl

cd /uufs/chpc.utah.edu/common/home/gompert-group1/data/lycaeides/matt_transcriptome/rcorrector/

for prefix in $(ls /uufs/chpc.utah.edu/common/home/gompert-group1/data/lycaeides/matt_transcriptome/rawdata/fastqc/*.fastq | sed -r 's/_R[12]_001[.]fastq//' | uniq)

perl /uufs/chpc.utah.edu/common/home/u6007910/projects/sam/rcorrector/run_rcorrector.pl -t 12 -1 "${prefix}_R1_001.fastq" -2 "${prefix}_R2_001.fastq"

done

After rcorrector has run, we will get outfiles with extensions "fq". Use the following steps to run some statistics on these files and get numbers for how many files are tagged with "cor" or "unfixable". I saved these numbers in this file:

#get file names

ls *.fq | cat

#get the total number of reads in each file:

for i in *fq; do grep ^@ $i | wc -l; done

#get reads with "cor" tag

for i in *fq; do grep "cor" $i | wc -l; done

#get reads with "unfixable" tag

for i in *fq; do grep "unfixable" $i | wc -l; done

After this I ran the script FilterUncorrectabledPEfastq.py to remove the unfixable reads. This outputs files with names: unfixrm*cor.fq

Page updated

Google Sites

Report abuse