Folder where the dovetail genome is saved : /uufs/chpc.utah.edu/common/home/u6000989/data/lycaeides/dovetail_melissa_genome/download/HiC_HiCRise_GLtS4/melissa_blue_21Nov2017_GLtS4
Folder where all the GBS data I generated in Summer 2017 is : /uufs/chpc.utah.edu/common/home/u6000989/data/lycaeides/lycaeides_dubois/
In this folder there are subfolders. Here is the one where the files for the barcode stuff is: /uufs/chpc.utah.edu/common/home/u6000989/data/lycaeides/lycaeides_dubois/Sequences
Files in this folder are:
barcodeSam036_037.csv - MAIN FILE, I need for my barcodes
samplesidstemp.txt - list of sample IDs from the plates in the order first 96 are 5-6 plate and next 96 are 7-8 plate.
barcodeGomp012.csv - example file I should create to chip barcodes
barcodeTemplate.csv - file which contains all the barcode information
gomp027_S8_L002_R1_001.fastq.gz - Lycaeides GBS data from this summer
md5sums.txt - just md5sums check file for the data
As I could not get fastqc to work, I decided to ditch it and run the alignment on my own using bwa alignment.
I have done this in the following directory: /uufs/chpc.utah.edu/common/home/gompert-group1/data/lycaeides/lycaeides_dubois/Sequences/phixFiltering
Files in this folder:
Folder phixSeqIndex: contains fasta files and bwa index files for the phiX sequences from the refSeq accession number NC_001422.
gomp027_S8_L002_R1_001.fastq : fastq file for my GBS data copied from the Sequences folder to this folder for safety.
phiXalignment.sh : bash script to align lycaeides fastq file to phiX
gomp027_S8_L002_R1_001_out.sam = sam file from the fastq file
gomp027_S8_L002_R1_001_out.sai = index file
gomp027_S8_L002_R1_001_out.bam = bam file
sample.mapped.sam = contains sequences that mapped to the phiX genome
sample.unmapped.sam = contains sequences that did not map to the phiX genome, this is the file to be used for parsing to remove barcodes
sample.unmapped.bam = bam file to further create fastq file
phiXfiltering.sh = script to convert sam file above to bam file and then separate out mapped and unmapped reads. Then convert the unmapped file to fastq format again to remove barcodes.
gomp027_S8_L002_R1_001_phiXfiltered.fastq = fastq file for further analysis for parsing barcodes
Here are the steps to do this:
A. Get the phiX sequences from NCBI
I downloaded the fasta genome file from ncbi directly to my desktop and secured copied it to this folder on the cluster.
B. Creating index files
I used bwa to index this fasta file :
module load bwa
bwa index -a is NC_001422.fasta
C. Aligning lycaeides fastq GBS data to the phiX genome
Next I had to run bwa to align my data to the phiX fasta file to see if we have any phiX sequences in our data. To do this I used bwa. The commands were in the script phiXalignment.sh (to submit to the cluster) and are as follows:
#!/bin/bash
#SBATCH --job-name=phiXfiltering
#SBATCH --time=96:00:00 #walltime
#SBATCH --nodes=1 #number of cluster nodes
#SBATCH --account=gompert-kp #PI account
#SBATCH --partition=gompert-kp #specify computer cluster, other option is kinspeak
## This script is to align the lycaeides GBS data fastq file with the fastq genome of phiX bacteria. This way I can filter the phiX sequences from my data.
### here are the commands for bwa aln
#-n NUM Max edit distance or the fraction of missing alignments given 2% uniform base error rate
#-l INT Take the first INT subsequence as seed. If INT is larger than the query sequence, seeding will be disabled. For long reads, this option is typically ranged from 25 to 35
#-k INT Maximum edit distance in the seed
#-t INT Number of threads
cd /uufs/chpc.utah.edu/common/home/gompert-group1/data/lycaeides/lycaeides_dubois/Sequences/phixFiltering
module load bwa
bwa aln -n 4 -l 20 -k 2 -t 12 ./phixSeqIndex/NC_001422.fasta gomp027_S8_L002_R1_001.fastq -f gomp027_S8_L002_R1_001_out.sai
bwa samse -n 1 -f gomp027_S8_L002_R1_001_out.sam -r "@RG\tID:test" ./phixSeqIndex/NC_001422.fasta gomp027_S8_L002_R1_001_out.sai gomp027_S8_L002_R1_001.fastq
This will creat a sam file with all the alignments. Next we have to check how many reads aligned to phiX and then remove those reads from the data before proceeding to chipping barcodes.
D. Determining number of mapped reads from the sam file
There are 2 ways in which we can do this. I tried both the ways and both of them work.
Approach 1 : To do this I used grep command:
##find the number of mapped reads
grep -c ^K00 gomp027_S8_L002_R1_001_out.sam #351511256 mapped reads
##column 3 in the sam file contains info for mapped alignment or not. * means the read did not align/map. So first count the number of * in each sequence line
grep ^K00 gomp027_S8_L002_R1_001_out.sam | cut -f 3 | grep -c "*" #number reads not mapped 261207195
## count the ones that do not contain *. This is done using invert mapping by grep -v option. Basically this tells how many lines did not match the pattern, here *.
grep ^K00 gomp027_S8_L002_R1_001_out.sam | cut -f 3 | grep -v -c "*" #261207195 number reads not mapped
## probably dont't need this info but still just tried this to see the pattern of alignments
grep ^K00 gomp027_S8_L002_R1_001_out.sam | cut -f 2 | sort | uniq -c #90304061 number of reads mapped
## now remove the lines which mapped so basically remove the lines which do not have * in the 3rd column
39849211 0 reads aligned in forward direction
50454850 16 reads aligned in reverse direction
261207195 4 reads did not align to the reference
Approach 2 : using samtools view
Using samtools view I can write out two files which contain mapped and unmapped reads. (this should give the same result as grep).
#convert the sam file to bam file
samtools view -S -b gomp027_S8_L002_R1_001_out.sam > gomp027_S8_L002_R1_001_out.bam
#find the number of mapped reads
samtools view -F4 gomp027_S8_L002_R1_001_out.bam > sample.mapped.sam #contains 90304061 reads
#get number of reads
wc -l sample.mapped.sam
#find the number of unmapped reads
samtools view -f4 gomp027_S8_L002_R1_001_out.bam > sample.unmapped.sam #contains 261207195 reads
#get number of reads
wc -l sample.unmapped.sam
These commands will look in the bam/sam file and extract reads depending on what is in column 4 in every line. Unmapped reads have “0” and mapped reads will have some other bigger number. -f4 will check for this “0” while -F4 will do the inverse, similar to the -v flag in grep.
~ 74 % of the reads did not map to phiX. These are clean and will be used for further analysis.
E. Parsing the barcodes
Now that I have the final filtered fastq file. I ran Zach's perl script to parse the barcodes. For this I need the barcode ID file (barcodeSam036_037.csv) and the filtered fastq file gomp027_S8_L002_R1_001_phiXfiltered.fastq. The script for this (parse_barcodesLyc036037.pl) is saved in the Scripts folder. The bash script parseBarcodes.sh is used to run the perl script on the cluster. This script will create a miderrors file and a parsed file. The parsed file should have lines written in 10X faster than the miderrors file. To check this I just test ran the script and when the output files got created, I just did wc -l to check the number of lines written in each file.
The final file used for alignments is: parsed_gomp027_S8_L002_R1_001_phiXfiltered.fastq.
JUST FYI
**************************************************************************************************************
fastqc screen for phiX
Before proceeding to remove barcodes from the sequences, I screened the GBS data for phiX contamination. During de novo assembly, contamination of Illumina data with PhiX control sequence may result in the generation of spurious contigs. This could cause index mis-assignment. Therefore to make sure the reads are not getting mis-assigned, I am screening the data for phiX contamination. To do this I am planning to run fastqc screen. This is a program from Babraham bioinformatics lab. Here is a link to fastqc screen page: https://www.bioinformatics.babraham.ac.uk/projects/fastq_screen/. The program uses any one of the alignment tools (bowtie, bowtie2 or bwa) to screen the reads in the fastq file. In addition, the program does not process the full dataset due to time constraint. Instead, the runs use a subset of the full dataset wherein the reads are selected from the full dataset at regular intervals. There could be variation in results due to the alignment tool used but the results will be roughly similar.
I have installed fastqc on the cluster. The main program is saved in my folder in the directory as fastq_screen_v0.11.4 :
/uufs/chpc.utah.edu/common/home/u6007910/source/
Both zipped file and unzipped folder are here. In addition, I downloaded a test dataset from the webpage to test the program. It is saved in the same directory as fastq_screen_test_dataset.
Now......
I copied the program into the lycaeides_dubois folder where all the genome and sequence data will be. I copied it here: /uufs/chpc.utah.edu/common/home/u6000989/data/lycaeides/lycaeides_dubois/Source/
****************************************************************************************************************