---
title: "Molly GBS analyses"
author: "Samridhi Chaturvedi"
date: "April 2021"
output: pdf_document
...
---
output:
pdf_document: default
html_document: default
---
Genome downloaded from: https://www.ncbi.nlm.nih.gov/assembly/GCF_001443285.1
Folder on UofU cluster: /uufs/chpc.utah.edu/common/home/gompert-group3/data/fish/
Unzipped fastq files: /uufs/chpc.utah.edu/common/home/gompert-group3/data/fish/sequences/
There are two files which I am guessing are associated with two lanes of sequencing.
# 1. PhiX filtering
*A. I first checked how many reads are present in each file:*
```bash
grep ^@K TSU_PLAT_S36_L007_R1_001.fastq | wc -l #323329960
grep ^@K TSU_PLAT_S36_L008_R1_001.fastq | wc -l #325534956
```
I then performed phiX filtering for both these files separately.
Folder: /uufs/chpc.utah.edu/common/home/gompert-group3/data/fish/phiX/
For this, I used the phiX fasta file had previously downloaded for lycaeides data.
*B. I first aligned the raw sequences to the phix file using the script *phiXalignment.sh*. This created sam files for each lane (2 files).*
```bash
#!/bin/bash
#SBATCH --job-name=phiXfiltering
#SBATCH --time=96:00:00 #walltime
#SBATCH --nodes=1 #number of cluster nodes
#SBATCH --account=gompert-kp #PI account
#SBATCH --partition=gompert-kp #specify computer cluster, other option is kinspeak
## This script is to align the lycaeides GBS data fastq file with the fastq genome of phiX bacteria. This way I can filter the phiX sequences from my data.
### here are the commands for bwa aln
#-n NUM Max edit distance or the fraction of missing alignments given 2% uniform base error rate
#-l INT Take the first INT subsequence as seed. If INT is larger than the query sequence, seeding will be disabled. For long reads, this option is typically ranged from 25 to 35
#-k INT Maximum edit distance in the seed
#-t INT Number of threads
cd /uufs/chpc.utah.edu/common/home/gompert-group3/data/fish/phiX/
module load bwa
#lane1
#bwa aln -n 4 -l 20 -k 2 -t 12 /uufs/chpc.utah.edu/common/home/u6007910/projects/lyc_dubois/lycaeides_dubois/Parsed/phixFiltering/phixSeqIndex/NC_001422.fasta /uufs/chpc.utah.edu/common/home/gompert-group3/data/fish/sequences/TSU_PLAT_S36_L007_R1_001.fastq -f TSU_PLAT_S36_L007_R1_001_out.sai
bwa samse -n 1 -f TSU_PLAT_S36_L007_R1_001_out.sam -r "@RG\tID:test" /uufs/chpc.utah.edu/common/home/u6007910/projects/lyc_dubois/lycaeides_dubois/Parsed/phixFiltering/phixSeqIndex/NC_001422.fasta TSU_PLAT_S36_L007_R1_001_out.sai /uufs/chpc.utah.edu/common/home/gompert-group3/data/fish/sequences/TSU_PLAT_S36_L007_R1_001.fastq
#lane2
#bwa aln -n 4 -l 20 -k 2 -t 12 /uufs/chpc.utah.edu/common/home/u6007910/projects/lyc_dubois/lycaeides_dubois/Parsed/phixFiltering/phixSeqIndex/NC_001422.fasta /uufs/chpc.utah.edu/common/home/gompert-group3/data/fish/sequences/TSU_PLAT_S36_L008_R1_001.fastq -f TSU_PLAT_S36_L008_R1_001_out.sai
bwa samse -n 1 -f TSU_PLAT_S36_L008_R1_001_out.sam -r "@RG\tID:test" /uufs/chpc.utah.edu/common/home/u6007910/projects/lyc_dubois/lycaeides_dubois/Parsed/phixFiltering/phixSeqIndex/NC_001422.fasta TSU_PLAT_S36_L008_R1_001_out.sai /uufs/chpc.utah.edu/common/home/gompert-group3/data/fish/sequences/TSU_PLAT_S36_L008_R1_001.fastq
```
*C. I then filtered the files to get the unmapped reads, which are the filtered reads and I will use these for downstream processing.*
```bash
#!/bin/bash
#SBATCH --job-name=phiXfiltering
#SBATCH --time=96:00:00 #walltime
#SBATCH --nodes=1 #number of cluster nodes
#SBATCH --account=usubio-kp #PI account
#SBATCH --partition=usubio-kp #specify computer cluster, other option is kinspeak
## This script is to use samtools to determine number of mapped and unmapped reads to the phiX genome. Then to create seperate files for mapped and unmapped reads. Then to create a final fastq file of the unmapped reads.
### here are the commands for bwa aln
#-n NUM Max edit distance or the fraction of missing alignments given 2% uniform base error rate
#-l INT Take the first INT subsequence as seed. If INT is larger than the query sequence, seeding will be disabled. For long reads, this option is typically ranged from 25 to 35
#-k INT Maximum edit distance in the seed
#-t INT Number of threads
cd /uufs/chpc.utah.edu/common/home/gompert-group1/data/lycaeides/lycaeides_dubois/Sequences/phixFiltering
module load samtools
#lane1
samtools view -S -b TSU_PLAT_S36_L007_R1_001_out.sam > TSU_PLAT_S36_L007_R1_001_out.bam
samtools view -F4 TSU_PLAT_S36_L007_R1_001_out.bam > TSU_PLAT_S36_L007_R1_001_out.mapped.sam
samtools view -f4 TSU_PLAT_S36_L007_R1_001_out.bam > TSU_PLAT_S36_L007_R1_001_out.unmapped.sam
samtools view -S -b TSU_PLAT_S36_L007_R1_001_out.unmapped.sam > TSU_PLAT_S36_L007_R1_001_out.unmapped.bam
samtools bam2fq TSU_PLAT_S36_L007_R1_001_out.unmapped.bam > TSU_PLAT_S36_L007_R1_001_out.unmapped.fastq
mv TSU_PLAT_S36_L007_R1_001_out.unmapped.fastq TSU_PLAT_S36_L007_R1_001_phiXfiltered.fastq
#lane2
samtools view -S -b TSU_PLAT_S36_L008_R1_001_out.sam > TSU_PLAT_S36_L008_R1_001_out.bam
samtools view -F4 TSU_PLAT_S36_L008_R1_001_out.bam > TSU_PLAT_S36_L008_R1_001_out.mapped.sam
samtools view -f4 TSU_PLAT_S36_L008_R1_001_out.bam > TSU_PLAT_S36_L008_R1_001_out.unmapped.sam
samtools view -S -b TSU_PLAT_S36_L008_R1_001_out.unmapped.sam > TSU_PLAT_S36_L008_R1_001_out.unmapped.bam
samtools bam2fq TSU_PLAT_S36_L008_R1_001_out.unmapped.bam > TSU_PLAT_S36_L008_R1_001_out.unmapped.fastq
mv TSU_PLAT_S36_L008_R1_001_out.unmapped.fastq TSU_PLAT_S36_L008_R1_001_phiXfiltered.fastq
```
*D. Then I checked the number of reads that mapped to the phix sequences and those which did not:*
```bash
## lane 1
#get number of reads
wc -l TSU_PLAT_S36_L007_R1_001_out.mapped.sam #65249985
#find the number of unmapped reads
wc -l TSU_PLAT_S36_L007_R1_001_out.unmapped.sam #258079975
## lane 2
#get number of reads
wc -l TSU_PLAT_S36_L008_R1_001_out.mapped.sam #67017011
#find the number of unmapped reads
wc -l TSU_PLAT_S36_L008_R1_001_out.unmapped.sam #258517945
```
**~ 80 % of the reads did not map to phiX. These are clean and will be used for further analysis.**
# 2. Parse barcodes and split into fastq files
Now that I have the final filtered fastq file, I ran Zach's perl script to parse the barcodes. For this I need the barcode ID file sent by Chris (diana_barcodeKeys.csv) and the filtered fastq files from the *phiX* folder: TSU_PLAT_S36_L007_R1_001_phiXfiltered.fastq and TSU_PLAT_S36_L008_R1_001_phiXfiltered.fastq.
I combined the two files as they have the same samples:
```bash
cat TSU_PLAT_S36_L007_R1_001_phiXfiltered.fastq TSU_PLAT_S36_L008_R1_001_phiXfiltered.fastq > TSU_PLAT_S36_combined_phiXfiltered.fastq
```
The script for this **parse_barcodesLyc036037.pl** is saved in the Scripts folder. The bash script parseBarcodes.sh is used to run the perl script on the cluster. This script will create a miderrors file and a parsed file. The parsed file should have lines written in 10X faster than the miderrors file. To check this I just test ran the script and when the output files got created, I just did wc -l to check the number of lines written in each file.
I split the reads into 7 (00-06) smaller files to parse barcodes using parallel jobs:
```bash
split -l 300000000 -d --additional-suffix=.fastq TSU_PLAT_S36_combined_phiXfiltered.fastq TSU_PLAT_S36_split_phiXfiltered_
```
I then used the **run_parseBarcodes.sh** scripts to run parse the barcodes. This script runs the parse_barcodesFish.pl script on the cluster. I combined the parsed and miderrors files for each run and created final combined files using command line cat. I then used **splitFastq.pl** to split files into fastq files for each individual.
These files were then moved to the following directory: /uufs/chpc.utah.edu/common/home/gompert-group3/data/fish/barcodes/demultiplexed_reads/
# 3. Alignment
3a.
Before starting the alignments, I indexed the *P. latipinna* genome using BWA **3a_run_bwa_index.sh**:
```bash
#!/bin/bash
#SBATCH --job-name=genomeindex
#SBATCH --time=70:00:00 #walltime
#SBATCH --nodes=1 #number of cluster nodes
#SBATCH --account=usubio-kp #PI account
#SBATCH --partition=usubio-kp #specify computer cluster, other option is kinspeak
#SBATCH --job-name=indexgen
#SBATCH --mail-type=ALL
#SBATCH --mail-user=samridhi.chaturvedi@gmail.com
cd /uufs/chpc.utah.edu/common/home/gompert-group3/data/fish/ncbi-genomes-2021-01-04/
module load bwa
bwa index -a bwtsw GCF_001443285.1_P_latipinna-1.0_genomic.fasta
```
3b.
Next, I aligned the fastq files to the *P. latipinna* genome using BWA MEM algorithm. I used the **3b_run_bwa_mem.sh** bash script to run BWA:
```bash
#!/bin/sh
#SBATCH --time=96:00:00
#SBATCH --nodes=1
#SBATCH --ntasks=8
#SBATCH --account=usubio-kp
#SBATCH --partition=usubio-kp
#SBATCH --job-name=bwamem
#SBATCH --mail-type=ALL
#SBATCH --mail-user=samridhi.chaturvedi@gmail.com
## This script is to run the parse_barcodes.pl script on the cluster
cd /uufs/chpc.utah.edu/common/home/gompert-group3/data/fish/barcodes/demultiplexed_reads/
module load bwa
for i in $(ls *.fastq | sed -r 's/.fastq//' | uniq)
do
bwa mem -t 12 -k 15 -r 1.3 -T 30 /uufs/chpc.utah.edu/common/home/gompert-group3/data/fish/ncbi-genomes-2021-01-04/GCF_001443285.1_P_latipinna-1.0_genomic.fasta ${i}.fastq > ${i}.sam
mv ${i}.sam /uufs/chpc.utah.edu/common/home/gompert-group3/data/fish/alignments/samfiles/
done
```
3c.
I then ran the **3c_run_sam_to_bam.sh** script to convert the samfiles to sorted and indexed bams.
```bash
#!/bin/sh
#SBATCH --time=96:00:00
#SBATCH --nodes=1
#SBATCH --ntasks=8
#SBATCH --account=usubio-kp
#SBATCH --partition=usubio-kp
#SBATCH --job-name=bwamem
#SBATCH --mail-type=ALL
#SBATCH --mail-user=samridhi.chaturvedi@gmail.com
## This script is to run the parse_barcodes.pl script on the cluster
cd /uufs/chpc.utah.edu/common/home/gompert-group3/data/fish/alignments/samfiles/
module load samtools
module load bcftools
for i in $(ls *.sam | sed -r 's/.sam//' | uniq);
do
samtools view -b -S -o ${i}.bam ${i}.sam
samtools sort ${i}.bam -o ${i}.sorted.bam
samtools index ${i}.sorted.bam
mv ${i}*bam* /uufs/chpc.utah.edu/common/home/gompert-group3/data/fish/alignments/bamfiles/
done
```
I then got some intitial numbers on the number of reads in the sorted bam files using the following command:
```bash
for i in *sorted.bam; do samtools idxstats $i | awk '{s+=$3} END {print s}'; done
```
The output of this seemed reasonable.
# 4. Variant calling and filtering
4a.
I did variant calling using samtools and bcftools using the script **4a_run_variant_calling.sh**
```bash
#!/bin/sh
#SBATCH --time=96:00:00
#SBATCH --nodes=1
#SBATCH --ntasks=8
#SBATCH --account=usubio-kp
#SBATCH --partition=usubio-kp
#SBATCH --job-name=variantcalling
#SBATCH --mail-type=FAIL
#SBATCH --mail-user=samridhi.chaturvedi@gmail.com
echo ------------------------------------------------------
echo SLURM: job identifier is $SLURM_JOBID
echo SLURM: job name is $SLURM_JOB_NAME
echo ------------------------------------------------------
#SAMTOOLs mpileup version 1.5 options used:
#C = adjust mapping quality; recommended:50, disable:0 [0]
#d = max per-file depth; avoids excessive memory usage [250]
#f = faidx indexed reference sequence file
#q = skip alignments with mapQ smaller than INT [0]
#Q = skip bases with baseQ/BAQ smaller than INT [13]
#g = generate genotype likelihoods in BCF format
#t = --output-tags LIST optional tags to output:DP,AD,ADF,ADR,SP,INFO/AD,INFO/ADF,INFO/ADR []
#BCFTOOLs call version 1.6 options used
#v = output variant sites only
#c/m = he original calling method (conflicts with -m) or alternative model for multiallelic and rare-variant calling (conflicts with -c)
#p = variant if P(ref|D)<FLOAT with -c [0.5]
#P = --prior <float-o mutation rate (use bigger for greater sensitivity), use with -m [1.1e-3]
#O = output type: 'b' compressed BCF; 'u' uncompressed BCF; 'z' compressed VCF; 'v' (here it is 'v')
#o = write output to a file [standard output]
module load samtools
module load bcftools
cd /uufs/chpc.utah.edu/common/home/gompert-group3/data/fish/alignments/bamfiles/
samtools mpileup -C 50 -d 250 -f /uufs/chpc.utah.edu/common/home/gompert-group3/data/fish/ncbi-genomes-2021-01-04/GCF_001443285.1_P_latipinna-1.0_genomic.fasta -q 20 -Q 15 -g -I -t DP,DPR -u -b molliesBam.txt -o variantsMollies.bcf
bcftools call -v -c -p 0.01 -P 0.001 -O v -o variantsMollies.vcf variantsMollies.bcf
mv variantsMollies* /uufs/chpc.utah.edu/common/home/gompert-group3/data/fish/variants/
```
I then checked how many variants were called and got some basic statistics for this step:
```bash
#just get number of SNPs called using command line
grep ^NW variantsMollies.vcf | wc -l #331619 SNPs are called
#use BCFTOOLS to get more stats
bcftools stats variantsLycaeides.vcf > lycaeides.stats #331619 SNPs are called
#sanity checks for allele frequencies
grep ^NW variantsMollies.vcf |cut -f 8 | grep -o "AF1=[0-9\.]*" > alleleFreqs.txt
awk '$1 < 0.005' alleleFreqs.txt | wc -l #89737
awk '$1 > 0.995' alleleFreqs.txt | wc -l #46927
```
4b.
I then filtered variants based on coverage, missing data and MAF.
Here are the options I set for filtering:
```bash
my $minCoverage = 768; # minimum number of sequences; DP, no. of individuals = X; mincoverage = 2X #change for each species
my $minAltRds = 4; # minimum number of sequences with the alternative allele; AC #10 #copies
my $notFixed = 1.0; # removes loci fixed for alt; AF #check this
my $mq = 30; # minimum mapping quality; MQ
my $miss = 77; # maximum number of individuals with no data #20% of the number of individuals #change for each species
my $minMaf = 0.005; # minimum MAF, as allele count ~ 0.005 # 0.005% of 2X, check AF in the sequences #frequency rare vs common #change for each species
my $ratDifDir = 0.01; ## maximum proportion of reads in reverse (relative to avg.) orientation
```
**After this step, I retained 61863 SNPs.**
4c.
To run the second script for further filtering, I needed two details: a). a file with scaffold and position for each script b). the max coverage depth across SNPs.
To get depth across sites for filtersomemore.pl script:
```bash
grep ^NW filtered2x_variantsMollies.vcf | cut -f 8 | cut -d ';' -f 1 | grep -o '[0-9]*' > depth_filtered2X_mollies.txt
```
To create the snpsMollies.txt file for filtersomemore.pl script:
```bash
#to get scaffold and position
grep ^NW filtered2x_variantsMollies.vcf | cut -f 1 | cut -d '_' -f 2 | cut -d '.' -f 1 > scaffolds.txt
grep ^NW filtered2x_variantsMollies.vcf | cut -f 2 > positions.txt
#create final file
paste scaffolds.txt positions.txt > filtered2x_vcf_scaf_pos.txt
```
I then checked how many scaffolds we had and how many SNPs per scaffold.
```R
sp<-read.table("filtered2x_vcf_scaf_pos.txt", header=F)
length(unique(sp[,1])) #3967 scaffolds
```
**We have 3967 scaffolds and the maximum number of SNPs are on the 15112616 scaffold with 199 SNPs.**
I then set the filtering criteria for next round of filtering as follows:
I calculated the max coverage for each species in R:
```R
x = read.table("depth_filtered2X_mollies.txt", header=FALSE)
cov = mean(x$V1) + 2*(sd(x$V1)) #11690.59
```
**The coverage for this data is 11690.59**
my $maxCoverage = 11691; #maximum depth to avoid repeats, mean + 2sd = 11690.59
my $mind = 5; #don't call if SNPs are closer together than this
I checked how many SNPs are retained if we change the $mind filter. The number of SNPs ranged from 46022 (10) to 57714 (2). I chose 5 individuals.
**Based on this, I retained 52900 SNPs for downstream analyses.**