barcodes_alignment_variantcalling

---

title: "Molly GBS analyses"

author: "Samridhi Chaturvedi"

date: "April 2021"

output: pdf_document

...

---

output:

pdf_document: default

html_document: default

---

Genome downloaded from: https://www.ncbi.nlm.nih.gov/assembly/GCF_001443285.1

Folder on UofU cluster: /uufs/chpc.utah.edu/common/home/gompert-group3/data/fish/

Unzipped fastq files: /uufs/chpc.utah.edu/common/home/gompert-group3/data/fish/sequences/

There are two files which I am guessing are associated with two lanes of sequencing.

# 1. PhiX filtering

*A. I first checked how many reads are present in each file:*

```bash

grep ^@K TSU_PLAT_S36_L007_R1_001.fastq | wc -l #323329960

grep ^@K TSU_PLAT_S36_L008_R1_001.fastq | wc -l #325534956

```

I then performed phiX filtering for both these files separately.

Folder: /uufs/chpc.utah.edu/common/home/gompert-group3/data/fish/phiX/

For this, I used the phiX fasta file had previously downloaded for lycaeides data.

*B. I first aligned the raw sequences to the phix file using the script *phiXalignment.sh*. This created sam files for each lane (2 files).*

```bash

#!/bin/bash

#SBATCH --job-name=phiXfiltering

#SBATCH --time=96:00:00 #walltime

#SBATCH --nodes=1 #number of cluster nodes

#SBATCH --account=gompert-kp #PI account

#SBATCH --partition=gompert-kp #specify computer cluster, other option is kinspeak

## This script is to align the lycaeides GBS data fastq file with the fastq genome of phiX bacteria. This way I can filter the phiX sequences from my data.

### here are the commands for bwa aln

#-n NUM Max edit distance or the fraction of missing alignments given 2% uniform base error rate

#-l INT Take the first INT subsequence as seed. If INT is larger than the query sequence, seeding will be disabled. For long reads, this option is typically ranged from 25 to 35

#-k INT Maximum edit distance in the seed

#-t INT Number of threads

cd /uufs/chpc.utah.edu/common/home/gompert-group3/data/fish/phiX/

module load bwa

#lane1

#bwa aln -n 4 -l 20 -k 2 -t 12 /uufs/chpc.utah.edu/common/home/u6007910/projects/lyc_dubois/lycaeides_dubois/Parsed/phixFiltering/phixSeqIndex/NC_001422.fasta /uufs/chpc.utah.edu/common/home/gompert-group3/data/fish/sequences/TSU_PLAT_S36_L007_R1_001.fastq -f TSU_PLAT_S36_L007_R1_001_out.sai

bwa samse -n 1 -f TSU_PLAT_S36_L007_R1_001_out.sam -r "@RG\tID:test" /uufs/chpc.utah.edu/common/home/u6007910/projects/lyc_dubois/lycaeides_dubois/Parsed/phixFiltering/phixSeqIndex/NC_001422.fasta TSU_PLAT_S36_L007_R1_001_out.sai /uufs/chpc.utah.edu/common/home/gompert-group3/data/fish/sequences/TSU_PLAT_S36_L007_R1_001.fastq

#lane2

bwa samse -n 1 -f TSU_PLAT_S36_L008_R1_001_out.sam -r "@RG\tID:test" /uufs/chpc.utah.edu/common/home/u6007910/projects/lyc_dubois/lycaeides_dubois/Parsed/phixFiltering/phixSeqIndex/NC_001422.fasta TSU_PLAT_S36_L008_R1_001_out.sai /uufs/chpc.utah.edu/common/home/gompert-group3/data/fish/sequences/TSU_PLAT_S36_L008_R1_001.fastq

```

*C. I then filtered the files to get the unmapped reads, which are the filtered reads and I will use these for downstream processing.*

```bash

#!/bin/bash

#SBATCH --job-name=phiXfiltering

#SBATCH --time=96:00:00 #walltime

#SBATCH --nodes=1 #number of cluster nodes

#SBATCH --account=usubio-kp #PI account

#SBATCH --partition=usubio-kp #specify computer cluster, other option is kinspeak

## This script is to use samtools to determine number of mapped and unmapped reads to the phiX genome. Then to create seperate files for mapped and unmapped reads. Then to create a final fastq file of the unmapped reads.

### here are the commands for bwa aln

#-n NUM Max edit distance or the fraction of missing alignments given 2% uniform base error rate

#-l INT Take the first INT subsequence as seed. If INT is larger than the query sequence, seeding will be disabled. For long reads, this option is typically ranged from 25 to 35

#-k INT Maximum edit distance in the seed

#-t INT Number of threads

cd /uufs/chpc.utah.edu/common/home/gompert-group1/data/lycaeides/lycaeides_dubois/Sequences/phixFiltering

module load samtools

#lane1

samtools view -S -b TSU_PLAT_S36_L007_R1_001_out.sam > TSU_PLAT_S36_L007_R1_001_out.bam

samtools view -F4 TSU_PLAT_S36_L007_R1_001_out.bam > TSU_PLAT_S36_L007_R1_001_out.mapped.sam

samtools view -f4 TSU_PLAT_S36_L007_R1_001_out.bam > TSU_PLAT_S36_L007_R1_001_out.unmapped.sam

samtools view -S -b TSU_PLAT_S36_L007_R1_001_out.unmapped.sam > TSU_PLAT_S36_L007_R1_001_out.unmapped.bam

samtools bam2fq TSU_PLAT_S36_L007_R1_001_out.unmapped.bam > TSU_PLAT_S36_L007_R1_001_out.unmapped.fastq

mv TSU_PLAT_S36_L007_R1_001_out.unmapped.fastq TSU_PLAT_S36_L007_R1_001_phiXfiltered.fastq

#lane2

samtools view -S -b TSU_PLAT_S36_L008_R1_001_out.sam > TSU_PLAT_S36_L008_R1_001_out.bam

samtools view -F4 TSU_PLAT_S36_L008_R1_001_out.bam > TSU_PLAT_S36_L008_R1_001_out.mapped.sam

samtools view -f4 TSU_PLAT_S36_L008_R1_001_out.bam > TSU_PLAT_S36_L008_R1_001_out.unmapped.sam

samtools view -S -b TSU_PLAT_S36_L008_R1_001_out.unmapped.sam > TSU_PLAT_S36_L008_R1_001_out.unmapped.bam

samtools bam2fq TSU_PLAT_S36_L008_R1_001_out.unmapped.bam > TSU_PLAT_S36_L008_R1_001_out.unmapped.fastq

mv TSU_PLAT_S36_L008_R1_001_out.unmapped.fastq TSU_PLAT_S36_L008_R1_001_phiXfiltered.fastq

```

*D. Then I checked the number of reads that mapped to the phix sequences and those which did not:*

```bash

## lane 1

#get number of reads

wc -l TSU_PLAT_S36_L007_R1_001_out.mapped.sam #65249985

#find the number of unmapped reads

wc -l TSU_PLAT_S36_L007_R1_001_out.unmapped.sam #258079975

## lane 2

#get number of reads

wc -l TSU_PLAT_S36_L008_R1_001_out.mapped.sam #67017011

#find the number of unmapped reads

wc -l TSU_PLAT_S36_L008_R1_001_out.unmapped.sam #258517945

```

**~ 80 % of the reads did not map to phiX. These are clean and will be used for further analysis.**

# 2. Parse barcodes and split into fastq files

Now that I have the final filtered fastq file, I ran Zach's perl script to parse the barcodes. For this I need the barcode ID file sent by Chris (diana_barcodeKeys.csv) and the filtered fastq files from the *phiX* folder: TSU_PLAT_S36_L007_R1_001_phiXfiltered.fastq and TSU_PLAT_S36_L008_R1_001_phiXfiltered.fastq.

I combined the two files as they have the same samples:

```bash

cat TSU_PLAT_S36_L007_R1_001_phiXfiltered.fastq TSU_PLAT_S36_L008_R1_001_phiXfiltered.fastq > TSU_PLAT_S36_combined_phiXfiltered.fastq

```

The script for this **parse_barcodesLyc036037.pl** is saved in the Scripts folder. The bash script parseBarcodes.sh is used to run the perl script on the cluster. This script will create a miderrors file and a parsed file. The parsed file should have lines written in 10X faster than the miderrors file. To check this I just test ran the script and when the output files got created, I just did wc -l to check the number of lines written in each file.

I split the reads into 7 (00-06) smaller files to parse barcodes using parallel jobs:

```bash

split -l 300000000 -d --additional-suffix=.fastq TSU_PLAT_S36_combined_phiXfiltered.fastq TSU_PLAT_S36_split_phiXfiltered_

```

I then used the **run_parseBarcodes.sh** scripts to run parse the barcodes. This script runs the parse_barcodesFish.pl script on the cluster. I combined the parsed and miderrors files for each run and created final combined files using command line cat. I then used **splitFastq.pl** to split files into fastq files for each individual.

These files were then moved to the following directory: /uufs/chpc.utah.edu/common/home/gompert-group3/data/fish/barcodes/demultiplexed_reads/

# 3. Alignment

3a.

Before starting the alignments, I indexed the *P. latipinna* genome using BWA **3a_run_bwa_index.sh**:

```bash

#!/bin/bash

#SBATCH --job-name=genomeindex

#SBATCH --time=70:00:00 #walltime

#SBATCH --nodes=1 #number of cluster nodes

#SBATCH --account=usubio-kp #PI account

#SBATCH --partition=usubio-kp #specify computer cluster, other option is kinspeak

#SBATCH --job-name=indexgen

#SBATCH --mail-type=ALL

#SBATCH --mail-user=samridhi.chaturvedi@gmail.com

cd /uufs/chpc.utah.edu/common/home/gompert-group3/data/fish/ncbi-genomes-2021-01-04/

module load bwa

bwa index -a bwtsw GCF_001443285.1_P_latipinna-1.0_genomic.fasta

```

3b.

Next, I aligned the fastq files to the *P. latipinna* genome using BWA MEM algorithm. I used the **3b_run_bwa_mem.sh** bash script to run BWA:

```bash

#!/bin/sh

#SBATCH --time=96:00:00

#SBATCH --nodes=1

#SBATCH --ntasks=8

#SBATCH --account=usubio-kp

#SBATCH --partition=usubio-kp

#SBATCH --job-name=bwamem

#SBATCH --mail-type=ALL

#SBATCH --mail-user=samridhi.chaturvedi@gmail.com

## This script is to run the parse_barcodes.pl script on the cluster

cd /uufs/chpc.utah.edu/common/home/gompert-group3/data/fish/barcodes/demultiplexed_reads/

module load bwa

for i in $(ls *.fastq | sed -r 's/.fastq//' | uniq)

bwa mem -t 12 -k 15 -r 1.3 -T 30 /uufs/chpc.utah.edu/common/home/gompert-group3/data/fish/ncbi-genomes-2021-01-04/GCF_001443285.1_P_latipinna-1.0_genomic.fasta ${i}.fastq > ${i}.sam

mv ${i}.sam /uufs/chpc.utah.edu/common/home/gompert-group3/data/fish/alignments/samfiles/

done

```

3c.

I then ran the **3c_run_sam_to_bam.sh** script to convert the samfiles to sorted and indexed bams.

```bash

#!/bin/sh

#SBATCH --time=96:00:00

#SBATCH --nodes=1

#SBATCH --ntasks=8

#SBATCH --account=usubio-kp

#SBATCH --partition=usubio-kp

#SBATCH --job-name=bwamem

#SBATCH --mail-type=ALL

#SBATCH --mail-user=samridhi.chaturvedi@gmail.com

## This script is to run the parse_barcodes.pl script on the cluster

cd /uufs/chpc.utah.edu/common/home/gompert-group3/data/fish/alignments/samfiles/

module load samtools

module load bcftools

for i in $(ls *.sam | sed -r 's/.sam//' | uniq);

samtools view -b -S -o ${i}.bam ${i}.sam

samtools sort ${i}.bam -o ${i}.sorted.bam

samtools index ${i}.sorted.bam

mv ${i}*bam* /uufs/chpc.utah.edu/common/home/gompert-group3/data/fish/alignments/bamfiles/

done

```

I then got some intitial numbers on the number of reads in the sorted bam files using the following command:

```bash

for i in *sorted.bam; do samtools idxstats $i | awk '{s+=$3} END {print s}'; done

```

The output of this seemed reasonable.

# 4. Variant calling and filtering

4a.

I did variant calling using samtools and bcftools using the script **4a_run_variant_calling.sh**

```bash

#!/bin/sh

#SBATCH --time=96:00:00

#SBATCH --nodes=1

#SBATCH --ntasks=8

#SBATCH --account=usubio-kp

#SBATCH --partition=usubio-kp

#SBATCH --job-name=variantcalling

#SBATCH --mail-type=FAIL

#SBATCH --mail-user=samridhi.chaturvedi@gmail.com

echo ------------------------------------------------------

echo SLURM: job identifier is $SLURM_JOBID

echo SLURM: job name is $SLURM_JOB_NAME

echo ------------------------------------------------------

#SAMTOOLs mpileup version 1.5 options used:

#C = adjust mapping quality; recommended:50, disable:0 [0]

#d = max per-file depth; avoids excessive memory usage [250]

#f = faidx indexed reference sequence file

#q = skip alignments with mapQ smaller than INT [0]

#Q = skip bases with baseQ/BAQ smaller than INT [13]

#g = generate genotype likelihoods in BCF format

#t = --output-tags LIST optional tags to output:DP,AD,ADF,ADR,SP,INFO/AD,INFO/ADF,INFO/ADR []

#BCFTOOLs call version 1.6 options used

#v = output variant sites only

#c/m = he original calling method (conflicts with -m) or alternative model for multiallelic and rare-variant calling (conflicts with -c)

#p = variant if P(ref|D)<FLOAT with -c [0.5]

#P = --prior <float-o mutation rate (use bigger for greater sensitivity), use with -m [1.1e-3]

#O = output type: 'b' compressed BCF; 'u' uncompressed BCF; 'z' compressed VCF; 'v' (here it is 'v')

#o = write output to a file [standard output]

module load samtools

module load bcftools

cd /uufs/chpc.utah.edu/common/home/gompert-group3/data/fish/alignments/bamfiles/

samtools mpileup -C 50 -d 250 -f /uufs/chpc.utah.edu/common/home/gompert-group3/data/fish/ncbi-genomes-2021-01-04/GCF_001443285.1_P_latipinna-1.0_genomic.fasta -q 20 -Q 15 -g -I -t DP,DPR -u -b molliesBam.txt -o variantsMollies.bcf

bcftools call -v -c -p 0.01 -P 0.001 -O v -o variantsMollies.vcf variantsMollies.bcf

mv variantsMollies* /uufs/chpc.utah.edu/common/home/gompert-group3/data/fish/variants/

```

I then checked how many variants were called and got some basic statistics for this step:

```bash

#just get number of SNPs called using command line

grep ^NW variantsMollies.vcf | wc -l #331619 SNPs are called

#use BCFTOOLS to get more stats

bcftools stats variantsLycaeides.vcf > lycaeides.stats #331619 SNPs are called

#sanity checks for allele frequencies

grep ^NW variantsMollies.vcf |cut -f 8 | grep -o "AF1=[0-9\.]*" > alleleFreqs.txt

awk '$1 < 0.005' alleleFreqs.txt | wc -l #89737

awk '$1 > 0.995' alleleFreqs.txt | wc -l #46927

```

4b.

I then filtered variants based on coverage, missing data and MAF.

Here are the options I set for filtering:

```bash

my $minCoverage = 768; # minimum number of sequences; DP, no. of individuals = X; mincoverage = 2X #change for each species

my $minAltRds = 4; # minimum number of sequences with the alternative allele; AC #10 #copies

my $notFixed = 1.0; # removes loci fixed for alt; AF #check this

my $mq = 30; # minimum mapping quality; MQ

my $miss = 77; # maximum number of individuals with no data #20% of the number of individuals #change for each species

my $minMaf = 0.005; # minimum MAF, as allele count ~ 0.005 # 0.005% of 2X, check AF in the sequences #frequency rare vs common #change for each species

my $ratDifDir = 0.01; ## maximum proportion of reads in reverse (relative to avg.) orientation

```

**After this step, I retained 61863 SNPs.**

4c.

To run the second script for further filtering, I needed two details: a). a file with scaffold and position for each script b). the max coverage depth across SNPs.

To get depth across sites for filtersomemore.pl script:

```bash

grep ^NW filtered2x_variantsMollies.vcf | cut -f 8 | cut -d ';' -f 1 | grep -o '[0-9]*' > depth_filtered2X_mollies.txt

```

To create the snpsMollies.txt file for filtersomemore.pl script:

```bash

#to get scaffold and position

grep ^NW filtered2x_variantsMollies.vcf | cut -f 1 | cut -d '_' -f 2 | cut -d '.' -f 1 > scaffolds.txt

grep ^NW filtered2x_variantsMollies.vcf | cut -f 2 > positions.txt

#create final file

paste scaffolds.txt positions.txt > filtered2x_vcf_scaf_pos.txt

```

I then checked how many scaffolds we had and how many SNPs per scaffold.

```R

sp<-read.table("filtered2x_vcf_scaf_pos.txt", header=F)

length(unique(sp[,1])) #3967 scaffolds

```

**We have 3967 scaffolds and the maximum number of SNPs are on the 15112616 scaffold with 199 SNPs.**

I then set the filtering criteria for next round of filtering as follows:

I calculated the max coverage for each species in R:

```R

x = read.table("depth_filtered2X_mollies.txt", header=FALSE)

cov = mean(x$V1) + 2*(sd(x$V1)) #11690.59

```

**The coverage for this data is 11690.59**

my $maxCoverage = 11691; #maximum depth to avoid repeats, mean + 2sd = 11690.59

my $mind = 5; #don't call if SNPs are closer together than this

I checked how many SNPs are retained if we change the $mind filter. The number of SNPs ranged from 46022 (10) to 57714 (2). I chose 5 individuals.

**Based on this, I retained 52900 SNPs for downstream analyses.**

Page updated

Google Sites

Report abuse