Updated on Dec 7th 2018.
I am trying to plot coverage per SNP across males and females. I am using samtools depth command on bam files to do this. Here is the bash script I am running to get the output file (note the working directory):
#!/bin/bash
#SBATCH -n 12
#SBATCH -N 1
#SBATCH -t 300:00:00
#SBATCH -p usubio-kp
#SBATCH -A usubio-kp
#SBATCH -J samtools
cd /uufs/chpc.utah.edu/common/home/gompert-group1/data/lycaeides/lycaeides_dubois/Alignments/bamfiles
module load samtools
samtools depth *M.fastq.sorted.bam > males_depth.txt
samtools depth *F.fastq.sorted.bam > females_depth.txt
After I created the files, I did the following to plot coverage across individuals for each SNP on the Z (1631) chromosome.
#check the total number of SNPs in the files
wc -l females_depth.txt #67586022
wc -l males_depth.txt #81760391
#grab the SNPs on Scaffold 1631 in both files above
grep ^Scaffold_1631 males_depth.txt > scaf1631_males_depth.txt
wc -l z_males_depth.txt #7843564
grep ^Scaffold_1631 females_depth.txt > scaf1631_females_depth.txt
wc -l scaf1631_females_depth.txt #5502804
######################################################################################################################################
Determining mean coverage across scaffolds
awk -F, '{print $1,$3,$2,$4}' OFS=, lyc_v2.csv > lyc_v2.csv
I have so far created a temp folder in the bamfiles directory to test for renaming files. The script works. I need to reorder the columns.
Folder where this is done: /uufs/chpc.utah.edu/common/home/gompert-group1/data/lycaeides/lycaeides_dubois/Alignments/bamfiles
Files needed for this are bam files (input to samtools) and .ids files.
Number of males = 479
Number of females = 352
Unknown = 4
Mapped scripts across males: 478640161
Mapped scripts across females: 354286447
1. I used samtools idxstats to create ids files for each bamfile. This command takes bam file as an input and then writes out a file in which number of mapped reads are listed in a column.
Usage for running across files: for f in mem*.sorted.bam; do samtools idxstats $f >> ./$f.ids; done
Number of lines or scaffolds for each file should be 1652.
Here is what ouput file looks like:
Scaffold Sequence length Number of mapped reads Number of unmapped reads
Scaffold_1;HRSCAF=6 969 0 0
Scaffold_2;HRSCAF=10 67112 7 0
Scaffold_3;HRSCAF=25 63677 7 0
Scaffold_4;HRSCAF=30 15840280 6005 0
Scaffold_5;HRSCAF=41 51524 9 0
Scaffold_6;HRSCAF=43 22424 47 0
Once, I had all these files I could use the information on number of mapped reads to calculate numbers for males and females.
2. Create master file for males and females.
I created one master file for males and females which combined all the ids files for each individual male and female respectively.
I did this by:
cat mem*F*.ids > outFemales.ids #581504 (Number of female ind. (352) * 1652).
cat mem*M*.ids > outMales.ids #791308 (Number of male individuals (479) * 1652).
3. Used R to get mean number of reads across individuals for each scaffold and then plot:
library(plyr)
males = read.table("outMales.ids", header = FALSE)
females = read.table("outFemales.ids", header = FALSE)
library(plyr)
m.scafs = ddply(males, "males[,1]", numcolwise(mean))
f.scafs = ddply(females, "females[,1]", numcolwise(mean))
plot(m.scafs$V3,f.scafs$V3, xlab= "males", ylab = "females")
dev.copy(pdf, 'sexchromosomes_mappedreads.pdf')
dev.off()
#sort the combined file to find out odd scaffolds for possible W chrom.
mf = cbind(m.scafs, f.scafs)
mf.sort = mf[order(mf[,3]),]
write.table(mf.sort, file = "mf_averages_mappedreads.txt")
Possible Z chromosome: row 705: Scaffold_1631;HRSCAF=2129 21870511 55486.95 0
Possible W chromosome: Scaffold_1353;HRSCAF=1677 1259 0 0 Scaffold_1353;HRSCAF=1677 1259 0.00852272727272727 0