This is a pipeline do SNP annotations for Timema. Directory on cluster for files:/uufs/chpc.utah.edu/common/home/u6007910/projects/timema_adaptation/analyses/annotation/inputfiles/pcs_allsnps_annot
1. The input files for the annotation are these three files created for the three PCs I ran for baypass:
a. pc1_bin_out.csv
b. pc2_bin_out.csv
c. pc3_bin_out.csv
d. timema_cristinae_genomeannotation.txt = genome annotation file
2. I went into R and subset these files to just have the median bayesfactor columns in each species.
```R
#read in files
pc1<-read.csv("pc1_bin_out.csv", header=T, sep=",")
pc2<-read.csv("pc2_bin_out.csv", header=T, sep=",")
pc3<-read.csv("pc3_bin_out.csv", header=T, sep=",")
#subset columns
pc1sub<-pc1[,c(1:3,36:43)]
pc2sub<-pc2[,c(1:3,36:43)]
pc3sub<-pc3[,c(1:3,36:43)]
#write out files
write.table(pc1sub, "pc1_bin_medbf.csv", col.names=T, row.names=F, sep=",", quote=F)
write.table(pc2sub, "pc2_bin_medbf.csv", col.names=T, row.names=F, sep=",", quote=F)
write.table(pc3sub, "pc3_bin_medbf.csv", col.names=T, row.names=F, sep=",", quote=F)
```
3. Now I will get scaffold bin ranges to finally get the annotations I need. This will give me start and stop position for each bin which can be used for annotations. This will be done using the scipt timema_create_scafbin_ranges.py
```bash
python timema_create_scafbin_ranges.py pc1_bin_medbf.csv pc1_scafbin_ranges.out
python timema_create_scafbin_ranges.py pc2_bin_medbf.csv pc2_scafbin_ranges.out
python timema_create_scafbin_ranges.py pc3_bin_medbf.csv pc3_scafbin_ranges.out
```
4. Now I will annotate each scaffold bin based on the start and stop positions in the previous files created. This file will create all the annotations for SNPs which are in the list of the timema annotation file. However, it misses a few SNPs. I will fix that in next step. For this script, I manually changed the input and output file for each PC and created one annotation file per PC.
```bash
python timema_create_snp_annotations.py
#Each file has the following number of lines:
wc -l pc*_snp_annot*
# 9344 pc1_snp_annotation_table.txt
# 9344 pc2_snp_annotation_table.txt
# 9344 pc3_snp_annotation_table.txt
```
5. For some reason, the output files have missing values in the last three columns if there is no IPR info in the genome annotation file. I fixed this manually through awk in bash as follows:
```bash
awk 'BEGIN { FS = OFS = "\t" } { for(i=1; i<=NF; i++) if($i ~ /^ *$/) $i = "~~" }; 1' pc1_snp_annotation_table.txt > pc1_snp_annotation_table_edit.txt
awk 'BEGIN { FS = OFS = "\t" } { for(i=1; i<=NF; i++) if($i ~ /^ *$/) $i = "~~" }; 1' pc2_snp_annotation_table.txt > pc2_snp_annotation_table_edit.txt
awk 'BEGIN { FS = OFS = "\t" } { for(i=1; i<=NF; i++) if($i ~ /^ *$/) $i = "~~" }; 1' pc3_snp_annotation_table.txt > pc3_snp_annotation_table_edit.txt
```
6. Since a few SNPs were deleted from the annotation file, I will do another step of merging which helps me to create the final file which will have all the SNPs. For this I used the script timema_combine_snps_annot.py. Before that I again created the final file in R.
```R
pc1<-read.csv("pc1_bin_medbf.csv", header=T, sep=",")
pc2<-read.csv("pc2_bin_medbf.csv", header=T, sep=",")
pc3<-read.csv("pc3_bin_medbf.csv", header=T, sep=",")
pc1ranges<-read.csv("pc1_scafbin_ranges.out", header=T, sep=",")
pc2ranges<-read.csv("pc2_scafbin_ranges.out", header=T, sep=",")
pc3ranges<-read.csv("pc3_scafbin_ranges.out", header=T, sep=",")
pc1com<-cbind(pc1[,c(1:3)],pc1ranges[2:3],pc1[,c(4:11)])
pc2com<-cbind(pc2[,c(1:3)],pc2ranges[2:3],pc2[,c(4:11)])
pc3com<-cbind(pc3[,c(1:3)],pc3ranges[2:3],pc3[,c(4:11)])
write.table(pc1com, file="pc1_bin_medbf_ranges.csv", col.names=T, row.names=F, sep=",", quote=F)
write.table(pc2com, file="pc2_bin_medbf_ranges.csv", col.names=T, row.names=F, sep=",", quote=F)
write.table(pc3com, file="pc3_bin_medbf_ranges.csv", col.names=T, row.names=F, sep=",", quote=F)
```
7. Now I will combine this file with whatever annotations I have from step 4 and 5. The bins which do not have any annotations will get "missing". I will manually replace them with "NA". For this I used the script timema_combine_snps_annot.py
```bash
python timema_combine_snps_annot.py pc1_bin_medbf_ranges.csv pc1_snp_annotation_table_edit.txt final_pc1_snp_annottable.csv
python timema_combine_snps_annot.py pc2_bin_medbf_ranges.csv pc2_snp_annotation_table_edit.txt final_pc2_snp_annottable.csv
python timema_combine_snps_annot.py pc3_bin_medbf_ranges.csv pc3_snp_annotation_table_edit.txt final_pc3_snp_annottable.csv
```
I again edited these files to replace missing with NA:
```bash
perl -p -i -e 's/missing/NA/g' final_pc1_snp_annottable.csv
perl -p -i -e 's/missing/NA/g' final_pc2_snp_annottable.csv
perl -p -i -e 's/missing/NA/g' final_pc3_snp_annottable.csv
```