Searching/
Fetching LD SNPs
outlook email title: searching LD SNPs
June 20th 2021- email from CH
outlook email title: searching LD SNPs
June 20th 2021- email from CH
Files stored: /space/chen-syn01/1/data/cinliu/data/clozapine/data_April2021/LD_proxy/
Instructions and explanation: /space/chen-syn01/1/data/cinliu/data/clozapine/data_April2021/LD_proxy/code_and_instructions
Stored according to R2 cut off:
/space/chen-syn01/1/data/cinliu/data/clozapine/data_April2021/LD_proxy/R09
/space/chen-syn01/1/data/cinliu/data/clozapine/data_April2021/LD_proxy/R08
README file
syn01/1/data/cinliu/data/clozapine/data_April2021
Ambiguous SNPs:
6 ambiguous SNPs were found in All_CLZ_samples_snplist1.raw
80 ambiguous SNPs were found in All_CLZ_samples_snplist2.raw
21 ambiguous SNPs were found in All_TOP_samples_snplist1.raw
81 ambiguous SNPs were found in All_TOP_samples_snplist2.raw
Note: the ambiguous SNPs found in the CLZ data were all included in the ambiguous SNPs found TOP data.
After the ambiguous SNPs listed above are removed -
When comparing the CLZ data with the TOP data:
488 SNPs from snplist1 had a different counted allele.
5 SNPs from snplist2 had a different counted allele.
986 SNPs from All_TOP_samples_snplist1.raw could not find SNP name matches in All_CLZ_samples_snplist1.raw (this makes sense because the TOP data set has 986 more SNPs than the CLZ data set).
1 SNP from All_TOP_samples_snplist2.raw could not find SNP name matches in All_CLZ_samples_snplist2.raw (this makes sense because the TOP data set has 1 more SNP than the CLZ data set).
Details of the exact checking process/code can be found here:
R code that generates new .raw files that changed the counted allele of TOP data to match CLZ data are located on the server:
/space/chen-syn01/1/data/cinliu/data/clozapine/data_April2021/counted_allele_correction
Hello there! If you're reading this, and you were not involved in the project at the time it was happening, I can imagine at this point you are probably a bit confused just reading through the log record of what happened. So let me try to explain to you in plain words what is happening.
We have these 4 files named :
All_CLZ_samples_snplist1_a1updated.raw
All_CLZ_samples_snplist2_a1updated.raw
All_TOP_samples_snplist1.raw
All_TOP_samples_snplist2.raw
Each file looks a little something like this:
Each of these raw files were generated from plink (see https://ucsdhs-my.sharepoint.com/:p:/g/personal/cil001_health_ucsd_edu/EQqeRJc5m5NNtl7B-SMwN3UBxxeoSBHCjBMnjLWFfoxKUA?e=xIEBVo for powerpoint expalination).
But basically these are genetics data. The FID is Family ID, IID is individual ID. So each row is a person.
The PAT, MAT, SEX, PHENOTYPE columns are other info taken about the person in this particular case we don't really worry about them in this project so just igore them for now.
Now see the column 'X1.1733219.A.G_A' and all the columns after it all look something similar to it. These are actual data of that person's DNA, SNP data.
Let me break it down, here's how to read 'X1.1733219.A.G_A'
"X1" expalin we are in chromosome 1
"1733219" is the base pair position. You can think of this, the exact location of where within the chromosome.
So we know we are in chromosome 1, but where within chromosome 1? This is what the base pair position explains, we are at location 1733219 within chromosome 1.
what the.raw files look like:
when looking for the SNP count calculation
subtract 6 columns for basic info
subtract the number of ambiguous SNPs accordingly
CLZ1 53686-6-6 = 53674
CLZ2 524-6-80 = 438
TOP1 54687-6-21= 54660
TOP2 526-6-81 = 439
53674 SNPs in CLZ snplist1
438 SNPs in CLZ snplist2
54660 SNPs in TOP snplist1
439 SNPs in TOP snplist2
We would like to take all the SNPs from the 4 list and input them into LDproxy. We want to comobine the LDproxy into 4 big lists in the end so we can have all the LDproxy SNP names in one .txt file
That file has been generated with a R2 > 0.8 cut off : /space/chen-syn01/1/data/cinliu/data/clozapine/data_April2021/LD_proxy/R08
That is the end goal of this project, 4 simple .txt files with SNP ID fetched from LDproxy - the SNPs we input into LDproxy will be from 4 files:
/space/chen-syn01/1/data/cinliu/data/clozapine/data_April2021/All_CLZ_samples_snplist1_a1updated.raw
/space/chen-syn01/1/data/cinliu/data/clozapine/data_April2021/All_CLZ_samples_snplist2_a1updated.raw
/space/chen-syn01/1/data/cinliu/data/clozapine/data_April2021/All_TOP_samples_snplist1_updated.raw
/space/chen-syn01/1/data/cinliu/data/clozapine/data_April2021/All_TOP_samples_snplist2_updated.raw
Can you help with searching LD SNPs using this tool, LDproxy (https://ldlink.nci.nih.gov/?tab=apiaccess)? This tool has a command line feature. You can try its online graphic user interface (GUI) version first.
https://ldlink.nci.nih.gov/?tab=ldproxy (select CEU population, using the default 500000 window size). For example, if you search LD SNPs of rs429358, the tool will fetch several LD SNPs.
Eventually, we hope to have a script to fetch all LD SNPs of the TOP/Clozapine SNPs.
https://ldlink.nci.nih.gov/?tab=ldproxy
This is a tool where you can basically enter a gene name, select the population, and it will return a list of genes that are the proxy variant for that gene. We want to save the results of these genes.
From Slack June 22nd
Here is an example script using the package, though the code is not exactly doing the same thing as I ask you to do, but similar
"/home/wed009/gwas_chromatin/LD_proxy/scripts/LD_friend_enigma_UKB.R"
June23rd - conversation in thread with CH
We will need CHR, BP, rs IDs, R^2 at least.
CHR and BP in hg19
We will filter out SNPs R^2 <0.2
If too many, then we will tighten the threshold and remove R^2 <0.5
You can merge the files after removing the SNPs with R2 <0.2 and we 'll see the total number.
You don't need to keep note and you can simply remove duplicates
Summary of what's happening right now
You can merge the files after removing the SNPs with R2 <0.2 and we 'll see the total number.
Run script on server and local computer. This is going to take a while... (all in all it took me about 1 week to finish running everything)
/space/chen-syn01/1/data/cinliu/data/clozapine/data_April2021/LD_proxy/
(expand to see details and links to files)
R changed from 0.2 to 0.5
Let's use CHR:POS rather than rs numbers.
Generate lists with just SNP IDs in the CHR:POS format
Can you give me lists with only SNP IDs in the CHR:POS format (ie, the 2nd column without "chr" prefix)? Please remove other columns. Then a text file format rather than tsv should be sufficient. You have helped make these SNP ID files for TOP and CLZ data before.
Hello Chi Hua,
Thank you for waiting! The final results are here:
I will also attach direct links to the files here:
CLZ snplist1 with R >0.2 (1769733 SNPs found): All_CLZ_samples_snplist1_LDproxy_r02.tsv
CLZ snplist1 with R > 0.5 (1000272 SNPs found): All_CLZ_samples_snplist1_LDproxy_r05.tsv
CLZ snplist2 with R > 0.2 (25131 SNPs found):All_CLZ_samples_snplist2_a1updated_LDproxy_FULL_List.tsv
TOP snplist1 with R > 0.2 (176932 SNPs found): All_TOP_samples_snplist1_LDproxy_r02.tsv
TOP snplist1 with R >0.5 (1000267 SNPs found): All_TOP_samples_snplist1_LDproxy_r05.tsv
TOP snplist2 with R >0.2 (26204 SNPs found):All_TOP_samples_snplist2_LDproxy_FULL_List.tsv
Everything else in the folder is there for record keeping.
I also made a folder that contains only the scripts used to run the files if it needs to be reproduced in the future: scripts_only
Let's go ahead to have R>0.5 list then.
Can you make snplist2 also with the threshold of R >0.5?
Can you give me lists with SNP IDs in the CHR:POS format (ie, the 2nd column without "chr" prefix)?
Files are now in CHR:POS format, R >0.5
New files and code are here:
Question: is CHR:POS format preferred over RS number? Because I did find that if we check duplication by CHR:POS (previously duplication was just checked by RS_number), an additional 104 SNPs are duplicated in CLZ and TOP snplist1. (Snplist2 remains the same for both CLZ and TOP).
A rather small amount, but in case you are preferred having those 104 CHR:POS duplications removed, I have made a removed version in the RS_name_and_CHR_POS_duplicate_removed folder.
If not, we just want to remove the SNPs by RS_number duplications only, which can be found in the RS_names_double_removed_only folder.
Let's use CHR:POS rather than rs numbers.
The .txt files for the SNP IDs have been generated here: SNP_ID_List
Code remain in the same location, just updated: 3_remove_duplicated.R
rm(list=ls())
savedir = "/Users/nini/Desktop/2021lab/CH/LDproxy_fetch/" #UPDATE location of save loaction
clz1 <- "~/Desktop/2021lab/CH/LDproxy_fetch/All_CLZ_samples_snplist1_a1updated_LDproxy/clzsnplist1_all_server_and_local.tsv"
clz2 <- "/Users/nini/Desktop/2021lab/CH/LDproxy_fetch/All_CLZ_samples_snplist2_a1updated_LDproxy_FULL_List.tsv"
top1 <- "~/Desktop/2021lab/CH/LDproxy_fetch/All_TOP_samples_snplist1_LDproxy/topsnplist1_all_server_and_local.tsv"
top2 <- "/Users/nini/Desktop/2021lab/CH/LDproxy_fetch/All_TOP_samples_snplist2_LDproxy_FULL_List.tsv"
df_name <- c("All_CLZ_samples_snplist1_a1updated","All_CLZ_samples_snplist2_a1updated","All_TOP_samples_snplist1","All_TOP_samples_snplist2")
df_list <- c(clz1,clz2,top1,top2)
for (i in 1:4) {
df <- read.delim(df_list[i])
print(df_list[i])
df2 <- df[df$R2 >0.5,] #only keep those that have R >0.5
nodouble <- df2[!duplicated(df2$RS_Number), ] #remove duplicated RS_number
nodouble <- nodouble[!nodouble$R2 == "R2",] #remove the row that is column names
#print(nrow(nodouble))
nodouble <- nodouble[!duplicated(nodouble$Coord), ] #remove duplicated RS
nodouble$Coord <- gsub("chr", "", nodouble$Coord) #change to CHR:POS from
nodouble <- nodouble[c(2,1,3,4,5,6,7,8,9,10)] #reorder so CHR:POS is the first column
write.table(nodouble,file = paste0(savedir,df_name[i],"_LDproxy_r05.tsv"),row.name = F, col.names = T, quote = F,sep='\t')
#print(nrow(nodouble))
df3 <- nodouble[,c(1)] #only selecting the CHR:POS
write.table(df3,file = paste0(savedir,df_name[i],"_LDproxy_r05_SNPID_list.txt"),quote = FALSE,row.names = FALSE, col.names = "CHR:POS")
#print(length(df3))
}
/space/chen-syn01/1/data/cinliu/data/clozapine/data_April2021/LD_proxy/All_*name*_LDproxy/*name*_LDproxy_individual_files
stores the individual files fetched from LDproxy
scp -r /Users/nini/Desktop/2021lab/from_onedrive/data/R05 cinliu@137.110.172.99:/space/chen-syn01/1/data/cinliu/data/clozapine/data_April2021/LD_proxy/
scp -r /Users/nini/Desktop/2021lab/from_onedrive/data/R09 cinliu@137.110.172.99:/space/chen-syn01/1/data/cinliu/data/clozapine/data_April2021/LD_proxy/