Searching/

Fetching LD SNPs

outlook email title: searching LD SNPs

June 20th 2021- email from CH

Files stored: /space/chen-syn01/1/data/cinliu/data/clozapine/data_April2021/LD_proxy/

Instructions and explanation: /space/chen-syn01/1/data/cinliu/data/clozapine/data_April2021/LD_proxy/code_and_instructions

Stored according to R2 cut off:

/space/chen-syn01/1/data/cinliu/data/clozapine/data_April2021/LD_proxy/R09

/space/chen-syn01/1/data/cinliu/data/clozapine/data_April2021/LD_proxy/R08

README file

syn01/1/data/cinliu/data/clozapine/data_April2021

Ambiguous SNPs:

6 ambiguous SNPs were found in All_CLZ_samples_snplist1.raw

80 ambiguous SNPs were found in All_CLZ_samples_snplist2.raw

21 ambiguous SNPs were found in All_TOP_samples_snplist1.raw

81 ambiguous SNPs were found in All_TOP_samples_snplist2.raw

Note: the ambiguous SNPs found in the CLZ data were all included in the ambiguous SNPs found TOP data.

After the ambiguous SNPs listed above are removed -

When comparing the CLZ data with the TOP data:

488 SNPs from snplist1 had a different counted allele.

5 SNPs from snplist2 had a different counted allele.

986 SNPs from All_TOP_samples_snplist1.raw could not find SNP name matches in All_CLZ_samples_snplist1.raw (this makes sense because the TOP data set has 986 more SNPs than the CLZ data set).

1 SNP from All_TOP_samples_snplist2.raw could not find SNP name matches in All_CLZ_samples_snplist2.raw (this makes sense because the TOP data set has 1 more SNP than the CLZ data set).

Details of the exact checking process/code can be found here:

https://ucsdhs-my.sharepoint.com/:f:/g/personal/cil001_health_ucsd_edu/Ej6wsITU2I9Oi0KHjLTeU6gBZniUJoQpFJIrwOXBp2EKbQ?e=0Gqe8c

R code that generates new .raw files that changed the counted allele of TOP data to match CLZ data are located on the server:

/space/chen-syn01/1/data/cinliu/data/clozapine/data_April2021/counted_allele_correction

An explanation of what's happening - from Nini

Hello there! If you're reading this, and you were not involved in the project at the time it was happening, I can imagine at this point you are probably a bit confused just reading through the log record of what happened. So let me try to explain to you in plain words what is happening.

We have these 4 files named :

All_CLZ_samples_snplist1_a1updated.raw
All_CLZ_samples_snplist2_a1updated.raw
All_TOP_samples_snplist1.raw
All_TOP_samples_snplist2.raw

Each file looks a little something like this:

Each of these raw files were generated from plink (see https://ucsdhs-my.sharepoint.com/:p:/g/personal/cil001_health_ucsd_edu/EQqeRJc5m5NNtl7B-SMwN3UBxxeoSBHCjBMnjLWFfoxKUA?e=xIEBVo for powerpoint expalination).

But basically these are genetics data. The FID is Family ID, IID is individual ID. So each row is a person.

The PAT, MAT, SEX, PHENOTYPE columns are other info taken about the person in this particular case we don't really worry about them in this project so just igore them for now.

Now see the column 'X1.1733219.A.G_A' and all the columns after it all look something similar to it. These are actual data of that person's DNA, SNP data.

Let me break it down, here's how to read 'X1.1733219.A.G_A'
"X1" expalin we are in chromosome 1
"1733219" is the base pair position. You can think of this, the exact location of where within the chromosome.
- So we know we are in chromosome 1, but where within chromosome 1? This is what the base pair position explains, we are at location 1733219 within chromosome 1.

what the.raw files look like:

when looking for the SNP count calculation
- subtract 6 columns for basic info
- subtract the number of ambiguous SNPs accordingly

CLZ1 53686-6-6 = 53674

CLZ2 524-6-80 = 438

TOP1 54687-6-21= 54660

TOP2 526-6-81 = 439

53674 SNPs in CLZ snplist1

438 SNPs in CLZ snplist2

54660 SNPs in TOP snplist1

439 SNPs in TOP snplist2

We would like to take all the SNPs from the 4 list and input them into LDproxy. We want to comobine the LDproxy into 4 big lists in the end so we can have all the LDproxy SNP names in one .txt file

That file has been generated with a R2 > 0.8 cut off : /space/chen-syn01/1/data/cinliu/data/clozapine/data_April2021/LD_proxy/R08

That is the end goal of this project, 4 simple .txt files with SNP ID fetched from LDproxy - the SNPs we input into LDproxy will be from 4 files:

/space/chen-syn01/1/data/cinliu/data/clozapine/data_April2021/All_CLZ_samples_snplist1_a1updated.raw
/space/chen-syn01/1/data/cinliu/data/clozapine/data_April2021/All_CLZ_samples_snplist2_a1updated.raw
/space/chen-syn01/1/data/cinliu/data/clozapine/data_April2021/All_TOP_samples_snplist1_updated.raw
/space/chen-syn01/1/data/cinliu/data/clozapine/data_April2021/All_TOP_samples_snplist2_updated.raw

Original instructions

Can you help with searching LD SNPs using this tool, LDproxy (https://ldlink.nci.nih.gov/?tab=apiaccess)? This tool has a command line feature. You can try its online graphic user interface (GUI) version first.

https://ldlink.nci.nih.gov/?tab=ldproxy (select CEU population, using the default 500000 window size). For example, if you search LD SNPs of rs429358, the tool will fetch several LD SNPs.

Eventually, we hope to have a script to fetch all LD SNPs of the TOP/Clozapine SNPs.

Test out the GUI

https://ldlink.nci.nih.gov/?tab=ldproxy

This is a tool where you can basically enter a gene name, select the population, and it will return a list of genes that are the proxy variant for that gene. We want to save the results of these genes.

Additional instructions

From Slack June 22nd

Here is an example script using the package, though the code is not exactly doing the same thing as I ask you to do, but similar

"/home/wed009/gwas_chromatin/LD_proxy/scripts/LD_friend_enigma_UKB.R"

June23rd - conversation in thread with CH

We will need CHR, BP, rs IDs, R^2 at least.

CHR and BP in hg19
We will filter out SNPs R^2 <0.2
- If too many, then we will tighten the threshold and remove R^2 <0.5
You can merge the files after removing the SNPs with R2 <0.2 and we 'll see the total number.
You don't need to keep note and you can simply remove duplicates

Summary of what's happening right now

You can merge the files after removing the SNPs with R2 <0.2 and we 'll see the total number.

Using it via command line

Run script on server and local computer. This is going to take a while... (all in all it took me about 1 week to finish running everything)

/space/chen-syn01/1/data/cinliu/data/clozapine/data_April2021/LD_proxy/

Summary of changes (between results round 1-2)

(expand to see details and links to files)

R changed from 0.2 to 0.5
Let's use CHR:POS rather than rs numbers.
Generate lists with just SNP IDs in the CHR:POS format
- Can you give me lists with only SNP IDs in the CHR:POS format (ie, the 2nd column without "chr" prefix)? Please remove other columns. Then a text file format rather than tsv should be sufficient. You have helped make these SNP ID files for TOP and CLZ data before.

Results (round1)

Hello Chi Hua,

Thank you for waiting! The final results are here:

fetch_LD_proxy

I will also attach direct links to the files here:

CLZ snplist1 with R >0.2 (1769733 SNPs found): All_CLZ_samples_snplist1_LDproxy_r02.tsv
CLZ snplist1 with R > 0.5 (1000272 SNPs found): All_CLZ_samples_snplist1_LDproxy_r05.tsv
CLZ snplist2 with R > 0.2 (25131 SNPs found):All_CLZ_samples_snplist2_a1updated_LDproxy_FULL_List.tsv

TOP snplist1 with R > 0.2 (176932 SNPs found): All_TOP_samples_snplist1_LDproxy_r02.tsv
TOP snplist1 with R >0.5 (1000267 SNPs found): All_TOP_samples_snplist1_LDproxy_r05.tsv
TOP snplist2 with R >0.2 (26204 SNPs found):All_TOP_samples_snplist2_LDproxy_FULL_List.tsv

Everything else in the folder is there for record keeping.

I also made a folder that contains only the scripts used to run the files if it needs to be reproduced in the future: scripts_only

Feedback for round1 results (from CH)

Let's go ahead to have R>0.5 list then.

Can you make snplist2 also with the threshold of R >0.5?

Can you give me lists with SNP IDs in the CHR:POS format (ie, the 2nd column without "chr" prefix)?

Results (round2)

Files are now in CHR:POS format, R >0.5

New files and code are here:

R05_CHR_POS_format

Question: is CHR:POS format preferred over RS number? Because I did find that if we check duplication by CHR:POS (previously duplication was just checked by RS_number), an additional 104 SNPs are duplicated in CLZ and TOP snplist1. (Snplist2 remains the same for both CLZ and TOP).

A rather small amount, but in case you are preferred having those 104 CHR:POS duplications removed, I have made a removed version in the RS_name_and_CHR_POS_duplicate_removed folder.

If not, we just want to remove the SNPs by RS_number duplications only, which can be found in the RS_names_double_removed_only folder.

Feedback (round2)

Let's use CHR:POS rather than rs numbers.

Final results (round3)

The .txt files for the SNP IDs have been generated here: SNP_ID_List

Code remain in the same location, just updated: 3_remove_duplicated.R

3_remove_duplicated.R

rm(list=ls())

savedir = "/Users/nini/Desktop/2021lab/CH/LDproxy_fetch/" #UPDATE location of save loaction

clz1 <- "~/Desktop/2021lab/CH/LDproxy_fetch/All_CLZ_samples_snplist1_a1updated_LDproxy/clzsnplist1_all_server_and_local.tsv"

clz2 <- "/Users/nini/Desktop/2021lab/CH/LDproxy_fetch/All_CLZ_samples_snplist2_a1updated_LDproxy_FULL_List.tsv"

top1 <- "~/Desktop/2021lab/CH/LDproxy_fetch/All_TOP_samples_snplist1_LDproxy/topsnplist1_all_server_and_local.tsv"

top2 <- "/Users/nini/Desktop/2021lab/CH/LDproxy_fetch/All_TOP_samples_snplist2_LDproxy_FULL_List.tsv"

df_name <- c("All_CLZ_samples_snplist1_a1updated","All_CLZ_samples_snplist2_a1updated","All_TOP_samples_snplist1","All_TOP_samples_snplist2")

df_list <- c(clz1,clz2,top1,top2)

for (i in 1:4) {

df <- read.delim(df_list[i])

print(df_list[i])

df2 <- df[df$R2 >0.5,] #only keep those that have R >0.5

nodouble <- df2[!duplicated(df2$RS_Number), ] #remove duplicated RS_number

nodouble <- nodouble[!nodouble$R2 == "R2",] #remove the row that is column names

#print(nrow(nodouble))

nodouble <- nodouble[!duplicated(nodouble$Coord), ] #remove duplicated RS

nodouble$Coord <- gsub("chr", "", nodouble$Coord) #change to CHR:POS from

nodouble <- nodouble[c(2,1,3,4,5,6,7,8,9,10)] #reorder so CHR:POS is the first column

write.table(nodouble,file = paste0(savedir,df_name[i],"_LDproxy_r05.tsv"),row.name = F, col.names = T, quote = F,sep='\t')

#print(nrow(nodouble))

df3 <- nodouble[,c(1)] #only selecting the CHR:POS

write.table(df3,file = paste0(savedir,df_name[i],"_LDproxy_r05_SNPID_list.txt"),quote = FALSE,row.names = FALSE, col.names = "CHR:POS")

#print(length(df3))

}

/space/chen-syn01/1/data/cinliu/data/clozapine/data_April2021/LD_proxy/All_*name*_LDproxy/*name*_LDproxy_individual_files
- stores the individual files fetched from LDproxy

scp -r /Users/nini/Desktop/2021lab/from_onedrive/data/R05 cinliu@137.110.172.99:/space/chen-syn01/1/data/cinliu/data/clozapine/data_April2021/LD_proxy/

scp -r /Users/nini/Desktop/2021lab/from_onedrive/data/R09 cinliu@137.110.172.99:/space/chen-syn01/1/data/cinliu/data/clozapine/data_April2021/LD_proxy/

Page updated

Report abuse