Convert file from plink to raw

April 21st 2021 - CH Slack

path of this project on server: /space/chen-syn01/1/data/cinliu/data/clozapine/data_April2021/

Details: /space/chen-syn01/1/data/cinliu/data/clozapine/data_April2021/1.plink2raw.sh

Original instructions

Could you help convert our newly arrived TOP and CLZ data from plink to raw data format by modifying this script?

"/space/chen-syn01/1/data/chen/clozapine/data_April2021/1.plink2raw.sh"

There are four Plink datasets.

TOP_snplist1
TOP_snplist2
CLZ_snplist1
CLZ_snplist2

in this folder "/space/chen-syn01/1/data/chen/clozapine/data_April2021/"

Could you also check whether TOP and CLZ samples use the same A1?

53674 SNPs in CLZ snplist1

438 SNPs in CLZ snplist2

54681 SNPs in TOP snplist1

54680

520 SNPs in TOP snplist2

Outputs

saved location on server: syn01/1/data/cinliu/data/clozapine/data_April2021

README.docx

Data summary of the data from:

syn01/1/data/cinliu/data/clozapine/data_April2021

Ambiguous SNPs:

6 ambiguous SNPs were found in All_CLZ_samples_snplist1.raw

80 ambiguous SNPs were found in All_CLZ_samples_snplist2.raw

21 ambiguous SNPs were found in All_TOP_samples_snplist1.raw

81 ambiguous SNPs were found in All_TOP_samples_snplist2.raw

Note: the ambiguous SNPs found in the CLZ data were all included in the ambiguous SNPs found TOP data.

After the ambiguous SNPs listed above are removed -

When comparing the CLZ data with the TOP data:

488 SNPs from snplist1 had a different counted allele.

5 SNPs from snplist2 had a different counted allele.

986 SNPs from All_TOP_samples_snplist1.raw could not find SNP name matches in All_CLZ_samples_snplist1.raw (this makes sense because the TOP data set has 986 more SNPs than the CLZ data set).

1 SNP from All_TOP_samples_snplist2.raw could not find SNP name matches in All_CLZ_samples_snplist2.raw (this makes sense because the TOP data set has 1 more SNP than the CLZ data set).

Details of the exact checking process/code can be found here:

https://ucsdhs-my.sharepoint.com/:f:/g/personal/cil001_health_ucsd_edu/Ej6wsITU2I9Oi0KHjLTeU6gBZniUJoQpFJIrwOXBp2EKbQ?e=0Gqe8c

R code that generates new .raw files that changed the counted allele of TOP data to match CLZ data are located on the server:

/space/chen-syn01/1/data/cinliu/data/clozapine/data_April2021/ counted_allele_correction

Output

1.plink2raw.sh

#!/bin/bash

path=/space/chen-syn01/1/data/chen/clozapine/data_April2021

outpath=/space/chen-syn01/1/data/cinliu/data/clozapine/data_April2021

/home/mtlo/plink1.9/plink --bfile ${path}/All_CLZ_samples_snplist1 --recode A --out ${outpath}/All_CLZ_samples_snplist1

/home/mtlo/plink1.9/plink --bfile ${path}/All_CLZ_samples_snplist2 --recode A --out ${outpath}/All_CLZ_samples_snplist2

/home/mtlo/plink1.9/plink --bfile ${path}/All_TOP_samples_snplist1 --recode A --out ${outpath}/All_TOP_samples_snplist1

/home/mtlo/plink1.9/plink --bfile ${path}/All_TOP_samples_snplist2 --recode A --out ${outpath}/All_TOP_samples_snplist2

# then output file, ALL_TOP_samples.raw, can be read in R using the line below

#raw<-read.table("/space/chen-syn01/1/data/chen/clozapine/ALL_TOP_samples.raw",h=T)

After the .raw files are generated we check for ambiguous SNPS and check that the same A1 is being used.

https://ucsdhs-my.sharepoint.com/:f:/g/personal/cil001_health_ucsd_edu/EtlE1Q9D4S9Hl6RLohmrMr8B-gi1gXnIG0rLVkcTCFFygQ?e=fvbtUC

May 6th 2021

Ambiguous SNP check

https://ucsdhs-my.sharepoint.com/:f:/g/personal/cil001_health_ucsd_edu/Ej6wsITU2I9Oi0KHjLTeU6gBZniUJoQpFJIrwOXBp2EKbQ?e=GKIWDM

generate_a1_updated_raw.R

rm(list=ls())library(stringr)
wd <- "/space/chen-syn01/1/data/cinliu/data/clozapine/data_April2021/"setwd(wd)
####AMBIGUOUS SNPS CHECK
#wd <- "/Users/nini/Desktop/2021lab/CH/data_April2021/"#setwd(wd)
###ambiguous function ambig_check <- function(dat) { ALL_Am <- vector() for (i in 1:nrow(dat)) { if (dat$A1[i] == "A" & dat$A2[i] == "T" | dat$A1[i] == "T" & dat$A2[i] == "A" |dat$A1[i] == "C" & dat$A2[i] == "G" | dat$A1[i] == "G" & dat$A2[i] == "C"){ #print(paste0(dat$chr[i],":",dat$pos[i],":",dat$A1[i],":",dat$A2[i],"_",dat$counted_Allele[i])) ALL_Am[i] <- i } } ALL_Am <- ALL_Am[!is.na(ALL_Am)] #remove the NAs return(ALL_Am) #return the row number of the ambiguous SNP in the data }
namelist <- c("All_CLZ_samples_snplist1","All_CLZ_samples_snplist2","All_TOP_samples_snplist1","All_TOP_samples_snplist2")#AMbiguous SNP check listofdfs <- list()for (j in 1:length(namelist)) { df <-read.table(paste0(wd,namelist[j],".raw"),h=T) df_names <- colnames(df) df_names_split <- gsub("_","\\.",df_names) nsplit <- str_split(df_names_split, "\\.") new_df <- as.data.frame(do.call(rbind,nsplit)) colnames(new_df) <- c("chr","pos","A1","A2","counted_Allele") ambig <- ambig_check(new_df) #will return the ambiguous SNPs in the dataframe df_ambig <- df_names[ambig] #write.csv(df_ambig,paste0(namelist[j],"_ambig_SNPs.csv")) df_noambig <- df[,-ambig] #remove ambiguous SNPs listofdfs[[j]] <- df_noambig # save your dataframes into the list}d1 <- listofdfs[[1]] #CLZ snplist1d2 <- listofdfs[[2]] #CLZ snplist2d3 <- listofdfs[[3]] #TOP snolist1d4 <- listofdfs[[4]] #TOP snplist2####ambiguous SNPs removed
####SNPLIST1 ########SNP name matching check d1_names <- colnames(d1)d3_names <- colnames(d3)#The NAs will tell us the mismatchsummary(match(d1_names,d3_names)) #difference could be due to anythingsummary(match(d3_names,d1_names)) #difference could be due to anything print("Remove the counted allele to see if NA above is caused by the counted allele.")summary(match(gsub('^(.*).$', '\\1', d1_names),gsub('^(.*).$', '\\1', d3_names))) #check if difference is caused by different counted allelesummary(match(gsub('^(.*).$', '\\1', d3_names),gsub('^(.*).$', '\\1', d1_names))) #check if difference is caused by different counted alleleprint(paste("Note: Please make sure the difference of the NAs listed above equals to",ncol(d3)-ncol(d1), "(snplist1 column difference). Otherwise the files produced might not be accurate and futher steps of checking are required. File for more detailed check here: https://ucsdhs-my.sharepoint.com/:f:/g/personal/cil001_health_ucsd_edu/EkW_D2G5D_1FjNTExFE433oBZmw7U2uBlhXcdGFAsiLgGA?e=sOPgdB"))##save output list of counted allele changed## Details of the flip check can be found in https://ucsdhs-my.sharepoint.com/:f:/g/personal/cil001_health_ucsd_edu/EkW_D2G5D_1FjNTExFE433oBZmw7U2uBlhXcdGFAsiLgGA?e=yiop74#snplist1_a1changed <- d1_names[is.na(match(d1_names,d3_names))]#snplist1_a1changed <- as.data.frame(snplist1_a1changed)#colnames(snplist1_a1changed) <- "from_CLZ_snplist1"#write.csv(snplist1_a1changed,"snplist1_A1_changed.csv", row.names = FALSE)##end###cross check before flipping head(d1$X11.121335728.A.G_G)head(d3$X11.121335728.A.G_A)#check the different counted allele is one of the other counted allelea1 <- d1[is.na(match(d1_names,d3_names))]a1_split <- gsub("_","\\.",colnames(a1))nsplit <- str_split(a1_split, "\\.")new_df <- as.data.frame(do.call(rbind,nsplit))colnames(new_df) <- c("chr","pos","A1","A2","counted_Allele")new_df$A1 <- as.character(new_df$A1)new_df$A2 <- as.character(new_df$A2)for (i in 1:nrow(new_df)) { if (new_df$counted_Allele[i] == new_df$A1[i] ) { new_df$new_counted_A[i] <- new_df$A2[i] } else { new_df$new_counted_A[i] <- new_df$A1[i] }}new_names_d1<- paste0(new_df$chr,".",new_df$pos,".",new_df$A1,".",new_df$A2,"_",new_df$new_counted_A)b <- colnames(d1)b[246]b[is.na(match(d1_names,d3_names))==TRUE] <- new_names_d1b[246]colnames(d1) <- b#Comfirm that the names are the same after the allele switch in d1summary(match(colnames(d1),colnames(d3))) #should have 0 NA summary(match(colnames(d3),colnames(d1))) # should have 986 NA #flip the count from 0,1,2 to 2,1,0 for the alleles that had counted alleles changedd1[is.na(match(d1_names,d3_names))] <- 2-d1[is.na(match(d1_names,d3_names))]#cross check after flip head(d1$X11.121335728.A.G_A)head(d3$X11.121335728.A.G_A)#save new .raw file write.table(d1, "All_CLZ_samples_snplist1_a1updated.raw", col.names = T)
#########################################
######SNPLIST 2 #######
#Name matching checkd2_names <- colnames(d2)d4_names <- colnames(d4)#The NAs will tell us the mismatchsummary(match(d2_names,d4_names)) #difference could be due to anythingsummary(match(d4_names,d2_names)) #difference could be due to anything print("Remove the counted allele to see if NA above is caused by the counted allele.")summary(match(gsub('^(.*).$', '\\1', d2_names),gsub('^(.*).$', '\\1', d4_names))) #check if difference is caused by different counted allelesummary(match(gsub('^(.*).$', '\\1', d4_names),gsub('^(.*).$', '\\1', d2_names))) #check if difference is caused by different counted alleleprint(paste("Note: Please make sure the difference of the NAs listed above equals to",ncol(d4)-ncol(d2), "(snplist2 column difference). Otherwise the files produced might not be accurate and futher steps of checking are required. File for more detailed check here: https://ucsdhs-my.sharepoint.com/:f:/g/personal/cil001_health_ucsd_edu/EkW_D2G5D_1FjNTExFE433oBZmw7U2uBlhXcdGFAsiLgGA?e=sOPgdB"))##save output list of counted allele changed## Details of the flip check can be found in https://ucsdhs-my.sharepoint.com/:f:/g/personal/cil001_health_ucsd_edu/EkW_D2G5D_1FjNTExFE433oBZmw7U2uBlhXcdGFAsiLgGA?e=yiop74#snplist2_a1changed <- d2_names[is.na(match(d2_names,d4_names))]#snplist2_a1changed <- as.data.frame(snplist2_a1changed)#colnames(snplist2_a1changed) <- "from_CLZ_snplist2"#write.csv(snplist2_a1changed,"snplist2_A1_changed.csv", row.names = FALSE)##end###cross check before flippinghead(d2$X1.197591406.C.A_A)head(d4$X1.197591406.C.A_C)#check the different counted allele is one of the other counted allelea1 <- d2[is.na(match(d2_names,d4_names))]a1_split <- gsub("_","\\.",colnames(a1))nsplit <- str_split(a1_split, "\\.")new_df <- as.data.frame(do.call(rbind,nsplit))colnames(new_df) <- c("chr","pos","A1","A2","counted_Allele")new_df$A1 <- as.character(new_df$A1)new_df$A2 <- as.character(new_df$A2)for (i in 1:nrow(new_df)) { if (new_df$counted_Allele[i] == new_df$A1[i] ) { new_df$new_counted_A[i] <- new_df$A2[i] } else { new_df$new_counted_A[i] <- new_df$A1[i] }}new_names_d2<- paste0(new_df$chr,".",new_df$pos,".",new_df$A1,".",new_df$A2,"_",new_df$new_counted_A)b <- colnames(d2)b[43]b[is.na(match(d2_names,d4_names))==TRUE] <- new_names_d2b[43]colnames(d2) <- b#Comfirm that the names are the same after the allele switch in d2summary(match(colnames(d2),colnames(d4))) #should have 0 NA summary(match(colnames(d4),colnames(d2))) # should have 1 NA #flip count for the alleles that had counted alleles changedd2[is.na(match(d2_names,d4_names))] <- 2-d2[is.na(match(d2_names,d4_names))]#cross check after flip head(d2$X1.197591406.C.A_C)head(d4$X1.197591406.C.A_C)#save new .raw file write.table(d2, "All_CLZ_samples_snplist2_a1updated.raw", col.names = T)

May 8t

Hello Chi Hua, new raw files for TOP and CLZ have been generated with their mismatch alleles flipped and ambiguous SNPs removed.

The new CLZ raw files: /space/chen-syn01/1/data/cinliu/data/clozapine/data_April2021/*_a1updated.raw (edited)

The code: /space/chen-syn01/1/data/cinliu/data/clozapine/data_April2021/counted_allele_correction

I also uploaded the code and README to the onedrive if it’s easier to access for you: https://ucsdhs-my.sharepoint.com/:f:/g/personal/cil001_health_ucsd_edu/Ej6wsITU2I9Oi0KHjLTeU6gBZniUJoQpFJIrwOXBp2EKbQ?e=e2IUT3

I’ve modified the code so now the R file “generate_a1updated_raw.R” and the README contains all the information. Please let me know if any issues come up. Thank you for waiting for this one! (edited)

Page updated

Report abuse