Convert file from plink to raw
April 21st 2021 - CH Slack
April 21st 2021 - CH Slack
path of this project on server: /space/chen-syn01/1/data/cinliu/data/clozapine/data_April2021/
Details: /space/chen-syn01/1/data/cinliu/data/clozapine/data_April2021/1.plink2raw.sh
Could you help convert our newly arrived TOP and CLZ data from plink to raw data format by modifying this script?
"/space/chen-syn01/1/data/chen/clozapine/data_April2021/1.plink2raw.sh"
There are four Plink datasets.
TOP_snplist1
TOP_snplist2
CLZ_snplist1
CLZ_snplist2
in this folder "/space/chen-syn01/1/data/chen/clozapine/data_April2021/"
Could you also check whether TOP and CLZ samples use the same A1?
53674 SNPs in CLZ snplist1
438 SNPs in CLZ snplist2
54681 SNPs in TOP snplist1
54680
520 SNPs in TOP snplist2
saved location on server: syn01/1/data/cinliu/data/clozapine/data_April2021
Data summary of the data from:
syn01/1/data/cinliu/data/clozapine/data_April2021
Ambiguous SNPs:
6 ambiguous SNPs were found in All_CLZ_samples_snplist1.raw
80 ambiguous SNPs were found in All_CLZ_samples_snplist2.raw
21 ambiguous SNPs were found in All_TOP_samples_snplist1.raw
81 ambiguous SNPs were found in All_TOP_samples_snplist2.raw
Note: the ambiguous SNPs found in the CLZ data were all included in the ambiguous SNPs found TOP data.
After the ambiguous SNPs listed above are removed -
When comparing the CLZ data with the TOP data:
488 SNPs from snplist1 had a different counted allele.
5 SNPs from snplist2 had a different counted allele.
986 SNPs from All_TOP_samples_snplist1.raw could not find SNP name matches in All_CLZ_samples_snplist1.raw (this makes sense because the TOP data set has 986 more SNPs than the CLZ data set).
1 SNP from All_TOP_samples_snplist2.raw could not find SNP name matches in All_CLZ_samples_snplist2.raw (this makes sense because the TOP data set has 1 more SNP than the CLZ data set).
Details of the exact checking process/code can be found here:
R code that generates new .raw files that changed the counted allele of TOP data to match CLZ data are located on the server:
/space/chen-syn01/1/data/cinliu/data/clozapine/data_April2021/ counted_allele_correction
Output
1.plink2raw.sh
#!/bin/bash
path=/space/chen-syn01/1/data/chen/clozapine/data_April2021
outpath=/space/chen-syn01/1/data/cinliu/data/clozapine/data_April2021
/home/mtlo/plink1.9/plink --bfile ${path}/All_CLZ_samples_snplist1 --recode A --out ${outpath}/All_CLZ_samples_snplist1
/home/mtlo/plink1.9/plink --bfile ${path}/All_CLZ_samples_snplist2 --recode A --out ${outpath}/All_CLZ_samples_snplist2
/home/mtlo/plink1.9/plink --bfile ${path}/All_TOP_samples_snplist1 --recode A --out ${outpath}/All_TOP_samples_snplist1
/home/mtlo/plink1.9/plink --bfile ${path}/All_TOP_samples_snplist2 --recode A --out ${outpath}/All_TOP_samples_snplist2
# then output file, ALL_TOP_samples.raw, can be read in R using the line below
#raw<-read.table("/space/chen-syn01/1/data/chen/clozapine/ALL_TOP_samples.raw",h=T)
After the .raw files are generated we check for ambiguous SNPS and check that the same A1 is being used.
May 6th 2021
Ambiguous SNP check
generate_a1_updated_raw.R
May 8t
Hello Chi Hua, new raw files for TOP and CLZ have been generated with their mismatch alleles flipped and ambiguous SNPs removed.
The new CLZ raw files: /space/chen-syn01/1/data/cinliu/data/clozapine/data_April2021/*_a1updated.raw (edited)
The code: /space/chen-syn01/1/data/cinliu/data/clozapine/data_April2021/counted_allele_correction
I also uploaded the code and README to the onedrive if it’s easier to access for you: https://ucsdhs-my.sharepoint.com/:f:/g/personal/cil001_health_ucsd_edu/Ej6wsITU2I9Oi0KHjLTeU6gBZniUJoQpFJIrwOXBp2EKbQ?e=e2IUT3
I’ve modified the code so now the R file “generate_a1updated_raw.R” and the README contains all the information. Please let me know if any issues come up. Thank you for waiting for this one! (edited)