Data matching percentage check
(A1A2 check and flip)
email title: "data matching percentage check"
March 25th 2021 (date the final version of file was send out)
email title: "data matching percentage check"
March 25th 2021 (date the final version of file was send out)
Location on server: /space/chen-syn01/1/data/cinliu/data/dat4Tyler
Goal:
We wanted to know if the how different old and the new data.
column by column check
A1 A2 match test
After check if A1A2 is not matching up we flip 0,1,2 -> 2,1,0
ClzTOP-dataTOP.RData" v.s "NEW_TOP_DATA.RData"):
39065 SNPs qualified for making the A1 A2 flip.
After the A1 A2 flip the perfect matches then become 43.87%
I will attach the summary text file in this email for more details.
Summary:
Percentage of matches (factoring out the extra FIDs in new data): 90.74
Percentage of matches (factoring out multiple factors)*: 95.1
Number of matches*: 85614177
Number of mismatches*: 4408173
*Note: the calculation factors out the information below:
FIDs removed from old data(ClzTOP-dataTOP.RData) because no matches were found: 53
FIDs removed from new data(NEW_TOP_DATA.RData) because no matches were found: 1096
SNPs removed old data(ClzTOP-dataTOP.RData) because no matches were found: 838
SNPs removed from new data(NEW_TOP_DATA.RData) because of duplication: 122
Columns listed below were also removed from old data(ClzTOP-dataTOP.RData):
y, gender, C1, C2, C3, C4
Columns listed below were also removed from new data(NEW_TOP_DATA.RData):
FID, PAT, MAT, SEX, PHENOTYPE
I've uploaded R codes used and the list of SNP names that had the 0,1,2 -> 2,1,0 into a onedrive folder here:
More information you might be interested in:
21 ambiguous SNPs were found in the new data set. All 21 SNPs are within in the 122 triallelic variant SNPS and were removed for this analysis.
No ambiguous SNPs found in the old data set (I think the previous post-doc already cleaned it).
For the "ClzTOP-dataClz.RData" (I scp from the server /space/syn09/1/data/nsanyal/GWASinlps/ClzTOP/ClzTOP_anal/ClzTOP-dataClz.RData) no FID matched with the new data so I couldn't run any analysis for it, so I guess in other words there was 0% match, or maybe Anu was referring to a different set of Clz data ?
(will not alter the data in anyways, just checks and print summary, also will generate a list of genes that qualify for flipping)
(this is where we do the 0,1,2 flip at d1[,i] <- 2- d1[,i]