UKBio Data Extraction and PRSice

Outlook mail title: Data extraction and PRSice

Oct 19th 2020

outlook email titiel: UKBio Factor extraction commands expatiation

Nov 15th 2020

(Also see: slack messages with Hao between Oct 19th 2020

Slack with CH Nov 3rd 2020

Location on server:

/space/chen-syn01/1/data/cinliu/data/UKB_extraction/UKB_factor_extraction_redord.zip

/space/chen-syn01/1/data/cinliu/data/UKB_extraction/UKB_factor_extraction_redord

Please scp the zip file to your local computer, open the directory that way for best experience.

This was originally saved OneDrive and many of the files names are not in a server friendly format

Once upzipped will be the same as: /space/chen-syn01/1/data/cinliu/data/UKB_extraction/UKB_factor_extraction_redord

Many of the files names are not in a server friendly format, not recommended to work with the directory on server.

Expalinations are in /UKB_factor_extraction_redord/UKBio factors extraction explanation.docx once you can open UKB_factor_extraction_redord.zip

Initial intstructions

PRSice learning

Meeting zoom call with Hao

Key points:

We would like to extract some data from UKBio bank first, then take that data and run it in PRSice
- first learn how to extract data from UKBio bank, ideally we would like to have it in an excel sheet
- second learn how to use PRSice, but this is a later step only worry about it after we have step 1 figured out.

Email info from Hao (email title: Data extraction and PRSice):

Here are the resources for data extraction:

UK Biobank Data Showcase: http://biobank.ndph.ox.ac.uk/showcase/
Showcase User Guide: as attached https://biobank.ndph.ox.ac.uk/ukb/ukb/exinfo/ShowcaseUserGuide.pdf

PRS calculation:

Sample script: as attached (PRS.sh)

PRSice documentation page: https://www.prsice.info/

PRSice tutorial: https://choishingwan.github.io/PRS-Tutorial/prsice/

I’ll send you the list of variables soon.

List of covariates were send via slack on Nov 1st 2020

File name: factors.docx

1. Less educationa. Age completed full time education (845)b. Qualifications (6138)2. Hearing lossa. Hearing difficulty (2247)b. Conductive and sensorineural hearing loss (H90)c. Other hearing loss (H91)d. Hearing aid user (3393)3. Traumatic brain injurya. Fracture of skull and facial bones (S02)b. Intracranial injury (S06)4. Hypertension (SBP > 140 mmHg)a. Essential hypertension (I10)b. Secondary hypertension (I15)5. Alcohol (21 drinks/week)a. Alcohol drinker status (22117)b. Alcohol consumed (100580)6. Obesity (BMI > 30kg/m2)a. Body mass index (21001)b. Body mass index – impedance (23104)7. Smokinga. Age of stopping smoking (22507)b. Age started smoking in current smokers (3436)c. Age started smoking in former smokers (2867)d. Pack years of smoking (20161)e. Smoking status (20116)f. Smoking/smokers in household (1259)8. Depressiona. Depressive episode (F32)9. Social isolationa. Loneliness, isolation (2020)10. Physical inactivitya. Number of days/weeks of vigorous physical activity 10+ minutes (10962)b. Number of days/weeks of moderate physical activity 10+ minutes (10971)c. Time spent doing light physical activity (104920)d. Time spent doing moderate physical activity (104910)e. Time spent doing vigorous physical activity (104900)11. Air pollutiona. Nitrogen dioxide air pollution: 2010 (24003)b. Particulate matter air pollution (pm10) : 2010 (24005)c. Particulate matter air pollution (pm2.5) : 2010 (24006)12. Diabetesa. Diabetes diagnosed by doctor (2443)b. Diabetes mellitus (250)13. Sleepa. Sleep duration (1160)b. Insomnia (1200)

Screenshot of factors.docx

UKBio

How to read the data from UKBio

https://biobank.ctsu.ox.ac.uk/~bbdatan/Accessing_UKB_data_v2.3.pdf

Page 21 shows how to read the .csv data files

Page 15 shows how to read the html file. The html file helps you understand what the keys mean in the csv file.

Outputs

/space/chen-syn01/1/data/cinliu/data/UKB_extraction/UKB_factor_extraction_redord.zip

Original email message:

The attached file "full command recorded" are the commands I used to extract the data for the UKB factors.

Since much of it has manual parts I have attached a document "UKBio factors extraction explanation" that explains the details.

I will also link a onedrive folderUKB factor extraction redord

This folder contains some extra material in case it is helpful:

"supplementary log history" - the terminal history I kept when I was doing all of the actions
"UKB ID list.xlsx" - the excel sheets I used to help me keep track
"readUKB.R" and R file that just combines the final csv into one excel sheet

UKBio factors extraction explanation.docx (expand to see text only, the original doc has images too, I'll try to show them blow)

Full command log:

Full command record

Log into the server

ssh cinliu@137.110.172.99

cd /space/chen-syn01/1/data/cinliu/data/fIDlist/

cat idlist.txt

cd /space/gwas-syn1/1/data/GWAS/UKBioBank/phenotypes/Baskets/

ls –l

awk 'BEGIN{ FS="," }{ for(fn=1;fn<=NF;fn++) {print fn" = "$fn;}; exit; }' ukb43789.csv > /space/chen-syn01/1/data/cinliu/data/fIDlist/list43789.csv

awk 'BEGIN{ FS="," }{ for(fn=1;fn<=NF;fn++) {print fn" = "$fn;}; exit; }' ukb42438.csv> /space/chen-syn01/1/data/cinliu/data/fIDlist/list42438.csv

awk 'BEGIN{ FS="," }{ for(fn=1;fn<=NF;fn++) {print fn" = "$fn;}; exit; }' ukb42012.csv> /space/chen-syn01/1/data/cinliu/data/fIDlist/list42012.csv

awk 'BEGIN{ FS="," }{ for(fn=1;fn<=NF;fn++) c{print fn" = "$fn;}; exit; }' ukb41296.csv> /space/chen-syn01/1/data/cinliu/data/fIDlist/list41296.csv

awk 'BEGIN{ FS="," }{ for(fn=1;fn<=NF;fn++) {print fn" = "$fn;}; exit; }' ukb40545.csv> /space/chen-syn01/1/data/cinliu/data/fIDlist/list40545.csv

awk 'BEGIN{ FS="," }{ for(fn=1;fn<=NF;fn++) {print fn" = "$fn;}; exit; }' ukb40544.csv> /space/chen-syn01/1/data/cinliu/data/fIDlist/list40544.csv

awk 'BEGIN{ FS="," }{ for(fn=1;fn<=NF;fn++) {print fn" = "$fn;}; exit; }' ukb40543.csv> /space/chen-syn01/1/data/cinliu/data/fIDlist/list40543.csv

awk 'BEGIN{ FS="," }{ for(fn=1;fn<=NF;fn++) {print fn" = "$fn;}; exit; }' ukb40542.csv> /space/chen-syn01/1/data/cinliu/data/fIDlist/list40542.csv

awk 'BEGIN{ FS="," }{ for(fn=1;fn<=NF;fn++) {print fn" = "$fn;}; exit; }' ukb40541.csv> /space/chen-syn01/1/data/cinliu/data/fIDlist/list40541.csv

awk 'BEGIN{ FS="," }{ for(fn=1;fn<=NF;fn++) {print fn" = "$fn;}; exit; }' ukb40539.csv> /space/chen-syn01/1/data/cinliu/data/fIDlist/list40539.csv

awk 'BEGIN{ FS="," }{ for(fn=1;fn<=NF;fn++) {print fn" = "$fn;}; exit; }' ukb37384.csv> /space/chen-syn01/1/data/cinliu/data/fIDlist/list37384.csv

awk 'BEGIN{ FS="," }{ for(fn=1;fn<=NF;fn++) {print fn" = "$fn;}; exit; }' ukb37115.csv> /space/chen-syn01/1/data/cinliu/data/fIDlist/list37115.csv

awk 'BEGIN{ FS="," }{ for(fn=1;fn<=NF;fn++) {print fn" = "$fn;}; exit; }' ukb37113.csv> /space/chen-syn01/1/data/cinliu/data/fIDlist/list37113.csv

awk 'BEGIN{ FS="," }{ for(fn=1;fn<=NF;fn++) {print fn" = "$fn;}; exit; }' ukb37112.csv> /space/chen-syn01/1/data/cinliu/data/fIDlist/list37112.csv

awk 'BEGIN{ FS="," }{ for(fn=1;fn<=NF;fn++) {print fn" = "$fn;}; exit; }' ukb37111.csv> /space/chen-syn01/1/data/cinliu/data/fIDlist/list37111.csv

awk 'BEGIN{ FS="," }{ for(fn=1;fn<=NF;fn++) {print fn" = "$fn;}; exit; }' ukb37110.csv> /space/chen-syn01/1/data/cinliu/data/fIDlist/list37110.csv

awk 'BEGIN{ FS="," }{ for(fn=1;fn<=NF;fn++) {print fn" = "$fn;}; exit; }' ukb37109.csv> /space/chen-syn01/1/data/cinliu/data/fIDlist/list37109.csv

awk 'BEGIN{ FS="," }{ for(fn=1;fn<=NF;fn++) {print fn" = "$fn;}; exit; }' ukb37108.csv> /space/chen-syn01/1/data/cinliu/data/fIDlist/list37108.csv

awk 'BEGIN{ FS="," }{ for(fn=1;fn<=NF;fn++) {print fn" = "$fn;}; exit; }' ukb37107.csv> /space/chen-syn01/1/data/cinliu/data/fIDlist/list37107.csv

awk 'BEGIN{ FS="," }{ for(fn=1;fn<=NF;fn++) {print fn" = "$fn;}; exit; }' ukb32537.csv> /space/chen-syn01/1/data/cinliu/data/fIDlist/list32537.csv

logout

scp cinliu@137.110.172.99:/space/chen-syn01/1/data/cinliu/data/fIDlist/idlist.txt /Users/niniliu/Desktop/2020lab/hao/DataUKB

scp cinliu@137.110.172.99:/space/chen-syn01/1/data/cinliu/data/fIDlist/*.csv /Users/niniliu/Desktop/2020lab/hao/DataUKB

grep -f idlist.txt list43789.csv

grep -f idlist.txt list42438.csv

grep -f idlist.txt list42012.csv

grep -f idlist.txt list41296.csv

grep -f idlist.txt list40545.csv

grep -f idlist.txt list40544.csv

grep -f idlist.txt list40543.csv

grep -f idlist.txt list40542.csv

grep -f idlist.txt list40541.csv

grep -f idlist.txt list40539.csv

grep -f idlist.txt list37384.csv

grep -f idlist.txt list37115.csv

grep -f idlist.txt list37113.csv

grep -f idlist.txt list37112.csv

grep -f idlist.txt list37111.csv

grep -f idlist.txt list37110.csv

grep -f idlist.txt list37109.csv

grep -f idlist.txt list37108.csv

grep -f idlist.txt list37107.csv

grep -f idlist.txt list32537.csv

ssh cinliu@137.110.172.99

cd /space/gwas-syn1/1/data/GWAS/UKBioBank/phenotypes/Baskets/

awk -F "," '{print$1,$367,$368,$369,$389,$390,$391,$392,$393,$394,$395,$396,$397,$398,$399,$400,$401,$402,$403,$404,$405,$406,$407,$408,$409,$410,$411,$412}' ukb40545.csv > /space/chen-syn01/1/data/cinliu/data/UKBcsv/40545.csv

awk -F "," '{print$1,$958,$959,$960,$961,$1323,$1324,$1325,$1326,$14462,$14463,$14464,$14465,$14466,$9471,$9472,$9473,$9474,$11060,$11061,$11062,$11063,$9616,$1339,$1340,$1341,$1342,$1132,$1133,$1134,$1135,$8779,$8780,$8781,$8782,$8631,$8632,$8633,$8634,$629,$630,$631,$632,$878,$879,$880,$881,$6414,$6415,$15512,$15513,$15514,$15515,$15516,$15507,$15508,$15509,$15510,$15511,$15502,$15503,15504,$15505,$15506,$1020,$593,$610}' ukb40539.csv > /space/chen-syn01/1/data/cinliu/data/UKBcsv/40539.csv

awk -F "," '{print$1,$4500}' ukb37110.csv > /space/chen-syn01/1/data/cinliu/data/UKBcsv/37110.csv

scp cinliu@137.110.172.99:/space/chen-syn01/1/data/cinliu/data/UKBcsv/*.csv /Users/niniliu/Desktop/2020lab/hao/DataUKB/

Page Break

EXPLAINATION

This is a step by step explanation of how I extracted the UKBio bank data.

(I really did it half manually so apologies in advance if that causes any inconvenience.)

PART1: locating the data

cd /space/chen-syn01/1/data/cinliu/data/fIDlist/

cat idlist.txt

Copy and pastes the ‘Data-field ID number’ from the given factors into a .txt file called “idlist.txt”

(located here: /space/chen-syn01/1/data/cinliu/data/fIDlist/idlist.txt)
The list looks like this:

Go to /space/gwas-syn1/1/data/GWAS/UKBioBank/phenotypes/Baskets

Using the ls –l command view the timestamp of the files

Using the following command I was able to view just the first row with their collumns number printed:

awk 'BEGIN{ FS="," }{ for(fn=1;fn<=NF;fn++) {print fn" = "$fn;}; exit; }' filename.csv

For example:

So the displayed output is everything in the first row only, the number infront of the equal sign prints out the column number.

Using the command:

awk 'BEGIN{ FS="," }{ for(fn=1;fn<=NF;fn++) {print fn" = "$fn;}; exit; }' ukbxxxx.csv> /space/chen-syn01/1/data/cinliu/data/fIDlist/listxxxx.csv

I saved the first the most recent 20 .csv’s first row inofmation with the columns number into a separate list to space/chen-syn01/1/data/cinliu/data/fIDlist/

I used an excel to keep track of which CSV has been recorded or not (sheet name ‘csv list’):
https://ucsdhs-my.sharepoint.com/:x:/g/personal/cil001_health_ucsd_edu/Ea3BxBdGOcdMnQl6UExyTSkBj2mVpOCNsEnIWpw5jo7Xhg?e=Bi6ErI

Now using grep -f idlist.txt listxxxx.csv look through the lists manually and record the location (column number and file number) in an excel

(this step takes a while because it’s just manual)

For some reason the command in out our server will show everything but just highlight the matches in red. My computer only returns the output with a match so I scp the files to my laptop and did this part locally. Below is the log history. Page Break

But basically what returns for each command is something that looks like this:

I just manually look throught the field ID (right hand side of the equal sign) to see if there is any match, if there is a match with then I note it down the information in the excel: https://ucsdhs-my.sharepoint.com/:x:/g/personal/cil001_health_ucsd_edu/Ea3BxBdGOcdMnQl6UExyTSkBj2mVpOCNsEnIWpw5jo7Xhg?e=tCQ1mS

PART2: extracting the data

Once locations (ie column numbers) have been located I log back into our server and export the desired columns to separate csv

Scp the csv files and combine into one excel (using R, see ‘readUKB.R’)

Page Break

Background information

There are 2 main types of files I used (accessed from /space/gwas-syn1/1/data/GWAS/UKBioBank/phenotypes/Baskets):

The main data = ukbxxxx.csv

The coding key for the field = ukbxxx.html

The main data

The actual data of the .csv file will look something like this is opened in excel:

The meaning of the column headers:

eid = the encoded participant identifier for the project in question
The rest of the columns are formated F-I.A

F = field number
I = instance index - used to distinguish data for a field which was gathered at different times
A = array index - used to distinguish multiple pieces of data for that field which were gathered at the same time

Note: More information on data can be found on page 21-23 of Accessing_UKB_data_v2.3.pdf

After the data is extracted you can look in the .html files to see the details of how each item is coded

Example: ukb37384.html

Click on the hyperlinks in the description to see the details of the incoding.

For example if you want to know what the coding for UID 19-0.0 means; click on data-coding “100260” if will lead you to:
https://biobank.ctsu.ox.ac.uk/crystal/coding.cgi?id=100260

PRSice

Follow instructions:

https://choishingwan.github.io/PRS-Tutorial/

QC steps

following the quality control steps from https://choishingwan.github.io/PRS-Tutorial/base/

Notes from Hao:　

if OR score is lower then 1 you need to flip the ratio

Flipping the ratio of the OR scores using R code

import data
look for OR that is less then 1: dat$OR < 1
change the decimals into fractions
save the numberator and the denominator is the fraction
calculate the fliped ratio of the numerator and denominator
flip the A1 and A2
recombine into the original data

rm(list=ls())library(MASS)library(pgirmess)#flip the OR that is smaller then onesetwd("/Users/niniliu/Desktop/2020lab/PRS")EUP_PSOR_2012 <- read.delim("/Users/niniliu/Desktop/2020lab/PRS/EUP_PSOR_2012.QC")#make a copydat2 <- EUP_PSOR_2012dat1 <- dat2[dat2$OR < 1, ]head(dat1$OR)#change to fractionsdat1$fraction <- fractions(dat1$OR)#split fraction functiongetfracs <- function(frac) { tmp <- strsplit(attr(frac,"fracs"), "/")[[1]] list(numerator=as.numeric(tmp[1]),denominator=as.numeric(tmp[2]))}x <- fractions(.375)fracs <- getfracs(x)fracs$numeratorgetfracs(dat1$fraction[2])for (i in 1:533396) {fract <- getfracs(dat1$fraction[i])dat1$numerator[i]<- fract$numeratordat1$denominator[i]<- fract$denominator}dat1$OR[1]for (i in 1:533396) {dat1$OR[i] <- dat1$denominator[i]/dat1$numerator[i]}colnames(dat1)[5] <- "A2"colnames(dat1)[6] <- "A1"dat1 <- dat1[c(1,2,3,4,6,5,7,8,9,10,11)]dat2[dat2$OR < 1, ] <- dat1write.delim(dat2, "EUP_PSOR_2012.ORQC",row.names = FALSE, quote = FALSE, sep = "\t")

After following the rest of the QC steps

Run the PRSICE program in terminal