Required Computational Skills
Overview
This page contains some useful Unix, awk, and PLINK commands as well as information on text editors and working with gzipped files and tar directory archives. This page is not intended to be a complete documentation for these tools. Rather, the examples listed below are the minimum amount of knowledge needed to effectively use this pipeline.
Unix Commands
A more complete guide to working in a unix environment is available here. Also try man <comand> (e.g. man mkdir) to get a manpage directly on the terminal
Changing the current working directory
Determine the current directory:
pwd
Changing the Working Directory:
cd my_rp_out/
Creating/Removing directories
To create a directory:
mkdir my_new_directory/
To remove a directory (including its subdirectories):
rm -r my_new_directory/
Note: Be careful when using rm! Recovering directories and files in the event of accidental deletion may not be possible.
show content
Use the cat command to show the content of a file:
cat my_data
Use the echo command:
echo "Hello World"
stdout
Text/Data coming out of a program.
Redirect output
To transfer text stored in stdout to another command as stdin, use a pipe "|":
cat my_data | wc -l
To save text stored in stdout to a file, use a carrot ">":
cat my_data > my.output
To append text stored in stdout to the end of an existing (or new) file, use a double carrot ">>":
cat my_data2 >> my.output
Count the number of lines in a file
wc -l my.output
Print the first N (here 5) lines of a file
head -n 5 my.output
Print the last N (here 5) lines of a file
tail -n 5 my.output
sort -k5,5 my.output
Print only uniq rows and show their number of occurences in a file (see further down for usage in context)
uniq -c my.output
Print all lines of a file matching a string
To extract all lines from an uncompressed file that match a string:
grep "my.search.text" my_data
*** grep is a very powerful tool, it's probably highly useful to get deeper insights into this classic unix program.
Stringing Multiple Commands Together
Take an input text file, search for all lines that match "cas_", and then count the number of lines that matched "cas_"
grep "cas_" my_data | wc -l
Awk Commands
Awk is a complex programming language, that can be used as a simple one-line tool for modifying text files quickly. For more information, see here. here some examples.
Printing a subset of columns from a file
For example, with the following command, we can extract columns 1,5,6,10 and print them to a new file
awk '{ print $1,$5,$6,$10 }' my_data > my_condensed_data
Rearranging the order of columns in a file
For example, with the following command, we print first column 2 from the original file, and then column 1 from the original file
awk '{ print $2,$1 }' my_data > my_rearranged_data
Counting the number of columns in each row of a file and show how often which number of columns does occur
awk '{ print NF }' my_data | sort -k1,1n | uniq -c
Changing the column delimiter in a file
awk '{OFS="\t"};{print $0}' my_data > my_data_w_tabs
Extracting lines from a file that meet certain conditions
For example, with the following command, for each row in the file my_data, we check whether the number of columns equals 9 and column 1 equals 2 before printing that row to a new file my_passing_data.
awk '{ if (NF == 9 && $1 == 2) print $0 }' my_data > my_passing_data
PLINK Commands
PLINK is a genetics software tool created by Shaun Purcell. The full documentation for PLINK is here. We will only go over a couple of data management commands.
be aware that there is a newer and much quicker version of plink ready. please use carefully since it's still in the developement state: plink2. (link)
Binary PLINK file format
A binary PLINK file consists of three files ending in .fam, .bim, and .bed.
A .fam file consists of 6 columns where each row describes one sample: a Family ID (FID), an Individual ID (IID), a Maternal ID (MID), a Paternal ID (PID), a code for the gender of the sample (1=Male,2=Female,-9=Unknown), and a code for the phenotype (1=Control, 2=Case, -9=Unknown).
A .bim file consists of 6 columns where each row describes one SNP: chromosome number, SNP ID, genetic distance (usually 0), chromosomal position (in base pairs), allele 1, and allele 2.
A .bed file is a binary file encoding the genotype data corresponding to the samples in the .fam file and the markers in the .bim file.
NEVER manually edit .fam, bim, or .bed files. Use the data management commands in PLINK to modify your files.
Making a new PLINK file
Use the following command to make a new binary PLINK file where my.original.data is the prefix root before the extensions .bed, .bim, and .fam:
Removing samples
You can use the following command to remove samples from a binary PLINK file:
plink --bfile my.original.data --remove dropSamples.txt --make-bed --out my.subset.data --allow-no-sex
Note: PLINK only removes the first instance of the sample name. Therefore, --remove will not work if you want to remove all instances of a sample from your file (ex: NA12898 appearing multiple times).
Therefore, we recommend always making a text file with the samples you want to keep in your dataset (rather than the ones you want to remove) and use the following command:
plink --bfile my.original.data --keep keepSamples.txt --make-bed --out my.subset.data --allow-no-sex
Updating phenotypes
To update phenotypes, first make a text file with three columns (FID, IID, new phenotype code [1=Control, 2=Case, -9=Unknown]) and each row corresponds to one sample.
Next, use the following PLINK command to generate a new file set with the phenotype in the .fam corresponding to the new phenotype codes:
plink --bfile my.original.data --pheno newPhenotypes --make-bed --out my.newpheno.data --allow-no-sex
Text Editors
Some facility with using a command line text editor is required for using Ricopili. Some example software tools are listed below.
Emacs
Emacs is a text editor used by programmers. For a complete tutorial, see here.
Vim
Vim is another text editor used by programmers. For a complete tutorial, see here.
only a couple commands should be sufficient for basic text-file editing:
i (insert)
esc (escape key)
:w (write file)
:q (quit)
/ (search)
yy (copy line)
pp (paste line)
Working with .gz files
A .gz file is a compressed version of a regular text file. The compressed version takes a lot less disk space than an uncompressed version.
Creating a .gz file
gzip my.file.txt
Decompressing a .gz file
gunzip my.file.txt.gz
Using the cat command on a .gz file
zcat my.file.txt.gz
Using the less command on a .gz file
zless my.file.txt.gz
the last two comands might not work in your setup, you can then unzip to stdout and pipe the output
gunzip -c my.file.txt.gz | cat
gunzip -c my.file.txt.gz | less
Working with .tar files
A tar file is an archive of all contents of a directory into one file. This file can also be compressed (tarball).
Creating a directory archive
tar -cvf my_directory.tar my_directory/
Creating a compressed directory archive
tar -czvf my_directory.tar.gz my_directory/
Decompressing a .tar file
tar -xvf my_directory.tar
Decompressing a .tar.gz file
tar -xvzf my_directory.tar.gz
Further weblinks / tutorials
Awk:
http://www.grymoire.com/Unix/Awk.html
http://www.vectorsite.net/tsawk.html
Regular Expressions:
http://www.regular-expressions.info/
sed:
http://www.panix.com/~elflord/unix/sed.html
grep:
plink --bfile my.original.data --make-bed --out my.new.data --allow-no-sex