This page contains some useful Unix, awk, and PLINK commands as well as information on text editors and working with gzipped files and tar directory archives. This page is not intended to be a complete documentation for these tools. Rather, the examples listed below are the minimum amount of knowledge needed to effectively use this pipeline.
A more complete guide to working in a unix environment is available here. Also try man <comand> (e.g. man mkdir) to get a manpage directly on the terminal
Determine the current directory:
pwd
Changing the Working Directory:
cd my_rp_out/
To create a directory:
mkdir my_new_directory/
To remove a directory (including its subdirectories):
rm -r my_new_directory/
Note: Be careful when using rm! Recovering directories and files in the event of accidental deletion may not be possible.
Use the cat command to show the content of a file:
cat my_data
Use the echo command:
echo "Hello World"
Text/Data coming out of a program.
To transfer text stored in stdout to another command as stdin, use a pipe "|":
cat my_data | wc -l
To save text stored in stdout to a file, use a carrot ">":
cat my_data > my.output
To append text stored in stdout to the end of an existing (or new) file, use a double carrot ">>":
cat my_data2 >> my.output
wc -l my.output
head -n 5 my.output
tail -n 5 my.output
sort -k5,5 my.output
uniq -c my.output
To extract all lines from an uncompressed file that match a string:
grep "my.search.text" my_data
*** grep is a very powerful tool, it's probably highly useful to get deeper insights into this classic unix program.
Take an input text file, search for all lines that match "cas_", and then count the number of lines that matched "cas_"
grep "cas_" my_data | wc -l
Awk is a complex programming language, that can be used as a simple one-line tool for modifying text files quickly. For more information, see here. here some examples.
For example, with the following command, we can extract columns 1,5,6,10 and print them to a new file
awk '{ print $1,$5,$6,$10 }' my_data > my_condensed_data
For example, with the following command, we print first column 2 from the original file, and then column 1 from the original file
awk '{ print $2,$1 }' my_data > my_rearranged_data
awk '{ print NF }' my_data | sort -k1,1n | uniq -c
awk '{OFS="\t"};{print $0}' my_data > my_data_w_tabs
For example, with the following command, for each row in the file my_data, we check whether the number of columns equals 9 and column 1 equals 2 before printing that row to a new file my_passing_data.
awk '{ if (NF == 9 && $1 == 2) print $0 }' my_data > my_passing_data
PLINK is a genetics software tool created by Shaun Purcell. The full documentation for PLINK is here. We will only go over a couple of data management commands.
be aware that there is a newer and much quicker version of plink ready. please use carefully since it's still in the developement state: plink2. (link)
A binary PLINK file consists of three files ending in .fam, .bim, and .bed.
A .fam file consists of 6 columns where each row describes one sample: a Family ID (FID), an Individual ID (IID), a Maternal ID (MID), a Paternal ID (PID), a code for the gender of the sample (1=Male,2=Female,-9=Unknown), and a code for the phenotype (1=Control, 2=Case, -9=Unknown).
A .bim file consists of 6 columns where each row describes one SNP: chromosome number, SNP ID, genetic distance (usually 0), chromosomal position (in base pairs), allele 1, and allele 2.
A .bed file is a binary file encoding the genotype data corresponding to the samples in the .fam file and the markers in the .bim file.
NEVER manually edit .fam, bim, or .bed files. Use the data management commands in PLINK to modify your files.
Use the following command to make a new binary PLINK file where my.original.data is the prefix root before the extensions .bed, .bim, and .fam:
You can use the following command to remove samples from a binary PLINK file:
plink --bfile my.original.data --remove dropSamples.txt --make-bed --out my.subset.data --allow-no-sex
Note: PLINK only removes the first instance of the sample name. Therefore, --remove will not work if you want to remove all instances of a sample from your file (ex: NA12898 appearing multiple times).
Therefore, we recommend always making a text file with the samples you want to keep in your dataset (rather than the ones you want to remove) and use the following command:
plink --bfile my.original.data --keep keepSamples.txt --make-bed --out my.subset.data --allow-no-sex
To update phenotypes, first make a text file with three columns (FID, IID, new phenotype code [1=Control, 2=Case, -9=Unknown]) and each row corresponds to one sample.
Next, use the following PLINK command to generate a new file set with the phenotype in the .fam corresponding to the new phenotype codes:
plink --bfile my.original.data --pheno newPhenotypes --make-bed --out my.newpheno.data --allow-no-sex
Some facility with using a command line text editor is required for using Ricopili. Some example software tools are listed below.
Emacs is a text editor used by programmers. For a complete tutorial, see here.
Vim is another text editor used by programmers. For a complete tutorial, see here.
only a couple commands should be sufficient for basic text-file editing:
i (insert)
esc (escape key)
:w (write file)
:q (quit)
/ (search)
yy (copy line)
pp (paste line)
A .gz file is a compressed version of a regular text file. The compressed version takes a lot less disk space than an uncompressed version.
gzip my.file.txt
gunzip my.file.txt.gz
zcat my.file.txt.gz
zless my.file.txt.gz
gunzip -c my.file.txt.gz | cat
gunzip -c my.file.txt.gz | less
A tar file is an archive of all contents of a directory into one file. This file can also be compressed (tarball).
tar -cvf my_directory.tar my_directory/
tar -czvf my_directory.tar.gz my_directory/
tar -xvf my_directory.tar
tar -xvzf my_directory.tar.gz
Awk:
http://www.grymoire.com/Unix/Awk.html
http://www.vectorsite.net/tsawk.html
Regular Expressions:
http://www.regular-expressions.info/
sed:
http://www.panix.com/~elflord/unix/sed.html
grep:
plink --bfile my.original.data --make-bed --out my.new.data --allow-no-sex