Required Computational Skills

Overview

This page contains some useful Unix, awk, and PLINK commands as well as information on text editors and working with gzipped files and tar directory archives. This page is not intended to be a complete documentation for these tools. Rather, the examples listed below are the minimum amount of knowledge needed to effectively use this pipeline.

Unix Commands

A more complete guide to working in a unix environment is available here. Also try man <comand> (e.g. man mkdir) to get a manpage directly on the terminal

Changing the current working directory

Determine the current directory:

pwd

Changing the Working Directory:

cd my_rp_out/

Creating/Removing directories

To create a directory:

mkdir my_new_directory/

To remove a directory (including its subdirectories):

rm -r my_new_directory/

Note: Be careful when using rm! Recovering directories and files in the event of accidental deletion may not be possible.

show content

Use the cat command to show the content of a file:

cat my_data

Use the echo command:

echo "Hello World"

stdout

Text/Data coming out of a program.

Redirect output

To transfer text stored in stdout to another command as stdin, use a pipe "|":

cat my_data | wc -l

To save text stored in stdout to a file, use a carrot ">":

cat my_data > my.output

To append text stored in stdout to the end of an existing (or new) file, use a double carrot ">>":

cat my_data2 >> my.output

Count the number of lines in a file

wc -l my.output

Print the first N (here 5) lines of a file

head -n 5 my.output

Print the last N (here 5) lines of a file

tail -n 5 my.output

Sort the file on the Nth (here 5th) column (see further down for usage in context)

sort -k5,5 my.output

Print only uniq rows and show their number of occurences in a file (see further down for usage in context)

uniq -c my.output

Print all lines of a file matching a string

To extract all lines from an uncompressed file that match a string:

grep "my.search.text" my_data

*** grep is a very powerful tool, it's probably highly useful to get deeper insights into this classic unix program.

Stringing Multiple Commands Together

Take an input text file, search for all lines that match "cas_", and then count the number of lines that matched "cas_"

grep "cas_" my_data | wc -l

Awk Commands

Awk is a complex programming language, that can be used as a simple one-line tool for modifying text files quickly. For more information, see here. here some examples.

Printing a subset of columns from a file

For example, with the following command, we can extract columns 1,5,6,10 and print them to a new file

awk '{ print $1,$5,$6,$10 }' my_data > my_condensed_data

Rearranging the order of columns in a file

For example, with the following command, we print first column 2 from the original file, and then column 1 from the original file

awk '{ print $2,$1 }' my_data > my_rearranged_data

Counting the number of columns in each row of a file and show how often which number of columns does occur

awk '{ print NF }' my_data | sort -k1,1n | uniq -c

Changing the column delimiter in a file

awk '{OFS="\t"};{print $0}' my_data > my_data_w_tabs

Extracting lines from a file that meet certain conditions

For example, with the following command, for each row in the file my_data, we check whether the number of columns equals 9 and column 1 equals 2 before printing that row to a new file my_passing_data.

awk '{ if (NF == 9 && $1 == 2) print $0 }' my_data > my_passing_data

PLINK Commands

PLINK is a genetics software tool created by Shaun Purcell. The full documentation for PLINK is here. We will only go over a couple of data management commands.

be aware that there is a newer and much quicker version of plink ready. please use carefully since it's still in the developement state: plink2. (link)

Binary PLINK file format

A binary PLINK file consists of three files ending in .fam, .bim, and .bed.

    • A .fam file consists of 6 columns where each row describes one sample: a Family ID (FID), an Individual ID (IID), a Maternal ID (MID), a Paternal ID (PID), a code for the gender of the sample (1=Male,2=Female,-9=Unknown), and a code for the phenotype (1=Control, 2=Case, -9=Unknown).

    • A .bim file consists of 6 columns where each row describes one SNP: chromosome number, SNP ID, genetic distance (usually 0), chromosomal position (in base pairs), allele 1, and allele 2.

    • A .bed file is a binary file encoding the genotype data corresponding to the samples in the .fam file and the markers in the .bim file.

NEVER manually edit .fam, bim, or .bed files. Use the data management commands in PLINK to modify your files.

Making a new PLINK file

Use the following command to make a new binary PLINK file where my.original.data is the prefix root before the extensions .bed, .bim, and .fam:

Removing samples

You can use the following command to remove samples from a binary PLINK file:

plink --bfile my.original.data --remove dropSamples.txt --make-bed --out my.subset.data --allow-no-sex

Note: PLINK only removes the first instance of the sample name. Therefore, --remove will not work if you want to remove all instances of a sample from your file (ex: NA12898 appearing multiple times).

Therefore, we recommend always making a text file with the samples you want to keep in your dataset (rather than the ones you want to remove) and use the following command:

plink --bfile my.original.data --keep keepSamples.txt --make-bed --out my.subset.data --allow-no-sex

Updating phenotypes

To update phenotypes, first make a text file with three columns (FID, IID, new phenotype code [1=Control, 2=Case, -9=Unknown]) and each row corresponds to one sample.

Next, use the following PLINK command to generate a new file set with the phenotype in the .fam corresponding to the new phenotype codes:

plink --bfile my.original.data --pheno newPhenotypes --make-bed --out my.newpheno.data --allow-no-sex

Text Editors

Some facility with using a command line text editor is required for using Ricopili. Some example software tools are listed below.

Emacs

Emacs is a text editor used by programmers. For a complete tutorial, see here.

Vim

Vim is another text editor used by programmers. For a complete tutorial, see here.

only a couple commands should be sufficient for basic text-file editing:

  • i (insert)

  • esc (escape key)

  • :w (write file)

  • :q (quit)

  • / (search)

  • yy (copy line)

  • pp (paste line)

Working with .gz files

A .gz file is a compressed version of a regular text file. The compressed version takes a lot less disk space than an uncompressed version.

Creating a .gz file

gzip my.file.txt

Decompressing a .gz file

gunzip my.file.txt.gz

Using the cat command on a .gz file

zcat my.file.txt.gz

Using the less command on a .gz file

zless my.file.txt.gz

the last two comands might not work in your setup, you can then unzip to stdout and pipe the output

gunzip -c my.file.txt.gz | cat

gunzip -c my.file.txt.gz | less

Working with .tar files

A tar file is an archive of all contents of a directory into one file. This file can also be compressed (tarball).

Creating a directory archive

tar -cvf my_directory.tar my_directory/

Creating a compressed directory archive

tar -czvf my_directory.tar.gz my_directory/

Decompressing a .tar file

tar -xvf my_directory.tar

Decompressing a .tar.gz file

tar -xvzf my_directory.tar.gz

Further weblinks / tutorials

Awk:

http://www.grymoire.com/Unix/Awk.html

http://www.vectorsite.net/tsawk.html

Regular Expressions:

http://www.regular-expressions.info/

sed:

http://www.panix.com/~elflord/unix/sed.html

grep:

http://www.panix.com/~elflord/unix/grep.html

plink --bfile my.original.data --make-bed --out my.new.data --allow-no-sex