re-doing enrichment analyses

Post date: Dec 10, 2014 11:18:23 PM

Using MySQL and (modifications of) Victors new scripts.

1. Ran create_load_annotations_db.sh to create the melissa_genome MySQL database.

2. Ran mysql_convert2innodb.sh to use a different engine for this.

3. Structural annotations from projects/lycaeides_hostplant/struct_annot/

All annotations within 1000 bp of each SNP (more than one could be reported because -c 0)

perl 01_retrieve_structural_annotations.pl -i SnpList.txt -o structOut1kb.txt -d 1000 -c 0 -db melissa_genome

Annotations were present for 58760 of 206047 SNPs.

All annotations within 0 bp of each SNP

perl 01_retrieve_structural_annotations.pl -i SnpList.txt -o structOut.txt -d 0 -c 0 -db melissa_genome

Annotations were present for 44965 of 206047 SNPs.

NOTE, the new script does not pick-up repeat regions from repeatmasker, so I will keep the original structural annotation results for structural enrichments. I will use the 1000 bp results for functional enrichments.

Functional annotations: these analyses and results are in projects/lycaeides_hostplant/funcAnnot/

1. Use the perl wrapper runt_extract_GO_IPR.pl (I wrote this) to run Victor's script extract_GO_IPR.pl. This does not depend on the SNP data set but rather extracts the GO terms and InterPro domains for all identified proteins. The original annotations it uses are in /home/zgompert/labs/evolution/data/lycaeides/melissa_genome/Annotation/functional_annotations/. I then write the scaffold specific information to proteins/prot*. I concatenated the information in these prot*txt files into a single file, prot.txt.

2. Retrieve the functional annotation for each SNP,

perl 02_retrieve_functional_annotations.pl -a ../struct_annot/structOut.txt -f funcannots.txt -o snpFuncAnnots.txt

3. Generate a list of the number of SNPs with each GO term (molecular function only) for the whole genome. I chose not to incorporate the GO term hierarchy.

perl 03_table_SNPs_GO.pl - snpFuncAnnots.txt -gom remote -mf

4. Considering the top 0.1% of SNPs from the GWA analysis, generate similar SNP GO counts for these. The infiles for 03_table_SNPs_GO.pl are the snplist*txt files in projects/lycaeides_hostplant/funcAnnot/testSnpSets/. I ran run_03_table_SNPs_GO.pl to run the count scrip on each of these as described above. The results are the snplists*molfun.dsv files in the same directory.

5. Test for significant enrichments at the SNP-level with no hierarchy for all GO terms that show up 2 or more times within a SNP set (treatment by response variable). P-values are based on 1000 permutations and I used FDR to control for false-discoveries.

The complete results are in individual files in projects/lycaeides_hostplant/funcAnnot/enrichment/. completeListOfFunctions.txt in the funcAnnot directory contains the counts of each GO term across all conditions (i.e. this is the complete set of GO terms for the top SNPs). Tables summarizing these results are in the manuscript.

###########################################

Here are Victors notes (see /home/zgompert/Annotations/scripts for his original scripts):

# Functional enrichment tests

# -----------------------------------------------------------------------------

There are some previous requirements:

- a mysql database with structural annotations (loaded and accessed using

Bio::DB::SeqFeature::Store) -> see section titled "Setting up database

with annotations" below

- a table with the functional annotations for proteins/genes in the form

of GO terms and InterPRO accession number (although the last one is not

really used) -> see section title "Summarise functional annotations

(GO terms and Interpro accn)" below

Optional, but recommended:

- a mysql local copy of the GO term database (only termdb)

This protocol allows comparing the counts of GO terms in two lists of SNPs

and find out if there are any GO terms significantly enriched in one of the list

relative to the other. For clarity, 'test' is the list of SNPs for which we want

to know if there is any enrichment in any particular function, whereas 'background'

is the list of SNPs used to estimate a null distribution for randomization tests.

The pipeline consists of four steps. The first three steps must be run

for both the test and the background lists of SNPs.

1.- 01_retrieve_structural_annotations.pl

Retrieve all structural annotations (genes, CDS, UTRs, TEs) within

a certain distance of each SNP and output a table summarising the info

(each row per SNP). There is an option to report only the closest gene

instead of all genes within the specified distance. This option will

also affect the report of distances to CDS, UTRs, and TEs (only

the distance to the nearest ones will be reported)

Syntax:

01_retrieve_structural_annotations.pl

-i <SNPs input file>

-o <output file>

-n <number of cpus> (optional, default=2)

-d <max. distance to seek genes, CDS, UTRs, and TEs> (optional, default=0)

-db <database name> (optional, default=tcris_draft_02)

-c Report only the nearest genes, CDS, UTRs, and TEs (default=0)

Output:

16 columns:

1 scaf -> scaffold

2 pos -> position

#3 lg -> linkage group

#4 ord -> order (in linkage group)

5 gene -> gene/s

(there may be several genes separated by '|' if there

are several genes within the specified distance)

6 aed -> annotation edit distance/s (measure quality of annotation)

(there may be several values separated by '|' if there

are several genes within the specified distance)

7 gstart -> start position of gene/s

(there may be several values separated by '|' if there

are several genes within the specified distance)

8 gstop -> end position of gene/s

(there may be several values separated by '|' if there

are several genes within the specified distance)

9 gdist -> distance to gene/s

(there may be several values separated by '|' if there

are several genes within the specified distance)

10 ongene -> is SNP within a gene? 0=no, 1=yes

11 cdist -> distance to CDS/s (exons actually)

(there may be several values separated by '|' if there

are several exon/s within the specified distance)

12 oncds -> is SNP within a coding region? 0=no, 1=yes

13 udist -> distance to UTR/s

(there may be several values separated by '|' if there

are several UTRs within the specified distance)

14 onutr -> is SNP within a UTR? 0=no, 1=yes

15 tdist -> distance to TE/s (transposable element)

(there may be several values separated by '|' if there

are several TEs within the specified distance)

16 onte -> is SNP within a TE? 0=no, 1=yes

2.- 02_retrieve_functional_annotations.pl

Retrieve all functional annotations of the genes reported for every SNP

in the previous step. It will look for

Syntax:

02_retrieve_functional_annotations.pl

-a <input file: SNPs structural annotations>

-f <input file: genes functional annotations>

-o <output file>

Output:

10 columns:

1 scaf -> scaffold

2 pos -> position

3 lg -> linkage group

4 ord -> order (in linkage group)

5 gene -> gene/s

(there may be several genes separated by '|' if there

are several genes within the specified distance)

6 gdist -> distance to gene/s

(there may be several values separated by '|' if there

are several genes within the specified distance)

7 GO -> gene ontology term

8 GO_text -> gene ontology term description

9 interpro -> InterPro accession number

10 interpro_text -> InterPro accession number description

3.- 03_table_SNPs_GO.pl

Summarise the GO terms present in the list of SNPs and output a table

of GO terms sorted by number of counts

Syntax:

03_table_SNPs_GO.pl

-a <input file: SNPs functional annotations>

-hier consider GO hierarchy when counting GOs (optional)

-gom <local|remote> GO database mirror (default=local)

-bp output biological process GOs (optional)

-cc output cellular component GOs (optional)

-mf output molecular function GOs

-all output also parental GOs not on the list (optional)

Output:

1-3 files, depending on GO families (bp, cc, mf)

Each file contains:

6 commented lines:

Total number of unique GO terms

Total number of unique molecular function GO terms

Total number of GO terms

Total number of molecular function GO terms

Total number of SNPs with at least one GO term

Total number of SNPs with at least one molecular function GO term

followed by a table with 3 columns: GO, nSNPs, description

4.- 04_randomization_test.R

Estimate a null frequency distribution for each GO term by sampling

SNPs without replacement from the background list

Syntax:

04_randomization_test.R

There are 8 arguments that must be passed in order:

1.- working directory

2.- SNPs GO table - test

3.- SNPs GO table - background

4.- output

5.- minimum number of SNPs required for testing

6.- number of replicates

7.- number of cores

8.- type of GO term (bp, cc, mf, all)

Output:

9 columns:

1 GO

2 GO.text

3 nobs

4 totobs

5 nexp

6 totexp

7 pobs

8 pexp

9 pval

# -----------------------------------------------------------------------------

# =============================================================================

# Setting up database with annotations

# -----------------------------------------------------------------------------

create_load_annotations_db.sh

This script will create a mysql database and will load the annotations

from GFF3 files

It requires BioPerl (Bio::DB::SeqFeature::Store module) and mysql

The following variables must be edited :

ANNOT_DIR -> directory with GFF3 files (they must have a .gff extension)

DBHOST -> database host (e.g. localhost if it is local)

DBPORT -> port to access database over the network

DBUSER -> mysql user (with permissions to create at least this database)

User password is securely introduced when executed

It will also add an extra table for the translation of scaffold names

among draft versions (i.e. with different linkage groups and order).

It currently (Dec2014) works the 3 versions of T. cristinae genome.

This will need to be customised accordingly for future new drafts or

other species.

mysql_convert2innodb.sh

This will change the storage engine used by mysql to InnoDB, which is

the default engine for mysql 5.5+ and it is generally faster than

the previous default engine (MyISAM)

The following variables must be edited :

DBHOST -> database host (e.g. localhost if it is local)

DBPORT -> port to access database over the network

DBUSER -> mysql user (with permissions to create at least this database)

User password is securely introduced when executed

# Tweaking of mysql configuration:

Locate file my.cnf (/etc/my.cnf in Scientific Linux 6.x, based on CentOS) and

edit accordingly. The following one is the configuration for diogenes

(12 cores, 32GB RAM) after changing the default data directory

from /var/lib/mysql/ to /home/mysql

# The MySQL server

[mysqld]

port = 3306

datadir=/home/mysql

socket=/home/mysql/mysql.sock

symbolic-links=0

skip-external-locking

key_buffer_size = 4G

max_allowed_packet = 1M

table_open_cache = 512

sort_buffer_size = 2M

read_buffer_size = 2M

read_rnd_buffer_size = 8M

myisam_sort_buffer_size = 64M

thread_cache_size = 8

query_cache_size = 32M

# Try number of CPU's*2 for thread_concurrency

thread_concurrency = 24

max_connections=1000

# Replication Master Server (default)

# binary logging is required for replication

log-bin=mysql-bin

# required unique id between 1 and 2^32 - 1

# defaults to 1 if master-host is not set

# but will not function as a master if omitted

server-id = 1

# Uncomment the following if you are using InnoDB tables

innodb_data_home_dir = /home/mysql

innodb_data_file_path = ibdata1:2000M;ibdata2:10M:autoextend

innodb_log_group_home_dir = /home/mysql

# You can set .._buffer_pool_size up to 50 - 80 %

# of RAM but beware of setting memory usage too high

innodb_buffer_pool_size = 20G

innodb_additional_mem_pool_size = 1G

# Set .._log_file_size to 25 % of buffer pool size

innodb_log_file_size = 5G

innodb_log_buffer_size = 8M

innodb_flush_log_at_trx_commit = 1

innodb_lock_wait_timeout = 50

innodb_thread_concurrency = 0

[mysqldump]

quick

max_allowed_packet = 16M

[mysql]

no-auto-rehash

# Remove the next comment character if you are not familiar with SQL

safe-updates

[myisamchk]

key_buffer_size = 2G

sort_buffer_size = 256M

read_buffer = 2M

write_buffer = 2M

[mysqlhotcopy]

interactive-timeout

[mysqld_safe]

log-error=/var/log/mysqld.log

pid-file=/var/run/mysqld/mysqld.pid

# -----------------------------------------------------------------------------

# Summarise functional annotations (GO terms and Interpro accn)

# -----------------------------------------------------------------------------

extract_GO_IPR.pl

This script will collect GO terms and InterPRO accession numbers from multiple

matches using different databases (Pfam, Panther, etc.) after running

InterProScan. It reports a list of unique GO terms and Interpro accession

numbers (if there is a GO term in multiple matches, it is reported only once),

but does not account for GO terms hierarchy (i.e. a gene can have GO

terms that are parent and child).

Syntax:

extract_GO_IPR.pl

-i <input file>

-o <output file>

-a <directory with protein annotations (GFF files)>

-gom <GO database mirror (local|remote)> (optional, default: local)

-pdb <database> (extract GO terms from one particular database only

[e.g. Pfam]; optional, default: none

# -----------------------------------------------------------------------------