Post date: Dec 10, 2014 11:18:23 PM
Using MySQL and (modifications of) Victors new scripts.
1. Ran create_load_annotations_db.sh to create the melissa_genome MySQL database.
2. Ran mysql_convert2innodb.sh to use a different engine for this.
3. Structural annotations from projects/lycaeides_hostplant/struct_annot/
All annotations within 1000 bp of each SNP (more than one could be reported because -c 0)
perl 01_retrieve_structural_annotations.pl -i SnpList.txt -o structOut1kb.txt -d 1000 -c 0 -db melissa_genome
Annotations were present for 58760 of 206047 SNPs.
All annotations within 0 bp of each SNP
perl 01_retrieve_structural_annotations.pl -i SnpList.txt -o structOut.txt -d 0 -c 0 -db melissa_genome
Annotations were present for 44965 of 206047 SNPs.
NOTE, the new script does not pick-up repeat regions from repeatmasker, so I will keep the original structural annotation results for structural enrichments. I will use the 1000 bp results for functional enrichments.
Functional annotations: these analyses and results are in projects/lycaeides_hostplant/funcAnnot/
1. Use the perl wrapper runt_extract_GO_IPR.pl (I wrote this) to run Victor's script extract_GO_IPR.pl. This does not depend on the SNP data set but rather extracts the GO terms and InterPro domains for all identified proteins. The original annotations it uses are in /home/zgompert/labs/evolution/data/lycaeides/melissa_genome/Annotation/functional_annotations/. I then write the scaffold specific information to proteins/prot*. I concatenated the information in these prot*txt files into a single file, prot.txt.
2. Retrieve the functional annotation for each SNP,
perl 02_retrieve_functional_annotations.pl -a ../struct_annot/structOut.txt -f funcannots.txt -o snpFuncAnnots.txt
3. Generate a list of the number of SNPs with each GO term (molecular function only) for the whole genome. I chose not to incorporate the GO term hierarchy.
perl 03_table_SNPs_GO.pl - snpFuncAnnots.txt -gom remote -mf
4. Considering the top 0.1% of SNPs from the GWA analysis, generate similar SNP GO counts for these. The infiles for 03_table_SNPs_GO.pl are the snplist*txt files in projects/lycaeides_hostplant/funcAnnot/testSnpSets/. I ran run_03_table_SNPs_GO.pl to run the count scrip on each of these as described above. The results are the snplists*molfun.dsv files in the same directory.
5. Test for significant enrichments at the SNP-level with no hierarchy for all GO terms that show up 2 or more times within a SNP set (treatment by response variable). P-values are based on 1000 permutations and I used FDR to control for false-discoveries.
The complete results are in individual files in projects/lycaeides_hostplant/funcAnnot/enrichment/. completeListOfFunctions.txt in the funcAnnot directory contains the counts of each GO term across all conditions (i.e. this is the complete set of GO terms for the top SNPs). Tables summarizing these results are in the manuscript.
###########################################
Here are Victors notes (see /home/zgompert/Annotations/scripts for his original scripts):
# Functional enrichment tests
# -----------------------------------------------------------------------------
There are some previous requirements:
- a mysql database with structural annotations (loaded and accessed using
Bio::DB::SeqFeature::Store) -> see section titled "Setting up database
with annotations" below
- a table with the functional annotations for proteins/genes in the form
of GO terms and InterPRO accession number (although the last one is not
really used) -> see section title "Summarise functional annotations
(GO terms and Interpro accn)" below
Optional, but recommended:
- a mysql local copy of the GO term database (only termdb)
This protocol allows comparing the counts of GO terms in two lists of SNPs
and find out if there are any GO terms significantly enriched in one of the list
relative to the other. For clarity, 'test' is the list of SNPs for which we want
to know if there is any enrichment in any particular function, whereas 'background'
is the list of SNPs used to estimate a null distribution for randomization tests.
The pipeline consists of four steps. The first three steps must be run
for both the test and the background lists of SNPs.
1.- 01_retrieve_structural_annotations.pl
Retrieve all structural annotations (genes, CDS, UTRs, TEs) within
a certain distance of each SNP and output a table summarising the info
(each row per SNP). There is an option to report only the closest gene
instead of all genes within the specified distance. This option will
also affect the report of distances to CDS, UTRs, and TEs (only
the distance to the nearest ones will be reported)
Syntax:
01_retrieve_structural_annotations.pl
-i <SNPs input file>
-o <output file>
-n <number of cpus> (optional, default=2)
-d <max. distance to seek genes, CDS, UTRs, and TEs> (optional, default=0)
-db <database name> (optional, default=tcris_draft_02)
-c Report only the nearest genes, CDS, UTRs, and TEs (default=0)
Output:
16 columns:
1 scaf -> scaffold
2 pos -> position
#3 lg -> linkage group
#4 ord -> order (in linkage group)
5 gene -> gene/s
(there may be several genes separated by '|' if there
are several genes within the specified distance)
6 aed -> annotation edit distance/s (measure quality of annotation)
(there may be several values separated by '|' if there
are several genes within the specified distance)
7 gstart -> start position of gene/s
(there may be several values separated by '|' if there
are several genes within the specified distance)
8 gstop -> end position of gene/s
(there may be several values separated by '|' if there
are several genes within the specified distance)
9 gdist -> distance to gene/s
(there may be several values separated by '|' if there
are several genes within the specified distance)
10 ongene -> is SNP within a gene? 0=no, 1=yes
11 cdist -> distance to CDS/s (exons actually)
(there may be several values separated by '|' if there
are several exon/s within the specified distance)
12 oncds -> is SNP within a coding region? 0=no, 1=yes
13 udist -> distance to UTR/s
(there may be several values separated by '|' if there
are several UTRs within the specified distance)
14 onutr -> is SNP within a UTR? 0=no, 1=yes
15 tdist -> distance to TE/s (transposable element)
(there may be several values separated by '|' if there
are several TEs within the specified distance)
16 onte -> is SNP within a TE? 0=no, 1=yes
2.- 02_retrieve_functional_annotations.pl
Retrieve all functional annotations of the genes reported for every SNP
in the previous step. It will look for
Syntax:
02_retrieve_functional_annotations.pl
-a <input file: SNPs structural annotations>
-f <input file: genes functional annotations>
-o <output file>
Output:
10 columns:
1 scaf -> scaffold
2 pos -> position
3 lg -> linkage group
4 ord -> order (in linkage group)
5 gene -> gene/s
(there may be several genes separated by '|' if there
are several genes within the specified distance)
6 gdist -> distance to gene/s
(there may be several values separated by '|' if there
are several genes within the specified distance)
7 GO -> gene ontology term
8 GO_text -> gene ontology term description
9 interpro -> InterPro accession number
10 interpro_text -> InterPro accession number description
3.- 03_table_SNPs_GO.pl
Summarise the GO terms present in the list of SNPs and output a table
of GO terms sorted by number of counts
Syntax:
03_table_SNPs_GO.pl
-a <input file: SNPs functional annotations>
-hier consider GO hierarchy when counting GOs (optional)
-gom <local|remote> GO database mirror (default=local)
-bp output biological process GOs (optional)
-cc output cellular component GOs (optional)
-mf output molecular function GOs
-all output also parental GOs not on the list (optional)
Output:
1-3 files, depending on GO families (bp, cc, mf)
Each file contains:
6 commented lines:
Total number of unique GO terms
Total number of unique molecular function GO terms
Total number of GO terms
Total number of molecular function GO terms
Total number of SNPs with at least one GO term
Total number of SNPs with at least one molecular function GO term
followed by a table with 3 columns: GO, nSNPs, description
4.- 04_randomization_test.R
Estimate a null frequency distribution for each GO term by sampling
SNPs without replacement from the background list
Syntax:
04_randomization_test.R
There are 8 arguments that must be passed in order:
1.- working directory
2.- SNPs GO table - test
3.- SNPs GO table - background
4.- output
5.- minimum number of SNPs required for testing
6.- number of replicates
7.- number of cores
8.- type of GO term (bp, cc, mf, all)
Output:
9 columns:
1 GO
2 GO.text
3 nobs
4 totobs
5 nexp
6 totexp
7 pobs
8 pexp
9 pval
# -----------------------------------------------------------------------------
# =============================================================================
# =============================================================================
# =============================================================================
# Setting up database with annotations
# -----------------------------------------------------------------------------
create_load_annotations_db.sh
This script will create a mysql database and will load the annotations
from GFF3 files
It requires BioPerl (Bio::DB::SeqFeature::Store module) and mysql
The following variables must be edited :
ANNOT_DIR -> directory with GFF3 files (they must have a .gff extension)
DBHOST -> database host (e.g. localhost if it is local)
DBPORT -> port to access database over the network
DBUSER -> mysql user (with permissions to create at least this database)
User password is securely introduced when executed
It will also add an extra table for the translation of scaffold names
among draft versions (i.e. with different linkage groups and order).
It currently (Dec2014) works the 3 versions of T. cristinae genome.
This will need to be customised accordingly for future new drafts or
other species.
mysql_convert2innodb.sh
This will change the storage engine used by mysql to InnoDB, which is
the default engine for mysql 5.5+ and it is generally faster than
the previous default engine (MyISAM)
The following variables must be edited :
DBHOST -> database host (e.g. localhost if it is local)
DBPORT -> port to access database over the network
DBUSER -> mysql user (with permissions to create at least this database)
User password is securely introduced when executed
# Tweaking of mysql configuration:
Locate file my.cnf (/etc/my.cnf in Scientific Linux 6.x, based on CentOS) and
edit accordingly. The following one is the configuration for diogenes
(12 cores, 32GB RAM) after changing the default data directory
from /var/lib/mysql/ to /home/mysql
# The MySQL server
[mysqld]
port = 3306
datadir=/home/mysql
socket=/home/mysql/mysql.sock
symbolic-links=0
skip-external-locking
key_buffer_size = 4G
max_allowed_packet = 1M
table_open_cache = 512
sort_buffer_size = 2M
read_buffer_size = 2M
read_rnd_buffer_size = 8M
myisam_sort_buffer_size = 64M
thread_cache_size = 8
query_cache_size = 32M
# Try number of CPU's*2 for thread_concurrency
thread_concurrency = 24
max_connections=1000
# Replication Master Server (default)
# binary logging is required for replication
log-bin=mysql-bin
# required unique id between 1 and 2^32 - 1
# defaults to 1 if master-host is not set
# but will not function as a master if omitted
server-id = 1
# Uncomment the following if you are using InnoDB tables
innodb_data_home_dir = /home/mysql
innodb_data_file_path = ibdata1:2000M;ibdata2:10M:autoextend
innodb_log_group_home_dir = /home/mysql
# You can set .._buffer_pool_size up to 50 - 80 %
# of RAM but beware of setting memory usage too high
innodb_buffer_pool_size = 20G
innodb_additional_mem_pool_size = 1G
# Set .._log_file_size to 25 % of buffer pool size
innodb_log_file_size = 5G
innodb_log_buffer_size = 8M
innodb_flush_log_at_trx_commit = 1
innodb_lock_wait_timeout = 50
innodb_thread_concurrency = 0
[mysqldump]
quick
max_allowed_packet = 16M
[mysql]
no-auto-rehash
# Remove the next comment character if you are not familiar with SQL
safe-updates
[myisamchk]
key_buffer_size = 2G
sort_buffer_size = 256M
read_buffer = 2M
write_buffer = 2M
[mysqlhotcopy]
interactive-timeout
[mysqld_safe]
log-error=/var/log/mysqld.log
pid-file=/var/run/mysqld/mysqld.pid
# -----------------------------------------------------------------------------
# Summarise functional annotations (GO terms and Interpro accn)
# -----------------------------------------------------------------------------
extract_GO_IPR.pl
This script will collect GO terms and InterPRO accession numbers from multiple
matches using different databases (Pfam, Panther, etc.) after running
InterProScan. It reports a list of unique GO terms and Interpro accession
numbers (if there is a GO term in multiple matches, it is reported only once),
but does not account for GO terms hierarchy (i.e. a gene can have GO
terms that are parent and child).
Syntax:
extract_GO_IPR.pl
-i <input file>
-o <output file>
-a <directory with protein annotations (GFF files)>
-gom <GO database mirror (local|remote)> (optional, default: local)
-pdb <database> (extract GO terms from one particular database only
[e.g. Pfam]; optional, default: none
# -----------------------------------------------------------------------------