pccx
pccx: phylogeny based coverage calculation and extension
Overview
pccx – phylogeny based coverage calculation and extension – is a simple application for target selection for structural genomics. It enables to:
evaluate how well a protein family has been covered with structural information
choose sequences to optimally extend this coverage
It is implemented in Java as part of the forester package. Currently, three scoring methods are implemented:
sum of 1/branch-segment-sum
sum of 1/branch-length-sum
sum of -ln(branch-length-sum)
Download
Most current version (might be unstable): forester.jar
Source code is available at GitHub: https://github.com/cmzmasek/forester
Usage
java -cp path/to/forester.jar org.forester.tools.pccx [options] <phylogen(y|ies) infile> [external node name 1] [name 2] ... [name n]
Options:
-d: 1/distance based scoring method (instead of branch counting based)
-ld: -ln(distance) based scoring method (instead of branch counting based)
-x[=<n>]: optimally extend coverage by <n> external nodes. Use none, 0, or negative value for complete coverage extension.
-o=<file>: write output to <file>
-i=<file>: read (new-line separated) external node names from <file>
-p=<file>: write output as annotated phylogeny to <file> (only first phylogeny in phylogenies infile is used)
Examples
For the examples, a phylogeny based on the Malate/L-lactate dehydrogenase alignment from Pfam 21.0 is used (download: Ldh_2.nhx).
As of 2007-05-25, the following seven sequences from this family have a structure in PDB: 1s20, 1nxu (DLGD_ECOLI); 1rfm (COMC_METJA); 1v9n (MDH_PYRHO); 1vbi (Q746L8_THET2); 1wtj, 2cwf (Q4U331_PSESM); 1xrh (ALLD_ECOLI); and 1z2i (Q7CRW4_AGRT5).
To calculate a coverage score for a given phylogeny using a "sum of 1/branch-segment-sum" (default) scoring method:
% java -cp path/to/forester.jar org.forester.tools.pccx Ldh_2.nhx DLGD_ECOLI COMC_METJA MDH_PYRHO Q746L8_THET2 Q4U331_PSESM ALLD_ECOLI Q7CRW4_AGRT5 -p=Ldh_2_b7.nhx
Output:
Options: scoring method: sum of 1/branch-segment-sum
Normalized score: 0.1497663297543091
Raw score : 33.84719052447385
Wrote annotated phylogeny to "Ldh_2_b7.nhx"
To calculate a coverage score for a given phylogeny using a "sum of 1/branch-length-sum" scoring method:
% java -cp path/to/forester.jar org.forester.tools.pccx -d Ldh_2.nhx DLGD_ECOLI COMC_METJA MDH_PYRHO Q746L8_THET2 Q4U331_PSESM ALLD_ECOLI Q7CRW4_AGRT5
Output:
Options: scoring method: sum of 1/branch-length-sum [for self: 1/branch-length] [min branch length: 0.0010]
Normalized score: 0.12868805358848912
Raw score : 7623.40971285036
To optimally extend coverage by 10 more sequences:
% java -cp path/to/forester.jar org.forester.tools.pccx -x=10 Ldh_2.nhx DLGD_ECOLI COMC_METJA MDH_PYRHO Q746L8_THET2 Q4U331_PSESM ALLD_ECOLI Q7CRW4_AGRT5 -p=Ldh_2_b7_x10.nhx
Output:
Options: scoring method: sum of 1/branch-segment-sum
Printing 10 names to extend coverage in an optimal manner:
before:
Normalized score: 0.1497663297543091
Raw score : 33.84719052447385
0 Q3PGX6_PARDE 0.16718837131297096
1 Q6D702_ERWCT 0.18360557829584423
2 Q1V2K0_9RICK 0.1942380873796807
3 Q7PI68_ANOGA 0.2046462755533554
4 Q5QTW6_IDILO 0.21426464391066183
5 Q2T3J0_BURTA 0.22353129908439665
6 Q5WAN1_BACSK 0.23244837758112122
7 Q8UIX7_AGRT5 0.2404779463407785
8 Q323Z3_SHIBS 0.2481616097766543
9 Q8YB95_BRUME 0.25576625930608254
after:
Normalized score: 0.25576625930608254
Raw score : 57.803174603174654
Wrote annotated phylogeny to "Ldh_2_b7_x10.nhx"
Comparison of scoring methods currently implemented in pccx
As for the examples above, a phylogeny based on the Malate/L-lactate dehydrogenase alignment from Pfam 21.0 is used.
The graph was produced with gnuplot.
Background
Brenner S.E. (2000). Target selection for structural genomics. Nature Structural Biology, 7, 967 - 969. [Nature Structural Biology]
Rodrigues A.P.C., Grant B.J., and Hubbard R.E. (2006). sgTarget: a target selection resource for structural genomics. Nucleic Acids Research, 34, W225-W230. [Nucleic Acids Research]