pccx

pccx: phylogeny based coverage calculation and extension

Overview

pccx – phylogeny based coverage calculation and extension – is a simple application for target selection for structural genomics. It enables to:

  • evaluate how well a protein family has been covered with structural information

  • choose sequences to optimally extend this coverage

It is implemented in Java as part of the forester package. Currently, three scoring methods are implemented:

  • sum of 1/branch-segment-sum

  • sum of 1/branch-length-sum

  • sum of -ln(branch-length-sum)

Download

Usage

java -cp path/to/forester.jar org.forester.tools.pccx [options] <phylogen(y|ies) infile> [external node name 1] [name 2] ... [name n]

Options:

-d: 1/distance based scoring method (instead of branch counting based)

-ld: -ln(distance) based scoring method (instead of branch counting based)

-x[=<n>]: optimally extend coverage by <n> external nodes. Use none, 0, or negative value for complete coverage extension.

-o=<file>: write output to <file>

-i=<file>: read (new-line separated) external node names from <file>

-p=<file>: write output as annotated phylogeny to <file> (only first phylogeny in phylogenies infile is used)

Examples

For the examples, a phylogeny based on the Malate/L-lactate dehydrogenase alignment from Pfam 21.0 is used (download: Ldh_2.nhx).

As of 2007-05-25, the following seven sequences from this family have a structure in PDB: 1s20, 1nxu (DLGD_ECOLI); 1rfm (COMC_METJA); 1v9n (MDH_PYRHO); 1vbi (Q746L8_THET2); 1wtj, 2cwf (Q4U331_PSESM); 1xrh (ALLD_ECOLI); and 1z2i (Q7CRW4_AGRT5).

To calculate a coverage score for a given phylogeny using a "sum of 1/branch-segment-sum" (default) scoring method:

% java -cp path/to/forester.jar org.forester.tools.pccx Ldh_2.nhx DLGD_ECOLI COMC_METJA MDH_PYRHO Q746L8_THET2 Q4U331_PSESM ALLD_ECOLI Q7CRW4_AGRT5 -p=Ldh_2_b7.nhx

Output:

Options: scoring method: sum of 1/branch-segment-sum

Normalized score: 0.1497663297543091

Raw score : 33.84719052447385

Wrote annotated phylogeny to "Ldh_2_b7.nhx"

To calculate a coverage score for a given phylogeny using a "sum of 1/branch-length-sum" scoring method:

% java -cp path/to/forester.jar org.forester.tools.pccx -d Ldh_2.nhx DLGD_ECOLI COMC_METJA MDH_PYRHO Q746L8_THET2 Q4U331_PSESM ALLD_ECOLI Q7CRW4_AGRT5

Output:

Options: scoring method: sum of 1/branch-length-sum [for self: 1/branch-length] [min branch length: 0.0010]

Normalized score: 0.12868805358848912

Raw score : 7623.40971285036

To optimally extend coverage by 10 more sequences:

% java -cp path/to/forester.jar org.forester.tools.pccx -x=10 Ldh_2.nhx DLGD_ECOLI COMC_METJA MDH_PYRHO Q746L8_THET2 Q4U331_PSESM ALLD_ECOLI Q7CRW4_AGRT5 -p=Ldh_2_b7_x10.nhx

Output:

Options: scoring method: sum of 1/branch-segment-sum

Printing 10 names to extend coverage in an optimal manner:

before:

Normalized score: 0.1497663297543091

Raw score : 33.84719052447385

0 Q3PGX6_PARDE 0.16718837131297096

1 Q6D702_ERWCT 0.18360557829584423

2 Q1V2K0_9RICK 0.1942380873796807

3 Q7PI68_ANOGA 0.2046462755533554

4 Q5QTW6_IDILO 0.21426464391066183

5 Q2T3J0_BURTA 0.22353129908439665

6 Q5WAN1_BACSK 0.23244837758112122

7 Q8UIX7_AGRT5 0.2404779463407785

8 Q323Z3_SHIBS 0.2481616097766543

9 Q8YB95_BRUME 0.25576625930608254

after:

Normalized score: 0.25576625930608254

Raw score : 57.803174603174654

Wrote annotated phylogeny to "Ldh_2_b7_x10.nhx"

Comparison of scoring methods currently implemented in pccx

As for the examples above, a phylogeny based on the Malate/L-lactate dehydrogenase alignment from Pfam 21.0 is used.

The graph was produced with gnuplot.

Background

Brenner S.E. (2000). Target selection for structural genomics. Nature Structural Biology, 7, 967 - 969. [Nature Structural Biology]

Rodrigues A.P.C., Grant B.J., and Hubbard R.E. (2006). sgTarget: a target selection resource for structural genomics. Nucleic Acids Research, 34, W225-W230. [Nucleic Acids Research]