Sampling

Introduction

Algorithms:

Greedy

Finding the most important neighbor

Greedy - top neighbours

Metropolis

Iterative simulated annealing

Postprocessing:

Removing overlaps

Evaluation

Introduction

The entire protein interaction network can be used to sample subgraphs as new candidate complexes. Alternately, a reduced protein interaction network containing only proteins in known complexes can be used, to perhaps evaluate the sampling algorithm better, but more importantly, to yield higher efficiencies.

Common steps in all algorithms:

All the nodes are chosen as seed nodes for the growth process that samples complexes.
All the below sampling algorithms use the seed node connected to its highest weight neighbor as the starting subgraph for growth.
A maximum threshold is set for the number of growth steps in each algorithm.

Algorithms:

Greedy

This is the most simplistic growth strategy for building complexes with interactions of high confidences.

Neighbor of current subgraph connected by highest edge weight is added to subgraph.
New subgraph is checked as to whether it is a complex or not using the ML model.
If not a complex, remove neighbor added and return previous subgraph as complex.
Else, repeat step 1

Greedy

Finding the most important neighbor

As we can intuit, the choice of which neighbor to add at each step dictates the accuracy of complexes formed. To improve on the previous strategy which checks only the highest edge weight neighbor, we can check more neighbors to find the best node to add.

The next 3 algorithms use the following strategy to choose the 'most important neighbor' of a node (candidate node to add to a complex) :

Sort the neighbors of a node based on edge weight. This is to ensure sampled complexes have interactions of high confidence. Also, higher edge weight can also indicate higher probability of being a part of the complex.
Pick a subset of the node's neighbors, V with the highest edge weights. This step is for efficiency, as a large number of neighbors can be present and checking all is inefficient.
For each neighbor in this subset V, add it to the subgraph and compute the score of the new complex (probability of being a complex) using the ML model.
Now sort the neighbors by these scores.
To enable some exploration, for instance because better complex nodes can be found as neighbors of a currently low scoring node, we introduce an exploration probabilty and set it to a low value. With this low probability, we chose a neighbor at random from the neighbors in step 4 as the 'most important neighbor'. With high probability, we chose the highest scoring neighbor in step 4.

Greedy - top neighbours

For each node in current complex, find its 'most important neighbor' out of a random subset of its neighbors. Size of this subset is set to min(no. of neighbors, threshold value)
From the list of all important neighbors obtained in step 1, find the most important using the same strategy and add to subgraph.
Same as Steps 2,3 & 4 of Greedy Algorithm.

Metropolis

Same as Steps 1 & 2 of Greedy- Top neighbors Algorithm.
Same as Steps 2 & 3 of Greedy Algorithm.
If the score of current complex is greater than previous complex, accept the new node addition.
Else, with a low probability, termed as metropolis probability, again accept the node addition (we are exploring again !). With high probability, reject the node addition.
If it has been a long time (ex: 10 steps) since score improvement of complex, return the current complex.
Repeat from Step 1

Iterative simulated annealing

Same as metropolis algorithm, with the metropolis probability replaced by the probability

p = e^((current_score - old_score)/T)

where T = T_old/alpha, starting with T0.

The analogy is that like the annealing process of metals, temperature T slowly decreases with time (here, iterations), minimizing energy, here, minimizing probability, thus exploring lesser towards the end.

Postprocessing:

Complexes with only 2 nodes are removed.

Removing overlaps

An overlap threshold is supplied and a smaller complex that overlaps (shares nodes) more than the threshold value with any other complex (of bigger size) is removed.

Evaluation

We compare the set of predicted complexes with known complexes to evaluate the algorithm.

First, we construct a set of reduced protein complexes containing proteins only present in the known complexes. We retain complexes with 3 or more nodes only.

We compare the predicted complex set and the known complex set, and say that a predicted complex recovers a known complex if,

where, p is an input threshold parameter between 0 and 1 and

A - Number of proteins only in the predicted complex
B - Number of proteins only in the known complex
C - Number of proteins in the overlap between two

Precision, Recall and F1 measures are calculated using this information, i.e

Precision = No. of predicted complexes that recover a known complex/No. of predicted complexes

Recall = No. of recovered known complexes/No. of known complexes

Results: Predicting complexes ->

Google Sites

Report abuse