Feature extraction

Protein complexes exhibit a variety of different structures, yet they are not random and we can hopefully make use of their structure and inherent patterns to identify new complexes.

Topological features are extracted for each of the positive/negative complexes to construct the final train and test datasets with the following 18 features :

No. of nodes (proteins) in subgraph
Density of the subgraph
First 3 singular values of the adjacency matrix of the subgraph
Edge weight (protein interaction) statistics - mean, max and variance. For example, mean edge weight is the mean of all edge weights in the subgraph
Clustering coefficient statistics - mean, max and variance
Degree statistics - mean, max, median and variance
Degree correlation statistics - mean, max and variance

In order to eliminate negative complexes (random walks) that resemble known complexes, we remove any row (sample) of a negative complex that matches any row of a positive complex exactly.

Classifier ->

Google Sites

Report abuse