Data

All experiment data is available here.

Details of the files in each experiment are given below.

Toy network experiment

Input data:

  • Toy network, available as a weighted edge list. Format: node1 node2 edge-weight

  • All raw toy communities, available as node lists. Format: node1 node2 node3 .. Each line represents a community.

Intermediate output results:

  • Training toy communities, available as node lists. Format: node1 node2 node3 .. Each line represents a community.

  • Testing toy communities, available as node lists. Format: node1 node2 node3 .. Each line represents a community.

  • Training toy communities feature matrix, available as a spreadsheet of rows of feature vectors, each corresponding to a positive or negative community. Format: density, number of nodes, degree statistics- maximum, mean, median and variance, clustering coefficient (CC) statistics- maximum, mean and variance, edge weight statistics- mean, maximum and variance, degree correlation (DC) statistics - mean, variance and maximum, 3 Singular values of the subgraph's adjacency matrix, Label (positive or negative community, indicated by 1 or 0 respectively)

  • Testing toy communities feature matrix, available as a spreadsheet of rows of feature vectors, each corresponding to a positive or negative community. Format: density, number of nodes, degree statistics- maximum, mean, median and variance, clustering coefficient (CC) statistics- maximum, mean and variance, edge weight statistics- mean, maximum and variance, degree correlation (DC) statistics - mean, variance and maximum, 3 Singular values of the subgraph's adjacency matrix, Label (positive or negative community, indicated by 1 or 0 respectively)

Output results:

  • Trained toy community fitness function, available as pickled files of a data pre-processor and a machine learning model from sklearn, that can be imported in python using the pickle module, for example using the commands, import pickle; pickle.load(filename)

  • Learned toy communities, available as node lists. Format: node1 node2 node3 .. nodeN Score. Each line represents a community. The score is the community fitness function of the community.

  • Learned toy communities, available as edge lists. Format: node1 node2 edge-weight. A blank line (i.e. two newline characters) separates the edges of one community from another community's edges.

HU.MAP experiment

Input data:

  • hu.MAP PPI (protein-protein interaction) network, available as a weighted edge list. Format: gene_ID1 gene_ID2 edge-weight

  • All raw human protein complexes from CORUM, available as node lists. Format: gene_ID1 gene_ID2 gene_ID3 .. Each line represents a protein complex.

Intermediate output results:

  • Training complexes from CORUM, available as node lists. Format: gene_ID1 gene_ID2 gene_ID3 .. Each line represents a protein complex.

  • Testing complexes from CORUM, available as node lists. Format: gene_ID1 gene_ID2 gene_ID3 .. Each line represents a protein complex.

  • Training data, i.e. feature matrix of CORUM complexes (with edge weights from hu.MAP PPI network), available as a spreadsheet of rows of feature vectors, each corresponding to a positive or negative protein complex. Format: density, number of nodes, degree statistics- maximum, mean, median and variance, clustering coefficient (CC) statistics- maximum, mean and variance, edge weight statistics- mean, maximum and variance, degree correlation (DC) statistics - mean, variance and maximum, 3 Singular values of the subgraph's adjacency matrix, Label (positive or negative protein complex, indicated by 1 or 0 respectively)

  • Testing data, i.e. feature matrix of CORUM complexes (with edge weights from hu.MAP PPI network), available as a spreadsheet of rows of feature vectors, each corresponding to a positive or negative protein complex. Format: density, number of nodes, degree statistics- maximum, mean, median and variance, clustering coefficient (CC) statistics- maximum, mean and variance, edge weight statistics- mean, maximum and variance, degree correlation (DC) statistics - mean, variance and maximum, 3 Singular values of the subgraph's adjacency matrix, Label (positive or negative protein complex, indicated by 1 or 0 respectively)

Output results:

  • Trained community fitness function of CORUM complexes (with edge weights from hu.MAP), available as pickled files of a data pre-processor and a machine learning model from sklearn, that can be imported in python using the pickle module, for example using the commands, import pickle; pickle.load(filename)

  • Learned protein complexes from hu.MAP PPI network, available as node lists. Format: Excel file, where the columns are - Learned complex name (Named as the most similar CORUM complex, prepended by the Jaccard coefficient similarity) , Proteins in learned complex (gene names, i.e gene_name1 gene_name2 gene_name3 .. gene_nameN ), Proteins in learned complex (gene IDs, i.e gene_ID1 gene_ID2 gene_ID3 .. gene_IDN ) and Score (Community fitness function of the learned protein complex)

  • Learned protein complexes from hu.MAP PPI network, available as gene ID edge lists. Format: gene_ID1 gene_ID2 edge-weight. A blank line (i.e. two newline characters) separates the edges of one complex from another complex's edges.

  • Learned protein complexes from hu.MAP PPI network, available as gene name edge lists. Format: gene_name1 gene_name2 edge-weight. A blank line (i.e. two newline characters) separates the edges of one complex from another complex's edges.

YEAST experimentS:

Input data:

  • DIP yeast PPI network, available as a weighted edge list. Format: gene_ID1* gene_ID2 edge-weight

  • Yeast protein complexes from MIPS, available as node lists. Format: gene_ID1 gene_ID2 gene_ID3 .. Each line represents a protein complex.

  • Yeast protein complexes from TAP-MS, available as node lists. Format: gene_ID1 gene_ID2 gene_ID3 .. Each line represents a protein complex.

Experiment 1: Training on TAP-MS and Testing on MIPS:

Experiment 1 Intermediate output results:

  • Training data, i.e. feature matrix of TAP-MS complexes (with edge weights from hu.MAP), available as a spreadsheet of rows of feature vectors, each corresponding to a positive or negative protein complex. Format: density, number of nodes, degree statistics- maximum, mean, median and variance, clustering coefficient (CC) statistics- maximum, mean and variance, edge weight statistics- mean, maximum and variance, degree correlation (DC) statistics - mean, variance and maximum, 3 Singular values of the subgraph's adjacency matrix, Label (positive or negative protein complex, indicated by 1 or 0 respectively)

  • Testing data, i.e. feature matrix of MIPS complexes (with edge weights from hu.MAP), available as a spreadsheet of rows of feature vectors, each corresponding to a positive or negative protein complex. Format: density, number of nodes, degree statistics- maximum, mean, median and variance, clustering coefficient (CC) statistics- maximum, mean and variance, edge weight statistics- mean, maximum and variance, degree correlation (DC) statistics - mean, variance and maximum, 3 Singular values of the subgraph's adjacency matrix, Label (positive or negative protein complex, indicated by 1 or 0 respectively)

Experiment 1 output results:

  • Trained community fitness function of TAP-MS complexes (with edge weights from DIP PPI network), available as pickled files of a data pre-processor and a machine learning model from sklearn, that can be imported in python using the pickle module, for example using the commands, import pickle; pickle.load(filename)

  • Learned protein complexes from DIP PPI network, available as node lists. Format: gene_ID1 gene_ID2 gene_ID3 .. gene_IDN Score. Each line represents a protein complex. The score is the community fitness function of the protein complex.

  • Learned protein complexes from DIP PPI network, available as edge lists. Format: gene_ID1 gene_ID2 edge-weight. A blank line (i.e. two newline characters) separates the edges of one complex from another complex's edges.

Experiment 2: Training on MIPS and Testing on TAP-MS:

Experiment 2 Intermediate output results:

  • Training data, i.e. feature matrix of MIPS complexes (with edge weights from hu.MAP), available as a spreadsheet of rows of feature vectors, each corresponding to a positive or negative protein complex. Format: density, number of nodes, degree statistics- maximum, mean, median and variance, clustering coefficient (CC) statistics- maximum, mean and variance, edge weight statistics- mean, maximum and variance, degree correlation (DC) statistics - mean, variance and maximum, 3 Singular values of the subgraph's adjacency matrix, Label (positive or negative protein complex, indicated by 1 or 0 respectively)

  • Testing data, i.e. feature matrix of TAP-MS complexes (with edge weights from hu.MAP), available as a spreadsheet of rows of feature vectors, each corresponding to a positive or negative protein complex. Format: density, number of nodes, degree statistics- maximum, mean, median and variance, clustering coefficient (CC) statistics- maximum, mean and variance, edge weight statistics- mean, maximum and variance, degree correlation (DC) statistics - mean, variance and maximum, 3 Singular values of the subgraph's adjacency matrix, Label (positive or negative protein complex, indicated by 1 or 0 respectively)

Experiment 2 output results:

  • Trained community fitness function of MIPS complexes (with edge weights from DIP PPI network), available as pickled files of a data pre-processor and a machine learning model from sklearn, that can be imported in python using the pickle module, for example using the commands, import pickle; pickle.load(filename)

  • Learned protein complexes from DIP PPI network, available as node lists. Format: gene_ID1 gene_ID2 gene_ID3 .. gene_IDN Score. Each line represents a protein complex. The score is the community fitness function of the protein complex.

  • Learned protein complexes from DIP PPI network, available as edge lists. Format: gene_ID1 gene_ID2 edge-weight. A blank line (i.e. two newline characters) separates the edges of one complex from another complex's edges.

Experiment 3: Training on MIPS and Testing on MIPS:

(Note: this experiment was performed only to be able to compare with NN, an existing supervised method)

Train and test feature matrices are the MIPS complexes' feature matrix from experiment 2. Trained community fitness function of MIPS complexes is the same as experiment 2

Experiment 3 output results:

  • Learned protein complexes from DIP PPI network, available as node lists. Format: gene_ID1 gene_ID2 gene_ID3 .. gene_IDN Score. Each line represents a protein complex. The score is the community fitness function of the protein complex.

  • Learned protein complexes from DIP PPI network, available as edge lists. Format: gene_ID1 gene_ID2 edge-weight. A blank line (i.e. two newline characters) separates the edges of one complex from another complex's edges.

*Here, gene ID refers to the OLN (Ordered locus name)