CS524 project 3: Molecular Universe


Source code and executable for Mac may be found here.
To run on Mac OSX, go to MolecularUniverse/build/Debug .

A youtube video of molecular universe in action may be found here:



Introduction: 

One of the major challenges in systems biology is moving fluidly between the increasingly well characterized 'small picture', consisting of individual genes and ther behavior in specific circumstances, and the relatively poorly understood 'big picture' view characterized by massive system-level data sets.  Without the ability to  toggle between views, a biologist might struggle to bridge the gap between their understanding of a local molecular phenomena, such as the activation of a signal-transduction pathway or the activating interaction between two proteins, and responses that are unmistakably system-wide. 


Signal transduction pathway (right) and metabolic map (above) (KEGG database).


To this end, I decided to attempt to 'bridge the gap' by using a large display wall to display a large set of molecular pathways in one view, with smooth movement between pathways in view. 

This project was, in many respects, more challenging than previous projects, and, as a result, it can be most accurately characterized as a 'work in progress'.  However, this project does afford the ability to examine a set of pathways from a different perspective, which I hope to expand and develop more fully in future.


File processing

File processing consumed an inordinate portion of the time developing this project.  In many respects, my goals in file processing shifted as the project developed.  The following diagram illustrates the complexity of my data processing workflow:




There were 5 data processing programs and a shell script that ran to produce a set of data files including:
- A complete human vtk graph data set
- Several versions of 1086 pathway vtk data files, with node local positions computed by graph-layout algorithms (force directed, clustered, circular, circular then force directed... )
- Sorted pathway name list
- Sorted gene name list
- List of pathways associated with each gene
- Several vtk Graph data files where nodes represent pathways, edges represent degree of similarity between pathways, computed as the number of genes in common.   These pathway-nodes were distributed in space using a force directed algorithm

The entire data processing pipeline runs in under 5 minutes.  This was carried out in VTK and C++.  The code for this data processing step can be found on this page.

The original data sets are as follows:

    EDGE DATA
        Interaction data containing binary, undirected interactions between proteins was obtained from the Pathway Commons website .  This data was available in several formats, the most convenient of which was the 'tab-delimited network' format, which contains binding information in the following format:

        CPATH_RECORD_ID_A    INTERACTION_TYPE    CPATH_RECORD_ID_B    GENE_SYMBOL_A    GENE_SYMBOL_B    INTERACTION_DATA_SOURCE    INTERACTION_PUBMED_ID
        100001    INTERACTS_WITH    265077    MT1G    NOT_SPECIFIED    INTACT    PUBMED:20711500;
        100001    INTERACTS_WITH    5187    MT1G    SPINK7    BIOGRID    PUBMED:12646258;
        100001    INTERACTS_WITH    5187    MT1G    SPINK7    BIOGRID    PUBMED:12970870;
        100001    INTERACTS_WITH    5187    MT1G    SPINK7    HPRD    PUBMED:12970870;
        ....

        CPATH_RECORD_ID_A = the pathway commons record id for component A of the interaction 

        INTERACTION_TYPE = the general classification of the interaction, which includes the following categories:
            INTERACTS_WITH
            REACTS_WITH
            IN_SAME_COMPONENT
            SEQUENTIAL_CATALYSIS
            STATE_CHANGE
            METABOLIC_CATALYSIS
            CO_CONTROL

        CPATH_RECORD_ID_B = the pathway commons record id for component B of the interaction

        GENE_SYMBOL_A = one common name for component A in the interaction

        GENE_SYMBOL_B = one common name for component B in the interaction

        INTERACTION_DATA_SOURCE = the database in which this interaction is recorded

        INTERACTION_PUBMED_ID = pubmed's identification for the interaction

        The edge attribute file can be found on this page under ' homo-sapiens-9606-edge-attributes.txt' .

    NODE DATA
        This interaction data was related to gene data, found in a separate file, 'homo-sapiens-9606-node-attributes.txt' .
        This data was formatted as follows:

        CPATH_RECORD_ID    GENE_SYMBOL    UNIPROT_ACCESSION    ENTREZ_GENE_ID    CHEBI_ID    NODE_TYPE    NCBI_TAX_ID
        100001    MT1G    P13640    4495    NOT_SPECIFIED    protein    9606
        100003    MFAP5    B0AZL6    8076    NOT_SPECIFIED    protein    9606
        100005    MGST2    Q99735    4258    NOT_SPECIFIED    protein    9606
        100007    MTBP    Q9HA89    27085    NOT_SPECIFIED    protein    9606

        CPATH_RECORD_ID = The pathway commons record id for the gene

        GENE_SYMBOL = one common name for the gene

        UNIPROT_ACCESSION = the access code for the gene in the uniprotein accession database

        ENTREZ_GENE_ID = the gene id on the entrez-gene database

        CHEBI_ID  = the gene id for the Chemical Entities of Biological Interest Database

        NODE_TYPE = 'protein' or 'small_molecule'

        NCBI_Tax_ID = the National Center For Biotechnology Information


    PATHWAY DATA
        Pathway data was obtain from the Pathway Commons database.  Although pathway data was available in several formats, the 'gene set enrichment' format was selected.  This data was formatted as follows:

Gap-filling DNA repair synthesis and ligation in GG-NER    REACTOME    15962:protein:DNLI1_HUMAN:P18858:LIG1:3978    49999:protein:RFA2_HUMAN:Q5TEJ5:RPA2:6118    49461:protein:RFA1_HUMAN:A8K0Y9:RPA1:6117    50525:protein:RFA3_HUMAN:P35244:RPA3:6119    15960:protein:DPOD2_HUMAN:P49005:POLD2:5425    14002:protein:DPOD1_HUMAN:Q96H98:POLD1:5424    14894:protein:DPOD3_HUMAN:B7ZAI6:POLD3:10714    15548:protein:DPOD4_HUMAN:Q9HCU8:POLD4:57804    93240:protein:PCNA_HUMAN:B2R897:PCNA:5111    49928:protein:RFC3_HUMAN:O15252:RFC3:5983    49392:protein:RFC5_HUMAN:P40937:RFC5:5985    50533:protein:RFC1_HUMAN:A8K6E7:RFC1:5981    50835:protein:RFC4_HUMAN:Q6FHX7:RFC4:5984    50077:protein:RFC2_HUMAN:Q9BU93:RFC2:5982    15166:protein:DPOE1_HUMAN:Q07864:POLE:5426    14578:protein:DPOE2_HUMAN:P56282:POLE2:5427   
       
        Each line contained a name for the pathway along with a list of identifiers for each gene in the pathway.

        In the course of this project I attempted to use several other pathway data files, with limited success, due to incomplete encoding of nodes with entrez-gene identifiers.



Using Molecular Universe

Molecular universe opens on a distant view of a set of pathways.  For the personal computer version, there are 300 pathways visualized.  For the Wall, all 1086 pathways are visualized. 

Manually exploring the space:
Zoom in and out by scrolling, or using two fingers on your trackpad.
Rotate about the actors in space by holding down the right mouse, or clicking and dragging the mouse.
Shift right and left by pressing 'Shift' while dragging on the mouse.
Spin around in space actors by pressing 'Control' ( or 'Apple key' on Mac) and dragging the mouse. 

Manual exploration allows you to target a pathway and explore it from multiple perspectives.



Built-in exploration features:

Since it is difficult to find pathways through random exploration, targeted navigation is provided through several mechanisms.

In the lower left-hand corner of the window, there is a 'Select Pathway' box.  By scrolling through the list or typing a gene of interest, the user can move immediately to view this pathway. 

In addition, the user may target any pathways listed in the Selected-Pathway-List-Box, on the left hand side.  Clicking on these pathways will move the camera toward the selected pathway.

Note:  Significant problems arose in directing the camera to the appropriate position for each pathway.  This feature will need to be improved for better targeting of pathways.  In addition, if the user tries to move in space through manual exploration after targeting a particular pathway, the camera continues to pursue its target over the selected pathway.  In future versions of this program it will be important to develop a unique 'interactor' class that provides better responsiveness to user requests.


Gene selection and highlighting:

Individual genes of interest may be selected by the user in the Human gene selection box.

After gene has been selected, it will be added to the 'gene selected box'   



Pathways associated with these genes are listed in the box below, allowing the user to target and view pathways that are likely to be of interest.







What can be seen using Molecular Universe:

The most apparent feature of Molecular Universe is ability to view related pathways in one view. 

Here, for instance, a set of signalling pathways and cascades appear in close proximity, indicating that they share many genes in common.

And here, a set of activation and energy-harnessing pathways appear in one view. 



Though of limited utility at this time, three dimensionality does provide the ability to overlay pathways, and view from a variety of perspectives.



Limitations:

Despite my best efforts, I was unable to create a program that can run on a standard personal computer with comfortable interactive time when the entire data set is loaded.  As a result, I load a subset of the pathway graphs to explore the possibilities that this view has to offer. 
I hope in future to develop a more efficient program that allows the user to truly bridge the divide between 'system-view' and 'local-view'.

Data processing was a significant challenge.  Several data processing programs were written, and abandoned after it became clear that the data set being used was not sufficiently complete or was too complex.  For instance, one pathway data file used entrez-gene ids to identify genes.  Believing this to be a reliable identifier, I used these data sets to compute all files in the pipeline.  Problems arose, however, when it became clear that elements and pathways were excluded from my algorithms because nodes were identified as 'not-specified', even when entrez-gene ids existed in the PubMed databases.

After encountering these problems, I re-wrote my programs to use 'pathway-commons' ids.  Since the data was from the pathway commons website, most genes and molecules were encoded with a unique 'CPath_ID'. 

Further challenges involved using vtk formats for graphs.  In addition to having a global vertex index into the graph, each gene also had a set of indices into the pathways in which it belonged.  Translating between gene index in one pathway, to gene index in another pathway, was a significant organizational challenge.  Further, gene symbols, which are recognizable to scientists, had to be mapped to CPath identifiers as well as vertex ids in different pathways so the information could be presented in a human-understandable form. 

Though this program begins to tackle these challenges, better methods are needed to truly organize these data sets efficiently.


Future work:

The first step is to improve the appearance of the networks on individual planes.  This may be accomplished through a variety of means, most importantly layout and edge presentation.  I hope to adopt the method of presented in a set of recent publications from the Caleydo group, and using a different, but related set of data from KEGG database, position nodes in space based on their position in hand-drawn images found on the KEGG database.  These files are in XML format and can be read by xml readers.  After nodes have been positioned in this manner, I hope to overlay the original image on the plane, thus relating hand-drawn features to interactive and selectable nodes.

Better interaction with the nodes and pathways in the graph will be a top priority.  The user must be able to see gene text labels either as a default state, upon hovering, or on clicking.  I worked to include this feature in the current version, but there were a variety of problems.  First of all, if text was displayed as a default, the program ran slowly.  Making text display on a user selection required better means of organizing the data so that location in space of the selection could be quickly related to gene name. 

Three dimensionality provides us with a variety of new ways to approach biological network exploration.  For instance, with three dimensions, color need not be used to indicate measures of gene activity and expression, such as DNA microarray data sets as is typical in 2 dimensional gene network exploration programs.   With three dimensions, we have the ability to represent these measures as elevation above the plain in three dimensional space. 

I hope to augment the current view with a 'stacked planes' view, where pathways selected by the user may be stacked on top of each other in a new window.  This allows the user to pick which pathways are of interest to him or her, and view them all at once. 
 
Use of large and touch-displays will feature prominently as well in future work on this project.


Comments