Research Projects

I. Autonomous Data Integration and Workflow Querying

VisFlow Project: Autonomous Integration of Heterogeneous Internet Resources: In our recent research, we have shown that our data integration system VisFlow is general and powerful enough to support a wide class of application domains including life sciences and ecology. VisFlow middleware system supports integration of various databases with widely heterogeneous schemes, representation styles, and data types. Several different technologies that have been maturing over the years play a significant role in shaping its architecture and workflow like query language, BioFlow, for querying hidden web resources. Apart from the language itself, ontology generation, wrapper construction, mediation and query processing are integral parts of this project. BioFlow supports traditional querying and workflow description in a single platform without “impedance mismatch.” We have proposed an algebraic characterization for BioFlow, named Integra, so that it can be directly implemented in a way similar to relational algebra. VisFlow is a completely visual language based query building platform on the web and is currently available, free-of-charge, for public use.

AutoPipe Project: Graph based Autonomous Analysis Pipeline Composition: This project advances VisFlow further by introducing higher levels of abstractions so that users are able to view their application world naturally. The main component of this research is sophisticated graph representation, summarization and matching techniques with constraints. We are developing technologies to discover knowledge resources (such as databases, analytics, online tools, etc.) and summarize and catalogue them in an on-demand basis when a query is asked. Queries are usually framed in natural language as a series of steps, and AutoPipe systems attempt to map the query intent on the underlying resources that are already catalogued, or discover one when none can be found in the repository. It uses story understanding and context recognition technologies to parse and map the computational steps expressed in natural language. The high level descriptions are transformed into layered or summarized graph representations so that match between potential resources and query needs can be found using graph matching techniques we are developing.

II. Biological Network Analysis

An Integrative Approach to Disease Candidate Gene Prioritization and Gene Regulatory Network Discovery

Network databases such as protein-protein interaction networks, phylogenies and gene networks, etc. are large, and take a substantial amount of processing time. One of the key techniques used in such analysis require matching graphs based on isomorphism and nearness. The graph and tree matching algorithms we have developed are used in the following three projects to improve matching accuracy and scalability. Using the techniques we have developed for network analysis, we are also developing a more powerful and intuitive system than popular Neo4j, called NetDB, and a declarative query language for general graph data management.

GeneScope Project: Disease Candidate Gene Prioritization: Many variations of the so-called guilt-by-association (GBA) approach have been used to predict functions for a fraction of the huge number of proteins that have unknown or poorly characterized roles in biology. Because genes associated with a particular disease (i.e., disease genes) are likely to be well connected within the interaction network, we identify the most well-connected subnetworks from a large number of possible subnetworks by searching through chromosomal loci, each of which has many candidate disease genes, to find a subset of genes that are well connected in the interaction network. To manage high computational overhead of searching for modules in networks (connected communities or sub-networks), we have developed an efficient method for selecting candidate genes from each locus. We have selected a number of disease genes pertaining to a specific disease from the OMIM Morbid map. For each of the disease genes we have created an artificial locus by defining a region containing fifty genes nearest to the disease gene according to genomic distance measure. In our experiment, we varied the percentage of known disease genes considering them as seed nodes and applied our approach to extract the other disease genes in the extracted subgraphs. Our stochastic approach reveals that we are able to identify most of the relevant subgraphs in an interaction network, and the stochastic nature of our algorithm guarantees significantly improved turn-around time compared to the brute-force approach. Thus the approach is likely to be scalable to any biological network data.

NetExpress Project: Gene Regulatory Network: Neurodegeneration is linked to a number of major diseases including multiple sclerosis, leukodystrophies, Alzheimer and Parkinson’s diseases. Mutant gene expression in oligodendrocytes is also hypothesized to be responsible for protein misfolding and retention in the endoplasmic reticulum (ER) leading to hypomyelination via the unfolded protein response (UPR). One arm of this metabolic stress response, called the PERK pathway, plays a significant role in the attenuation of global protein translation cell cycle arrest. Recent studies in our collaborator’s laboratory on rumpshaker (rsh) and myelin synthesis deficiency (msd) mutant mice suggest that crosstalk between the PERK (a type of metabolic stress response) and p53 pathways may be the key to deciphering the mechanism of oligodendrocyte cell death. This possibility is plausible because p53 is known to be involved in cell cycle progression check point at the G1-S boundary, and the interaction between ATF4 and p53 via ATF3 in the PERK pathway. However, initial laboratory studies cast doubt on the involvement of p53 in oligodendrocyte death under UPR conditions.

This observation raises the possibility of novel pathways in oligodendrocytes that are influenced by p53.The goal of the current study is to develop a computational approach to define the role of p53 and investigate its contribution to the UPR signaling cascade. To accomplish this task, my collaborator and I have developed a toolkit, called NetExpress, that can be used to identify and extract the set of most promising k genes from a gene expression time course. We designed NetExpress to support multiple data formats such as Affymetrix, Agilent and Illumina. A simple user interface requires only setting an expression threshold, in the form of fold change per unit time, to distinguish true expression changes from background noise. The output is a set of co-regulated k genes listed in order of significance and promise. The expression threshold acts as a modulator to indirectly control the value of k, i.e., higher the threshold, the smaller the size of k. Additionally, NetExpress allows modifying STEM statistical clustering algorithm default values directly using a customization screen. It also allows inspecting the model profiles based on which the gene lists are produced. Further details of the individual profiles may be viewed, downloaded and saved. The gene lists produced by NetExpress can be optionally piped to the regulatory network prediction module. This module supports prediction of gene regulatory networks using three algorithms: RNN, Genie3 or BicAT-Plus. These algorithms have different objectives and strengths and thus offer an array of choices. Users are able to use any or all three algorithms to generate candidate networks and then generate a consensus network for validation using network analysis using our graph matching system TraM, and databases such as Reactome and KEGG. Recently, UI’s Center for Modeling Complex Interactions (CMCI) has invited me to submit a pilot grant application on NetExpress under the NIH INBRE grant program for possible funding after first round of reviews.

PhyloBase Project: Evolutionary Validation of Genes and Gene Products: One of the goals in many biological analyses is phylogenetic validation of obtained results or is a step in between in the search for information or to isolate gene products for analysis. The tool driven search engines supported in some of the phylogenetic databases are not powerful or flexible enough to aid researchers adequately in their search for evolutionary evidences. Our system, PhyloBase, aims to provide a storage mechanism and visual language based convenient querying capability of all these phylogenies in an integrative fashion. The optimization strategies employed in PhyloBase’s query language PhyQL leverages subgraph isomorphism, shortest path and hub indexing techniques to speed up processing. We expect to design a new list intersection method, an improved cost estimation technique to further reduce the processing time. The PhyQL optimization techniques serve as a prelude to the whole scale graph query processing engine NetDB implementation we are currently planning.

III. Social Analysis and Social Networking System for Education

Cascading Experiences Project: Social Network Analysis of Learners’ Online Activities: Tracking learners over time to understand the impact of an educational experience has been a longstanding challenge. Historically, there has been no way to track this but now, with massive amounts of data being shared and stored online, education researchers have an unprecedented opportunity to study actual evidence of the long-term effects of educational experiences through the application of “big data” analytics and visualization technologies. This project builds the tools that allow educators to identify the long-term, cascading effects of virtually any educational experience, including out-of-school STEM-related learning experiences. If successful, educators would be able to discover whether visiting a science center exhibition, watching a science special on television or even participating in a classroom science lesson, for example, results in an increase in online social media discussions of these topics, or whether an experience stimulates the public to initiate, through search engines, searches for further information on presented topics. This information, in turn, will allow educators to not only gauge how effective their programs are but what aspects of the educational experience generate further interest and engagement.

The research has the potential to be relevant far beyond informal science education by advancing the use of data mining and data analysis processes to better understand how individuals communicate, interact and learn over time, across a range of social networks (such as Facebook, Twitter, and LinkedIn) and online platforms (such as Google, Yahoo! and Bing). Some of the questions we are exploring include:

Do learners who engage in informal STEM education experiences like visiting a science museum or watching science-related television shows, further their learning through discussions and sharing of information on social media networks?
Which types of data are present in online platforms that are relevant for understanding the cascading impacts of learning experiences over time?
Do patterns of learning and online engagement evolve independently post-experience, and are there ways to track these pathways over time using big data analytics?
Is it possible to use advanced data visualization techniques to make such a model easily understandable and usable by practitioners?

GreenShip Project: A Social Networking System for K-12, College and Life Long Learners: Our experiences with the Cascading Experiences project suggest that to track learners effectively, educational sites such as EdModo, SumDog, MineCraftEdu and edConnectr offer platforms where educators and learners can come together to share thoughts, ideas and educational materials often on a personalized manner. Some sites support interactions among members to foster peer-2-peer, collaborative and group learning. But these sites often do not consider FERPA-related confidentiality and privacy. More seriously, these sites ignore the potential for abuse and bullying that negatively impact younger children and adolescents. To support online learning and track learners, we plan to develop an entirely new social networking system in which members will be able to learn and socialize with minimal risk to their educational and other confidential information and equally importantly, their social reputation.

Personal social reputation in online social networks is fundamentally different from enterprise reputation. Cyber-bullying, stalking and revenge porn is a considerable hazard and contemporary social network systems such as Facebook do very little to combat such cyber-attacks. In our new system, we use a new friends’ social network model and a new query language called GreenShip. We believe GreenShip users will be better equipped to combat online harassment and reputation damage than traditional online social networks. We also develop a more granular information sharing and privacy model based on object-oriented languages, called PiQL, which we expect to be more intuitive and powerful than current social networking systems such as Facebook.

IV. Computational Thinking: CS for All

Online Programming for CS0, CS1, CS2 and Databases

MindReader Project: As a largely rural state, Idaho lacks qualified teachers/instructors in schools to offer the necessary support and mentoring required for effective learning, especially for cyber-learners. The vast majority of teachers are concentrated in cities such as Moscow, Boise, and Idaho Falls. To mitigate this deficiency, the State of Idaho is embarking upon an ambitious initiative to encourage digital learning through the Idaho Digital Learning Academy (IDLA), and complementary activities through Idaho STEM Action Center. These initiatives aim at increasing availability and access, improving effectiveness, and reducing cost. While these initiatives focus on instruction delivery and content development, they are substantially less focused on effective online mentoring and assessment. My CS for All project aims to support these initiatives by developing online tools for digital and automated mentoring and authentic assessment for STEM learners.

The UI’s Computer Science Department is capable of offering Computational Thinking course as a dual credit AP class through IDLA support. Currently, it is offered only at one high school. We believe, this course can be improved with digital assistive technology that provides mentoring and assessment resources fully automatically. When perfected, the proposed technology can be used to support other IDLA classes as well.

Online Tutoring and Authentic Assessment of Introductory Programming Classes: Once instruction is delivered, learners need assistive tools to guide them in assimilating the content in a self-paced, on demand and interactive manner. The goal should be to progressively prod the learner with low level hints to encourage critical thinking and analysis to help her devise the solution or response on her own. Such strategies can be coupled with a personalized learning profile and the subject’s learning objective set out in the curricular requirement. An intelligent robot can learn individual learning strengths and limitations, and help the learner accordingly. Once the learner is ready to demonstrate how much she was able to learn and thus solve problems requiring mastery and analytical thinking, an active and online assessment system can evaluate subjective responses to questions. Instead of predefined assessment items, such a system can be used to understand the response of each student and mapped to a continuum of increasingly rigorous questions. For such an evaluation system, a large body (Big Data) of candidate responses will be instrumental from which the best match can be mined and ranked. Pairing the ranking with the learning objective and curricular requirement can aid in estimating learner’s knowledge depth. In this project, we plan to build such a system, called the MindReader, to assist learners with cyber mentoring and for online assessment.

Tutoring and Assessment in MindReader: With assumption that a sound knowledgebase can be used to deductively understand code segments in a hierarchical fashion by first de-constructing a code and then reconstructing it from elementary knowledge and equivalence rules, MindReader automatically offers tutoring support and formative assessment to novice learners of imperative programming classes such as C++. The system is currently poised to be used in assisting high school teachers participating in a certification program, to teach dual credit and AP classes. MindReader is able to understand a wide variety of elementary algorithms students learn in their entry level programming classes such as Java, C++ and Python. The MindReader system is able to assess student assignments and guide them on how to develop correct and better code in real time without human assistance. A complementary system for developing and delivering web-based interactive autonomous lectures, called vTutor, is scheduled to be presented at the ICWL 2018 conference in August, 2018.

Programming with CoCo (Conceptual Computing): With the increasing emphasis being placed on computational thinking as a fundamental skill, as part of MindReader, we are developing an array of conceptual programming toolkits to help students learn programming at multiple abstraction levels to reinforce their knowledge conceptually. We are developing a new visual language, called Patch, using which students are able to express their solutions to eScience computational problems in abstract visual tools. Patch is closer to high level procedural languages such as C++ or Java than Scratch or Snap! but combines simplicity and expressive power in one single platform. We are also building a visual language, called LogChart that can be used to express program logic using predefined icons similar to flowcharts and mapped to C++, Java or Phython automatically. Finally, we are also developing an NLP based programming platform, called the DiaGram that can be used to express programming logic at an even higher level of abstraction, and effortlessly mapped to a target language. We have already presented an initial version of this language under the name BioSmart at the 2017 ACM BCB conference, and an advanced version of it on mobile iOS platform, called Cyrus, to teach SQL querying has been submitted to 2018 IEEE TALE conference for review. Cyrus maps natural language queries to a wide class of SQL queries, and it is under extensive testing and is set to be used in our first database class in Fall 2018 and for AP classes offered through IDLA. Initial results have demonstrated significant success with the overall CoCo project.

V. Smart Interfaces to Improve Usability and Value

Database Usability Project: In theory, the only way for database administrators to prevent unauthorized access and malicious computations is to constantly design customized interfaces, which allows users tailored access to the information they need. Needless to say, practical constraints prevent such customized access to the requisite information. Users must therefore, sift through a huge pile of responses to extract the needed information, or settle for a less than perfect query response in the hope of approximating what they are looking for. Our CoIN language, described below, reconciles these competing needs and allows databases to accept ad hoc queries from unknown users and respond to them without violating disclosure constraints or database security. Such a language also opens up unlimited analysis and querying options for users because they will no longer be restricted by the limited number of interfaces traditional public databases support.

CoIN as a Smart Interface Design Language: Our goal is to propose a mechanism for web interfaces to accept user defined arbitrary query constraints without compromising database safety or allowing unauthorized access to its contents, and to demonstrate the conditions under which this relaxation is possible. We introduce the notion of monotone constraints and show that all monotone constraints preserve the intended semantics of the interface views. We then introduce an algorithm to reduce all user-supplied constraints to admissible monotone constraints such that the response A generated will be a subset of one of the views V_i the database intended to disclose.

Deep Web Querying using DQL: In this research, we develop a generic structured query model that can be used to retrieve information from the deep web. Using this query model, the contributions of a community of researchers can be combined freely, leading to a system that can be improved incrementally each time someone develops a specific novel technique to improve an operator. We propose to access deep web content that are “relevant” and “permissible” for a query in a dynamic fashion at run time so that we are not required to harvest or index them ahead of time. Our vision is to have a decomposable deep web querying system framework capable of accepting the most capable implementations for each module. While challenges remain, we show that such a model and architecture is entirely feasible today by suitable adaptation of recent research in this area.

Semantic Cognition of Linked Open Data for Analytics Orchestration: The emergence of data science and big data research is forcing us to rethink how we query these large amounts of complex data and emphasizing the need for analytics to process these data to produce knowledge and applications that use them to generate information. In this research, we are investigating how we can query large amounts of data by automatically orchestrating workflows and analytics by mapping natural language queries into contextual structured queries in successive steps. Such an approach can also be leveraged to design analytics from a set of English description of the process. In this early stage, we are developing a platform for understanding plain English text stories and answering questions about the content of the text. The idea is to accept a set of paragraphs in English, convert the entities and their relationships into an RDF database, then using a contextual querying framework map natural language queries into SPARQL queries to compute the response in RDF, and then map back the RDF response into natural language sentences to be read out using a voice interpreter. We plan to use this platform to analyze legal documents to generate legal arguments and query web documents contents.

MapBase: Biological ID Mapping. Traditionally, biological objects such as genes, proteins and pathways are represented by a convenient identifier, or ID, which is then used to cross reference, link and describe objects in biological databases. Relationships among the objects are often established using non-trivial and computationally complex ID mapping systems or converters, and are stored in authoritative databases such as UniGene, GeneCards, PIR and BioMart. Despite best efforts, such mappings are largely incomplete, and riddled with false negatives. Consequently, data integration using record linkage that relies on these mappings produces poor quality of data, inadvertently leading to erroneous conclusions. In this research, we address this largely ignored dimension of data integration, examine how the ubiquitous use of identifiers in biological databases is a significant barrier to knowledge fusion using distributed computational pipelines, and propose two algorithms for ad hoc and restriction free ID mapping of arbitrary types using online resources. Our focus is on improving the quality of data integration and aggregation that relies on ID mappings generated by these mappers, as well as the warehouses, such as PIR, GeneCards, BioMart and UniGene that maintain them. We identify two principal factors that impact the quality of integration -- materialized view maintenance and access methods to mapping for information aggregation. This is a collaborative project involving PIR, GeneCards, BioMart, and the University of Antonio Narino, Colombia.

VI. Veracity through CrowdSourcing

The CrowdCure Project: In this research, we are developing a novel approach to annotating and curating of biological database content using crowd computing. While the proposed approach and the CrowdCure system are designed for literature mined protein-protein interaction data curation, they are amenable to substantial generalization. We are leveraging a powerful uncertainty management framework called Information Source Tracking that already has a complete theoretical foundation for a query language to address uncertain data and information sources such as trust deficient curators. We are also breaking new ground on crowd curation using micro-tasks on budget time using personalization. The declarative query language CureQL we are building assimilates all these developments into a single platform.

Graph Summarization for Network Functional Equivalence: This project is aimed at supporting my AutoPipe, GeneScope, NetExpress and MindReader projects as all of them use graph matching technique in some form. Summarizing graphs in functional terms supports substitutability, and helps understand functional equivalence. For example, we could replace a functionally equivalent gene network for a diseased gene network as a therapeutic measure, or replace a code segment in a program, or choose a functionally equivalent web service when we are developing computational pipelines in VisFlow. As part of the MindReader project, we are developing a crowd sourced database of program code snippets we believe are equivalent for annotation. Once annotated, incrementally, the graph representation of these code segments will serve as model solution fragments for automated grading and tutoring in online programming classes. We also plan to use this database for semantic and most relevant error message generation during tutoring in computer programming MOOCs such as MindReader.