Human lifestyle, agriculture and industry burden our environment with thousands of synthetic chemicals many of which may evoke adverse outcomes in biological systems, from a single cell, to living individuals, to whole ecosystems. A great majority of the chemicals on the market remain unregulated and knowledge about their toxicity and adverse effects on organisms including humans are yet to be determined.
Exposure of an organism to chemicals can trigger a range of events at the molecular level, which may subsequently propagate to responses at the cellular, organ, organism and population level. Adverse Outcome Pathways (AOPs) organize knowledge about this sequence of biological processes that can lead to adverse outcomes. The AOP concept was initially developed to curate mechanistic biology knowledge associated with chemical toxicity with the goal of supporting chemical hazard and risk assessment; it is now increasingly used by the biomedical community to organize knowledge about disease pathways.
An AOP is structured as a sequential chain that starts with molecular interaction event (that can be initiated by a chemical, biological agent etc.), and progresses through a series of causally linked key events (KEs, across multiple levels of biological organization), culminating in an adverse outcome (depending on the end user needs - adverse outcome may be allocated to the individual, population, or ecosystem level). It is a directed and structured representation of biological events leading to an adverse outcome (AO). As different AOPs may share key events, AOPs form and AOP-network.
The structural nature of the AOP supersedes the utility of past hazard/risk-assessment tools; it was developed to rely on collaborative (human) expert-sourcing, and it allows for modular data deposition in a Wiki-based knowledge base (https://aopwiki.org/). Furthermore, the AOPs are composed of vertices and edges, thus lending themselves to graph-based analyses.
Promoting the development of AOP applications is intended to enhance communication between scientists involved in generating toxicological data and the potential end users of this information, such as regulators, modelers or risk assessors [1]. It has potential to support and enhance the use of mechanistic data in regulatory decision-making. The development and publication of new AOPs is, for instance, promoted by the OECD and the European Union.
A Short Introduction Video for AOP
There is a need for research that rapidly predicts adverse outcome pathways and/or extends existing AOP networks. AOP data is a network (i.e. graph) with a collection of key events, represented as nodes, that are inter-connected via links as shown in the figure on the right [2]. The most of the current AOP network is populated manually by human experts who review research literature to develop and deposit individual "putative" AOP linear chains. One traditional way to find putative AOPs is to use existing knowledge from the AOP wiki [3] and extract common key events from different AOPs and compose them into a new pathway. Efficient ways of AOP development and weight of evidence assembly are lacking.
In spite of extensive buy-in from the research and regulatory community to utilize the AOP framework to guide risk assessment at present there are only 257 existing AOPs and few common KEs. As a result, the existing AOP-network is still very small with a couple hundreds of graph nodes and links. In order to facilitate the AOP development, we plan to build predictive models based on various Deep Neural Networks (DNN) to predict and evaluate links of KE-associations that have higher probability to be associated together. Such predicted nodes and links can guide experts and focus their efforts on a much smaller set of related key events instead of testing/assessing infinite combinations of random objects.
Since our goal is to predict AOP links, we need to have training data (or training samples) to train our DNN. A well-trained DNN will detect linkage patterns from the training data we provided. Based on the detected linkage patterns, the DNN can then predict linkages that are missing in the existing AOP network.
From the above discussion, we know that a large amount of training data is the basic requirement in training a DNN. The second even more important requirement is the quality of training data. If we provide random training data with sample linkages that never existed among objects in our environment, the DNN will learn to predict fictional linkages that are totally useless.
So, where can we get large amount of high-quality training data that contains many small fragmented networks of objects in our environment so a DNN can learn to predict a more complete AOP network? We propose extracting such high-quality training data from three major sources:
While each paper in the life-science area may focus on discussing only a very small network of interconnected objects, from those millions of papers, we can easily collect hundreds of millions of high-quality networks of objects. That linkage information has high accuracy as it is scientifically verified by paper authors and carefully reviewed by experts before its publication. However, different authors may describe the same medical concepts and relations using different terminology (i.e. cancer and tumor). Without carefully converting all the synonyms of the same concepts/relations into uniform code, frequent and important connections between many concepts will be mistakenly represented by much less frequencies. This misrepresentation may lead DNNs to falsely disregard important connections.
Secondly, we also plan to add more than 32-million relations documented in various, well-established medical ontologies (i.e. MeSH, ICD, SNOMED-CT) into our training set. The NLM database UMLS collects all the relations between medical concepts in these ontologies.
The third dataset we plan to include in our DNN training process is the existing AOP network. Comparing to the large amount of shorter links in the journal papers, the existing AOP network has a very small amount of longer links that describe deeper chains of cause-effect relationships. Since different experts may describe the same concepts using different terminology, the difficulty in using this dataset is the same as trying to parse text in medical journal papers. We have parsed the text in the current AOP network by using the text parsing tools described above, and we extracted about couple thousands of relations.
There are three major tasks in this project: (1) text parsing, (2) training neural network models, and (3) integrating outputs produced in the previous two tasks. Amazon Web Service (AWS) and Amazon Simple Storage Service (S3) offer a perfect integrated computing platform for our three tasks. In the following we explain how we plan to utilize AWS and S3 for the proposed project:
We have extracted about one million pairs of related biological terms from more than 16 millions of biomedical journal papers. However, those text pairs must be pre-processed into a meaningful numeric representation before they can be used to train our Deep Neural Networks (DNNs). We explain the steps of our approach below:
All the different terminology that describes the same biological meaning is first converted into the same Concept Unique Identifier (CUI) defined in UMLS. In other words, each pair of biological terms are first represented as a pair of CUIs. While different terminology that describes the same biological meaning is now represented by the same CUI, this representation cannot describe the similarity between different terminologies with similar biological meaning (i.e. bacteria pneumonia and viral pneumonia).
To capture the similarity between different terminologies, we plan to train a word embedding model to convert each CUI into a vector in a high-dimensional space (i.e. 200) such that CUIs that represent similar meanings will be automatically mapped closer to each other.
We plan to test different DNN architectures such as AE (Auto-Encoder), DAE (De-Noising Autoencoder), LSTM (Long Short Term Memory) to predict possible relations between pairs of biological terms or Key Events (KEs).
We plan to propose a new measurement to evaluate the quality of our predictions. Our idea is that we will measure how much the target terminology and the predicted terminology overlap on their semantic hierarchy defined in UMLS. We will refer to this as Semantic Correctness.
The test data we plan to use in testing the predictability of our models will come from at least the two following sources:
In our preliminary study, we evaluated potential of the existing knowledge in the above three major data sources for AOP discovery and development. We first downloaded the AOP network from the AOP Wiki 2.0 [3]. We parsed the English descriptions in the network using the tools mentioned above, and converted all terminology in AOPs into a uniform coding system.
We found that there are 3,084 relationships among stressors, MIEs (molecular initiating events), KEs (key events), AOs (adverse outcomes), stressor-chemicals, and stressor-events. High performance graphic processing units (GPU) were used to determine which of the 3,084 relationships can be found in hundred million of relationships in UMLS and NLM databases. 610 (20%) relationships were found in the UMLS database. About 1,837 (60%) relationships were found in the abstracts of 16 million biomedical papers on NLM. When combining our searches over both the UMLS and NLM databases, 1,983 (64%) relationships from the AOP wiki were found; relationships in some sub-categories such as stressor-chemicals had a much higher hit ratio - 78%.
Our findings [5, 6] indicate that an AOP-discovery system that utilizes UMLS and NLM article data to predict new probable AOP relationships could substantially accelerate AOP development and contribute to weight of evidence analyses.
Furthermore, we are currently training our neural network models to learn complex relations between 89,230,566 pairs of biological concepts that we extracted from the 16 million biomedical papers. We also develop a novel quantitative method to measure the semantic correctness of our prediction. Not only show our preliminary results that our AOP models can predict possible connections between biological concepts with high semantic correctness, the results also show that 85% of predicted relations have greater than 50% of semantic correctness. We include a sample preliminary results that test the semantic correctness of 10,000 KE pairs in the following figure. Our initial results will be published soon.
While we have obtained few very early and preliminary results that prove the possibility of predicting A.O.P. to accelerate the toxicology analysis, this project still faces some challenging as discussed above. Please refer to the early section "Aim of Project".
Please click reference page.