Refer all questions to Dr. Edward Njoo (edward.njoo@asdrp.org) & the Njoo research group (ASDRP Chemistry Department)
Towards the Development of a Versatile Machine Learning Platform for Pharmacophore-Guided Drug Discovery
Investigators:
Robert Downing Department of Computer Science & Engineering, ASDRP
Edward Njoo Department of Chemistry, Biochemistry & Physics, ASDRP
“Chemical space is vast but most of it is biologically uninteresting: blank, lightless galaxies exist within it into which good ideas at their peril wander.” - Brian Shoichet, University of California, San Francisco. Published in Nature 2004, 432(7019), 862-865.
Competitive Landscape
1.1 From High Throughput Screening (HTS) to High Throughput Virtual Screening (HTVS)
Traditional methods of high-throughput screening (HTS) libraries of chemical entities involve the chemical synthesis, preparation, and biological evaluation of large libraries of compounds against a particular target. This process, while scientifically insightful, is tedious and slow (such campaigns take on the order of years to decades from start to finish), and costly, both in terms of time (personnel) and resources (chemical synthesis, solvents, instrumentation / laboratory amenities required, etc.). In fact, every year, billions of dollars are spent making chemical compounds that end up having little to no biological activity. One of the major costs of the pharmaceutical industry, which ends up being passed along to the consumer, arises from the fact that, to probe such structures in reality, large teams of medicinal chemists are employed every year in the struggle to make lead compounds, most of which will never work.
As such, the use of computers to linearize the drug discovery workflow has entered the competitive arena as a potential alternative in dramatically truncating the time and material investment required to arrive at early hypotheses of the structure-activity relationship (SAR) of compounds of interest against a particular target - high throughput virtual screening (HTVS). Moreover, such computational screens allow a much broader library of entities in the chemical universe to be screened at any given time, without being bound by limits of which chemical entities are readily commercially available or easily synthesized in a set number of steps.
1.2 Current Methods in High Throughput Virtual Screening (HTVS)
Current methodology in HTVS involves molecular docking, a biophysical simulation approach that seeks to determine and quantify the most thermodynamically stable interaction between a “ligand” (in this case a small molecule) and its receptor (a protein or nucleic acid that it targets). There are many software applications that do this, each using a different proprietary algorithm, including Schrodinger Glide, AutoDock Vina / AutoDock 4, SwissDock, etc. Each software has different methods for treating ligand flexibility, receptor flexibility, and relative importance / thermodynamic weight of predicted chemical interactions between ligand and receptor, as well as the docking pose search decision tree. An exhaustive review of different approaches to docking is beyond the scope of this synopsis, but in the drug discovery arena, there are no shortage of examples of leads identified via HTVS approaches.
However, current docking algorithms come with a fairly steep increase in computational cost for added accuracy and/or search exhaustiveness. This is because these search algorithms are largely based on force fields and molecular mechanics, and accounting for iterative ensembles of atom-to-atom force fields is computationally costly, especially when this is being used to simultaneously screen thousands of compounds.
The usage of machine learning in drug discovery is not novel to our group. However, most virtual chemical libraries are composed of billions of possible structures, many of
which, while possible to produce, are non-trivial to make in a laboratory, and which rely upon esoteric starting materials.
There have been prior efforts at using machine learning on simplified chemical representations - SMILES (Simplified Molecular Input Line Entry System) - to predict chemical reactivity or drug efficacy. While computationally much simpler and easier to handle, such algorithms do not take into account the three-dimensionality of compounds, and thus often fail to account for the possibility of multiple, 3-dimensional conformer states.
Project Phases
2.1 Phase 1: Pharmacophore detection
The biological activity of a small molecule bioactive is largely defined by its binding interaction with some sort of macromolecular target, namely, a nucleic acid or protein. The strength and nature of such interactions are largely governed by noncovalent interactions such as hydrogen bonding, pi-stack interactions, coulombic / electrostatic attractions, and Van der Waals / hydrophobic interactions. Among these, the former three are principally important, and these can be attributed to (a) the 3D locations of heteroatoms capable of delivering / accepting hydrogen bond interactions (N, O, P, S, F) or capable of forming the centers of point charges (-SO3-, -COO-, -NH3+), etceteras and (b) the 3D locations of aromatic rings. These are defined as 5, 6, or 7-membered, planar, pi-conjugated systems that might either be carbocyclic (in which case no overlap exists with condition (a)) or heterocyclic (in which case a heteroatom might also be part of an aromatic ring system, e.g. indole, furan, imidazole, pyridine systems). As such, we envision that the structural elements of a bioactive compound can be reduced to a 3-dimensional vector map of heteroatoms and aromatic rings.
For the purposes of this work we assume that aliphatic side chains [heavy in Csp3-H bonds] are less important in drug-target binding and hence can be largely ignored in the initial search algorithms.
This seems trivial at first inspection, but the challenge arises in extracting the 3D coordinates of heteroatoms and centers of aromatic ring systems from cartesian coordinates of individual atoms where no information about atom-to-atom connectivity is given. Other molecular descriptors are based on data provided through PaDEL, an open source software package that captures molecular descriptors and fingerprints.
2.2 Phase 2: Accounting for Conformer Ensembles
Molecules are not static, rigid bodies - rather they enjoy some degree of flexibility, and these modes of flexibility are at times important in ligand binding to its receptor. Moreover, each conformational state of a compound generates a novel 3D pharmacophore “fingerprint”. Traditional methods for accounting for molecular flexibility usually involved conformer searches using molecular mechanics, applied on finding local thermodynamic minima via alteration of angles and dihedrals among rotatable bonds. Chemical structures used for the initial screen were first subjected to geometric optimization by density functional theory (DFT).
Here, we propose that in most small molecule “drug-like” entities, conformational flexibility can be simply accounted for via simple 3-dimensional geometric transformations of key fragments. We classify most conformational states to be (a) rotation of C-C bonds in biphenyl systems; (b) chair-flips in non-aromatic heteroatomic ring systems; and (c) dihedral rotation across Csp3-Csp3 or Csp3-Csp2 centers. Where multiple sequential degrees of freedom in molecular flexibility exist, each is iteratively treated. Steric clashes that place atoms within a predetermined distance between boundaries of their respective radii result in disallowment of the conformer. For each conformer of each compound, an independent fingerprint is generated as an ensemble representing each compound.
2.3 Phase 3: Generation of Vector Maps of 3D Pharmacophore Fingerprints and Complementarity Scoring; Generation of a Training Set
Relative 3-D positions of such identified pharmacophoric features (aromatic rings and heteroatoms) are defined as a vector space. Concurrently, a vector space is generated labeling 3D positions of pharmacophoric features on the target receptor. The complementarity between the two vector spaces is quantified and correlated with known empirically-obtained binding affinities. The aim is to build this correlation based on a library of compounds of known binding affinity to the target. The relative weighted roles of noncovalent interactions (pi-stack, hydrogen bond, electrostatic) as well as the relative distance-decay of such interactions are adjustable parameterized values that can be iteratively adjusted through machine learning to fit the modeled data.
These structural complementarity descriptors are combined with similar analyses performed on molecular descriptors provided through PaDEL, which will eventually be subjected to similar parameter refinement algorithms.
2.4 Phase 4: Implementation of a Learning Set
From here, in order to assess the predictivity of the search algorithm, a library of known chemical entities with known binding affinities will be screened using both descriptors from PaDEL and complementarity features from the aforementioned pharmacophore map search, and the predicted binding scores will be compared to empirical data. This is used iteratively in tuning adjustable parameters for each descriptor.
This set of chemical entities [NNRTIs] was used as the learning set for development of the algorithm described previously, that is “extracting the 3D coordinates of heteroatoms and centers of aromatic ring systems from cartesian coordinates of individual atoms where no information about atom-to-atom connectivity is given.”
Consisting of known compounds with established effectiveness [by the FDA] this set of NNRTIs would serve as the test-bench against which the algorithm would be constructed that would allow: the discovery of the presence of a common denominator between the compounds [i.e.: the presence and relative locations of hetero atoms], and then; evaluate the ‘downstream topology of each compound for the presence of a ‘ring’ structure [aromatic or non-]; that together with the existing hetero atoms indicate a probable associative relationship between ‘novel’ compounds and the known effective NNRTIs.
2.5 Phase 5: Implementation of a Training Set
Upon demonstrating that this algorithm performed as desired [within an acceptable range of error], a training dataset was created of compounds having the same ‘class’ attributes as those of the learning dataset, but for which an effectiveness was unknown. The results of this training exercise would produce a set of novel compounds which could then be evaluated using the more traditional methods, measuring the performance of the algorithm.
2.5 Phase 5: Implementation of the Results Set
Given the algorithm’s meeting the desired level of performance (i.e.: yielding results that lay within the confidence interval desired) the algorithm would then be integrated into a system found to be most effective in the screening of large [public] datasets to enable HTVS of natural compounds to produce a result set that could then be effectively tested in situ for effectiveness against a specified target.,
Such a platform would be easily extendable to other such systems beyond those which are currently studied - reverse transcriptase with respect to HIV therapy, and, more recently, target proteins in the SARS-CoV-2 replication cycle.
RECENT PUBLICATIONS
Baranwal, Tanish; Huang, Howard; Avadhani, Udbhav, Goyal, Anya; Samavedam, Akhil; Kale, Mihir; Nepani, Tvisha; Hu, Timothy; Srikanth, Vishak; Downing, R.A. ; Njoo, E.S. "Quantitative Definition of Chemical Complexity Through Gaussian Mixture Models and Autoencoder Approaches" Journal of Emerging Investigators 2022, manuscript accepted.
Ashok, Bhavesh; Baranwal, Tanish; Avadhani, Udbhav; Biddala, Geethika; Nepani, Tvisha; Srikanth, Vishak; Zaceria, Luqman; Williams, Natalia; Downing, R.A.*; Njoo, E.S. "Development of a novel machine learning platform to identify structural trends among NNRTI HIV-1 reverse transcriptase inhibitors." Journal of Emerging Investigators 2021, in press.
Ashok, Bhavesh; Bajaj, Ayush; Adwankar, Rohan; Surapaneni, Atri; Surapaneni, Anvi; Chen, Allen; Sun, Stephanie; Chattopadhyay, Kushal; Wu, Jeslyn; Liang, Andrew; Poosarla, Ayeeshi; Mageswaran, Karankumar; Rao, Isha; Kharshingher, Sania; Booma, Sushruth; Njoo, E.S.; Downing, R.A. "Pharmacophore-based screening and identification of molecular level descriptors applied to non-nucleoside reverse transcriptase inhibitors (NNRTIs)" National High School Journal of Science 2020, manuscript accepted, [Preprint PDF]
"A universal chemical synthesis platform was extended by on line NMR to allow adjustments of the encoded reactions on the fly. The approach was demonstrated on the class of Grignard reactions showing robust analytical results under harsh conditions. Real-time process analytics and the use of straightforward feedback control algorithm enhance the usability of available synthesis formula considering existing deviations." [https://doi.org/10.1002/anie.202106323]
Join us for an exciting new project that will require a willingness & dedication to pursuing a novel new platform for the automated synthesis of novel chemical compounds:
Applied Engineering in designing, assembling & prototyping a robotic 'helper' central to the platform, while also engineering the chemical storage reservoirs, measuring system & instrument integration
Software engineering of an artificial intelligence system responsible for initiation, command & control, & real-time retrieval/analysis of intermediate results to make in-flight modifications to the system, and
Chemical design, provisioning & monitoring of reactants to bring the principles high-throughput to chemical synthesis of novel compounds to address current & future problems through advanced Chemistry.