Chemoinformatics & In-silico tools

CTB runs many different computational processes for optimal in-silico support of the experimental efforts to identify small molecules and peptidomimetic modulators of novel and high value targets.

LBSSX

The application of ligand-based virtual screening techniques, otherwise known as molecular similarity measures enables many facets central to cheminformatics, support of medicinal chemistry efforts and fundamental drug discovery. We currently make use of an in-house developed platform: LBSSX - Ligand Based Structural Similarity 10, which includes 10 diverse (fingerprint, graph, and 3D-based) similarity methods. Application of these techniques alone or in combination allows prioritisation of commercial similars for hit compound SAR exploration, library generation, diversity evaluation and even target prediction when paired with public assay data from repositories such as ChEMBL (updated and regularly reintegrated into the LBSSX platform upon new release). Along with these traditional applications of molecular similarity, we also use the platform to generate molecular descriptors which are used to represent molecules in our machine learning/AI workflows.



MassFinder - automated analysis of plate based size exclusion (pbSEC) screening results.

One of CTB’s primary screening methods for identification of small molecular, or fragment based binders to target proteins, integrated in the Label Free Affinity Platform (LFAP) is plate-based size exclusion chromatography (pbSEC). With many libraries acquired from commercial sources, industrial partners, or created internally, screened against a range of target proteins and variable purposes, manual analysis of instrument output for each compound screened is a laborious process. This critical task in hit identification and confirmation has now been fully automated with the creation of the MassFinder software, capable of reading mass spectrometry data from a range of source machines and instruments. With compound pools defined, proprietary mass spectrometry file formats are read and cumulative ion counts for masses present within compound screening pools are calculated. A Massfinder score (MFScore) is then assigned to each compound screened. A shortlist is presented for follow-up manual inspection by visualisation of a generated MassFinder ion profile. The image to the right shows the difference in ion profiles for a binder (top) and non-binder (bottom). The green line showing compound ion counts obtained when incubated with protein, and the red line ion counts without protein (control). An ideal hit should come through in a single peak (green line) with protein.

PuLSE - Phage Library Sequence Evaluation

The design of highly diverse phage display libraries is based on assumption that DNA bases are incorporated at similar rates within the randomized sequence. As library complexity increases and expected copy numbers of unique sequences decrease, the exploration of library space becomes sparser and the presence of truly random sequences becomes critical. To test this assumption and catch poor quality libraries, we have developed PuLSE (Phage Library Sequence Evaluation) as a tool for assessing randomness and therefore diversity of phage display libraries. PuLSE operates on a collection of next generation sequence reads representative of a phage display library and reports on the library in terms of unique DNA sequence counts and positions, translated peptide sequences, and normalized ‘expected’ occurrences from base to residue codon frequencies. Reporting allows at-a-glance quantitative quality control of a phage library in terms of sequence coverage both at the DNA base and translated protein residue level, which has been missing from toolsets and literature.


Drug combinations

Using any kind of network and pathway information, from basic knowledge of nodes to highly sophisticated dynamic rewiring data, the CTB software process allows applying several selection and prioritization modes to find novel, IP free combinations of two our three drugs, development compounds and experimental compounds.

CTB-ConeExclusion

Solution based virtual screening codes are not always applicable to compounds on soluid support. We have addressed this shortcomming with the development of CTB-ConeExclusion, which may be applied to the output of solution-based structural virtual screening codes. CTB-ConeExclusion may act as a post-docking filter, allowing the selection of compounds with poses amienable to syntheis on solid support.

In addition to acting as a post-docking filter, in literature, we exemplify its application to library design with a purine example. Purines can be conjugated at position 2, 6, 8, or 9. If the structure of the binding pocket of a target protein is known, only one of these positions will be a suitable starting point for on-bead screening. To streamline on-bead screening of small molecular libraries CTB has derived a software package which prioritizes, accepts or rechecks available bead based libraries for HTS.

A larger overarching project making use of this code aims to reveal fundamental properties of classic kinase inhibition scaffolds.

CTB-Core

The CTB-Core platform links combinatorial chemistry and virtual screening at the Auer Lab. It encompasses many aspects of cheminformatics, virtual screening and modelling for compounds on solid support. Specifically designed to complement our CONA technology, the CTB-Core platform aids in library planning, building block selection, and hit optimisation. Integrating a number of technologies, we are able to operate on targets with or without known structure and known ligands. A decision tree based on the available information of a target’s 3D-structure, binders/inhibitors and the representation of the scaffold of interest in the CTB-Core-Knowledge Base defines the strategy of identification of hits for most targets. Current developments in machine learning techniques are addressing protein sequence driven library design, allowing operation on targets with no known active compounds or x-ray structure. An exciting application of the CTB-Core platform is library reuse/repurposing; drawing on our expertise in drug repurposing and bioisostere replacement, this has been applied in a library context and allows reuse of libraries for optimum efficiency and impact. With a database of enumerated scaffolds, and the combinatorial explosion of unique small molecules possible, we have developed an approach known as “Representative Sets”, in which the core scaffold of a molecule is evaluated in a computationally affordable manner, without sacrificing diversity afforded by a normally inaccessibly large number of possible R-groups.

CTB-Morph

Conceptually, this software package allows a semi-automatic development of a peptidimimetic from a peptide. The necessary input is peptide sequence and structure of a protein-peptide or protein-protein interaction. Natural peptides are very important chemical tools for target validation, and might even represent potential drug candidates, however, it is often not easy to develop a natural peptide with the necessary high affinity and stability, while specificity is often achieved. The complete set of worldwide commercially available non-natural amino acids, amino-carboxylic acids and di-acids and amines are the input information for the position specific replacement of natural amino acids.

Bioisosteer centric fragmentation

We use known actives for target classes and even individual proteins to build fragment libraries. The high hit-rate for fragment libraries, coupled with population of known active moieties for a target aids in the discovery of medicinal chemistry starting points. Our custom software pipelines can evaluate a target space in relation to the known actives, how they fragment, the properties of resultant fragments, and commercial availability.


PyBindingCurve / KMath

Multi component interactions yield intricate binding stoichiometries described by polynomial equations for which there are no direct analytical solutions. We have developed PyBindingCurve to address this, using direct analytical solutions to protein-ligand systems where possible, and Lagrange-multiplier, or kinetic-based systems when not. Simulation and fitting of experimental parameters to data and integration into existing pipelines is simple using the Python package, presenting unique opportunities in experimental planning and data-analysis.

Small molecule libraries

The Auer group has access to a wide range of diverse commercial, proprietary and in-house small molecule libraries for hit finding. These libraries span known bioactives, fragment libraries based off of diverse hit-finding strategies (currently 7), diversity-based libraries, and target-class focused libraries.

Artificial Intelligence / Machine Learning

CTB employs AI/machine learning (ML) techniques in a number of projects, with currently three people working in the field. The projects are multidisciplinary and diverse in nature, broadly separable into three categories; imaging, chemoinformatics and analysis of instrument data. Within the field of imaging, we work with phenotypic screening data, developing techniques and identifying compound modes of action. To augmenting our CONA assays, we apply computer vision techniques to hit bead identification and simplification of assay readouts. Within chemoinformatics, we have built upon our existing expertise, adding AI/ML techniques and building predictors of physicochemical properties for small molecules. Work with an industrial partner also aims to predict the targets of small molecules, and allows us access to proprietary compound archives with novel chemistry. The analysis of instrument data has led us to apply techniques to NMR, and mass spectrometry data. Our aim in all projects is the exploitation of data in ways inaccessible without the insights afforded by AI/ML techniques. We utilise many different techniques in the creation of our predictors and generators, including feed forward, recurrent and convolutional neural networks, along with more exotic autoencoder and general adversarial autoencoder architectures. Models are trained either on the University of Edinburgh’s supercomputing facilities, on Google’s Compute Engine, or on in-house GPGPU accelerated hardware.

Machine Learning / AI for Kinase Drug Discovery

Protein kinases are arguably one of the two most attractive and important drug target classes in oncology. Their importance within further therapeutic fields is also rising, with interest in their use in treating immune system disorders, degenerative disease and infectious disease. Owing to this, there is necessity to efficiently screen active compounds for a growing number of novel kinase targets. Conventional drug discovery approaches, are based on either assay development and high throughput screening, or on structure based design approaches. Projects in areas like infectious disease are restricted by a number of factors, including: i) Target validation, i.e. biological knowledge of which kinase is the most important one in bacteria, viruses etc., ii) The availability of sufficiently high amounts of enzyme for assay development or the 3D structure of the target kinases. iii) For orphan targets, there are no known active hits or lead compounds. We are therefore applying machine learning techniques to kinase drug discovery with the goal of creation of a new branch of virtual screening; sequence-based virtual screening. With this newly-developed screening method, we will predict active compounds for novel orphan or unexplored kinase drug targets using only their sequence information, accelerating the development and application of kinase inhibitors.

Phenotypic Virtual Screening: Deriving predictive machine learning models from high-content screening datasets

Phenotype-based drug discovery has been shown to be an effective method for discovering first-in-class drugs. By studying candidate drugs on a comparatively high level – on the level of the induced phenotype, rather than the induced activity of proteins within the cell – we can more easily identify compounds which induce something resembling a desired phenotype. Our dataset, screened at the MRC Institute of Genetics and Molecular Medicine (IGMM) University of Edinburgh, by The group of Prof Neil Carragher, includes over 12,000 compounds screened against 8 distinct breast cancer cell lines spanning 4 clinical subtypes. This affords us a rich dataset with which we can exploit machine learning techniques to find automated methods of identifying – and perhaps ultimately generating – compounds with high therapeutic potential.

Our shorter-term goals include developing a ‘phenotypic metric’, and effectively adjusting for batch effects. Our phenotypic metric aims to quantify the magnitude of the response that any given compound induces. Adjusting for batch effects is a canonical problem in high-content screening: since our compounds are screened on different plates, and are therefore subject to different conditions, ‘nuisance variation’ is introduced to our datasets, which we are working to minimise.

This decade has heralded in dramatic advances in processing power, which has enabled machine learning models to process larger and larger datasets with ever increasing speed. One consequence of this is that we now have the ability to employ deep learning models to detect low-level features in imaging data, yielding rich descriptions of images, and consequently powerful predictive models. We are working on convolutional autoencoders to profile cell morphology and compounds.