Deep Learning-Based Activity Prediction for Estrogen Receptor Modulators and Application to Natural Product Screening
Jongkeun Choi*
Department of Biological and Chemical Engineering, Chungwoon University, Sukgol-ro 113, Michuhol-gu Incheon, 22100, Republic of Korea
Corresondence to: Jongkeun Choi, jkchoi@chungwoon.ac.kr
Received: September 11, 2024; Revised: November 20, 2024; Accepted: February 11, 2025; Published: February 14, 2025
NATPRO J. 2024; 1: 5-15 Open Access
Published: February 14, 2025 https://doi.org/10.23177/NJ024.901
Abstract
This study investigated the use of deep learning techniques to predict estrogen receptor (ER) activity in small molecules, crucial for drug discovery in hormone-dependent diseases. Active and inactive compound data were collected from PubChem and BindingDB, and their chemical properties, including molecular weight, polar surface area, and hydrogen bond characteristics were analyzed. Utilizing RDKit, molecular descriptors and Morgan Fingerprints were calculated. Chemical space analysis using principal component analysis (PCA) visualization revealed that this approach alone was insufficient for distinguishing between active and inactive compounds. Therefore, two TensorFlow-based deep learning models were developed: one using molecular descriptors and the other using Morgan Fingerprints. Both models were trained on BindingDB and PubChem datasets with varying activity thresholds and molecular weight restrictions. The Morgan Fingerprint-based model consistently outperformed, achieving up to 99.82% accuracy and 0.9994 AUC on the BindingDB dataset. To validate practical applicability, 81,442 compounds from the NPASS natural product database were screened using the best-performing model. This virtual screening identified 3,577 potential ER-active candidates, including known active compounds and novel potential modulators. The results highlighted the superiority of Morgan Fingerprints in capturing relevant structural features for activity prediction and emphasized the importance of high-quality datasets in model development. This study also demonstrated the potential of deep learning in expediting drug discovery processes, particularly in identifying promising candidates from large compound libraries. Future work will need to include free energy calculations using molecular dynamics and experimental validation.
This publication is licensed under CC-BY-NC-ND 4.0.
Copyright © 2024 The Asian Society of Natural Products
Keywords
Artificial intelligence, deep learning, estrogen receptor, drug discovery, TensorFlow, ligand activity prediction, Morgan Fingerprints, virtual screening
Graphic Abstract
Introduction
Artificial Intelligence (AI) is a branch of computer science that aims to mimic human intelligence to solve complex problems and to learn from data. In recent years, the remarkable advancements in deep learning have driven innovation across various fields [1]. Deep learning utilizes multilayered artificial neural networks to learn complex patterns, achieving outstanding performance in the areas, such as image recognition, speech recognition, natural language processing, autonomous driving, and healthcare [2]. In the medical field, deep learning models are used to analyze medical images, such as X-rays and magnetic resonance imaging (MRIs), for disease diagnosis and to predict molecular structures for drug discovery [3]. The integration of AI into drug discovery has become increasingly prominent, particularly in medicinal chemistry. Techniques such as AlphaFold for protein structure prediction and deep learning models for modeling the relationship between small-molecule structures and biological activity are transforming the search for new drug candidates [4,5]. Traditional methods, such as high-throughput screening (HTS), are often time-consuming and costly with low success rates. In contrast, in silico methodologies using deep learning offer significant reductions in time and cost, as well as the ability to identify novel drug candidates that might be missed by conventional approaches [6]. Deep learning's versatility has led to its integration in various facets of drug discovery. For instance, it has been employed to predict drug-target interactions, assess drug-drug similarity interactions, and evaluate drug sensitivity and side effects [7,8]. A systematic literature review covering over 300 articles from 2000 to 2022 has highlighted how recent advancements in Deep learning technologies have streamlined the drug discovery process by reducing development time and costs [7]. Moreover, the integration of explainable AI techniques supports better decision-making in drug development by providing transparency in Deep learning model predictions [7]. AI is revolutionizing drug discovery across various stages, from target identification to drug repurposing, offering increased efficiency, cost reduction, and improved success rates. Significant milestones have been achieved, such as AI-designed drugs entering clinical trials and the development of AI-generated antibodies, demonstrating the technology's potential to accelerate the drug development process [9]. However, non-negligible challenges, including algorithmic biases, ethical concerns, and the need for extensive computational resources, are still emphasizing the importance of overcoming these hurdles while prioritizing rigorous validation and ethical considerations for the future of AI in drug discovery [9].
A variety of tools have been developed, and some can be freely used in drug discovery projects. Among these, Google's TensorFlow, a popular deep learning framework, has been instrumental in these advancements by providing a robust platform for building AI model [6,10]. TensorFlow's flexibility and scalability, along with its high-level API, Keras, enable researchers to efficiently develop complex models [11]. Its support for distributed processing allows handling large-scale data, making it an invaluable tool in AI research and application [11].
Estrogen receptor (ER) is a nuclear receptor that binds estrogen, a female hormone, and it regulates various physiological processes [12]. Upon binding estrogen, the receptor undergoes conformational changes, activating or repressing specific gene transcription through interactions with coactivator and corepressor proteins [13, 14]. These interactions are crucial in the context of hormone-dependent diseases, such as breast cancer and endometrial cancer, thus making ER a vital target for therapeutic intervention [5, 15]. In the recent years, several studies have been published indicating that AI can be used to treat diseases related to estrogen receptors and to discover new drugs. Rajpura et al. used a deep learning model to predict the binding affinity of small molecules to estrogen receptors, comparing it with a simple linear regression model [5]. Vamathevan et al. utilized symbolic AI and deep neural networks to explore these interactions, focusing on predicting ligand activity on targets such as estrogen receptors [10]. Their work highlights AI's transformative potential in drug discovery, particularly through the precise modeling of intricate interactions. Similarly, Cui et al. employed graph neural networks and machine learning techniques to identify promising drug candidates for breast cancer treatment, emphasizing the crucial role of non-Euclidean data structures in capturing the complexity of interactions with estrogen receptors, which is vital for developing effective therapies [16]. Additionally, deep learning can aid cell detection and classification in breast cancer tissues, enhancing the understanding of the tumor microenvironment in estrogen receptor-positive cancers and paving the way for the improved diagnostic and therapeutic strategies [17].
This study aimed to investigate the utility of deep learning, specifically TensorFlow-based deep learning models, to predict the activity of Erα-binding ligands. By leveraging compound data from PubChem and BindingDB databases, development of a deep learning model capable of identifying potential modulators of estrogen receptor activity was sought. In the process of building the deep learning model, a comparative study between models using molecular descriptors and those using Morgan fingerprints was conducted. The results revealed that the Morgan fingerprint-based model showed relatively better performance compared to the molecular descriptor-based model. Furthermore, by applying the developed model to a natural product database, the usefulness of the deep learning approach for searching candidates was demonstrated, thereby to contribute to the discovery of novel therapeutic agents.
Materials and methods
Data Preparation
The data for the compounds with biological assay results related to the estrogen receptor were sourced from PubChem [18] and BindingDB [19]. These datasets were downloaded in the CSV file format. Each file contained information that allowed for the classification of compounds based on their activity status as active or inactive. From PubChem, we collected 8,936 active compounds and 959,967 inactive compounds. Additionally, 4,622 active compounds were obtained from BindingDB. These compounds were organized into a separate file: active.csv for active compounds and inactive.csv for inactive compounds. These files were structured to serve as inputs for the TensorFlow-based predictive modeling process.
Development of a Deep Learning Program Using AI
The development of the deep learning program was facilitated by using advanced AI services such as Monica, Google's Gemini, and Microsoft's Copilot. These AI tools were used interactively to draft, refine, and complete the deep learning code. The process involved iterative dialogue with these AI models, enabling the incorporation of best practices and advanced machine-learning techniques. AI-assisted development allowed for efficient troubleshooting and optimization of the model architecture, leading to a robust program capable of predicting compound activity with high accuracy. Through this collaborative approach, the program was tailored to effectively utilize both molecular fingerprints and descriptors, thereby enhancing its predictive capabilities across diverse chemical datasets.
Analysis of Basic Property Distributions in Compound Database
To gain insight of the fundamental properties of compounds within the dataset, a comprehensive analysis was performed by using Python libraries, such as Pandas, NumPy, Matplotlib, and Seaborn. The analysis was also aimed to visualize the distribution of key molecular properties, facilitating a better understanding of the chemical space occupied by active and inactive compounds. Data were loaded from CSV files containing assay results from PubChem and BindingDB, categorized into active and inactive compounds. Key molecular properties, including molecular weight, polar surface area, XLogP, hydrogen bond donors (HBD), hydrogen bond acceptors (HBA), and SMILES strings, were extracted. Then, a variety of plots, including density distributions and scatter plots, was generated to visualize these properties. Specifically, the density distributions of molecular weight, LogP, and polar surface area, as well as distributions of HBD and HBA, were analyzed to compare differences among active and inactive compounds from PubChem, and active compounds from BindingDB. This visualization process provided a comprehensive understanding of the chemical space occupied by each dataset and their characteristics. This step was intended to provide a broad overview of the distribution of compounds and identify differences between datasets, serving as a foundation for planning subsequent analyses.
Analysis of Chemical Space
To analyze the chemical space of molecular datasets, two distinct computational methodologies was employed: one utilizing molecular descriptors and the other employing Morgan Fingerprints [20]. Both approaches involve dimensionality reduction and clustering to elucidate structural diversity and relationships within the chemical space. The first approach utilizes molecular descriptors, which are numerical values representing various physicochemical properties of molecules, calculated using RDKit's Descriptors module [21]. The dataset was preprocessed to handle missing values and extreme outliers, ensuring robust analysis, and principal component analysis (PCA) was employed to reduce the descriptor space to two and three dimensions. The second approach involved generating Morgan Fingerprints, a widely used method for encoding molecular structures into fixed-size bit arrays. Using RDKit, Morgan Fingerprints with a radius of 2 and a bit length of 2048 were computed for each molecule. This encoding captured the structural features of molecules, which were then subjected to PCA to reduce the feature space to three principal components, facilitating visualization in 2D or 3D space. The analysis was conducted separately for active and inactive compounds as well as for the combined dataset, allowing for a comprehensive exploration of the chemical space. For both methodologies, 2D and 3D scatter plots were generated to visualize the distribution of molecules in the reduced chemical space.
Molecular Feature Extraction and Prediction Models
Two distinct methodologies for predicting molecular activity were employed: one based on molecular descriptors and the other on Morgan Fingerprints. Using RDKit, we computed 1826 molecular descriptors, representing the unique physicochemical properties of each molecule, and Morgan Fingerprints with a radius of 2 and a bit length of 2048, which capture diverse substructures with each bit representing a specific fragment. To address data imbalance between active and inactive compounds (ratio approximately 1:7), the synthetic minority oversampling technique (SMOTE) [23] was adopted. The predictive model was constructed using TensorFlow's deep neural network framework. The model architecture consisted of an input layer matching the molecular descriptor (1826) or the Morgan fingerprint dimension (2,048), followed by two hidden layers with 128 and 64 neurons, respectively. A dropout rate of 0.5 was implemented between layers to mitigate overfitting and the output layer used sigmoid activation for binary classification. The model was compiled using binary cross-entropy loss and the Adam optimizer. The training process continued for 25 epochs with a batch size of 32, and 20% of the training data was reserved for validation. The dataset was split into training (80%) and test (20%) sets using a random state of 42 for reproducibility. The performance of the model was evaluated using metrics such as accuracy, precision, recall, F1 score, and area under the ROC curve (AUC) [22,23], as defined by the following equations:
Predictive Modeling of Compound Activity in NPASS Database
Morgan Fingerprints model was utilized to predict the activity of compounds in the NPASS database, with the goal of screening large chemical datasets for potential active candidates [24]. A pre-trained model, stored in H5 file format, was loaded using TensorFlow's Keras API. The NPASS database containing SMILES strings was processed to generate the RDKit molecule objects. Morgan Fingerprints with a radius of 2 and a bit length of 2048 were calculated for each molecule. These fingerprints served as input features for the model, which predicted the activity of each compound. Compounds with prediction scores above 0.5 were classified as active, and these active compounds were extracted and saved in CSV file format.
Results & Discussion
Distribution of molecular properties in database
The distribution and average values of molecular properties for compounds downloaded from PubChem and BindingDB are shown in Figure 1 and Table 1. While some differences between active and inactive compounds are observed, these differences are generally subtle. Active compounds from both PubChem and BindingDB show slightly higher average molecular weights (414 ± 194 and 447 ± 214, respectively) compared to inactive compounds (343 ± 96). This suggests that active compounds may tend to be slightly larger molecules. LogP values demonstrate a trend towards higher values in active compounds (4.29 ± 2.20 for PubChem and 4.78 ± 2.23 for BindingDB) compared to inactive compounds (2.77 ± 2.58). This indicates that active compounds might generally be more lipophilic, which could influence their ability to cross cell membranes or interact with hydrophobic binding pockets. Interestingly, the polar surface area shows similar mean values across all three groups (82.0±38.0 for inactive, 80.9±69.5 for PubChem active, and 79.6 ± 77.0 for BindingDB active). The larger standard deviations in active compounds suggest a wider range of polar surface areas among these compounds. Similarly, active compounds show slightly higher average numbers of hydrogen bond donors and acceptors, but the differences are modest.
Table 1. Comparison of molecular properties for inactive and active compounds from PubChem and BindingDB
Figure 1. Distribution of molecular properties and relationship between hydrogen bond donors (HBD) and acceptors (HBA). The upper panels show the distribution of molecular weight, logP, and polar surface area (PSA), where blue lines represent inactive compounds from PubChem, red lines represent active compounds from PubChem, and green lines represent active compounds from BindingDB. The lower panel displays a scatter plot of HBD (x-axis) vs. HBA (y-axis), with each point representing an individual compound using the same color scheme as in the upper panels.
In addition, active compounds from BindingDB generally show slightly higher average values compared to those from PubChem, particularly in molecular weight and LogP. This could reflect differences in the nature of compounds included in each database or variations in the criteria used for defining activity. While these molecular properties provide some insights, they may not fully capture the factors contributing to a compound's biological activity. The three-dimensional distribution of functional groups and overall molecular structure likely play crucial roles in determining activity. To gain a more comprehensive understanding of the structural features that distinguish active from inactive compounds, we calculated additional molecular descriptors and Morgan fingerprints. Principal component analysis (PCA) was then performed on these features to identify key patterns and differences between active and inactive compounds.
Chemical space analysis: Principal component analysis
Figure 2. PCA of molecular properties for active and inactive compounds. Molecular descriptors were calculated using RDKit. (a) and (b) show 2D PCA plots, while (c) and (d) present 3D PCA visualizations. The color scheme is consistent with Figure 1: blue points represent inactive compounds from PubChem, red points indicate active compounds from PubChem, and green points denote active compounds from BindingDB. These plots compare the distribution of active compounds from both PubChem and BindingDB against inactive compounds from PubChem in the PCA space, providing insights into the molecular property differences between active and inactive compounds.
To further explore the structural features that distinguish active compounds from inactive ones, PCA on two sets of molecular properties was performed: conventional molecular descriptors and Morgan fingerprints (Figure 2 and 3). The analysis included three datasets: active and inactive compounds from PubChem, and active compounds from BindingDB. The analysis pipeline was developed by using Python, utilizing libraries such as RDKit for molecular processing, scikit-learn for PCA, and matplotlib for visualization. For each descriptor set, we created both 2D and 3D PCA plots using matplotlib. Active compounds from PubChem were represented in red, inactive compounds in blue, and active compounds from BindingDB in green to ensure clear visualization. The PCA was conducted on standardized data to account for different feature scales.
Figure 3. PCA based on Morgan fingerprints for active and inactive compounds. (a) and (b) display 2D PCA plots, while (c) and (d) show 3D PCA visualizations. The color scheme remains consistent with previous figures: blue points represent inactive compounds from PubChem, red points indicate active compounds from PubChem, and green points denote active compounds from BindingDB. These plots illustrate the distribution of active compounds from both PubChem and BindingDB compared to inactive compounds from PubChem in the PCA space derived from Morgan fingerprints. This visualization provides insights into the structural differences between active and inactive compounds as captured by Morgan fingerprints, offering a complementary perspective to the molecular descriptor-based analysis in Figure 2.
Interestingly, the PCA results showed that Morgan fingerprints provided a more distinct separation between active and inactive compounds compared to conventional molecular descriptors. This superior discrimination ability suggests that Morgan fingerprints capture more relevant structural information for predicting compound activity. The improved separation can be attributed to Morgan fingerprints encoding specific structural fragments and their connectivity, which are crucial for protein-ligand interactions. In contrast, conventional molecular descriptors, while offering useful overall characteristics of molecules, may not fully capture the subtle structural features that determine a compound's biological activity. Moreover, the clear separation achieved with Morgan fingerprints indicates their potential utility in developing more accurate predictive models for compound activity.
Descriptor-Based Prediction
The performance of descriptor-based prediction models for compound activity was investigated using various datasets and criteria in this study (Table 2 and Table 3). Models trained on BindingDB and PubChem datasets were compared to explore the impact of different activity thresholds and molecular weight restrictions.
To evaluate classification model performance, metrics such as accuracy, precision, recall, F1 score, and area under the ROC curve (AUC) were used, so that each provides unique insights into different aspects of model efficacy [25,26]. Accuracy measures the proportion of correctly predicted instances out of the total instances, offering a broad overview of model performance. However, it may not be sufficient in cases of class imbalance. Precision focuses on the quality of positive predictions, answering how many instances labeled as positive are actually positive. Recall, also known as sensitivity, measures the model's ability to identify all relevant positive instances, which is crucial in scenarios where missing positive instances can be costly. The F1 score, a harmonic mean of precision and recall, offers a balanced measure when there's a need to account for both false positives and false negatives. It is particularly useful when the class distribution is uneven. AUC provides an aggregate measure of performance across all classification thresholds. The ROC curve plots the true positive rate against the false positive rate at various threshold settings. The AUC represents the likelihood that the model will rank a randomly chosen positive instance higher than a randomly chosen negative one. An AUC of 0.5 indicates no discriminative power (akin to random guessing), while an AUC of 1.0 reflects perfect discrimination.
Table 2. Comparison of descriptor-based model performance across activity thresholds using BindingDB dataset.
Table 3. Comparison of descriptor-based model performance across activity thresholds using PubChem dataset.
The study's findings reveal crucial insights into model performance across various metrics, emphasizing the importance of test set results, loss values, and precision-recall trade-offs. Notably, the models demonstrated remarkable consistency between training and test set performances, indicating robust generalization capabilities. For instance, in the BindingDB dataset with the strictest activity criterion (<1 μM), the test accuracy of 98.15% closely mirrored the training accuracy of 98.16%, with correspondingly low loss values (0.0304 for test, 0.0302 for training). This consistency across datasets suggests minimal overfitting, a critical factor in developing reliable predictive models for real-world applications in drug discovery. Results showed that models using BindingDB data consistently outperformed those using PubChem data, with accuracies ranging from 91.45% to 98.16% and AUC values between 0.9671 and 0.9949 for BindingDB models, compared to 81.12–84.10% accuracy and 0.8918-0.9249 AUC for PubChem models.
The precision, recall, and F1 scores provide deeper insights into the models' performance characteristics. For the BindingDB model with a <1 μM activity threshold, high precision (97.69%) indicates a low false positive rate, crucial for minimizing costly experimental validations of inactive compounds. The exceptional recall (98.62%) suggests the model's strong ability to identify truly active compounds, vital for not missing potential drug candidates. The resulting F1 score of 98.15% represents a well-balanced model performance, harmonizing precision and recall. This balance is particularly important in drug discovery, where both false positives (leading to wasted resources) and false negatives (missing potential leads) can have significant consequences. The consistently high AUC values, peaking at 0.9949 for the strictest BindingDB model, further corroborate the models' excellent discriminative power across various classification thresholds. In contrast, limiting molecular weight ranges in the PubChem dataset slightly enhanced model performance, with an optimal range of 250–650 Da yielding an accuracy of 83.89% and AUC of 0.9218.
The models showed high consistency between training and test set performances, indicating minimal overfitting. Interestingly, while stricter activity criteria reduced dataset size for BindingDB, it paradoxically improved model performance, highlighting the importance of data quality over quantity. These findings demonstrate the critical role of data curation and the need for careful selection of training datasets to enhance predictive modeling in drug discovery.
Morgan Fingerprint-Based Prediction
The performance of compound activity prediction models based on Morgan fingerprints was evaluated using the same methodology previously applied to molecular descriptor-based models (Table 4 and Table 5). Both approaches were tested on BindingDB and PubChem datasets to enable a direct comparison. The Morgan fingerprint-based prediction models demonstrated consistently better performance compared to the descriptor-based prediction models across both BindingDB and PubChem datasets. This superiority was particularly pronounced in the BindingDB dataset, where Morgan fingerprint models achieved remarkably high accuracies ranging from 99.36% to 99.82% and AUC values between 0.9980 and 0.9994. In contrast, the descriptor-based models for BindingDB, while still performing well, showed lower accuracies of 91.45% to 98.16% and AUC values of 0.9671 to 0.9949. This substantial improvement in performance suggests that Morgan fingerprints may be capturing structural features more relevant to compound activity than traditional molecular descriptors.
The performance gap between the two methods was even more evident in the PubChem dataset. Morgan fingerprint models achieved accuracies of 95.87% to 96.57% and AUC values of 0.9849 to 0.9863, significantly outperforming the descriptor-based models, which only managed accuracies of 81.12% to 84.10% and AUC values of 0.8918 to 0.9246. This consistent outperformance across different datasets underscores the robustness and versatility of the Morgan fingerprint approach in capturing molecular features relevant to activity prediction.
Both methods showed improvements in performance with stricter activity criteria and molecular weight restrictions, but this trend was more consistent and pronounced in the Morgan fingerprint models. For instance, in the BindingDB dataset, as the activity threshold became more stringent (from <100 μM to <1 μM), the Morgan fingerprint models showed a steady increase in accuracy from 99.36% to 99.82%. This improvement was mirrored in other metrics such as precision, recall, and F1 score, all of which reached near-perfect values at the strictest activity threshold. The Descriptor-Based models also improved with stricter criteria but not to the same extent or consistency.
The balance between precision and recall was notably better in the Morgan fingerprint models, particularly for the BindingDB dataset. At the strictest activity threshold (<1 μM), the Morgan fingerprint model achieved a precision of 99.65% and a recall of 100%, resulting in an F1 score of 99.81%. This near-perfect balance indicates the model's exceptional ability to correctly identify active compounds while minimizing false positives and false negatives. In comparison, the best-performing descriptor-based model on the same dataset achieved a precision of 97.69%, recall of 98.62%, and an F1 score of 98.15%, which, while impressive, falls short of the Morgan fingerprint model's performance.
Both approaches demonstrated better performance on the BindingDB dataset compared to the PubChem dataset, but this difference was more pronounced in the Morgan fingerprint models. This observation highlights the importance of data quality and curation in building effective predictive models. The superior performance of Morgan fingerprint models across different datasets and conditions suggests that this method of molecular representation may be more effective in capturing the structural intricacies that contribute to compound activity. These findings have significant implications for the field of drug discovery, indicating that Morgan fingerprints could be a more reliable choice for molecular representation in activity prediction tasks, potentially leading to more accurate identification of promising drug candidates and more efficient use of resources in the drug discovery pipeline.
Application of deep learning model for activity prediction and virtual screening on natural product databases
Table 4. Comparison of Morgan Fingerprint-based model performance across activity thresholds using BindingDB dataset
Table 5. Comparison of Morgan Fingerprint-based model performance across activity thresholds using PubChem dataset
For screening bioactive compounds from natural product databases, a deep learning model based on Morgan fingerprints was employed, which was trained on a dataset combining compounds with activity < 100 μM from BindingDB and inactive compounds from PubChem. This relatively less strict activity threshold was chosen to identify a wider range of potential estrogen receptor modulators, allowing for compounds with varying levels of activity. We applied this model to screen natural products from the NPASS database, which contains a total of 81,442 compounds. This virtual screening approach, designed to capture a diverse set of potentially active compounds, identified 3,577 compounds as potentially active against the estrogen receptor. The use of a more inclusive activity criterion in the training set aimed to improve the model's ability to identify a broader range of structural features that may influence estrogen receptor activity, including those with moderate effects. Additionally, to validate the model's predictions, a literature review was conducted for the compounds flagged as active. Table 6 presents a subset of these compounds that have been previously reported in the literature as having activity against the estrogen receptor, confirming the effectiveness of our deep learning-based virtual screening approach.
Furthermore, our model identified novel compounds that have not been previously associated with estrogen receptor activity. Table 7 presents examples of these newly identified potential estrogen receptor modulators, demonstrating the model's capacity to uncover new lead candidates.
These results demonstrate the utility of deep learning-based virtual screening in drug discovery, particularly in identifying both known active compounds and potential new leads from large databases of natural products. Our approach successfully narrowed down the vast chemical space of 81,442 compounds to a more manageable set of 3,577 potentially active compounds, significantly streamlining the drug discovery process.
Table 6. Natural compounds screened virtually from NPASS dataset. Previously reported estrogen receptor-active compounds identified by our model
While these findings are promising, it is important to note that further validation is necessary. The activity of these newly identified compounds should be confirmed through additional computational methods, such as free energy calculations, and ultimately through experimental verification. Nonetheless, this study showed the power of combining deep learning techniques with traditional cheminformatics approaches to expedite the identification of bioactive natural products, potentially accelerating the drug discovery pipeline for estrogen receptor modulators.
Table 7. Natural compounds screened virtually from NPASS dataset. Examples of novel compounds predicted to have estrogen receptor activity
Conclusion
This study demonstrates the efficacy of deep learning models, particularly those utilizing Morgan fingerprints, in predicting estrogen receptor activity. Morgan fingerprint-based approach significantly outperformed traditional molecular descriptor methods, achieving high accuracy and AUC values, especially when trained on high-quality datasets like BindingDB. The successful application of our model to the NPASS database, identifying both known active compounds and potential new candidates, underscores the power of AI in streamlining drug discovery processes. By narrowing down a vast chemical space of over 80,000 compounds to a manageable set of potential candidates, our method offers a time and cost-effective approach to lead identification. While promising, these results also highlight the critical importance of data quality and proper curation in developing reliable predictive models. Future work should focus on experimental validation of newly identified compounds and further refinement of the model to enhance its predictive capabilities across diverse chemical spaces.
References
1. Sarker, I. H. Deep Learning: a comprehensive overview on techniques, taxonomy, applications and research directions. SN Comput. Sci. 2021, 2, 420. DOI: 10.1007/s42979-021-00815-1
2. Alzubaidi, L.; Zhang, J.; Humaidi, A. J.; Al-Dujaili, A.; Duan, Y.; Al-Shamma, O.; Santamaría, J.; Fadhel, M. A.; Al-Amidie, M.; Farhan, L. Review of deep learning: concepts, CNN architectures, challenges, applications, future directions. J. Big Data, 2021, 8, 53. DOI: 10.1186/s40537-021-00444-8
3. Najjar, R. Redefining radiology: a review of artificial intelligence integration in medical imaging. Diagnostics, 2023, 13, 2760. DOI: 10.3390/diagnostics13172760
4. Tunyasuvunakool, K.; Adler, J.; Wu, Z.; Green, T.; Zielinski, M.; Žídek, A.; Bridgland, A.; Cowie, A.; Meyer, C.; Laydon, A.; Velankar, S.; Kleywegt, G. J.; Bateman, A.; Evans, R.; Pritzel, A.; Figurnov, M.; Ronneberger, O.; Bates, R.; Kohl, S. A. A.; Potapenko, A.; Ballard, A. J.; Romera-Paredes, B.; Nikolov, S.; Jain, R.; Clancy, E.; Reiman, D.; Petersen, S.; Senior, A. W.; Kavukcuoglu, K.; Birney, E.; Kohli, P.; Jumper, J.; Hassabis, D. Highly accurate protein structure prediction for the human proteome. Nature, 2021, 596, 590-596. DOI: 10.1038/s41586-021-03828-1
5. Rajpura, H. R.; Ngom, A. Predicting small molecule potency to inhibit estrogen receptors using machine learning and deep learning approaches. 2018 International Conference on Artificial Intelligence and Data Processing (IDAP), Malatya, Turkey, 2018, 1-5. DOI: 10.1109/IDAP.2018.8620864
6. Paul, D.; Sanap, G.; Shenoy, S.; Kalyane, D.; Kalia, K.; Tekade, R. K. Artificial intelligence in drug discovery and development. Drug Discov. Today, 2021, 26, 80-93. DOI: 10.1016/j.drudis.2020.10.010
7. Askr, H.; Elgeldawi, E.; Ella, H. A.; Elshaier, Y. A. M. M.; Gomaa, M. M.; Hassanien, A. E. Deep learning in drug discovery: an integrative review and future challenges. Artif. Intell. Rev. 2023, 56, 5975-6037. DOI: 10.1007/s10462-022-10306-1
8. Vora, L. K.; Gholap, A. D.; Jetha, K.; Thakur, R. R. S.; Solanki, H. K.; Chavda, V. P. Artificial intelligence in pharmaceutical technology and drug delivery design. Pharmaceutics, 2023, 15, 1916. DOI: 10.3390/pharmaceutics15071916
9. Abou Hajal, A.; Al Meslamani, A. Z. Insights into artificial intelligence utilisation in drug discovery. J. Med. Econ. 2024, 27, 304-308. DOI: 10.1080/13696998.2024.2315864
10. Vamathevan, J.; Clark, D.; Czodrowski, P.; Dunham, I.; Ferran, E.; Lee, G.; Li, B.; Madabhushi, A.; Shah, P.; Spitzer, M.; Zhao, S. Applications of machine learning in drug discovery and development. Nat. Rev. Drug Discov. 2019, 18, 463-477. DOI: 10.1038/s41573-019-0024-5
11. Hagg, A.; Kirschner, K. N. Open-source machine learning in computational chemistry. J. Chem. Inf. Model. 2023, 63, 4505-4532. DOI: 10.1021/acs.jcim.3c00643
12. Fuentes, N.; Silveyra, P. Estrogen receptor signaling mechanisms. Adv. Protein Chem. Struct. Biol. 2019, 116, 135-170. DOI: 10.1016/bs.apcsb.2019.01.001
13. Katzenellenbogen, B. S.; Katzenellenbogen, J. A. Estrogen receptor transcription and transactivation: Estrogen receptor alpha and estrogen receptor beta: regulation by selective estrogen receptor modulators and importance in breast cancer. Breast Cancer Res. 2000, 2, 335-344. DOI: 10.1186/bcr78
14. Kraichely, D. M.; Sun, J.; Katzenellenbogen, J. A.; Katzenellenbogen, B. S. Conformational changes and coactivator recruitment by novel ligands for estrogen receptor-alpha and estrogen receptor-beta: correlations with biological character and distinct differences among SRC coactivator family members. Endocrinology, 2000, 141, 3534-3545. DOI: 10.1210/endo.141.10.7698
15. Patel, H. K.; Bihani, T. Selective estrogen receptor modulators (SERMs) and selective estrogen receptor degraders (SERDs) in cancer treatment. Pharmacol. Ther. 2018, 186, 1-24. DOI: 10.1016/j.pharmthera.2017.12.012
16. Cui, C.; Ding, X.; Wang, D.; Chen, L.; Xiao, F.; Xu, T.; Zheng, M.; Luo, X.; Jiang, H.; Chen, K. Drug repurposing against breast cancer by integrating drug-exposure expression profiles and drug-drug links based on graph neural network. Bioinformatics, 2021, 37, 2930-2937. DOI: 10.1093/bioinformatics/btab191
17. Nederlof, I.; Hajizadeh, S.; Sobhani, F.; Raza, S. E. A.; AbdulJabbar, K.; Harkes, R.; van de Vijver, M. J.; Salgado, R.; Desmedt, C.; Kok, M.; Yuan, Y.; Horlings, H. M. Spatial interplay of lymphocytes and fibroblasts in estrogen receptor-positive HER2-negative breast cancer. NPJ Breast Cancer, 2022, 8, 56. DOI: 10.1038/s41523-022-00416-y
18. Kim, S.; Chen,J.; Cheng, T.; Gindulyte, A.; He, J.; He, S.; Li, Q.; Shoemaker, B. A.; Thiessen, P. A.; Yu, B.; Zaslavsky, L.; Zhang, J.; Bolton, E. E. PubChem in 2021: new data content and improved web interfaces. Nucleic Acids Res. 2021, 49, D1388-D1395. DOI: 10.1093/nar/gkaa971
19. Gilson, M. K.; Liu, T.; Baitaluk, M.; Nicola, G.; Hwang, L.; Chong, J. BindingDB in 2015: A public database for medicinal chemistry, computational chemistry and systems pharmacology. Nucleic Acids Res. 2016, 44, D1045-53. DOI: 10.1093/nar/gkv1072
20. Capecchi, A.; Probst, D.; Reymond, J. L. One molecular fingerprint to rule them all: drugs, biomolecules, and the metabolome. J. Cheminform. 2020, 12, 43. DOI: 10.1186/s13321-020-00445-4
21. Bento, A. P.; Hersey, A.; Félix, E.; Landrum, G.; Gaulton, A.; Atkinson, F.; Bellis, L. J.; De Veij, M.; Leach, A. R. An open source chemical structure curation pipeline using RDKit. J. Cheminform. 2020, 12, 51. DOI: 10.1186/s13321-020-00456-1
22. Chicco, D.; Jurman, G. The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation. BMC Genomics, 2020, 21, 6. DOI: 10.1186/s12864-019-6413-7
23. Hicks, S. A.; Strümke, I.; Thambawita, V.; Hammou, M.; Riegler, M. A.; Halvorsen, P.; Parasa, S. On evaluation metrics for medical applications of artificial intelligence. Sci. Rep. 2022, 12, 5979. DOI: 10.1038/s41598-022-09954-8
24. Zeng, X.; Zhang, P.; He, W.; Qin, C.; Chen, S.; Tao, L.; Wang, Y.; Tan, Y.; Gao, D.; Wang, B.; Chen, Z.; Chen, W.; Jiang, Y. Y.; Chen, Y. Z. NPASS: natural product activity and species source database for natural product research, discovery and tool development. Nucleic Acids Res. 2018, 46, D1217-D1222. DOI: 10.1093/nar/gkx1026
25. Tharwat, A.; Classification assessment methods. Appl. Comput. Inform. 2021, 17, 168-192. DOI: 10.1016/j.aci.2018.08.003
26. Vakili, M.; Ghamsari, M.; Rezaei, M. Performance analysis and comparison of machine and deep learning algorithms for IoT data classification. arXiv, 2001. 09636. DOI: 10.48550/arXiv.2001.09636
27. Ruh, M. F.; Zacharewski, T.; Connor, K.; Howell, J.; Chen, I.; Safe, S. Naringenin: a weakly estrogenic bioflavonoid that exhibits antiestrogenic activity. Biochem. Pharmacol. 1995, 50, 1485-1493. DOI: 10.1016/0006-2952(95)02061-6
28. Galluzzo, P.; Ascenzi, P.; Bulzomi, P.; Marino, M. The nutritional flavanone naringenin triggers antiestrogenic effects by regulating estrogen receptor α-palmitoylation. Endocrinology, 2008, 149, 2567-2575. DOI: 10.1210/en.2007-1173
29. Mersereau, J. E.; Levy, N.; Staub, R. E.; Baggett, S.; Zogric, T.; Chow, S.; Ricke, W. A.; Tagliaferri, M.; Cohen, I.; Bjeldanes, L. F.; Leitman, D. C. Liquiritigenin is a plant-derived highly selective estrogen receptor β agonist. Mol. Cell. Endocrinol. 2008, 283, 49-57. DOI: 10.1016/j.mce.2007.11.020
30. Tzenov, Y. R.; Andrews, P.; Voisey, K.; Gai, L.; Carter, B.; Whelan, K.; Popadiuk, C.; Kao, K. R. Selective estrogen receptor modulators and betulinic acid act synergistically to target ERα and SP1 transcription factor dependent Pygopus expression in breast cancer. J.Clin. Pathol. 2016, 69, 518-526. DOI: 10.1136/jclinpath-2015-203395
31. Kim, Y. G.; Park, Y. H.; Yang, E. Y.; Park, W. S.; Park, K. S. Inhibition of tamoxifen's therapeutic effects by emodin in estrogen receptor-positive breast cancer cell lines. Ann. Surg. Treat. Res. 2019, 97, 230-238. DOI: 10.4174/astr.2019.97.5.230
32. Ye, W.; Chang, H. L.; Wang, L. S.; Huang, Y. W.; Shu, S.; Dowd, M. K.; Wan, P. J.; Sugimoto, Y.; Lin, Y. C. Modulation of multidrug resistance gene expression in human breast cancer cells by (-)-gossypol-enriched cottonseed oil. Anticancer Res. 2007, 27, 107-116. PMID: 17352222 https://ar.iiarjournals.org/content/27/1A/107
33. Kim, S. D.; Kim, Y.; Kim, M.; Jeong, H.; Choi, S. H.; Ryu, H. W.; Oh S. R.; Lee, S. W.; Li, W. Y.; Wu, H. H.; Zhu, Y.; Wang, X.; Chang, M.; Song, Y. S. Estrogenic properties of Prunus cerasoides extract and its constituents in MCF‐7 cell and evaluation in estrogen‐deprived rodent models. Phytother. Res. 2020, 34, 1347-1357. DOI: 10.1002/ptr.6604
34. Collingwood, T. N.; Urnov, F. D.; Wolffe, A. P. Nuclear receptors: coactivators, corepressors and chromatin remodeling in the control of transcription. J. Mol. Endocrinol.1999, 23, 255-275. DOI: 10.1677/jme.0.0230255