ML challenges
QM9 dataset
QM9 is a comprehensive dataset that provides geometric, energetic, electronic and thermodynamic properties , and was calculated by the quantum chemical calculations (DFT calculation, B3LYP/6-31G(2df,p) ). The data includes homo, lumo, enthalpy, gibbs energy, etc for134K molecules.
Datasets from MoleculeNet in deepchem or Quantum-Machine.org
https://deepchem.readthedocs.io/en/latest/api_reference/moleculenet.html#bace-dataset
http://quantum-machine.org/datasets/
Regression analysis
Data number : 134 K (A small number of data was used for calculations, min number: 30000)
Descriptor: RDkit, 290 descriptors
Ratio of training and test = 8:2
Target: HOMO, LUMO and Energy gap
Estimation of models
In all estimation, the Random Forest (RF) model provided the best results, and the following calculation was performed using the RF model.
HOMO
training r2 = 0.99, MAE = 0.0018, test r2 = 0.90, MAE = 0.00047
LUMO
RF: training r2 = 0.99, MAE = 0.0019, test r2 = 0.97, MAE = 0.0052
Energy gap
training r2 = 0.99, MAE = 0.0024, test r2 = 0.96, MAE = 0.0063
The descriptors includes sufficient information to describe the energy properties.
Descriptor selection
The determinant descriptors were selected for the HOMO energy.
In the first round of selection, PEOE_VSE1 was selected, which is one of the molecular surface area descriptors. After selected 10 descriptors,
training: r2 = 0.97, MAE = 0.0033, test: r2 = 0.85, MAE = 0.0076
In the second round of selection, TPSA and NHOHCount were selected. TPSA is also a molecular surface area descriptor. In the following selection, no strong descriptors were found, but PEOE, fr_nitrile, fr_aniline are selected.
Mostly, the energy related properties can be explained by the surface area descriptors, which is related to the volume of the electron cloud. Also, many descriptors related to nitrogen-related functional groups were selected.
Response plot for the test data
BACE dataset
The BACE dataset provides quantitative IC50 and qualitative (binary label) binding results for a set of inhibitors of human beta-secretase 1 (BACE-1), which was downloaded from MoleculeNet. IC50 indicates how much chemicals are necessary to inhibit biological function.
Datasets from MoleculeNet in deepchem
https://deepchem.readthedocs.io/en/latest/api_reference/moleculenet.html#bace-dataset
Regression analysis
Data number : 1513
Descriptor: RDkit, 290 descriptors
Ratio of training and test = 8:2
Selection of descriptors
Since the score was comparatively low for the molecular dataset, several other descriptors were tested.
MACCSKey
Linear model: training r2 =0.61, MAE = 0.64, test = 0.56, MAE = 0.72
Ridge model: training r2 =0.61, MAE = 0.64, test = 0.56, MAE = 0.72
Circular
Linear model: training r2 =0.97, MAE = 0.12, test = 0.54, MAE = 0.66
Ridge model: training r2 =0.97, MAE = 0.08, test = 0.12, MAE = 0.97
Rdkit
Linear model: training r2 =0.71, MAE = 0.54, test = 0.66, MAE = 0.66
Ridge model: training r2 =0.70, MAE = 0.55, test = 0.64, MAE = 0.67
As a result, the Rdkit descriptor still showed the best result, and used here.
Rough estimation of the model
Linear model: training: r2 =0.71, MAE = 0.54, test = 0.66, MAE = 0.66
Ridge model: training: r2 =0.70, MAE = 0.55, test = 0.64, MAE = 0.67
K-nearest neighbor: training r2 = 0.77, MAE = 0.46, test r2 = 0.62, MAE = 0.66
Partial least square: training r2 = 0.47, MAE = 0.78, test, r2 = 0.52, MAE = 0.78
Random forest training: r2 = 0.94, MAE = 0.21, test, r2 = 0.72, MAE = 0.56
Hyperparameter optimization of the best model: Random forest model
n_estimaters : 90, max_depth: 15
training r2 = 0.94, MAE = 0.23, test, r2 = 0.73, MAE = 0.56
For the molecular datasets and a large number of descriptors, lasso model does not work. For example, the hyperparameter optimization result gave,
alpha = 0.033
training r2 = 0.60, MAE = 0.66, test, r2 = 0.61, MAE = 0.70
The 1st norm almost did not contribute to the result.
Descriptor selection
Stepwise regression: After each regression, the most dominant descriptors were removed and regressed again. Here are the descriptors dominating the solubility.
BertzCT: A topological index meant to quantify “complexity” of molecules
Chi0, Chi1, Kappa1, Kappa2:
These are topological indices possessing information on the fragment of bond structure, shape, etc. Refer to
http://www.edusoft-lc.com/molconn/manuals/400/chaptwo.html
HeavyAtomCount, HeaveAtomMolWt, ExactMolWt, MolWt: Molecular weight
Based on these selected features, the biological activity can be explained mostly by the structure (bonding, shape). Also, it seems that heavy atom could possibly affect the result.
Classification analysis for IC50 activity (binary label)
Model selection
kNN: training: 0.86, test:0.77
SVM: training: 0.88, test:0.83
DT: training: 0.99, test:0.76
RF: training: 0.99, test:0.83
LDA: training: 0.86, test: 0.79
Hyper parameter optimization for the best model: SVM
C: 3.02, training :0.92, test: 0.83
It is interesting to look into the molecules 'true-false' (80 molecules) and 'false-true' (90 molecules) candidates. The former could be the list that the structure was not understandable by the function yet and potentially the molecules which have not investigated well. The latter list could include the molecules which have the active structure as an ingredient but did not work for some reasons. The former list often includes imidazole, pyrrole structure and also cationic amine. The latter list has molecules with peptide bonds and cationic amine.
Response plot for the training data
Confusion matrix for the training data
Molecules of 'active but not predicted' by the function
Molecules of 'not active but predicted by the function'
Ti-O material dataset
The data for the molecules including Ti and O was downloaded from Material Project. The data can be downloaded by pymatgen module, called Materials Project REST API.
https://pymatgen.org/pymatgen.ext.matproj.html
https://materialsproject.org/
The energy, volume, formation energy per atom, density and total magnetization data were downloaded, and the energy was predicted only by the descriptors from the chemical formulae.
Data number : 3830
Descriptor: Xenonpy, 290 descriptors
Ratio of training and test = 8:2
Rough estimation of models
Linear model: training r2 =0.570, MAE = 95.5, test r2 = 0.153, MAE = 133
Lasso model: not applicable
Ridge model: training r2 = 0.547, MAE = 98, test r2 = 0.353, MAE = 127
K-nearest neighbor: training r2 = 0.68, MAE = 76.9, test r2 = 0.446, MAE = 111
Partial least square: training r2 = 0.40, MAE = 115, test, r2 = 0.39, MAE = 130
Random forest: training r2 = 0.94, MAE = 33.7, test, r2 = 0.56, MAE = 93
Hyperparameter optimization of ridge model, and random forest model
Ridge model
alpha : 46.6
training r2 = 0.44, MAE = 116, test, r2 = 0.42, MAE = 118
Random forest model
n_estimators : 200, max_depth : 15
training r2 = 0.94, MAE = 33.7, test, r2 = 0.56, MAE = 95
Different from the molecules, there are few candidates of good descriptors. Here, one of the candidates, xenonpy descriptor was used. However, the discrepancy between the actual and predicted values was large. The reason for it was examined from the chemicals which agreed and not agreed with the actual values. (50 % difference) The list of the elements in the 'well-predicted' and 'poorly-predicted' are shown by WordCloud.
Some clear contrasts were found in the wordclouds for 'well-predicted' and 'poorly predicted' elements. Phosphore and manganese are included in the 'well-predicted', and Sodium, lantance, and silicone can be found. There is unknown but clear consistency on the elements, and the descriptors could be improved for it.
The response plot for the training data
List of elements which is weill-predicted
The list of elements which is poorly-predicted
This page introduces the application of machine learning for various chemical datasets. The contents are not for professional but for the introduction about how the ML can be applied to obtain the information from the database. The analyses were mostly made by the general machine learning (ML) techniques, and programmed by python (scikit-learn, xenonpy, molecule-net).
HIV dataset
The HIV dataset was introduced by the Drug Therapeutics Program (DTP) AIDS Antiviral Screen, which tested the ability to inhibit HIV replication for over 40,000 compounds. Screening results were evaluated and placed into three categories: confirmed inactive (CI),confirmed active (CA) and confirmed moderately active (CM). The data was analyzed for 1/0 activity by combining CA and CM as active reagents.
The dataset was downloaded from MoleculeNet in deepchem (HIV.csv)
Data number : 41127
Descriptor: RDkit, 208 descriptors
Ratio of training and test = 8:2
https://deepchem.readthedocs.io/en/latest/api_reference/moleculenet.html#hiv-datasets
As an initial check of the data, the principal component analysis (PCA) was performed, and the main three components were plotted in the figure. It seems that the distinction of 1/0 would be possible.
Rough selection of models
kNN, training accuracy: 0.977, test accuracy: 0.972
SVM, training accuracy: 0.975, test accuracy: 0.971
DT, training accuracy: 1.0, test accuracy: 0.945, overfitting
RF, training accuracy: 1.000, test accuracy, 0.972
LDA, training accuracy: 0.961, test accuracy, 0.961
Hyperparameter optimization for the best model: k nearest neighbor or Random forest
n_neibors : 4, p: 1
training accuracy: 0.979, test accuracy: 0.976
The 'predicted active' molecules are listed and saved as a molecular list, and shown on the right. However, it is interesting to note that the list of 'predicted inactive' but active molecules, and that of 'predicted active' but inactive provides more insight. The former (on the right) corresponds to the molecules whose information has not yet included in the prediction function. The latter indicates that the molecules have the structure/property of 'active' molecules, but it did not work, and the similar molecules could have a potential to be active.
3D plot of scores of principal components
(Blue: active, Red: inactive)
List of 'predicted active' and active molecules
List of 'predicted inactive' but active molecules
ESOL dataset
The original source paper and the data is included as a supporting information in the following site. The original paper describes the prediction of the solubility (log P) using several descriptors. (2874 data, R2 = 0.69, MAE = 0.75)
https://pubs.acs.org/doi/10.1021/ci034243x
The dataset was downloaded from MoleculeNet in Deepchem. (ESOL_delaney-processed.csv) The dataset has been reduced to 1128 data.
https://deepchem.readthedocs.io/en/latest/api_reference/moleculenet.html#delaney-datasets
Descriptor: RDkit (208 descriptors)
Ratio of training and test = 8:2
Rough selection of models
Linear model: training r2 =0.95, MAE = 0.35, test r2 = 0.89, MAE = 0.48
Lasso model(alpha=0.01) training r2 =0.91, MAE = 0.47, test r2 = 0.91, MAE = 0.50
Ridge model(alpha=0.5) training r2 = 0.94, MAE = 0.37, test r2 = 0.92, MAE = 0.46
K-nearest neighbor(n = 5) training r2 = 0.89, MAE = 0.49, test r2 = 0.85, MAE = 0.60
Partial least square (n = 1) training r2 = 0.57, MAE = 1.02, test, r2 = 0.59, MAE = 1.03
Random forest training r2 = 0.98, MAE = 0.17, test, r2 = 0.92, MAE = 0.44
Hyperparameter optimization for the best model: Randam forest
n_estimaters : 50, max_depth: 15
training: r2 = 0.98, MAE = 0.18, test: r2 = 0.92, MAE = 0.44
Descriptor selection
Stepwise regression: After each regression, the most dominant descriptors were removed and regressed again. Here are the descriptors dominating the solubility.
MolLogP: Octanol/water solubility partition coefficient
MolMR: molecular refractivity, measure of the total polarizability
Chi0v: one of connectivity index
LabutaASA: Average surface area of molecules
HeavyAtomMolWt, MolWt, ExactMolWt, SlogP_VSA2
It is natural that the solubility-related descriptors are selected. The descriptors for polarizability and molecular weight are frequently selected to describe the solubility.
Response plot for the training/test data
using the best model