ML challenges

QM9 dataset

QM9 is a comprehensive dataset that provides geometric, energetic, electronic and thermodynamic properties , and was calculated by the quantum chemical calculations (DFT calculation, B3LYP/6-31G(2df,p) ). The data includes  homo, lumo, enthalpy, gibbs energy, etc for134K molecules.

Datasets from MoleculeNet in deepchem  or  Quantum-Machine.org
https://deepchem.readthedocs.io/en/latest/api_reference/moleculenet.html#bace-dataset
http://quantum-machine.org/datasets/

Regression analysis
Data number :  134 K  (A small number of data was used for calculations, min number: 30000)
Descriptor: RDkit, 290 descriptors
Ratio of training and test  =  8:2

Target: HOMO, LUMO and Energy gap

Estimation of models
In all estimation, the Random Forest (RF) model provided the best results, and the following calculation was performed using the RF model.

HOMO
training r2 = 0.99, MAE = 0.0018, test r2 = 0.90, MAE = 0.00047

LUMO
RF: training r2 = 0.99, MAE = 0.0019, test r2 = 0.97, MAE = 0.0052

Energy  gap
training r2 = 0.99, MAE = 0.0024, test r2 = 0.96, MAE = 0.0063

The descriptors includes sufficient information to describe the energy properties.

Descriptor selection

The determinant descriptors were selected for the HOMO energy.
In the first round of selection, PEOE_VSE1 was selected, which is one of the molecular surface area descriptors. After selected 10 descriptors,
training: r2 = 0.97, MAE = 0.0033, test: r2 = 0.85, MAE = 0.0076
In the second round of selection, TPSA and NHOHCount were selected. TPSA is also a molecular surface area descriptor. In the following selection, no strong descriptors were found, but PEOE, fr_nitrile, fr_aniline are selected.

Mostly, the energy related properties can be explained by the surface area descriptors, which is related to the volume of the electron cloud. Also, many descriptors related to nitrogen-related functional groups were selected. 

Response plot for the test data

BACE dataset

The BACE dataset provides quantitative IC50 and qualitative (binary label) binding results for a set of inhibitors of human beta-secretase 1 (BACE-1), which was downloaded from MoleculeNet. IC50 indicates how much chemicals are necessary to inhibit biological function.

Datasets from MoleculeNet in deepchem
https://deepchem.readthedocs.io/en/latest/api_reference/moleculenet.html#bace-dataset

Regression analysis
Data number :  1513
Descriptor: RDkit, 290 descriptors
Ratio of training and test  =  8:2

Selection of descriptors
Since the score was comparatively low for the molecular dataset, several other descriptors were tested.

MACCSKey
Linear model: training r2 =0.61, MAE = 0.64, test = 0.56, MAE = 0.72
Ridge model: training r2 =0.61, MAE = 0.64, test = 0.56, MAE = 0.72

Circular
Linear model: training r2 =0.97, MAE = 0.12, test = 0.54, MAE = 0.66
Ridge model: training r2 =0.97, MAE = 0.08, test = 0.12, MAE = 0.97

Rdkit
Linear model: training r2 =0.71, MAE = 0.54, test = 0.66, MAE = 0.66
Ridge model: training r2 =0.70, MAE = 0.55, test = 0.64, MAE = 0.67

As a result,  the Rdkit descriptor  still showed the best result, and used here.

Rough estimation of the model 

Linear model: training: r2 =0.71, MAE = 0.54, test = 0.66, MAE = 0.66
Ridge model: training: r2 =0.70, MAE = 0.55, test = 0.64, MAE = 0.67
K-nearest neighbor: training r2 = 0.77, MAE = 0.46, test r2 = 0.62, MAE = 0.66
Partial least square: training r2 = 0.47, MAE = 0.78, test, r2 = 0.52, MAE = 0.78
Random forest training: r2 = 0.94, MAE = 0.21, test, r2 = 0.72, MAE = 0.56

Hyperparameter optimization of the best model: Random forest model

n_estimaters : 90, max_depth: 15
training r2 = 0.94, MAE = 0.23, test, r2 = 0.73, MAE = 0.56

For the molecular datasets and a large number of descriptors, lasso model does not work. For example, the hyperparameter optimization result gave,
alpha = 0.033
training r2 = 0.60, MAE = 0.66, test, r2 = 0.61, MAE = 0.70
The 1st norm almost did not contribute to the result.

Descriptor selection
Stepwise regression: After each regression, the most dominant descriptors were removed and regressed again. Here are the descriptors dominating the solubility.
BertzCT: A topological index meant to quantify “complexity” of molecules 
Chi0, Chi1, Kappa1,  Kappa2:
These are topological indices possessing information on the fragment of bond structure, shape, etc. Refer to
http://www.edusoft-lc.com/molconn/manuals/400/chaptwo.html
HeavyAtomCount, HeaveAtomMolWt, ExactMolWt, MolWt: Molecular weight

Based on these selected features, the biological activity can be explained mostly by the structure (bonding, shape). Also, it seems that heavy atom could possibly affect the result.

Classification analysis for IC50 activity (binary label)

Model selection
kNN: training: 0.86, test:0.77
SVM: training: 0.88, test:0.83
DT: training: 0.99, test:0.76
RF: training: 0.99, test:0.83
LDA: training: 0.86, test: 0.79

Hyper parameter optimization for the best model: SVM
C: 3.02,  training :0.92, test: 0.83

It is interesting to look into the molecules 'true-false' (80 molecules) and 'false-true' (90 molecules) candidates. The former could be the list that the structure was not understandable by the function yet and potentially the molecules which have not investigated well. The latter list could include the molecules which have the active structure as an ingredient but did not work for some reasons. The former list often includes imidazole, pyrrole structure and also cationic amine. The latter list has molecules with peptide bonds and cationic amine.

Response plot for the training data

Confusion matrix for the training data

Molecules of 'active but not predicted' by the function

Molecules of 'not active but predicted by the function'

Ti-O material dataset

The data for the molecules including Ti and O was downloaded from Material Project. The data can be downloaded by pymatgen module, called Materials Project REST API.
https://pymatgen.org/pymatgen.ext.matproj.html
https://materialsproject.org/
The energy, volume, formation energy per atom, density and total magnetization data were downloaded, and the energy was predicted only by the descriptors from the chemical formulae.

Data number :  3830
Descriptor: Xenonpy, 290 descriptors
Ratio of training and test  =  8:2

Rough estimation of models
Linear model: training r2 =0.570, MAE = 95.5, test r2 = 0.153, MAE = 133
Lasso model: not applicable
Ridge model: training r2 = 0.547, MAE = 98, test r2 = 0.353, MAE = 127
K-nearest neighbor: training r2 = 0.68, MAE = 76.9, test r2 = 0.446, MAE = 111
Partial least square: training r2 = 0.40, MAE = 115, test, r2 = 0.39, MAE = 130
Random forest: training r2 = 0.94, MAE = 33.7, test, r2 = 0.56, MAE = 93

Hyperparameter optimization of ridge model, and random forest model
Ridge model
alpha : 46.6
training r2 = 0.44, MAE = 116, test, r2 = 0.42, MAE = 118

Random forest model
n_estimators : 200, max_depth : 15
training r2 = 0.94, MAE = 33.7, test, r2 = 0.56, MAE = 95

Different from the molecules, there are few candidates of good descriptors. Here, one of the candidates, xenonpy descriptor was used. However, the discrepancy between the actual and predicted values was large. The reason for it was examined from the chemicals which agreed and not agreed with the actual values. (50 % difference)  The list of the elements in the 'well-predicted' and 'poorly-predicted' are shown by WordCloud.

Some clear contrasts were found in the wordclouds for 'well-predicted' and 'poorly predicted' elements. Phosphore  and manganese are included in the 'well-predicted', and Sodium, lantance, and silicone can be found. There is unknown but clear consistency on the elements, and the descriptors could be improved for it.

The response plot for the training data

List of elements which is weill-predicted

The list of elements which is poorly-predicted

This page introduces  the application of machine learning for various chemical datasets. The contents are not for professional but for the introduction about how the ML can be applied to obtain the information from the database. The analyses were mostly made by the general machine learning (ML) techniques, and programmed by python (scikit-learn, xenonpy, molecule-net). 

HIV dataset

The HIV dataset was introduced by the Drug Therapeutics Program (DTP) AIDS Antiviral Screen, which tested the ability to inhibit HIV replication for over 40,000 compounds. Screening results were evaluated and placed into three categories: confirmed inactive (CI),confirmed active (CA) and confirmed moderately active (CM). The data was analyzed for 1/0 activity by combining CA and CM as active reagents. 

The dataset was downloaded from MoleculeNet in deepchem (HIV.csv)
Data number : 41127
Descriptor: RDkit, 208 descriptors
Ratio of training and test  =  8:2
https://deepchem.readthedocs.io/en/latest/api_reference/moleculenet.html#hiv-datasets

As an initial check of the data, the principal component analysis (PCA) was performed, and the main three components were plotted in the figure. It seems that the distinction of 1/0 would be possible.

Rough selection of models
kNN, training accuracy: 0.977, test accuracy: 0.972
SVM, training accuracy: 0.975, test accuracy: 0.971
DT, training accuracy: 1.0, test accuracy: 0.945, overfitting
RF, training accuracy: 1.000, test accuracy, 0.972
LDA, training accuracy: 0.961, test accuracy, 0.961

Hyperparameter optimization  for the best model:  k nearest neighbor or  Random forest
n_neibors : 4, p: 1

training accuracy: 0.979, test accuracy: 0.976

The 'predicted active' molecules are listed and saved as a molecular list, and shown on the right. However, it is interesting to note that the list of 'predicted inactive' but active molecules, and that of 'predicted active' but inactive provides more insight. The former (on the right) corresponds to  the molecules whose information has not yet included in the prediction function. The latter indicates that the molecules have the structure/property of 'active' molecules, but it did not work, and the similar molecules could have a potential to be active.

3D plot of scores of principal components
(Blue: active, Red: inactive)

List of 'predicted active' and active molecules

List of 'predicted inactive' but active molecules

ESOL dataset

The original source paper and the data is included as a supporting information in the following site. The original paper describes the prediction of the solubility (log P) using several descriptors. (2874 data, R2 = 0.69, MAE = 0.75)

https://pubs.acs.org/doi/10.1021/ci034243x


The dataset was downloaded from MoleculeNet in Deepchem. (ESOL_delaney-processed.csv) The dataset has been reduced to 1128 data.

https://deepchem.readthedocs.io/en/latest/api_reference/moleculenet.html#delaney-datasets

Descriptor: RDkit (208 descriptors)

Ratio of training and test = 8:2


Rough selection of models

Linear model: training r2 =0.95, MAE = 0.35, test r2 = 0.89, MAE = 0.48

Lasso model(alpha=0.01) training r2 =0.91, MAE = 0.47, test r2 = 0.91, MAE = 0.50

Ridge model(alpha=0.5) training r2 = 0.94, MAE = 0.37, test r2 = 0.92, MAE = 0.46

K-nearest neighbor(n = 5) training r2 = 0.89, MAE = 0.49, test r2 = 0.85, MAE = 0.60

Partial least square (n = 1) training r2 = 0.57, MAE = 1.02, test, r2 = 0.59, MAE = 1.03

Random forest training r2 = 0.98, MAE = 0.17, test, r2 = 0.92, MAE = 0.44

Hyperparameter optimization  for the best model:  Randam forest
n_estimaters : 50, max_depth: 15
training: r2 = 0.98, MAE = 0.18, test: r2 = 0.92, MAE = 0.44

Descriptor selection
Stepwise regression: After each regression, the most dominant descriptors were removed and regressed again. Here are the descriptors dominating the solubility.
MolLogP: Octanol/water solubility partition coefficient
MolMR: molecular refractivity, measure of the total polarizability
Chi0v: one of connectivity index
LabutaASA: Average surface area of molecules
HeavyAtomMolWt, MolWt,  ExactMolWt, SlogP_VSA2 

It is natural that the solubility-related descriptors are selected. The descriptors for polarizability and molecular weight are frequently selected to describe the solubility.

Response plot for the training/test data
using the best model