Platinum-based drugs such as cisplatin and carboplatin are among the most widely used cancer chemotherapeutics. However, their clinical use is limited by significant side effects — including nephrotoxicity, neurotoxicity, and drug resistance — because the active Pt(II) species attacks not only tumour DNA but also healthy tissue indiscriminately.
Pt(IV) complexes offer a promising solution. They are six-coordinate, kinetically inert prodrugs that remain inactive until they reach the tumour environment, where they are reduced by cellular reducing agents (such as ascorbate and glutathione) to release the active two-coordinate Pt(II) species:
Pt(IV)(axial₁)(axial₂)(ligands) + 2e⁻ → Pt(II)(ligands) + axial₁⁻ + axial₂⁻
The reduction potential (E, in V vs NHE) measures how readily this reduction occurs. It is the key design parameter for Pt(IV) prodrugs:
Higher E (less negative, toward +1 V)→ more easily reduced → drug activates quickly but may also activate before reaching.
Lower E (more negative, toward −1 V)→ harder to reduce → more stable, more selective for the reducing environment of the tumour
The tumour microenvironment has a significantly higher concentration of reducing agents than normal tissue, so a moderately negative E (roughly −0.4 to 0 V) is generally desired: stable enough to survive in circulation, reactive enough to activate selectively in tumour tissue. This work applies machine learning (ML) to predict E from molecular structure, enabling rapid computational screening of new Pt(IV) candidates. This work is reproduced from the work by Vigna et al. (2024), J. Chem. Inf. Model
DataSet
- 142 Pt(IV) complexes with experimentally measured reduction potentials
Method
1. SMILES → RDKit mol objects (dative bonds applied for metal coordination)
2. ECFP fingerprints generated; AlvaDesc and LUMO descriptors loaded
3. One-hot encode categorical descriptors; MinMax scale LUMO
4. Remove highly correlated features (Pearson r > 0.9): 5514 → ~3290 features
5. Dataset A benchmark: evaluate 5 tree-based ML models (Random Forest, Gradient Boosting, XGBoost, ExtraTree, DecisionTree) using Leave-One-Out Cross-Validation (LOOCV)
6. Feature selection: each model selects its most important features using SelectFromModel
7. Recursive Feature Elimination (RFE) on the best model (ETR) to identify the top-20 most informative features
RESULTS
After hyperparameter optimisation (max_depth=19, min_samples_leaf=1, min_samples_split=4) and outlier removal, An R² of 0.918 with RMSE = 0.126 V on a 134-compound dataset using LOOCV is a strong result, matching the accuracy reported in the original paper.
The two most important descriptors reveal clear structural rules:
# 1. F07(C-Cl) — Presence of a Carbon 7 Bonds from an Axial Chlorido Ligand [most important]
It is triggered by two common axial ligand patterns:
Pattern A — Axial benzoate ligand:
```
Cl – Pt – O – C(=O) – c(ipso) – c(ortho) – c(meta) – c(para)
1 2 3 4 5 6 7
The para carbon of the phenyl ring is exactly 7 bonds from Cl.
Pattern B — Axial alkyl carboxylate (≥4 carbons):
```
Cl – Pt – O – C(=O) – C1 – C2 – C3 – C4
1 2 3 4 5 6 7
The 4th carbon of the chain is exactly 7 bonds from Cl.
Both patterns identify compounds that have both an axial Cl ligand and a substantial organic axial carboxylate ligand. The boxplot (Figure 3, left) shows that when this descriptor is present, E tends to be concentrated around −0.7 V — indicating that this structural motif (mixed Cl + carboxylate axial ligands) consistently produces moderately stable prodrugs.
### 2. B03(O-O) — Oxalate Bidentate Chelating Ligand [second most important]
This descriptor is True when two oxygen atoms are exactly 3 bonds apart in the molecular graph, i.e. the pattern **O=C–C=O**. This is the hallmark of the **oxalate bidentate ligand**, which coordinates to Pt through one oxygen from each carboxylate end, forming a 5-membered chelate ring:
```
O O
‖ ‖
Pt–O–C – C–O–(Pt) ← 5-membered chelate ring
```
Compounds with oxalate axial ligands (Figure 3, right) have more negative E values (harder to reduce), meaning oxalate stabilises the Pt(IV) state strongly. This makes oxalate-containing complexes more selective prodrugs with controlled activation.
Representative Pt(IV) complexes across the range of reduction potentials are shown.
| [Pt(NH₃)₂Cl₄] | All-Cl axial ligands | +0.596 | Easiest to reduce; very reactive |
| [Pt(NH₃)₂(OAc)₂Cl₂] | Acetate + Cl axial | +0.511 | Easily reduced |
| [Pt(dach)(propanoate)₂Cl₂] | Propanoate + Cl axial | +0.445 | High E; fast activation |
| [Pt(en)(OH)₂(oxalate)] | Oxalate chelate, no Cl | −0.067 | Moderate; selectively activated |
| [Pt(dach)(OH)₂(oxalate)] | Oxalate + DACH | −0.810 | Hard to reduce; very stable |
| [Pt(NH₃)₂(acetate)(hexanoyl-amino)(squarate)] | Squarate chelate | −1.001 | Hardest to reduce in dataset |
Distribution of reduction potentials across the 142 compounds.
Comparison of all five models
Feature importance from the optimised ExtraTreeRegressor.
Representative Pt(IV) complexes across the range of reduction potentials. Top row: easily reduced (high E), bottom row: hard to reduce (low E).
Mycobacterium tuberculosis(Mtb) is the causative agent of tuberculosis (TB), which remains one of the leading infectious disease killers worldwide. Drug discovery for TB is slow and expensive — understanding how a compound kills Mtb (its mechanism of action, MoA) is critical for prioritising candidates and avoiding resistance.
This project reproduces and extends the work of Liu et al. (2022), who showed that a deep learning model trained on large-scale chemical-genetic interaction profiles (CGIP) can predict the MoA of new compounds purely from their molecular structure.
> Liu et al. (2022) *"Deep learning-driven prediction of drug mechanism of action from large-scale chemical-genetic interaction profiles"*, Journal of Cheminformatics.
DataSet
The Chemical-Genetic Interaction Profile (CGIP) dataset (Johnson et al., 2019):
Property value
Source | https://www.chemicalgenomicsoftb.com
Compounds | 47,217
Mtb hypomorph strains | 152 genes
Measurement | Wald test Z-score (growth inhibition)
Activity threshold | Median Z-score ≤ −4 → active
Method
The 152 individual genes were first reduced to 13 biologically meaningful clusters using Gene Ontology (GO) semantic similarity rather than data-driven clustering. Data-driven clustering was not used because Z-score profiles are broadly positively correlated across all genes (general toxicity dominates the signal), making functional groupings invisible from the data alone.
The prediction task is therefore multi-label binary classification: given a SMILES string, predict a 13-bit vector indicating which functional gene clusters the compound inhibits.
MACHINE LEARNING WORKFLOW
The model is a **Directed Message Passing Neural Network (D-MPNN)**, implemented via the Chemprop library.
SMILES string
↓
Molecular graph (atoms = nodes, bonds = edges)
↓ D-MPNN message passing
Molecule-level embedding
+ 200 RDKit 2D normalised descriptors
↓
Feed-forward network → 13 output probabilities (G1–G13)plits)
RESULTS
Training results
Mean AUROC | 0.737 ± 0.003
Mean AUPRC | 0.148 (random baseline ~0.030)
Mean F1 (Youden threshold) | 0.155
Optimal threshold range | 0.16–0.38 (cluster-dependent)
The second figure shows the binary ground truth (green = active, white = inactive). The right panel shows the predicted probability for each cluster. The model generally assigns higher scores to clusters where the ground truth is 1, and lower scores where it is 0 — confirming the predictions are informative rather than random.
We applied the model to 19 real drugs not present in the training data, spanning known Mtb drugs and common non-Mtb drugs used as negative controls. Fluoroquinolones (Ciprofloxacin, Moxifloxacin) received the highest scores from both models (0.88–0.97), independently and consistently. This is biologically correct — fluoroquinolones inhibit DNA gyrase (GyrA/GyrB), a validated Mtb drug target.
ROC Curves per Gene Cluster
Heatmap: Ground Truth vs Predicted Scores
Molecular Structures of Selected Test Drugs
Predicted Scores for Real Drugs (Both Models)
The electronic band gap is one of the most fundamental properties of a material,
determining whether it behaves as an insulator, semiconductor, or conductor. A
band gap of zero characterises metals; semiconductors typically show gaps of
0–3 eV; wide-gap materials (> 3 eV) are used in optoelectronics and power
devices. Accurate and fast prediction of the band gap enables large-scale
computational screening of new functional materials without the cost of
full density-functional theory (DFT) calculations.
In this study we reproduced and extended the gradient-boosted machine learning
workflow described by Jung, Jung & Cole (J. Chem. Inf. Model. 2024, 64, 1187)
to predict the band gap (in eV) of inorganic materials directly from their
chemical formula — no crystal structure information is required.
TARGET VARIABLE
Property : Band gap (eV)
Source : Materials Project DFT database
Dataset : 3,000 non-metallic inorganic compounds
INPUT DESCRIPTORS (FEATURES)
The descriptor groups used are:
• Stoichiometry — number of constituent elements; L2, L3, L5, L7,
L10 norms of the elemental fractions.
• Magpie statistics — mean, minimum, maximum, range, and standard
deviation of 22 element-level properties across all
atoms in the formula (e.g. atomic number, atomic
mass, electronegativity, valence electron count,
melting point, molar volume).
• PymatgenData stats — same statistical aggregation applied to properties
from the pymatgen elemental database (Mendeleev
number, atomic radius, electronegativity X, etc.).
• DemlData stats — formation energy, electronegativity, and ionisation
energy statistics from the Deml dataset.
• MEGNet element — 16-dimensional learned element embeddings from the
embeddings MEGNet graph-neural-network model, aggregated
(mean, minimum, maximum) over the composition.
• Meredig descriptors — 15 composition features including fraction of
s/p/d/f valence electrons and mean atomic radius.
• BandCenter — estimated band centre from Mulliken
electronegativity (geometric mean).
• IonProperty / OxidationStates — ionic character and weighted statistics of
formal oxidation states.
MACHINE LEARNING WORKFLOW
The pipeline follows five sequential stages:
1. Feature analysis (ANOVA F-test + Mutual Information)
Each base descriptor is ranked by its linear (ANOVA F-statistic) and
non-linear (mutual information) correlation with the band gap. The top-50
ANOVA descriptors are selected as seeds for feature engineering.
2. Feature engineering
Ratio features (A/B) and binary marker features are constructed from the
seed descriptors, capturing interaction effects and presence/absence
information that single descriptors cannot express.
3. Gradient-Boosted Feature Selection (GBFS)
A LightGBM regressor is trained using a quick grid search over 27
hyperparameter combinations (n_estimators × learning_rate × num_leaves).
The best model is then used to rank all 2,500 features by gain importance.
A greedy recursive forward-selection adds features one at a time in order
of importance, stopping when the validation R² fails to improve for 30
consecutive steps. This yields a compact, high-performing subset
4. Model type : LightGBM (Gradient Boosted Decision Trees)
Objective : Regression (minimise mean squared error)
Importance : Gain (total reduction in loss across all tree splits)
RESULTS
The final model was evaluated on the held-out 20 % test set (600 compounds
not seen during any stage of training or feature selection):
MAE = 0.33 eV
RMSE = 0.51 eV
R² = 0.92
The response plot (predicted vs. actual band gap) is shown above.
The model explains 92 % of the variance in band gap across a wide range of
inorganic materials using only 31 composition-derived features.
DOMINANT DESCRIPTORS
The table below lists the top 10 features ranked by LightGBM gain importance
(% of total gain). All top features are ratio descriptors engineered from the
base featurizer output.
Rank Feature Gain %
---- ------------------------------------------------------------ ------
1 MEGNetElementData mean embedding 15 / mean Number 45.3 %
2 mean Number / MagpieData avg_dev Electronegativity 7.7 %
3 MEGNetElementData minimum embedding 6 / frac p valence e– 6.4 %
4 frac d valence electrons / frac p valence electrons 4.5 %
5 avg d valence electrons / MEGNetElementData mean embedding 15 3.1 %
6 PymatgenData std_dev mendeleev_no / Pymatgen max mendeleev_no 3.0 %
7 avg d valence electrons / PymatgenData std_dev X 2.9 %
8 PymatgenData maximum X / PymatgenData std_dev X 2.8 %
9 MEGNetElementData mean embedding 15 / MagpieData mean Row 2.0 %
10 MagpieData mean Row / MagpieData mode Electronegativity 1.9 %
Response plot for the test data
QM9 is a comprehensive dataset that provides geometric, energetic, electronic and thermodynamic properties , and was calculated by the quantum chemical calculations (DFT calculation, B3LYP/6-31G(2df,p) ). The data includes homo, lumo, enthalpy, gibbs energy, etc for134K molecules.
Datasets from MoleculeNet in deepchem or Quantum-Machine.org
https://deepchem.readthedocs.io/en/latest/api_reference/moleculenet.html#bace-dataset
http://quantum-machine.org/datasets/
Regression analysis
Data number : 134 K (A small number of data was used for calculations, min number: 30000)
Descriptor: RDkit, 290 descriptors
Ratio of training and test = 8:2
Target: HOMO, LUMO and Energy gap
Estimation of models
In all estimation, the Random Forest (RF) model provided the best results, and the following calculation was performed using the RF model.
HOMO
training r2 = 0.99, MAE = 0.0018, test r2 = 0.90, MAE = 0.00047
LUMO
RF: training r2 = 0.99, MAE = 0.0019, test r2 = 0.97, MAE = 0.0052
Energy gap
training r2 = 0.99, MAE = 0.0024, test r2 = 0.96, MAE = 0.0063
The descriptors includes sufficient information to describe the energy properties.
Descriptor selection
The determinant descriptors were selected for the HOMO energy.
In the first round of selection, PEOE_VSE1 was selected, which is one of the molecular surface area descriptors. After selected 10 descriptors,
training: r2 = 0.97, MAE = 0.0033, test: r2 = 0.85, MAE = 0.0076
In the second round of selection, TPSA and NHOHCount were selected. TPSA is also a molecular surface area descriptor. In the following selection, no strong descriptors were found, but PEOE, fr_nitrile, fr_aniline are selected.
Mostly, the energy related properties can be explained by the surface area descriptors, which is related to the volume of the electron cloud. Also, many descriptors related to nitrogen-related functional groups were selected.
Response plot for the test data
The BACE dataset provides quantitative IC50 and qualitative (binary label) binding results for a set of inhibitors of human beta-secretase 1 (BACE-1), which was downloaded from MoleculeNet. IC50 indicates how much chemicals are necessary to inhibit biological function.
Datasets from MoleculeNet in deepchem
https://deepchem.readthedocs.io/en/latest/api_reference/moleculenet.html#bace-dataset
Regression analysis
Data number : 1513
Descriptor: RDkit, 290 descriptors
Ratio of training and test = 8:2
Selection of descriptors
Since the score was comparatively low for the molecular dataset, several other descriptors were tested.
MACCSKey
Linear model: training r2 =0.61, MAE = 0.64, test = 0.56, MAE = 0.72
Ridge model: training r2 =0.61, MAE = 0.64, test = 0.56, MAE = 0.72
Circular
Linear model: training r2 =0.97, MAE = 0.12, test = 0.54, MAE = 0.66
Ridge model: training r2 =0.97, MAE = 0.08, test = 0.12, MAE = 0.97
Rdkit
Linear model: training r2 =0.71, MAE = 0.54, test = 0.66, MAE = 0.66
Ridge model: training r2 =0.70, MAE = 0.55, test = 0.64, MAE = 0.67
As a result, the Rdkit descriptor still showed the best result, and used here.
Rough estimation of the model
Linear model: training: r2 =0.71, MAE = 0.54, test = 0.66, MAE = 0.66
Ridge model: training: r2 =0.70, MAE = 0.55, test = 0.64, MAE = 0.67
K-nearest neighbor: training r2 = 0.77, MAE = 0.46, test r2 = 0.62, MAE = 0.66
Partial least square: training r2 = 0.47, MAE = 0.78, test, r2 = 0.52, MAE = 0.78
Random forest training: r2 = 0.94, MAE = 0.21, test, r2 = 0.72, MAE = 0.56
Hyperparameter optimization of the best model: Random forest model
n_estimaters : 90, max_depth: 15
training r2 = 0.94, MAE = 0.23, test, r2 = 0.73, MAE = 0.56
For the molecular datasets and a large number of descriptors, lasso model does not work. For example, the hyperparameter optimization result gave,
alpha = 0.033
training r2 = 0.60, MAE = 0.66, test, r2 = 0.61, MAE = 0.70
The 1st norm almost did not contribute to the result.
Descriptor selection
Stepwise regression: After each regression, the most dominant descriptors were removed and regressed again. Here are the descriptors dominating the solubility.
BertzCT: A topological index meant to quantify “complexity” of molecules
Chi0, Chi1, Kappa1, Kappa2:
These are topological indices possessing information on the fragment of bond structure, shape, etc. Refer to
http://www.edusoft-lc.com/molconn/manuals/400/chaptwo.html
HeavyAtomCount, HeaveAtomMolWt, ExactMolWt, MolWt: Molecular weight
Based on these selected features, the biological activity can be explained mostly by the structure (bonding, shape). Also, it seems that heavy atom could possibly affect the result.
Classification analysis for IC50 activity (binary label)
Model selection
kNN: training: 0.86, test:0.77
SVM: training: 0.88, test:0.83
DT: training: 0.99, test:0.76
RF: training: 0.99, test:0.83
LDA: training: 0.86, test: 0.79
Hyper parameter optimization for the best model: SVM
C: 3.02, training :0.92, test: 0.83
It is interesting to look into the molecules 'true-false' (80 molecules) and 'false-true' (90 molecules) candidates. The former could be the list that the structure was not understandable by the function yet and potentially the molecules which have not investigated well. The latter list could include the molecules which have the active structure as an ingredient but did not work for some reasons. The former list often includes imidazole, pyrrole structure and also cationic amine. The latter list has molecules with peptide bonds and cationic amine.
Response plot for the training data
Confusion matrix for the training data
Molecules of 'active but not predicted' by the function
Molecules of 'not active but predicted by the function'
The data for the molecules including Ti and O was downloaded from Material Project. The data can be downloaded by pymatgen module, called Materials Project REST API.
https://pymatgen.org/pymatgen.ext.matproj.html
https://materialsproject.org/
The energy, volume, formation energy per atom, density and total magnetization data were downloaded, and the energy was predicted only by the descriptors from the chemical formulae.
Data number : 3830
Descriptor: Xenonpy, 290 descriptors
Ratio of training and test = 8:2
Rough estimation of models
Linear model: training r2 =0.570, MAE = 95.5, test r2 = 0.153, MAE = 133
Lasso model: not applicable
Ridge model: training r2 = 0.547, MAE = 98, test r2 = 0.353, MAE = 127
K-nearest neighbor: training r2 = 0.68, MAE = 76.9, test r2 = 0.446, MAE = 111
Partial least square: training r2 = 0.40, MAE = 115, test, r2 = 0.39, MAE = 130
Random forest: training r2 = 0.94, MAE = 33.7, test, r2 = 0.56, MAE = 93
Hyperparameter optimization of ridge model, and random forest model
Ridge model
alpha : 46.6
training r2 = 0.44, MAE = 116, test, r2 = 0.42, MAE = 118
Random forest model
n_estimators : 200, max_depth : 15
training r2 = 0.94, MAE = 33.7, test, r2 = 0.56, MAE = 95
Different from the molecules, there are few candidates of good descriptors. Here, one of the candidates, xenonpy descriptor was used. However, the discrepancy between the actual and predicted values was large. The reason for it was examined from the chemicals which agreed and not agreed with the actual values. (50 % difference) The list of the elements in the 'well-predicted' and 'poorly-predicted' are shown by WordCloud.
Some clear contrasts were found in the wordclouds for 'well-predicted' and 'poorly predicted' elements. Phosphore and manganese are included in the 'well-predicted', and Sodium, lantance, and silicone can be found. There is unknown but clear consistency on the elements, and the descriptors could be improved for it.
The response plot for the training data
List of elements which is weill-predicted
The list of elements which is poorly-predicted
This page introduces the application of machine learning for various chemical datasets. The contents are not for professional but for the introduction about how the ML can be applied to obtain the information from the database. The analyses were mostly made by the general machine learning (ML) techniques, and programmed by python (scikit-learn, xenonpy, molecule-net).
The HIV dataset was introduced by the Drug Therapeutics Program (DTP) AIDS Antiviral Screen, which tested the ability to inhibit HIV replication for over 40,000 compounds. Screening results were evaluated and placed into three categories: confirmed inactive (CI),confirmed active (CA) and confirmed moderately active (CM). The data was analyzed for 1/0 activity by combining CA and CM as active reagents.
The dataset was downloaded from MoleculeNet in deepchem (HIV.csv)
Data number : 41127
Descriptor: RDkit, 208 descriptors
Ratio of training and test = 8:2
https://deepchem.readthedocs.io/en/latest/api_reference/moleculenet.html#hiv-datasets
As an initial check of the data, the principal component analysis (PCA) was performed, and the main three components were plotted in the figure. It seems that the distinction of 1/0 would be possible.
Rough selection of models
kNN, training accuracy: 0.977, test accuracy: 0.972
SVM, training accuracy: 0.975, test accuracy: 0.971
DT, training accuracy: 1.0, test accuracy: 0.945, overfitting
RF, training accuracy: 1.000, test accuracy, 0.972
LDA, training accuracy: 0.961, test accuracy, 0.961
Hyperparameter optimization for the best model: k nearest neighbor or Random forest
n_neibors : 4, p: 1
training accuracy: 0.979, test accuracy: 0.976
The 'predicted active' molecules are listed and saved as a molecular list, and shown on the right. However, it is interesting to note that the list of 'predicted inactive' but active molecules, and that of 'predicted active' but inactive provides more insight. The former (on the right) corresponds to the molecules whose information has not yet included in the prediction function. The latter indicates that the molecules have the structure/property of 'active' molecules, but it did not work, and the similar molecules could have a potential to be active.
3D plot of scores of principal components
(Blue: active, Red: inactive)
List of 'predicted active' and active molecules
List of 'predicted inactive' but active molecules
The original source paper and the data is included as a supporting information in the following site. The original paper describes the prediction of the solubility (log P) using several descriptors. (2874 data, R2 = 0.69, MAE = 0.75)
https://pubs.acs.org/doi/10.1021/ci034243x
The dataset was downloaded from MoleculeNet in Deepchem. (ESOL_delaney-processed.csv) The dataset has been reduced to 1128 data.
https://deepchem.readthedocs.io/en/latest/api_reference/moleculenet.html#delaney-datasets
Descriptor: RDkit (208 descriptors)
Ratio of training and test = 8:2
Rough selection of models
Linear model: training r2 =0.95, MAE = 0.35, test r2 = 0.89, MAE = 0.48
Lasso model(alpha=0.01) training r2 =0.91, MAE = 0.47, test r2 = 0.91, MAE = 0.50
Ridge model(alpha=0.5) training r2 = 0.94, MAE = 0.37, test r2 = 0.92, MAE = 0.46
K-nearest neighbor(n = 5) training r2 = 0.89, MAE = 0.49, test r2 = 0.85, MAE = 0.60
Partial least square (n = 1) training r2 = 0.57, MAE = 1.02, test, r2 = 0.59, MAE = 1.03
Random forest training r2 = 0.98, MAE = 0.17, test, r2 = 0.92, MAE = 0.44
Hyperparameter optimization for the best model: Randam forest
n_estimaters : 50, max_depth: 15
training: r2 = 0.98, MAE = 0.18, test: r2 = 0.92, MAE = 0.44
Descriptor selection
Stepwise regression: After each regression, the most dominant descriptors were removed and regressed again. Here are the descriptors dominating the solubility.
MolLogP: Octanol/water solubility partition coefficient
MolMR: molecular refractivity, measure of the total polarizability
Chi0v: one of connectivity index
LabutaASA: Average surface area of molecules
HeavyAtomMolWt, MolWt, ExactMolWt, SlogP_VSA2
It is natural that the solubility-related descriptors are selected. The descriptors for polarizability and molecular weight are frequently selected to describe the solubility.
Response plot for the training/test data
using the best model
Boxplots showing the effect of the two most important descriptors on reduction potential.