You write the report on Google Colab.
Classify the molecules in the Esol data set by high (1) or low (0) solubility, where high solubility is defined as logS being above 0. Use the six descriptors in the dataset file to train both a logistic regression model (using sklearn) and a NN-model (using Keras). In addition, make a simple "null model" that predicts 0 for all molecules.
Compute and discuss the confusion matrix as well as the the accuracy, precision, sensitivity, and specificity of all three models. In particular, discuss why accuracy is not a good metric for this particular problem.
Which model works best?
Use the Esol data set to train a Keras NN model that predicts solubility from "fingerprint as bit vector" (the actual value, not whether it is above of below 0, like last time). The goal is the get the lowest possible error on the validation set. You must use early stopping and present a plot of the training and validation loss for your final model.
Make a scatter plot of the predicted vs experimental values for your training and test sets.
For your training set you should notice a handful of points with identical predicted values, but very different experimental values (you may have to run ca 30 epochs to see them clearly). Explain this by making figures of the molecules and compare their fingerprint bit vectors. HINT
For the test set compute the RMSE for your model. Compare this value to the RMSE for a "null model" that predicts the average solubility for all data molecules. Finally, compare the RMSE for the null model to the standard deviation (SD) of the solubility values. Why is it important for the RMSE to be significantly lower than the SD?
Is there a correlation between the error and the solubility, i.e. is your model better for some ranges of solubility and, if so, why?
How sensitive are your results to the choice of training and test set?
Repeat assignment 1 but with the following changes:
1. Use molecular fingerprints and graph convolution together with a NN. Try different atom and bond descriptors to get the lowest validation error.
2. Use train, validation, and test sets
3. Use early stopping that monitors the precision multiplied by sensitivity. Keras can't do this, so you need to implement early stopping from scratch, by iteratively training the model for 1 epoch and computing and monitoring the metrics.
You need to use a new dataset. There are several datasets described in this paper, but you can also find other datasets in the scientific literature (subject to my approval). You also can find links to the datasets in the .py files here.
The questions are the same as for the assignments you have made. What method works best and why?
You also need to introduce your dataset and theory behind the methods you use.