Assignments

Assignment 1

You write the report on Google Colab.

Classify the molecules in the Esol data set by high (1) or low (0) solubility, where high solubility is defined as logS being above 0. Use the data set to train both a logistic regression model (using sklearn) and a NN-model (using Keras). In addition, make a simple "null model" that predicts 0 for all molecules.

Compute and discuss the confusion matrix as well as the the accuracy, precision, sensitivity, and specificity of all three models. In particular, discuss why accuracy is not a good metric for this particular problem.

Which model works best?


Asssignment 2

Use the Esol data set to train a Keras NN model that predicts solubility from "fingerprint as bit vector" (the actual value, not whether it is above of below the median, like last time).  The goal is the get the lowest possible error on the test set.

Make a scatter plot of the predicted vs experimental values for your training and test sets.

For your training set you should notice a handful of points with identical predicted values, but very different experimental values (you may have to run ca 30 epochs to see them clearly). Explain this by making figures of the molecules and compare their fingerprint bit vectors. HINT

For the test set compute the RMSE for your model. Compare this value to the RMSE for a "null model" that predicts the average solubility for all data molecules. Finally, compare the RMSE for the null model to the standard deviation (SD) of the solubility values. Why is it important for the RMSE to be significantly lower than the SD?

Is there a correlation between the error and the solubility, i.e. is your model better for some ranges of solubility and, if so, why?

How sensitive are your results to the choice of training and test set?



Assignment 3

Compare your results from Assignment 2 with a graph-based representation of the molecules. Is it better or worse? I suggest you use DeepChem for the graph convolution.

Answer the same questions as for Assignment 2.

Final Project

For the final project you can use any dataset you want and any ML method you want. The questions are the same as for the assignments you have made. What method works best and why? What molecules have large errors and why.

You can use the Delaney dataset if you want, so that your final project is a comparison of the methods you have learned about so far. But you will learn a lot more from working with a new dataset. There are several datasets described in this paper, but you can also find other datasets in the scientific literature