Zoe

chem_mov.mov

question. When exploring new prospective drugs, one of the key attributes which is taken into account, is the solubility of the chemical. Solubility in the bloodstream, the gut, or other in vivo environments is crucial for drug delivery. If a drug is insoluble, solid participants may accumulate which can be extremely harmful. If the compound is partially soluble, one must ascertain exactly how much of the compound is actually being dissolved and delivered in order to control dosage. However, experimental determination of solubility is difficult or even impossible in some cases. It therefore becomes important to predict accurate solubility values of molecules given easily measurable properties.

One such easily computable metric, is the computed partition coefficient (clogP) of a molecule. This method considers small fragments which are common to many molecules. Then, by reducing a large molecule to a collection of fragments (13 in the original implementation), and discarding extraneous segments, the method computes an overall estimate of the partition coefficient, a value with strong correlation to solubility.

Because of our modern computational power, there is no need to reduce complicated molecules to only 13 integer values (the number of each fragment). I wanted to combine several approaches to this problem in order to create a more successful model of solubility.


data. My dataset, AqSolDB (1), contained 9,946 total examples of molecules. For each compound, their chemical structure in SMILES notation which gives a list of elements and their internal covalent bonds along with 16 molecular properties (see example below) was given. The molecules included were each common small organic compounds which show up regularly in medicines. They had an average molecular weight of 266.7.




distribution of solubility and cutoffs for each of the categories: insoluble, slightly soluble, and soluble

a plan. Some research in this area tries to consider every covalent bond between molecules, representing a drug (usually with hundreds of atoms), as a series of elements connected by bonds (picture 1). A simple example of this method, the graph convolutional network (GCN), trains a neural network on these atom - bond objects to recognize patterns in their structures (i.e. rings, long carbon chains, etc.) and predict solubility. However, when I implemented this method, it failed to make successful predictions (averaging 35% accuracy in my test). I hypothesized that the lack of information about the macro structure of the molecule was preventing success. Therefore, I aimed to incorporate a few overall molecular properties into the model, to improve results.

some details

choosing a couple key properties

When choosing the most relevant molecular properties from the 16 numerical values in the dataset, I considered only the ones which described global properties. I was wary of values like the number of rings, the number of hydrogen donors or acceptors, because they only considered particular fragments of the molecule, information which would theoretically be captured by the GCN. However, in order to verify this intuition, I trained a very simple random forest classifier on all 16 attributes, and then observed the relative importance of each feature to the final prediction (which had 51% accuracy) (picture 2). By far the most important features were cLogP which is known to predict solubility, and Molecular Refractivity, a metric which quantifies the density of a molecule based on its interaction with a stream of light.


cLogP and MolMR

cLogP. Technically, this measures the number of ionized versus unionized molecules of a substance in an octanol and water solution. Effectively, it quantifies the ability of a molecule to cross a phospholipid membrane.

Molecular Refractivity. Basically, when light interacts with a heavier, denser molecule, it is more likely to hit an atom and bounce off, so MolMR is almost perfectly correlated with the molecular weight of the molecule, with slightly more information about the density.

results. When exploring new prospective drugs, one of the key attributes which is taken into account, is the solubility of the chemical. Solubility in the bloodstream, the gut, or other in vivo environments is crucial for drug delivery. If a drug is insoluble, solid participants may accumulate which can be extremely harmful. If the compound is partially soluble, one must ascertain exactly how much of the compound is actually being dissolved and delivered in order to control dosage. However, experimental determination of solubility is difficult or even impossible in some cases. It therefore becomes important to predict accurate solubility values of molecules given easily measurable properties.

One such easily computable metric, is the computed partition coefficient (clogP) of a molecule. This method considers small fragments which are common to many molecules. Then, by reducing a large molecule to a collection of fragments (13 in the original implementation), and discarding extraneous segments, the method computes an overall estimate of the partition coefficient, a value with strong correlation to solubility.

Because of our modern computational power, there is no need to reduce complicated molecules to only 13 integer values (the number of each fragment). I wanted to combine several approaches to this problem in order to create a more successful model of solubility.