A high-throughput bioinformatics tool for the intelligent processing of lipid molecule libraries into a common set of fragments for force field parametrization
The purpose of molecular dynamics (MD) simulation is to avoid where possible the extremely high computational cost of performing quantum mechanical simulations and the great expense of wet lab experiments by simulating molecules using classical forces. These constants associated with these forces are assigned based on based on data obtained from quantum mechanical simulations or physical data.
One of the most important biological targets for molecular dynamics is the modeling of lipid membranes. A great number of hugely important biological processes happen at cellular and organelle membranes which are largely composed of lipid molecules. Perhaps most significant among these are the various processes by which molecules can penetrate the intracellular membrane. Being able to rapidly computationally model the interactions between drug molecules or carriers and cellular membranes using molecular dynamics could greatly accelerate the pace of drug delivery research.
The first step in creating a molecular mechanics force field is to determine a functional form (the set of forces involved and how they are calculated). Many of the most ubiquitous force fields in use today are all-atom, which means that all the forces in the force field are calculated based on pairwise Newtonian interactions between atoms. The forces typically used in an all-atom force field can be divided into two categories: bonded and non-bonded interactions.
The non-bonded interactions typically considered are van der Waals and electrostatic interactions. The constants involved in these interactions can be uniquely determined by the elements and charges of the two atoms involved, so determining which constants to use is easy.
Bonded interactions typically include bond length, bond angle, and bond torsion angles. Unlike the constants involved in non-bonded interactions, the constants in the bonded terms tend to be specific to not just the two atoms in question, but the surrounding chemical environment. For instance, the equilibrium length and bond strength of a carbon-carbon bond depends heavily on what else those carbons are bound to. Thus, one of the major problems involved in creating a force field is deciding how to assign bonded interaction constants.
Most force fields were originally developed for proteins, and for that class of molecules the solution is easy. Since proteins are composed of a small set of discrete amino acids, each atom and bond in each amino acid can have its own bonded interaction parameters. To obtain these, quantum mechanical simulations are run on each amino acid to determine the relevant constants for each atom in the residue. Given any arbitrary protein, it is then easy to discretize it into amino acids, identify each atom as atom X on amino acid Y, and then assign appropriate parameters.
For lipids, however, the problem is much more difficult. Lipids are not composed of a small set of discrete building blocks. It is difficult to know which lipid fragments to run quantum mechanical simulations on to obtain parameters in order to get full and non-redundant coverage of all common lipids. Similarly, given an arbitrary lipid it is difficult to know which bonded parameters to assign to a given interaction since lipids cannot be readily discretized in the same way as proteins.
For this reason, existing lipid force fields only have parameters defined for a very small set of membrane lipids. By way of example, the recent extension of the CHARMM force field to lipids only rendered it capable of simulating four very similar lipids: lauroylpropanediol-phosphorycholine (LPPC), 1,2-dimyristoyl-sn-phosphatidylcholine (DMPC), 1,2-dilauroyl-sn-glycero-1-phosorylethanolamine (DLPE), and 1,2-dioleoyl-sn-phosphatidylcholine (DOPC). By comparison, there are tens of thousands of known unique lipid structures. Therefore, there is a great need for an informatics tool capable of intelligently and systematically breaking down large numbers of arbitrary lipids from databases into a smaller set of fragments just as proteins can be naturally broken down into amino acids.
Lipid Fragmenter, the novel algorithm and bioinformatics tool presented on this website, is a first step towards resolving the issue outlined above. The purpose of the tool is to systematically break down any lipid into a set of smaller fragments. By running this tool on a large library of lipids, the fragments required to be parametrized to obtain complete coverage of common lipids could be identified. Once those fragments are all parametrized appropriately for a force field, running the tool on an individual lipid would supply identification of which parameters should be assigned to each atom.
While the tool is primarily designed to fragment lipids in preparation for force field parametrization, it is capable of breaking down any class of molecule according to the rules detailed in the below section. Accordingly, it could be employed in a range of bio and cheminformatics research involving the detection of certain chemical functionalities in large sets of molecules. For example, it could be used to further process bioinformatic data obtained from the output of mass spectrometry processing software like SimLipid in order to learn more about the frequencies of functional groups present on a cell membrane.
Read in a molecule from an XYZ file and identify all heteroatoms. Heteroatoms are defined as all non hydrocarbon atoms, as well as any non-sp3 carbons
Using a modified Quick Union algorithm optimized with tree weighting and path compression, merge all non-hydrogen atoms bound to a heteroatom to its group
Assign every continuous set of carbon atoms not yet assigned to a group to the same group
Assign all hydrogen atoms to the same group as their parent atom
Detach all groups from each other by disconnecting the bonds and adds methyl groups to cap the ends of the fragments in appropriate 3D locations.
Write each fragment to disk in Tinker XYZ format.
Before fragmentation
Fragmentation locations
The algorithm, implemented here in Go, is capable of fragmenting one typical membrane lipid in 4-5 ms on a modern consumer-grade CPU (AMD Ryzen R5 3600), making it suitable for high throughput analysis of large libraries of lipids. The algorithm could be easily multi-threaded to process multiple molecules in parallel using Go Routines if a further speed-up is desired, but I did not consider this measure to be worth implementing considering its current speed would be sufficient to process well in excess of 500,000 lipids per hour. For reference, the largest lipid structure database in the world, LIPID MAPS' Lipidomics, contains fewer than 50,000 unique structures.
Using Lipid Fragmenter on individual lipids
Making use of Lipid Fragmenter is very easy. The program is entirely self contained and has no dependencies or requirements.
Download the exe file from github using the button below
Prepare a file containing your target lipid in Tinker XYZ format
Run the program using command prompt as follows: "/path/to/lipidFragmenter.exe /path/to/lipid.xyz"
Fragment XYZ files will be deposited in the same directory as the lipid XYZ file
Using Lipid Fragmenter with online lipid databases and large sets of lipids
For using Lipid Fragmenter in conjunction with various online lipid databases which store their molecules in a variety of formats, the use of OPENBABEL is recommended for the preparation of input files in Tinker XYZ format. OPENBABEL publishes libraries for Python, Java, C#, C++ and more on their website.
As Lipid Fragmenter is a self-contained executable file, automation scripts for running many lipids written in Bash or the programming language of your choice can easily call it. In Python, for example, it could be run using os.system().
If you wish to remove fragments with identical connectivity from Lipid Fragmenter's output, conversion of the output Tinker XYZ files to SMILES strings is recommended. This step is not performed automatically as outputting in the SMILES format and purging fragments with identical connectivity entails a significant loss of information which in many cases is undesirable.