Protein Structure Solution Road Map
Roger S. Rowlett
Gordon & Dorothy Kline Professor, Emeritus
Colgate University Department of Chemistry
Gordon & Dorothy Kline Professor, Emeritus
Colgate University Department of Chemistry
This page describes typical steps and benchmarks for solving a protein structure. This is not a detailed description of protein structure solution, but rather a general guide to how to proceed. Details about how to use specific tools and software packages to accomplish these tasks are described in the additional pages. There are several popular software suites for solving protein structures, including CCP4, CNS, and Phenix. All of these have been used in our lab. However, we have found that the traditional CCP4i interface to the CCP4 suite is among the easiest to use for undergraduate students and newcomers to protein crystallography, with a good balance between ease of use and control of the structure solution process.
Any successful protein structure solution starts with a good data set. An ideal data set takes advantage of the full diffracting power of the crystal. One should take care to collect as complete a data set as possible, with good redundancy so that reflections will have good statistical weighting. Data processing software can see many more significant reflections than seem visible to the eye, so err on the side of collecting data to a 0.1-0.2 A higher resolution than suggested by visual inspection. If there is time available, it is not unrealistic to collect up to 30 minutes frames on a home source, especially for weakly diffracting crystals. Most data collection software will processs and analyze your data on the fly. On a home source, strive to collect a data set that achieves or exceeds these benchmarks:
Typically, the data collection software will output an unsorted data sets of Ihkl values (intensities), with one value for each measured reflection in the data set. The data should be output in a format readable by your processing software. For CCP4, this should be a binary .mtz format.
Unsorted data sets of Ihkl values should be sorted, and multiple measurements of Ihkl averaged to unique Ihkl ± σ(Ihkl) values. The σ(Ihkl) values can be used as weighting factors for using the data for structure solution. The CCP4 programs sortmtz and scala can perform this task, as well as the CCP4 program aimless.
Values of Ihkl will must be converted to |Fhkl| ± σ(|Fhkl|)for use in data fitting and electron density map calculation. The σ(|Fhkl|) values are used as weighting factors for using the data for structure solution. This conversion normally includes Bayesian methods to deal with negative values of Ihkl that are encountered in weak reflections. The CCP4 program ctruncate can perform this task, and can be called as an option from either scala or aimless.
Some of your data should be set aside to guard against model bias and overfitting. This data is never used during structure refinement, but is used to independently monitor the conformance of your structure to the reflection data. Typically 5% of your data should be set aside for this purpose, but no more than 1000 or so reflections is sufficient even if this is less than 5% of the total. It's merely necessary to have a random, statistically significant quantity of data for this purpose. Typically R (the residual error of the fitted data) and Rfree (residual error of the set-aside data) should agree within 5% or so for an unbiased structure solution.
To solve a protein structure, phase information is required to complement the measure reflection intensities. There are myriad ways of obtaining phase information, but the overwhelming majority of protein structures are solved by molecular replacement.
Molecular replacement (MR) involves choosing a protein , whose structure is known, and which has significant sequence similarity to the target protein, to obtain provisional phase information to solve the structure. Briefly, an MR protein is placed in an optimal way in the known unit cell consistent with the collected data, and provisional phases are calculated from that molecular arrangement. The MR phases are combined with experimental intensity information (expressed as |Fhkl|) to calculate electron density which can be used to better place the target protein in the unit cell. The CCP4 program phaser is excellent at finding MR solutions.
Choosing a molecular replacement model
A good MR model should have as much sequence similarity as possible with the target protein. MR models with >40% sequence identity are nearly always successful. MR models with 30-40% sequence often work but may be challenging. MR models with less than 30% identity are frequently challenging or impossible in finding a suitable solution. MR models may be used as is, or use truncated side chains to reflect differences in sequence between the target and the MR model. The CCP4 program chainsaw can truncate MR models in various ways. Normally, only the protein component of the MR model is used for phasing. Water molecules and cofactors should normally be omitted.
Checking the solution
A good MR solution should produce an initial electron density map that is interpretable and generally conforming to the protein model. In addition, symmetry mates should pack together reasonably (evidence of protein-protein contacts and no or minimal molecular overlap), and show clear evidence of solvent channels. In addition, missing known cofactors (e.g. metal ions) should clearly show up in difference maps if the solution is correct.
The initial phase solution (e.g., MR model) is the starting point for structure refinement. Normally the first step is carry out an initial refinement to improve phases. The CCP4 program refmac can carry out refinement tasks.
The full structure of the protein will include the protein itself, cofactors (e.g. metal ions, organic molecules), ordered water molecules, and possibly additional ligands (counterions, buffer molecules, cryoprotectants, and/or other molecules). As the model becomes more accurate, phases improve, and electron density becomes clearer, allowing visualization of additional molecular details. The following sequence of rebuilding is suggested. Several cycles of rebulding and refinement may be required for each step as electron density maps improve. Coot (WinCoot for Windows) is the tool of choice for inspecting and rebuilding structures. R and Rfree should decrease after each rebuilding step. It is important to get the protein, known cofactors, and solvent shell right before trying to interpret electron density for bound ligands.
You are not done yet! Your model should be checked for adherence to geometric standards for bond angles and lengths, and for unlikely side chain conformations. Any anomalies should be inspected closely and either fixed or verified. A negative result for a validation check does not mean your structure is wrong, merely that there is an unusual feature that is unlikely to be present unless clearly justified by the electron density. Here is a suggested list of things to be checked. Most can be done in Coot.
Re-refine the structure after each validation inspection and fix application. This is a tedious but important process to ensure that your structure is the highest possible quality. Once you have completed all validation checks, you are ready for deposition and hopefully peer-reviewed publication!