Protein Structure Solution Road Map

Roger S. Rowlett

Gordon & Dorothy Kline Professor, Emeritus

Colgate University Department of Chemistry

This page describes typical steps and benchmarks for solving a protein structure. This is not a detailed description of protein structure solution, but rather a general guide to how to proceed. Details about how to use specific tools and software packages to accomplish these tasks are described in the additional pages. There are several popular software suites for solving protein structures, including CCP4, CNS, and Phenix. All of these have been used in our lab. However, we have found that the traditional CCP4i interface to the CCP4 suite is among the easiest to use for undergraduate students and newcomers to protein crystallography, with a good balance between ease of use and control of the structure solution process.

Collecting a data set

Any successful protein structure solution starts with a good data set. An ideal data set takes advantage of the full diffracting power of the crystal. One should take care to collect as complete a data set as possible, with good redundancy so that reflections will have good statistical weighting. Data processing software can see many more significant reflections than seem visible to the eye, so err on the side of collecting data to a 0.1-0.2 A higher resolution than suggested by visual inspection. If there is time available, it is not unrealistic to collect up to 30 minutes frames on a home source, especially for weakly diffracting crystals. Most data collection software will processs and analyze your data on the fly. On a home source, strive to collect a data set that achieves or exceeds these benchmarks:

  • Completeness >90% in all shells (the innermost shell may be less if the beam stop is large, especially in dual beam systems)
  • Rsym < 0.10 overall
  • Mosaicity <1.0° and comparable in all three axes
  • High redundancy (>4.0) in all shells
  • I/σ(I) > 10.0 overall and 1.0-2.0 in last shell (don't be afraid to go farther as useful data may be present in shells with I/σ(I) as low as 0.5)

Typically, the data collection software will output an unsorted data sets of Ihkl values (intensities), with one value for each measured reflection in the data set. The data should be output in a format readable by your processing software. For CCP4, this should be a binary .mtz format.

Preparing data for structure solution

Merging

Unsorted data sets of Ihkl values should be sorted, and multiple measurements of Ihkl averaged to unique Ihkl ± σ(Ihkl) values. The σ(Ihkl) values can be used as weighting factors for using the data for structure solution. The CCP4 programs sortmtz and scala can perform this task, as well as the CCP4 program aimless.

Conversion to structure factors

Values of Ihkl will must be converted to |Fhkl| ± σ(|Fhkl|)for use in data fitting and electron density map calculation. The σ(|Fhkl|) values are used as weighting factors for using the data for structure solution. This conversion normally includes Bayesian methods to deal with negative values of Ihkl that are encountered in weak reflections. The CCP4 program ctruncate can perform this task, and can be called as an option from either scala or aimless.

Establish free R data

Some of your data should be set aside to guard against model bias and overfitting. This data is never used during structure refinement, but is used to independently monitor the conformance of your structure to the reflection data. Typically 5% of your data should be set aside for this purpose, but no more than 1000 or so reflections is sufficient even if this is less than 5% of the total. It's merely necessary to have a random, statistically significant quantity of data for this purpose. Typically R (the residual error of the fitted data) and Rfree (residual error of the set-aside data) should agree within 5% or so for an unbiased structure solution.

Phase Solution

To solve a protein structure, phase information is required to complement the measure reflection intensities. There are myriad ways of obtaining phase information, but the overwhelming majority of protein structures are solved by molecular replacement.

Molecular replacement

Molecular replacement (MR) involves choosing a protein , whose structure is known, and which has significant sequence similarity to the target protein, to obtain provisional phase information to solve the structure. Briefly, an MR protein is placed in an optimal way in the known unit cell consistent with the collected data, and provisional phases are calculated from that molecular arrangement. The MR phases are combined with experimental intensity information (expressed as |Fhkl|) to calculate electron density which can be used to better place the target protein in the unit cell. The CCP4 program phaser is excellent at finding MR solutions.

Choosing a molecular replacement model

A good MR model should have as much sequence similarity as possible with the target protein. MR models with >40% sequence identity are nearly always successful. MR models with 30-40% sequence often work but may be challenging. MR models with less than 30% identity are frequently challenging or impossible in finding a suitable solution. MR models may be used as is, or use truncated side chains to reflect differences in sequence between the target and the MR model. The CCP4 program chainsaw can truncate MR models in various ways. Normally, only the protein component of the MR model is used for phasing. Water molecules and cofactors should normally be omitted.

Checking the solution

A good MR solution should produce an initial electron density map that is interpretable and generally conforming to the protein model. In addition, symmetry mates should pack together reasonably (evidence of protein-protein contacts and no or minimal molecular overlap), and show clear evidence of solvent channels. In addition, missing known cofactors (e.g. metal ions) should clearly show up in difference maps if the solution is correct.

Rebuilding and refinement

The initial phase solution (e.g., MR model) is the starting point for structure refinement. Normally the first step is carry out an initial refinement to improve phases. The CCP4 program refmac can carry out refinement tasks.

Rebuilding

The full structure of the protein will include the protein itself, cofactors (e.g. metal ions, organic molecules), ordered water molecules, and possibly additional ligands (counterions, buffer molecules, cryoprotectants, and/or other molecules). As the model becomes more accurate, phases improve, and electron density becomes clearer, allowing visualization of additional molecular details. The following sequence of rebuilding is suggested. Several cycles of rebulding and refinement may be required for each step as electron density maps improve. Coot (WinCoot for Windows) is the tool of choice for inspecting and rebuilding structures. R and Rfree should decrease after each rebuilding step. It is important to get the protein, known cofactors, and solvent shell right before trying to interpret electron density for bound ligands.

  1. Step through the protein residue by residue, altering amino acids as necessary to match the target protein sequence, and provisionally orienting side chains into the electron density. Occasional peptide flips may be necessary in the main chain as well to conform with electron density. Save refine, and reinspect until satisfied.
  2. Add known cofactors (e.g. metal ions), make any explicit links between protein and cofactors, and save, refine, and reinspect as needed
  3. Add water molecules to the structure. The initial addition can be done using the Findwaters option in Coot or in refmac. Refine and reinspect. Manually add or subtract waters as justified by the electron density, refining and reinspecting and necessary until satisfied.
  4. Evaluate remaining bits of difference density in the electron density maps and decide what if any ligands are compatible with the observed maps. It is common to find monoatomic and polyatomic ions, buffer molecules, cryoprotectants, and other materials found in the crystallziation matrix. Use chemical judgment to make reasonable choices. Refine and reinspect as necessary.
  5. When the R-factor can be reduced no further, and all reasonable difference density is accounted for, call it a day (or two or three or more). Depending on the quality of the original data the final R-factor is typically in the range of 10-24%. Higher than this may suggest problems with the structure solution.

Validation

You are not done yet! Your model should be checked for adherence to geometric standards for bond angles and lengths, and for unlikely side chain conformations. Any anomalies should be inspected closely and either fixed or verified. A negative result for a validation check does not mean your structure is wrong, merely that there is an unusual feature that is unlikely to be present unless clearly justified by the electron density. Here is a suggested list of things to be checked. Most can be done in Coot.

  • Ramachandran plot - inspect all residues with unusual phi-psi angles. Fix or confirm as necessary.
  • Incorrect chiral volumes - this check ensures you did not accidentally mangle a residue from the expected L-configuration to its enantiomeric D-configuration during rebuilding and refinement. Fix any errors as necessary.
  • Check/Delete waters - this check makes suggestions about the validity of water molecules added to your model. Inspect all suggestions and verify or delete suspect waters.
  • Geometry analysis - this check flags residues with unusual conformations. Inspect and fix as necessary.
  • GLN and ASN B-factor outliers - this check flags Asn and Gln residues whose side chains may need to be flipped 180 degrees to better satisfy electron density or hydrogen bonding contacts.
  • Rotamer analysis - this check flags residues that have unusual rotamer conformations. Inspect all outliers and fix as necessary. This check will frequently identify Val and Leu side chains that are flipped the "wrong" way.

Re-refine the structure after each validation inspection and fix application. This is a tedious but important process to ensure that your structure is the highest possible quality. Once you have completed all validation checks, you are ready for deposition and hopefully peer-reviewed publication!