Phase Solution

Roger S. Rowlett

Gordon & Dorothy Kline Professor, Emeritus

Colgate University Department of Chemistry

The most difficult problem in this modeling process is obtaining information about the phase of the observed reflections. (The intensities are accurately measured in your experimental data set.) In order to produce accurate electron density maps, it is essential to have both accurate intensity and phase information. Approximate phases can be obtained by collecting additional data on heavy atom derivatives of the same protein (multiple isomorphous replacement), by examining anomalous scattering of endogenous heavy atoms in the protein (useful for certain metalloenzymes or selenomethionine-substituted proteins), or by using a starting model derived from a homologous protein (molecular replacement).

The simplest method of obtaining phase estimates for X-ray diffraction data analysis is molecular replacement, which involves building a provisional model of the target protein based on the structure of a highly homologous protein, and placing it in the appropriate orientation in the unit cell. The initial phases are calculated based on the positions of all the atoms in the molecular replacement model, and such phases are often sufficient to obtain a usable electron density map that can be used to refine the structure of the target protein. Two excellent tools for solving structures by molecular replacement are EPMR and Phaser, both of which are detailed here. Phaser is the simplest to use, and is the preferred method as it is both fast and efficient. EPMR may be a good alternative in situations where many search models must be placed in the asymmetric unit, and Phaser is unable to find a solution.

Constructing a molecular replacement model

For any molecular replacement solution, it is necessary to construct a reasonable molecular replacement search model. Select a molecular replacement protein that is as homologous as possible to the target protein, and examine a sequence alignment of the two proteins. A molecular solution replacement may be possible if the proteins are more than 30% identical. The molecular replacement protein should be modified as follows to make it a similar as possible to the target protein:

  • If the molecular replacement protein has extra residues, either internally or at the N- or C-termini, remove them.
  • Leave as is any residues that are identical in both proteins
  • For mismatches, change the molecular replacement residue to Ala except:
    • Pro, Gly or Ala residues in molecular replacement model should be left as is
    • Gly should be used where Gly appears in the target protein
    • No substitution is necessary for Asn/Asp or Gln/Glu
    • Phe in the molecular replacement protein is allowed to substitute for Tyr in the target protein
    • Val in the molecular replacement protein is allowed to substitute for Ile in the target protein

The necessary modifications can be easily made using Coot or (even easier) using CHAINSAW in CCP4. For search models that have extra loops or deletions compared to the target protein, the Phyre server is an excellent way to build a reasonable search model. Upload a target sequence, and Phyre will return an ensemble model from the best available sequence matches in the PDB. Phyre will return a monomer, so you may have to create oligomer search models by aligning this chain with various chains in an oligomeric version of your best search model in Pymol or some other program, and combining these orientations in a text file.


Note: for solving the structure of mutant proteins, the ideal search model is an existing solved structure of the wild-type protein. No modifications need be made to the residues of the molecular replacement model in this case.


For the purpose of generating an initial electron density map it is probably wise to remove all cofactors (e.g., coenzymes, metal ions), bound species (e.g., buffers, solvents, ions), and solvent.

Finding molecular replacement solutions using EPMR

Before an electron density map can be generated, it is necessary to place the search model (molecular replacement protein) in the appropriate location of the unit cell. There are a number of programs capable of doing this, but among the best is EPMR, the instructions for which are described here.

File preparation

The first task is to convert the structure factor file from MTZ format to a format readable by EPMR. This task can be accomplished in the CCP4i environment by choosing the task Convert from MTZ in the Reflection Data Utilities menu. The CCP4i task window for carrying out these actions shown in Figure 2. You should probably exclude reflections that are marked for Free R calculation.

Figure 2. Convert from MTZ task window. Required fields are highlighted in color. Data fields in MTZ file that are to be converted to user-defined format are listed in the MTZ File Labels section.


EPMR also requires an additional file that contains information about the unit cell dimensions and the space group number. This file should contain a single line in the format in which the values of a, b, c, α, β, γ, and the International Tables space group number are entered separated by spaces. The unit cell parameters and space group number can be found in the log file of truncate. Give the file the .cel extension. File 5 is an example for a C2 crystal (space group #5):

File 5

epmr .cel file

232.66 144.73 52.41 90 93.96 90 5


Running EPMR

EPMR uses an efficient evolutionary search algorithm to find one of many good fits of the search model to the reflection data during each trial. The search is repeated for many trials, starting with different initial orientations of the search model. The results of the best of these trials is assumed to be (and often is) close to global best fit, providing a good model for estimating phase data and constructing the first electron density map. The program is customizable by including various switches in the command line, some of which are outlined below:

  • -o filename sets the stem for the filenames of the output PDB files, which will look something like filename.1.best.pdb.
  • -mn instructs EPMR to place n molecules of the search model into the unit cell. The default is to place one molecule in the unit cell
  • -tn instructs EPMR to use the correlation coefficient n as the cutoff value for determining what is a satisfactory molecular replacement solution. When placing more than one molecule in the unit cell, it is usually desirable to set this value to 1.0 to force an more exhaustive search for the best fit for the first molecule placed. This often improves the chance of success for finding a satisfactory solution for multiple placements. The default is a correlation coefficient of 0.45 for one molecule or 0.30 for the first of multiple molecules.
  • -hn gives EPMR the high-resolution limit of data to be used in the search. The default value is 4 Å. Occasionally, using slightly higher resolution data can help find a satisfactory solution. This value should normally be set to 5Å or higher resolution.
  • -ln gives EPMR the low-resolution limit of data to be used in the search. The default value is 15 Å. If accurately measured low resolution reflections are available, including data out to 25-30Å can be useful.

The general format for invoking the program is:

epmr –o filestem filename.cel filename.pdb filename.epmr

where filestem is the stem of the output PDB filename, filename.cel is the unit cell information file, filename.pdb is the molecular replacement search model in PDB format, and filename.epmr is the reflection list file in EPMR format. The command line, which can be quite long, is best put into an executable Linux script file named epmr.sh, an example of which is shown in File 6. The command can be invoked to run in the background by typing epmr.sh & at the prompt.


File 6

A typical EPMR executable file

epmr –m3 –t1.0 –o 3dimer hica08.cel dimer.pdb hica08.epmr > 3dimer.log


The script in File 6 will do an exhaustive search (correlation coefficient of 1.0) to place 3 molecules of dimer.pdb in the unit cell described by hica08.cel, using hica08.epmrreflection data. The best fits for the three placed dimers will be written out as 3dimer.1.best.pdb, 3dimer.2.best.pdb, and 3dimer.3.best.pdb. The real-time output of the program will be sent to the file 3dimer.log, which can be monitored by using the tail –f command. EPMR, even as efficient as it is, will take a substantial amount of time to find a molecular replacement solution for a large unit cell, especially if multiple molecules must be placed.

Preliminary determination of suitability of the molecular replacement search

A decent molecular replacement solution will have an R-factor no larger than ≈0.45. If R>0.50 it is unlikely that the molecular replacement solution will be useful. If the R-factor is satisfactory, then the packing of molecules placed in the unit cell by EPMR should be examined by loading the file into Pymol or Coot and enabling display of symmetry mates. If there are no obvious clashes between symmetry mates, and the symmetry-generated molecules pack well into the unit cell with clear solvent channels and no gaps between molecules, you should proceed, else you should re-evaluate your molecular replacement solution and perhaps try again using different conditions.

Preparing EPMR PDB file output for use by CCP4 or Phenix

If you have placed several molecules of a search model into the unit cell, they should be consolidated and reformatted before proceeding. First, the files should be concatenated using a text editor; any remark files can be removed. Next, the file should be reformatted so that each protein chain has a different SEGID. This can be done in any text editor.

Finding molecular replacement solutions using Phaser

Phaser is probably the most popular (and powerful) molecular replacement program. Phaser is most conveniently run via CCP4i, and one feature of Phaser can be used to estimate the number of protein molecules present in the asymmetric unit prior to running either Phaser or EPMR.

Estimating the number of protein molecules in the asymmetric unit

A utility within Phaser can utilize Matthews Probability calculations to estimate the most likely number of protein molecules within the asymmetric unit of the unit cell. This task can be carried out in the CCP4i interface by the following steps:

  • In the CCP4i main task window select Molecular Replacement from the task menu and click on Analysis...Phaser Cell Content Analysis. A task window will open (Figure 3).

Figure 3. Phaser task window set up for Matthews probability estimation. Required fields are highlighted in color.


  • Enter a job name (e.g., content analysis) and the input file name (the structure factor MTZ file for the entire dataset)
  • Under Define composition of the asymmetric unit, choose protein, select the molecular weight option and enter the molecular weight of the search model and the number of these molecules you expect in the asymmetric unit. If you don’t know how many search models are reasonable to enter, try “1.” Alternatively, you can select the sequence file option and then enter the filename of a sequence file for the search model in FASTA format. The sequence file option will of course give you the most precise cell content analysis and solvent content.

Example sequence file for a dimer in FASTA format

>2A8D:A

MDKIKQLFANNYSWAQRMKEENSTYFKELADHQTPHYLWIGCSDSRVPAEKLTNLEPGELFVHRNVANQVIHTDFNCLSV

VQYAVDVLKIEHIIICGHTNCGGIHAAMADKDLGLINNWLLHIRDIWFKHGHLLGKLSPEKRADMLTKINVAEQVYNLGR

TSIVKSAWERGQKLSLHGWVYDVNDGFLVDQGVMATSRETLEISYRNAIARLSILDEENILKKDHLENT

>2A8D:B

MDKIKQLFANNYSWAQRMKEENSTYFKELADHQTPHYLWIGCSDSRVPAEKLTNLEPGELFVHRNVANQVIHTDFNCLSV

VQYAVDVLKIEHIIICGHTNCGGIHAAMADKDLGLINNWLLHIRDIWFKHGHLLGKLSPEKRADMLTKINVAEQVYNLGR

TSIVKSAWERGQKLSLHGWVYDVNDGFLVDQGVMATSRETLEISYRNAIARLSILDEENILKKDHLENT


  • Start the job by selecting Run…Run Now at the lower left of the task window. The job will be entered into the job list in the CCP4i window, and you can monitor its status.
  • When the job is finished, examine the log file from the View Files from Jobs menu in the administration functions pane of the CCP4i window to verify that the job has run correctly.
  • Examine the Results tab and/or the log file to determine the most likely number of search models contained in the asymmetric unit. You will need this information to run Phaser.

Performing molecular replacement calculations in Phaser

Phaser is a fast, highly automated program for finding molecular replacement solutions for multiple protein molecules (search models) in an asymmetric unit. Phaser is conveniently run in the CCP4i environment.

  • In the CCP4i main task window select Molecular Replacement from the task menu and click on Model Generation...Phaser MR. A task window will open (Figure 4).


Figure 4. Phaser task window set up for molecular replacement solution. Required fields are highlighted in color.


  • Enter a job name (e.g., phaser) and the input file name (the sorted structure factor MTZ file for the entire dataset)
  • Under Mode for molecular replacement, select automated search.
  • The MTZ in file should be the sorted and merged structure factor file. Verify the Phaser has successfully read in your F and sigF columns and has properly identified the space group from the MTZ file header.
  • If you are sure of the space group, Run Phaser with the MTZ space group. If you know the point group (e.g. P222) but do not know the exact space group (e.g., P222, P212121, P21212, etc.) then Run Phaser with all alternative space groups.
  • In the Define ensembles section, provide an ensemble name and enter the filename of the search model, as well as an estimate of the homology of the search model to the protein of interest. Do not enter 100% if the two protein are not perfectly identical. Underestimates are better than overestimates of homology. When using a wild-type protein as a search model for site-directed mutants, a value of 90% is OK. Your search model should not contain cofactors, water molecules, or ligands.
  • Under Define composition of the asymmetric unit, choose protein and enter the molecular weight of the search model, and the number of these molecules you expect in the asymmetric unit based on Matthews probability analysis. (Alternatively, you can select the sequence file option and input a FASTA format sequence file.)
  • Under Search parameters, enter the ensemble name and the number of copies to be placed in the asymmetric unit.
  • If your search model has potentially disordered termini that might cause packing clashes in the final stages of molecular replacement evaluation, of if you do not get a solution due to packing failure, you may want to alter the Packing criterion to be more tolerant. This option is found under Expert parameters. You can allow a certain absolute number or percentage of the best-packed solutions, or allow a maximum number of molecular clashes in the solution (try 20-30). This option is not normally needed.
  • Start the job by selecting Run…Run Now at the lower left of the task window. The job will be entered into the job list in the CCP4i window, and you can monitor its status. Phaser jobs can takefrom a few minutes to 24 hours, depending on the complexity of the problem and the various selection criteria. CCP4i jobs will continue to run even if you exit CCP4i and logout of your account.
  • When the job is finished, examine the Results of log file by double-clicking on the job in the administration pane of CCP4i to verify that the job has run correctly.

Molecular Replacement with Density Modification and Automated tracing

Molecular replacement is relatively routine when there is a high degree of sequence and structure homology between search model and target protein. (Typical requirements for a successful search are > 30% identity in sequence, and less than 2 Å rms difference in atomic positions.) For difficult cases near the limits of sequence identity or rms difference in atomic positions, a simple Phaser or EPMR search is very unlikely to yield an interpretable electron density map. One possible approach in these borderline cases is a combination of molecular replacement search (Phaser or EPMR), density modification (PARROT), followed by auto-tracing (BUCCANEER). For some cases, this approach, which borrows methodology from experimental phasing, works remarkably well.

Initial MR Search

  • Prepare a search model. The Phyre server is highly recommended for preparing a good search model for a novel protein structure, or from models that have borderline sequence homology. For large unit cells with many chains per asymmetric unit, consider searching with dimers, trimers, or tetramers. (These oligomers will have to be constructed in Pymol or some other program based on the symmetry operators of a close search model oligomer.) It is unlikely that you will be able to successfully place into the asymmetric unit more than 4 copies of anything with a marginal search model, although it may be worth trying. Consider trying various versions of the search model:
    • Full search model oligomer with side chains intact
    • Search model pruned in CHAINSAW to the last common atom in the search model and oligomer
    • Search model truncated to poly-Ala
  • Initiate Phaser or EPMR searches using appropriate search models. Have a cup of coffee.
  • If a reasonable solution is found, examine the results in Coot to see if crystal packing is reasonable.
  • If crystal packing is reasonable (all protein chains and symmetry partners have contacts, overlaps are not severe) then run a few cycles of rigid body refinement or a full refinement in Refmac to generate initial phases. You may choose to examine this result in Coot to see if the electron density map is interpretable, but this is not likely as phases will normally be very poor at this point. If your original search model included full side chains, consider running Refmac with the search model truncated to poly-Ala as well.

Density modification

The initial MR solution may pack well in the unit cell, and be approximately correctly positioned, but the mean phase error of the resulting electron density map may be quite high. Density modification, especially if non-crystallographic symmetry is available, may significantly improve phases to the point that maps are interpretable. PARROT, which is part of the CCP4 suite, is an excellent option for accomplishing this task. The following steps are typical (see Figure 15 for a PARROT task window):

  • Choose as input files the .pdb and .mtz files output from the refinement of the Phaser or EPMR solution
  • If the MR solution is an oligomer, select the option Get NCS form MR/partial model. The S/N of the resulting maps will be enhanced by the square root of the number of copies in the ASU. For a large number of copies, e.g. hexamer or octamer, this can be quite significant.
  • Use the Free-R flag to monitor model bias
  • Select an output filename for the resulting density-modified .mtz file.
  • Enter the number of cycles of phase improvement (5-10 are typical)
  • Enter the solvent content fraction (e.g., 0.468, not 46.8) in the unit cell. You can obtain this number from the Phaser or EPMR log file.
  • Select Run...Run now. Have several cups of coffee.

Figure 15. PARROT task window


The improved electron density map can be inspected in coot, along with the original MR model. In Coot,

  • Open the coordinate (.pdb) file for the MR model using File...Open
  • Open the PARROT electron density map from the .mtz file using Auto Open MTZ...
  • The improved density can be found in the map labeled "parrot.F_phi.F, parrot.F_phi.phi"

Auto-tracing

It may be possible to rebuild the original MR search model into the improved density from PARROT, but more than likely, there will be many sequence registration errors and/or ambiguities of sequence alignment to the electron density based on a poly-Ala model. It is likely that a better result can be obtained in less time by autobuilding as much of the protein chain as possible. The CCP4 program BUCCANEER is one effective option. The following steps are typical (see Figure 16 for a BUCCANEER task window):

  • Open a BUCCANEER task window and select the task model building from experimental phases
  • For de novo building, enter a filename that contains the sequence of the protein model in FASTA format
  • Enter as the input .mtz file the improved phases .mtz file output by PARROT.
  • User the Free-R flag to monitor model bias
  • Select a filename for the output .pdb file.
  • Normally, you should Apply anisotropy correction to input data
  • Select a number of building cycles to perform (5 is typical)
  • To improve the probability of correct sequence registration, select "Assign a sequence when a definite match is found" for both initial and subsequent cycles of building.
  • Select Run...Run now and have several cups of coffee while the model builds.

Figure 6. BUCCANEER task window


The output solution can be inspected in Coot. The output .pdb and .mtz files are found in a subdirectory of the CCP4 project directory labeled "39_buccaneer_pipeline..." where the initial number is the CCP4 job number. The files are named refine.pdb and refine.mtz.

  • Open the coordinate (.pdb) file for the BUCCANEER model using File...Open
  • Open the electron density map from the .mtz file using Auto Open MTZ...

Examining crystal packing in Coot

It is important to verify that your molecular replacement solution is sensible. Examining crystal packing can provide important information, including:

  1. Determining if your molecular replacement solution is compatible with crystal packing symmetry
  2. Determining, for partial solutions, where and how many missing components might be placed in the asymmetric unit or unit cell

Crystal packing can be conveniently examined using Coot:

  • Open the Results pane for the Phaser job by double-clicking on the job in the administration pane of CCP4i. At the bottom of the Results pane, click on the Coot button under Structure and Electron Density. Coot will open the PDB solution file and automatically load electron density maps from the associated MTZ file.
  • Go to Draw...Cell and Symmetry and select "Show Symmetry Atoms" and "Show Unit Cell"
  • Click on "Symmetry by Molecule" and select "Display Near Chains"
  • Adjust the "Symmetry Atom Display Radius" until you can see enough copies of symmetry molecules to define crystal packing. Start with 10-15 A and work your way up as required.

If the space group and molecular replacement solution are correct,

  1. Every molecule in the unit cell(s) should have a molecular contact with several other protein molecules.
  2. Solvent channels should be visible down various principal axes of the unit cell
  3. No molecules should be observed "floating in space" without other intermolecular contacts. (If this is the case, you may have a partial molecular replacement solution, and are missing additional protein assemblies in the asymmetric unit.) The empty space should provide some clue about how much additional protein must be packed into the asymmetric unit to complete the solution.