X-ray Diffraction Data Processing

Roger S. Rowlett

Gordon & Dorothy Kline Professor, Emeritus

Colgate University Department of Chemistry

There are many software suites that can be used to analyze protein X-ray diffraction data. For data collected on an Oxford Diffraction system, integration and scaling in CrysalisPro is recommended. For data collected elsewhere, e.g., at a synchrotron, instructions for integration and scaling using the programs MOSFLM and SCALA are described.

Processing Data in CrysalisPro

Normally, CrysalisPro will process data during data collection. If you are satisfied with the experiment as it was originally set up, then it is only necessary to import the output .hkl file into the CNS workflow or the output .mtz file into the CCP4 workflow. Note: before the .mtz file can be used in CCP4, it will be necessary to merge the reflection file and convert intensities to structure factor values as described below.

Reprocessing reflection data in CrysalisPro

For major in CrysalisPro data processing done during data collection you can and should completely reprocess (integrate and scale) the data from scratch. Reasons you might want to do this include disregarding certain ranges of frames, altering the resolution limits, or manually assigning the proper space group. The following instructions describe a typical reprocessing task:

  • Copy the entire experiment folder created by CrysaslisPro to another computer running Windows XP later. (Refinalization should not be performed on the computer controlling the XRD instrument.) It is also possible to run CrysalisPro in wine under Linux, although it will be considerably slower.
  • Start CrysalisPro on the computer you have copied data to.
  • In the Select Experiment dialog box, select Browse experiment, navigate to the desired experiment folder, and open the .par file for the experiment. The experiment will be added to the list of previous experiments opened in CrysalisPro on that computer.
  • Select the desired experiment, and click on Open selected. The first image in the experiment should be displayed.
  • Click on Start/Stop
  • Click on Data reduction with options
  • The first dialog box appears. Select the family of space groups you expect your data to belong to (e.g., (P)rimitive, (C)-centered, (I)-centered, (R) hexagonal, etc.) If you don't know, select (P)rimitive and CrysalisPro may be able to figure it out. Select Next.
  • The second dialog box allows you to edit the start and end image number of each run to be processed. Click Next when finished.
  • The third dialog box opens. You should normally select Auto select optimal prediction approach and Follow significant sample wobbling. If your crystal slipped during the run (you might notice this when indexing is suddenly lost part way through the run) you can select Follow sudden changes in orientation. This works well in this scenario, but will dramatically increase computational time. Start with a +/- 2 degree search in 4 steps initially, but you may have to widen your search and the number of steps (up to 10) to make the necessary correction. You may also want to click on Clear all data from previous run to start with a clean slate. Click on Nextto continue.
  • The fourth dialog box appears. Ensure that Smart background is selected. Unless you are short of memory, uncheck Reduce background accumulation to SHORT.Click Next to continue.
  • The fifth dialog box opens. Check Use Friedel mates as equivalent unless you want to keep anomalous scattering data. Verify that the correct outlier rejection criteria are being used for the expected space group. Click Next to continue.
  • The last dialog box opens. You may change the name of the output files (recommended) by clicking on the Change output name button. Both finalization options (Space group determination and Completeness determination) should be checked. Four difficult space group determinations, you may want to use the Manual (interactive) space group determination option.
  • Click Finish to start the analysis and go have a cup of coffee. (It is recommended to watch the process to ensure that spots are indexed correctly across all frames.)
  • Refinalize the data as described next. This is useful to (1) limit resolution if desired and obtain merging statistics for the output data set, and (2) produce an MTZ file for CCP4.

Refinalizing reflection data in CrysalisPro

The following instructions describe a typical reprocessing task. If the data has been previously processed to your satisfaction, you can simply load the processed data as described above and begin the finalization process.

  • Click on the "Inspect Data Reduction Results and Refinalize..." icon on the toolbar. A dialog box will open with a summary of the results of the concurrent data reduction that occurred during data collection.
  • To alter the data reduction, click on Refinalize. A data reduction finalizing dialog box will open. Change the following parameters as required:
    • Use Outlier rejection should be enabled
    • Friedel mates equivalent should be selected unless anomalous scattering data is required
    • Automated Empirical correction should normally be selected
    • Select Auto or Interactive space group determination as desired.
    • If you desire to change resolution limits for the re-processed data, select Manual and enter the desired resolution limits, otherwise select Automated
    • Under Output, verify that Export options specify that an mtz file is selected. Change the output file name for the .hkl and .mtz files (recommended) by clicking onChange. It is recommended that you add "-refinal" to your file name to distinguish the hkl and/or mtz files from the originally processed data.
    • Click on OK to refinalize data.
    • The summary table for the refinalized data will be found at the bottom of the Command Shell window. Open the Command Shell by clicking on the second icon from the top on the left menu bar of CryaslisPro.


Merging a CrysalisPro MTZ file for use in CCP4

To process data from CrysalisPro in CCP4, it is necessary to sort reflections by h, k, l and merge the reflection data without applying scale factors. (The data has already been scaled in CrysalisPro.) This can be easily accomplished on one go in the CCP4i GUI using the program Aimless. Start CCP4i and add your project to the project directory list, if necessary. An alternate method is to use the programs sortmtz and scala, but scala has been deprecated in the lastest CCP4 release, so the latter method is not recommended.

Sorting

Note: If using aimless to process data, sorting separately in by sortmtz is unnecessary. Sorting is required if you process data in scala


    • In the Module drop-down menu, choose Program List, then scroll down and find Sortmtz and click on it to open a task window.
    • Enter an appropriate job title, e.g., sortmtz
    • Select your input file in the MTZ in field, and select and appropriate filename for your MTZ out file. The default sorting order (ascending) on H K L M/ISYM BATCH is appropriate.
    • If desired and necesssary, you may change the space group assignment (e.g., I222 to I212121) by clicking on the Change space group... button. You can type the new space group assignment under Reindex Details by pressing the Change spacegroup to button and typing in the new space group assignment.
    • Select Run...Run Now to start the job.

Merging

There are two ways of completing this task, one using scala, and one using aimless. Aimless is strongly preferred, as scala is no longer updated:

Using SCALA

    • In the Program List of the Module drop-down menu, choose Scala. (Alternatively, you can find Scala under the Data Reduction and Analysis module as Find Symmetry, Scale and Merge (Scala).)
    • Enter an appropriate job title, e.g., merge.
    • Select Run Ctruncate... and check output to a single MTZ file to ensure that Scala will convert intensities to structure factors values after merging
    • Optionally (this is recommended) select Ensure unique data and add FreeR column... and select a fraction of data to set aside for statistical purposes. A typical fraction is 0.05.
    • Select as the input file the MTZ file output by the sortmtz job previously run (MTZ in field)
    • Type in an appropriate name for the output file in the MTZ out field, e.g. d44n01-sorted-sf.mtz. It is a good practice to end file names in "-sf" that contain your structure factors for that dataset.
    • In the Define Output Datasets section, enter appopriate names for the Project, Crystal, and Dataset name. Typically the Project name would be the protein studied (e.g. HICA), the Crystal name would reflect the specific variant and crystal number used, e.g., D44N-001, and the Dataset name would be 'all' for a single dataset.
    • Under Scaling Protocol, select constant under Scale. CrysalisPro has already scaled your intensities; you merely want to merge duplicates together at this point.
    • Under Scaling Details select "Refine Scale Factors for 0 cycles" (otherwise you will get an error message about negative scales)
    • Select Run...Run Now to start the job.
    • This output file is ready for further processing, e.g., Phaser.

Using AIMLESS

This is the preferred method for merging data, and does not required a sorted file for input. You may input your un-merged mtz file from CrysalisPro into aimless.

    • In the Program List of the Module drop-down menu, choose Aimless. (Alternatively, you can find Aimless under the Data Reduction and Analysis module as Symmetry, Scale & Merge (Aimless).)
    • Enter an approprate job title, e.g., merge.
    • Tick the box Option to skip scaling & just merge
    • Select Run Ctruncate... to ensure that Aimless will convert intensities to structure factor values after merging
    • Optionally (this is recommended) select Ensure unique data and add FreeR column... and select a fraction of data to set aside for statistical purposes. A typical fraction is 0.05.
    • In the HKLIN in line select as the input file the desired MTZ file of intensities from CrysalisPro
    • Type in an appropriate name for the output file in the HKLOUT field, e.g. d44n01-sorted-sf.mtz.

Enter appropriate names for the Crystal name, Project name, and Dataset name. Typically the project name would be the protein (e.g., HICA), the crystal name is its variant and identifier (e.g., D44N-001), and the dataset name describes the origin of the data (e.g., all )

    • Select Run...Run Now to start the job.
    • This output file is ready for further processing, e.g., Phaser.

Processing Data in MOSFLM and SCALA

MOSFLM is a program for integrating single crystal diffraction data from area detectors, maintained by Harry Powell, Medical Research Council Laboratory of Molecular Biology, Cambridge. The most convenient way to run MOSFLM is through IMOSFLM in the CCP4 GUI. IMOSFLM provide data suitable for scaling in SCALA.

Indexing and Integrating data using IMOSFLM

Startup and configuration

  • To start IMOSFLM, start a CCP4i session, select your project, and select Start iMosflm from the "Data Reducton and Analysis...Data Processing using MOSFLM" menu item. The IMOSFLM task window should open. (Figure 2).

Figure 2. IMOSFLM task window.


  • Click on the images icon or select Session...Add Images and navigate to the directory where the desired image files are located. Select the first image and click on Open. All images will be loaded into IMOSFLM, and an image view window will open.
  • To configure the experiment settings, select View...Experiment Settings
    • Check Reverse direction of spindle rotation if analyzing data at CHESS on another facility that utilizes reverse phi rotation.
    • Enter the Project name, Crystal name, and Dataset name for the images selected.
    • Enter the appropriate X and Y beam positions, and check the distance and λ for the experiment.
  • Set the beamstop shadow before proceeding with indexing or integration:
    • Click on the zoom icon in the image display window and use the mouse to draw a zoom box around the center of the image. The diplay window will zoom to the defined area when the mouse button is released.
    • Click on the circle fitting icon in the image display window. Three additional icons will appear (Figure 3).

Figure 3. IMOSFLM display window.


    • Click on the top circle-fitting icon and define a circle around the beamstop by clicking around the periphery the beamstop shadow.
    • When you have defined a series of points around the beamstop, click on the middle circle-fitting icon to define a circular area. It should be highlighted in green.
    • Remove the points used to define the beamstop shadow by clicking on the bottom circle-fitting icon.
    • If the green area is distracting, you can remove it from the display by clicking on the green icon in the display window.

Indexing the first frame

  • Click on Indexing in the taskbar of IMOSFLM
  • Select one image to index by clicking on the Pick first image icon or by typing the number of the first image in the Images box and pressing enter.
  • Click on the Index button. If necessary, fix the Max cell edge value in the top of the IMOSFLM task window to a reasonable value to limit spurious and unreasonable cell symmetry assignments.
  • Identify the most likely spacegroup from the list provided. Reasonable solutions are highlighted with a green icon. Typically, the highest symmetry spacegroup (the lowest on the list) with a reasonably low penalty function is the correct choice. You should notice a large gap in penalty scores between acceptable and unacceptable space groups. IMOSFLM will suggest the most likely spacegroup for you in the pull-down dialog box. If there are several equivalent solutions (e.g., I222 and I212121) and you know the correct screw axes for the crystal, you can select it now.
  • In the display window, turn off the display of spots used for indexing by clicking on the red cross icon.
  • Estimate mosaicity by clicking on the Estimate mosaicity button at the bottom of the task window. Note: If mosaicity is more than 0.7° it may be underestimated by MOSFLM.
  • Examine the predictions (yellow and blue boxes) to see if they match spots on the image. If there are more spots than predictions, consider increasing the value for the mosaicity. If there are more predictions than spots, consider reducing the value for the mosaicity.

Cell refinement

  • Click on the Cell Refinement task in the IMOSFLM taskbar
  • Before proceeding further, establish data processing settings. You can access these by selecting View...Processing Options and clicking on the Processing tab.
    • Set the resolution limits, especially the _High resolution limit__
    • Adjust the spot separation, if necessary. (The IMOSFLM default is usually adequate)
    • Check the Spots "close" box if spots are very close or overlapping, or if you get a warning message during cell refinement or data processing
    • Select a files name for the integration MTZ file.
  • Select in the Images bar the frames to perform cell refinement with. Several segments will give a better cell refinement than a single segment. Choose segments 20-45° apart. For example, 1-8, 21-28, 41-48 would refine three 8-frame segments starting at frames 1, 21, and 41.
  • Fix any parameters you don't want refined during cell refinement by checking the appropriate box in the Cell Refinement task window. Frequently, the mosaicity is fixed if it is large and does not refine stably.
  • Start cell refinement by clicking Process in the task window. If you wish to monitor the progress of refinement predictions in the display window, click on the Show Predictions icon in the task window.
  • Monitor the central spot profiles in the task window. If things go well, the central spot profile should be centered in the processing box, and refined cell dimensions should appear in the task window.
  • If warnings are encountered, examine them in detail in the log file (mosflm.lp) and make corrections to processing parameters, if necessary.

Integration

  • Click on the Integration task in the IMOSFLM taskbar
  • Select the images you would like to integrate in the Images text bar. For example 1-250 would integrate the first 250 images of the dataset.
  • If you fixed mosaicity to perform cell refinement, you should probably fix it for integration as well by checking the appropriate box in the task window.
  • Modify the output MTZ filename if required by typing it in the text box at the top of the task window.
  • If you wish to monitor the progress of refinement predictions in the display window, click on the Show Predictions icon in the task window.
  • Click on Process to begin integration.
  • Monitor spot profiles in the task window and the predictions in the display window to verify the integration is proceeding satisfactorily.
  • If warnings are encountered, examine them in detail in the log file (mosflm.lp). Make corrections to processing parameters and re-integrate, if necessary.

Scaling reflection data from MOSFLM

If integration has gone well, you can proceed to scaling data using Aimless. The most convenient way to use Amiless is through the CCP4i interface. In general the Aimless default settings are very good, and scaling of data is quite transparent. The following procedure is typical for scaling a single data set. For merging and scaling multiple datasets, see the next section.

  • Start CCP4i by issuing the command ccp4i at the prompt. The CCP4i graphical interface will open (Figure 4).


Figure 4. Main task window for CCP4i. Tasks are listed in the left pane, jobs in the middle pane, and administration functions in the right pane.


  • If you have not already done so, set up and select a project directory by clicking on Directories&ProjectDir in the administration pane.
  • Select the Data Reduction and Analysis module in CCP4i (upper left menu bar) and click on Symmetry, Scale, Merge (Aimless). A task window will open (Figure 5.) You will need to enter a job title, select the appropriate MTZ file to be scaled (from your MOSFLM integration or CrysalisPro integration/scaling), define an output MTZ filename (different from the input MTZ filename) and (optionally) the estimated number of residues in the asymmetric unit. The latter is useful if you would like to obtain an estimated average b-factor for the data set from a Wilson plot. If PNAME, XNAME, and DNAME were set in MOSFLM before integration, these will be successfully read into the job under Define Output Datasets. These parameters are mandatory if you are merging two or more datasets together in CCP4i. The Scaling Protocoldefaults should be fine in 99% of cases.
  • Ensure that Run Truncate... and output a single MTZ file is selected, so that your scaled data will be converted to structure factors before output.
  • Optionally, enter the estimated number of residues per asymmetric unit to get accurate Wilson Plot statistics, including the average estimated b-factor for the dataset.
  • Optionally, this is a good time to set aside reflection data for calculating Rfree. This can be done by selecting Ensure unique data & add FreeR... and a fraction of data to set aside. The default of 5% (0.05) is usually acceptable.


Figure 5. The Aimless task window in CCP4i. Mandatory fields are highlighted in color.


  • The scaling job is started by selecting Run…Run Now at the lower left of the task window. The job will be entered into the job list in the CCP4i window, and you can monitor its status.
  • When the job is finished, examine the scaling statistics by selecting View Files from Job…View Log File from the administration pane. In the log file window, select Show Summary.
    • The overall Rmerge should be very low, typically ≈0.05 for an excellent data set. Overall Rmerge values > 0.10 may be cause for concern, although low resolution or weak data sets may have Rmerge values > 0.10. For a typical data set the Rmerge values by shell should increase monotonically from low- to high-resolution shells.
    • The overall I/σ(I) value should typically be ≈20. An overall I/σ(I) < 10 describes a data set with weak intensities . For a typical data set I/σ(I) should decrease monotonically from low- to high-resolution shells. You may want to disregard shells with I/σ(I) < 1-2, and re-scale and/or re-integrate with reduced resolution limits. Aimless will estimate the highest possible usable resolution shell based on CC 1/2 values, and this is a more modern method of setting resolution cutoffs.
    • Examine the completeness of the data set. Data that is 85-90% complete should be sufficient to solve a structure, although more completeness is better if practical. A quality data set will also have approximately the same degree of completeness in each shell, with perhaps a fall-off at high resolution where spot intensities are weaker or are limited to the corners of the images. Gaps in completeness in low- or mid-resolution shells may indicate problems with ice rings and/or integration.
    • Examine the multiplicity of the data set. A typical data set will have an average multiplicity of 4 or more. This is a measure of the average number of times a reflection intensity Ihkll (or its Friedel mate, I–(hkl)) has been independently measured. Higher multiplicities will result in a more precise data set. There may be a fall-off in multiplicity at high resolution where spot intensities are weaker.

Merging and scaling multiple data sets in CCP4i

Frequently in protein X-ray crystallography it is necessary to combine several datasets in order to solve a structure. Such situations might include:

  • combining several datasets at from different phi rotations of the same crystal. This situation might arise from an interrupted data collection run where the initial data set was not sufficiently complete, and for which it was impossible or impractical to resume the run exactly where it left off. Combining two sets will allow the construction of a suitably complete data set
  • combining datasets from the same crystal using different camera distances. This situation is very useful when a crystal has large unit cell dimensions (and therefore closely spaced spots), where it is difficult to collect a complete dataset which includes high-resolution data as well as well-resolved low-resolution reflections. In this case the low resolution data can be collected as a separate dataset with a longer camera distance, allowing better separation of low-resolution reflections. Overlapping low-resolution reflections are discarded in the high-resolution data set.
  • combining several datasets from different crystals of the same protein in the same space group. This situation might arise when crystals have a limited lifetime in the X-ray beam, and no single data set is complete enough for structure solution.

Aimless can be used to scale and merge datasets in one go. (This is the preferred approach). However, it is also possible to sort and merge data manually, and scale using SCALA if desired.


Using SCALA

To merge datasets, the second and subsequent datasets must be renumbered so that batches of reflections (collections of reflections from a frame of data) will have unique, non-conflicting batch numbers. The resulting sorted datasets are then combined and sorted by reflection, and then finally re-scaled to render them consistent with each other.

Sorting and merging intensity data

  • Open a CCP4i session as previously described previously.
  • Select the Data Reduction and Analysis in CCP4i (upper left menu bar) and click on Utilities...Sort/Modify/Combine MTZ files. A task window will open (Figure 6):


Figure 6. Sort/Modify/Combine MTZ files task window. Mandatory fields are highlighted in color.


  • Enter a job name (e.g., renumber), and select the appropriate MTZ input filename. The input dataset should be a reflection file, e.g. intensity data output from MOSFLM, and not a structure factor file. An output file name will be generated, or you can change it to something else.
  • Select Reset the Batch number(s) and enter a number for the first batch. This number should be larger than the highest batch (frame) number in the batch of the other dataset. It is simplest to add a multiple of 1000 to the original batch number.
  • Start the job by selecting Run…Run Now at the lower left of the task window. The job will be entered into the job list in the CCP4i window, and you can monitor its status.
  • When the job is finished, examine the log file from the View Files from Jobs menu in the administration functions pane of the CCP4i window to verify that the job has run correctly.
  • Open a new Sort/Modify/Combine MTZ files task window. Enter a job name (e.g., combine) and enter the MTZ filename of the renumbered MTZ (reflection intensity) file from the previous steps. Click on Add File and enter the MTZ (reflection intensity) filename of the sorted intensities corresponding to the dataset you wish to combine it with. Finally, select an output MTZ filename.
  • Start the job by selecting Run…Run Now at the lower left of the task window. The job will be entered into the job list in the CCP4i window, and you can monitor its status.
  • When the job is finished, examine the log file from the View Files from Jobs menu in the administration functions pane of the CCP4i window to verify that the job has run correctly.

Scaling merged intensity data

  • To scale and merge the sorted files, open a Scale and Merge Intensities task window.
  • Enter a job name (e.g., merge) and select as your input MTZ file the sorted and combined MTZ (reflection intensity) file created in the previous job. Select and output MTZ filename, and in the Define Output Datasets section check Combine all input datasets into a single output dataset. Change the output dataset name to something descriptive like all.
  • Ensure that Run Truncate... and output a single MTZ file is selected, so that your scaled data will be converted to structure factors before output.
  • Optionally, enter the estimated number of residues per asymmetric unit to get accurate Wilson Plot statistics, including the average estimated b-factor for the dataset.
  • Optionally, this is a good time to set aside reflection data for calculating Rfree. This can be done by selecting Ensure unique data & add FreeR... and a fraction of data to set aside. The default of 5% (0.05) is usually acceptable.
  • Start the job by selecting Run…Run Now at the lower left of the task window. The job will be entered into the job list in the CCP4i window, and you can monitor its status.
  • When the job is finished, examine the log file from the View Files from Jobs menu in the administration functions pane of the CCP4i window to verify that the job has run correctly. Examine the scaling statisitics to verify that the combined data set is satisfactory. Combined datasets may not have monotonically varying values of Rmerge, I/σ(I), or multiplicity by shell because of discontinuities in the merged data. However, the merged data should still have overall statistics that conform to what is expected for a usable dataset.
  • The output MTZ file from this procedure is ready for further processing as described in Phase Solution.


Using Aimless

This is the preferred method for merging data, and does not required a sorted file for input. You may input your un-merged mtz file from CrysalisPro or MOSFLM into aimless.

    • In the Program List of the Module drop-down menu, choose Aimless. (Alternatively, you can find Aimless under the Data Reduction and Analysis module as Symmetry, Scale & Merge (Aimless).) See Figure 7.
    • Enter an approprate job title, e.g., merge.
    • Select Run Ctruncate... to ensure that Aimless will convert intensities to structure factor values after merging
    • Optionally (this is recommended) select Ensure unique data and add FreeR column... and select a fraction of data to set aside for statistical purposes. A typical fraction is 0.05.
    • In the HKLIN line(s), add multiple input files (MTZ format) from your image processing program (.e.g., CrysalisPro or MOSFLM)
    • Type in an appropriate name for the output file in the HKLOUT field, e.g. d44n01-merged-sf.mtz.
    • Enter appropriate names for the Crystal name, Project name, and Dataset name. Typically the project name would be the protein (e.g., HICA), the crystal name is its vasriant and identifier (e.g., D44N-001), and the dataset name describes the origin of the data (e.g., highres or lowres)
    • Select Run...Run Now to start the job.
    • This output file is ready for further processing, e.g., Phaser.

Figure 7. Aimless task window for scaling and merging multiple data sets.


Reindexing Data Sets in CCP4i

Sometimes a space group is generally known but the exact space group including screw axes is not immediately known, and the data set must be re-indexed to conform to standard conventions later. For example, you may know that a particular crystal is in the primitive orthorhombic space group (e.g., P222, P212121, P21212, P21221, P22121, P2221, P2212, P2122). Of these space groups, only P222, P212121, P21212, and P2221 are recognized as standard space groups. The others are non-standard variants in which the h, k, l indices have been permuted. To convert one of these non-standard space groups into a standard one, the reflection data indices must be appropriately swapped. For example to convert reflection data from P22121 to the standard P21212, it is necessary to rearrange the indices hkl into klh. This is conveniently done in CCP4i:

  • Open a CCP4i session as previously described previously.
  • Select the Reflection Data Utilities module in CCP4i (upper left menu bar) and click on Reindex Reflections. A task window will open (Figure 8).

Figure 8. Reindex Reflections task window. Mandatory fields are highlighted in color.


  • Enter a job name (e.g., reindex), and select the appropriate MTZ input filename. An output file name will be generated, or you can change it to something else.
  • Under the Reindex Details section of the form, select entering reflection transformation. In this example, we have selected h=k, k=l, l=h to permute the indices hkl to klh.
  • Check the box Change spacegroup to and enter the proper, standard space group, here P21212.
  • Start the job by selecting Run…Run Now at the lower left of the task window. The job will be entered into the job list in the CCP4i window, and you can monitor its status.
  • When the job is finished, examine the log file from the View Files from Jobs menu in the administration functions pane of the CCP4i window to verify that the job has run correctly.