Serial crystallography tools

Python-based EZ-Hit-Finding suite

Background and Introduction

This work was led by Chufeng Li (now CFEL, DESY)

This project was initiated at the NSLS-FMX-20190306 beamtime. The original aim was to create an interface for the Dozor+DIALS modules and the popular SFX tools such as CrystFEL and DatView (see relevant pages about the NSLS-ASU partnership project software dev plan in the same hierarchical space). For computation efficiency as well as fast real-time feedback on data collection, this Python-based EZ-Hit-Finding suite was developed as an alternative (Specific comparisons on the computation time, output data format between the current hit-finding pipeline using Dozor/DIALS and the Python-based EZ-Hit-Finding suite are in the To-Do-list).

The predecessor of this Python-based EZ-Hit-Finding suite is the "CXI_MAKER" which was designed to convert the .mccd files collected from the Mar-CCD Detector at the BioCARS beamline of APS to .cxi(.h5) files followed by hit-finding and writing peak list info into the .cxi file adopting the same path convention as in Cheetah and CrystFEL suites. The major aim was to interface the data format to existent and widely adopted SFX data analysis software pipelines. The "CXI_MAKER" was written in MATLAB, and also packaged into a stand-alone installation package that can be launched without the MATLAB license.

The basic algorithm adopted in the EZ-Hit-Finding is binary-image-based connectivity analysis and adaptive-thresholding. It was not aimed to excel the current hit-finding algorithms such as Cheetah etc at the starting time of development, as indicated by its name "EZ". However, some new features are indeed incorporated in this "EZ" approach. Detailed peak profiling analysis has been built in such as the estimation of the peak area, centroid, weighted centroid(center of mass), direction and length of the long axis and short axis, aspect ratio, momentum etc. These features might be helpful and will potentially aid the research where the peak-profiling are needed(such as BCDI, pink beam SX at ASU CXLS, and mosaicity analysis). However, the EZ-Hit-Finding framework is left open for further development in the future.

Scripts and Usage

All scripts are deposited in the GitHub repository: https://github.com/chufengl/NSLS_FMX_tools

Below, I will show some basic usage of these tools

Hit-Finding: batch mode using MPI(message passing interface):

Usage:

mpiexec -n 5 python -u /home/chufengl/CFL_reposit/NSLS_FMX_tools/NSLS_FMX_utils_mpis.py <Eiger_file_list> <adc_thld> <min_pix> <mask_file> <min_peak> <region>

Eiger_file_list: list file of the image files to be hit-found (usually generated from find command in bash)

thld: pixel value threshold

min_pix: minimal number of pixels for a peak

min_peak: minimal number of peaks for a hit

mask_file: relative or absolute path to the mask file

region: options include 'ALL', 'C', and 'Q' representing hit-finding using the full image, central part and q quarter(quadrant) of it, to accelerate the hit-finding process and provide the fast feed-back.

single event/frame and single file hit-finding mode is also available as functions if the NSLS_FMX_utils_mpis.py is imported as module. These modes can be used for diagnosis. The comparison on computation performance has been researched by

Nadia Zatsepin

. (can append the results here for a rough sense.)

Good examples of files and scripts can be found here /bioxfel/chufengl/NSLS_FMX_tools/

To launch it through a batch job management system, e.g. Agave Slurm, you can use the following template:

#!/bin/bash

#SBATCH --job-name=HF_CFL

#SBATCH -p parallel

##SBATCH --mem-per-cpu=8000

#SBATCH -n 30

##SBATCH -N 1

#SBATCH -t 2-12:00

##SBATCH --mem-per-cpu=2000

##SBATCH --array=20%8

##SBATCH -A chufengl

#SBATCH -o HF_%j.out

#SBATCH -e HF_%j.err

##SBATCH --mail-type=ALL

#SBATCH --mail-type=ALL # notifications for job done & fail

#SBATCH --mail-user=chufengl@asu.edu # send-to address

export HDF5_PLUGIN_PATH=/bioxfel/software/h5plugin

mpiexec -n 30 python -u /home/chufengl/CFL_reposit/NSLS_FMX_tools/NSLS_FMX_utils_mpis.py ../../agave_test.lst 100 5 None 10 $1

The number of cores can be changed according to your need, however, it is recommended to be less than 50 for a reason that will be discussed later.

Output files:

The suite mainly output 3 kinds of files:

HF_*.out: the prompt in the interactive mode, and the output file in the batch mode. It reflects the current running status of the on-going hit-finding job.

*HIT-rank*.log: the basic hit-finding statistics of each process/worker.

*eve-rank*.lst: the list file of the events that are identified as hits, adopting the CrystFEL event_lst convention.

Object Oriented Framework (OOF) for computation performance evaluation and fast live feed back:

It is great to have a well-established HPC system coupled with data collection and analysis, such as psana of LCLS. However, for some light sources and data repository systems, the optimization of the computation job management is lacking. Therefore, it is an important problem in the strategic level to organize the computational resources and match the hit-finding to the data collection approaches so that fast real-time feed back can be provided.

With this in mind, the computation performance evaluation module was designed to give evaluation of the computational time of each process. The Object Oriented Framework is adopted, so that it can be easily modified, further developed and potentially interfaced to GUI. Classes are defined for each hit-finding working folder, and each working folder can be treated as an object instance of these classes. Attributes and methods are created for the classes, such as

.folder_name,

.out_file_name,

.comp_time_lst,

.comp_time_mean,

.comp_time_min,

.comp_time_max,

.Get_hit_stats_single(rank=),

.Get_hit_stats_all()

These attributes and methods allow us to instantly access the computation status and the hit-finding statistics on-the-fly.

The classes are defined in the module comput_info.py . When the script is run as a "main", it provides the diagnosis for the computation performance:

'''

comput_info.py constructs the Object Orientation framework to extract the computation information.

Usage：

python comput_info.py <path> <key_word>

<path>: the path to the computation trial folder_list

<key_word>: the key_word of the trial folders to be used with the wildcards.

e.g. : python comput_info.py . core

'''

Fig. 1. The computation time (seconds) as a function of number of processes

Fig. 2. The on-the-fly hit-rate as function of time series.

Discussion:

We can see from Fig. 1. that it is not necessarily true that the computation time decreases with the increased number of processes. This is because when more processes are reading different chunks of the same file, the I/O becomes the bottle neck of the speed instead of the hit-finding computation itself. The best strategy for the system without the HPC job management system, such as NSLS-FMX, might be to wrap fewer number of frames into a single .h5 file. This can break down the file reading speed bottle neck, and provide the efficient and fast feed back, such as Fig.2.

Future work includes perfecting the OOF so that it can be instantly deployed to GUIs.

Extension to .cxi files

To extend the "fast-n-EZ" peak-finding features to .cxi files, some tools have been developed for

a) redo peak-finding after Cheetah or .cxi files have been obtained from XTC.
b) output the peak list information as a small and portable .pkl file with complete and retrievable information.
c) writing peak lists to .cxi file paths, so to avoid redoing hit-finding using CrystFEL which only output the peak lists in .stream file.

https://github.com/chufengl/NSLS_FMX_tools

1. CXI_hit_finder_EZ.py:

Usage:

CXI_hit_finder_EZ.py <CXI_file_name> <thld> <min_pix> <max_pix> <mask_file> <min_peak>

thld: pixel value threshold

min_pix: minimal number of pixels for a peak

max_pix: maximal number of pixels for a peak

mask_file: name of the mask file

min_peak: minimal number of peaks for a hit

By running this script, you can redo the hit finding rather fast, I have tested, it is even fast on your laptop. However, I designed it such that it runs with a single .cxi file. But, you can write a bash script to launch an array of jobs.

It will yield a .pkl file which includes all the peak finding information, such as peak number, peak coordinates, and peak intensities. This file is going to be needed in the next step.

If you need to tweak your hit-finding parameters (quasi-)interactively, I have the “Parameter_tweaking mode:” for you:

First, you open up an ipython session. Second, you import CXI_hit_finder as a module. Third, run the following script

label_filtered_sorted,weighted_centroid_filtered,props= \

CXI_hit_finder.single_peak_finder(CXI_file_name,event_no,thld,min_pix,max_pix,mask_file,interact='True')

When you are satisfied with the parameters, then you can use them for the whole data set.

2. peak_list_write.py :

Usage:

python peak_list_write.py <w_CXI_file_name> <pkl_file_name>

w_CXI_file_name: the cxi file to which the peak list is written to,

usually, this should be a copy of the raw file.

pkl_file_name: the .pkl file generated from CXI_hit_finder_EZ.py

you can use it to write the peak lists from the hit-finding stage to the .cxi file, according to the Cheetah and CrystFEL convention. In this way, the .cxi files can be directly handled by CrsytFEL in the indexing stage without spending long time on the hit-finding each time you change your parameters (because the hit lists is not recorded other than in the .stream file which is difficult to interface with other tools. Maybe T. White has a way to record the peak lists in some other way.)

Attention: To protect the raw data, I designed it such that the peak_lists are NOT necessarily written to the raw .cxi file. Instead, you can make a copy of the raw .cxi file, in case you do not wish to alter it.

This copy is the “w_CXI_file_name” in the syntax. For data safety, the user has to use “chmod” to give the writing permission to the file so that the writing process can be successful, rather than doing this in the codes.

To check whether the peak lists have been successfully written into the designated .cxi file, use “h5dump -n” to confirm.

The new peak_list h5 path should be “/entry_1/CFL_peaks/“. You can change the path name according to your wish. This does not conflict with the default “/entry_1/result_1/” by Cheetah peak-finding,

which means that peak lists from Cheetah(if any) and the CXI_EZ_peak-fiding tools coexists in the .cxi file.

Therefore, modify your CrystFEL indexing commands, so that it has "--peaks=cxi --hdf5-peaks=path” for indexing to use the correct peak lists.

You can control the peak detection on the command line. Firstly, you can choose the peak detection method using --peaks=method. There are three possibilities for "method" here. --peaks=hdf5 will take the peak locations from the HDF5 file. It expects a two dimensional array, by default at /processing/hitfinder/peakinfo, whose size in the first dimension equals the number of peaks and whose size in the second dimension is three. The first two columns contain the fast scan and slow scan coordinates, the third contains the intensity. However, the intensity will be ignored since the pattern will always be re-integrated using the unit cell provided by the indexer on the basis of the peaks. You can tell indexamajig where to find this table inside each HDF5 file using --hdf5-peaks=path.

--peaks=cxi works similarly to this, but expects four separate HDF5 datasets beneath path, nPeaks, peakXPosRaw, peakYPosRaw and peakTotalIntensity. See the specification for the CXI file format at http://www.cxidb.org/ for more details.

Speed test

Fig. 3. configuration of the CPU-memory system on which the speed test was performed.

single thread

Fig. 4. 1GB .cxi file, 544 events took 37 seconds for hit finding with "ALL" region option. Average time spent on one pattern is 68ms, for single thread.

Extension to .cbf files

Due to fact that different X-ray detectors record the collected data in different formats, the EZ-hit-finder has been recently extended to take as many file formats as needed to gain more general applications.

The approach to this generalization is dividing the pipeline into the following stages:

listing all image files/events. This can be easily done using bash scripts, such as "find"
reading each image file of different formats into memory, e.g. numpy ndarray as a memory form.
feeding the image of memory form to the hit-finder kernel.
outputting log files and event lists, for real-time feedback and later file transferring.
saving peak lists including the number of peaks, the peak intensity, and coordinates in the same convention that Cheetah and CrystFEL adopt in a .pkl binary file.
(optional) writing the peak lists back to file(mostly .h5 or .cxi files) for later processing, such as indexing and integration.

Upon the demands of the data collection and pre-processing at ALBA, the EZ-hit-finder was firstly extended to handle .cbf files as a test case. For all other possible image file formats available, however, Fabio is imported as the upstream module to read in the image data in various formats and convert it to memory form. The development version can also be found here.

Repository:

https://github.com/chufengl/NSLS_FMX_tools

Script:

ALBA_SX_uitils_mpi.py

Setting up the environment:

install Anaconda3
conda env create -f ALBA_SX.yml
conda activate EZ_hit-finder (source activate EZ_hit-finder)

Usage:

ALBA_SX_uitils.py <cbf_file_list_file> <thld> <min_pix> <max_pix> <mask_file> <min_peak> <Region>

thld: pixel value threshold

min_pix: minimal number of pixels for a peak

max_pix: maximal number of pixels for a peak

mask_file: name of the mask file

min_peak: minimal number of peaks for a hit

Region: 'ALL', 'Q', or 'C'

Parameter_tweaking mode (Run in ipython env, after importing the 'ALBA_SX_uitils_mpi.py' as a module):

label_filtered_sorted,weighted_centroid_filtered,props= \

single_peak_finder(CBF_file_name,thld,min_pix,max_pix,mask_file,interact='True')

ALL test results are in the folder:

/data/bioxfel/data/2018/ALBA-2018-Sep-MartinGarcia/analysis/chufengl/scripts

test_split.lst00000HIT.log is the log file of hit-finding process

test_split.lst00000eve.lst is the event list of the "hits"

test_split.lst00000_peak_info.pkl is the peak_list information, which can be used later for indexing.

Tips:

split the .cbf file list into batches of smaller ones using "split -d ", so that multiples processes of hit-finding can be launched simultaneously.
Copy only the "hits" to your hard drive, by using "rsync --files-from=*.eve.lst "
Check the configuration of the computation system always, before launching multiples processes.

For Agave single process, one image takes 160ms on average.(For the 'ALL' Region)

74ms for each pattern on average if Region='C' is chosen, and the hit rate remains almost unchanged.

Google Sites

Report abuse