Small Data

Also in confluence: LINK

Introduction

Typical data acquisition rate at XPP ranges from 0.1-1GBps. For a typical 10 minute scan, we are looking at the raw data file size on the order of 50-500 GB. Crunching through such large data volume, even though just with very primitive arithmetic operations, can be time consuming.

For many established type of measurements at XPP, we developed the SMALL DATA format to serve as a starting point of user data analysis. It is a preprocessed and condensed, smaller hdf5 file. It's content is configurable based on the type of measurements and analysis requirement. The size is typically 2-3 orders of magnitude smaller than the raw data file. The SMALL DATA generation/translation is now based on general psana code with a hutch specific set of "default" detectors.

The production can be run from a standalone script. It is straightforward to add user code in python that works on e.g. waveforms or images and adds the results to the littleData. There is still an option to add similar code to the standard HDF5 translator, but this is a more complicated procedure and only recommend if you have prior experience and do not expect to need much help. For fast turnaround, the standard translator needs the data to be broken into steps, something that introduces wait times and is not necessary for every scan. For details please contact your POC/PAL in advance of your experiment.

The SMALL DATA is very 'portable' for most experiments, with file size for a typical run on the order of 1GB. Recent users have adapted our Matlab examples into Mathematica, Igor, Python, etc.

What's in the small data

The small data by default contains most analysis relevant point detector values, event IDs and markers. Below is a list with brief descriptions, roughly in order is relevance for most users. Fields in red are applicable for all hutches, either through smallData (fields in red) or the "default detector" setup (fields in magenta).

  • lightStatus: contains binary sequence of laser and xray to indicate the light status.

  • ipm2: beamline I zero, down stream of the Be lens, upstream from the attenuators.

  • ipm3: last beamline I zero before the beam exit the beamline diamond window.

  • diodeU: user point detectors, usually referred to as user IPM.

  • scan: values of the motors/virtual-motors that were being scanned. Also referred to commonly as the control PVs for the DAQ.

  • enc: readback position of selected motors. Typically contains "lasDelay" for the calibcycle-less way to doing delay scans at XPP. enc/lasDelay is in ps (as is the time tool per event correction)

  • tt: time tool correction values. tt/ttCorr contains the correction value with the calibration applied in the littleData creation. More details below.

  • "UserData": these are generated by user plug-ins such are ROI etc and have names defined by the user. This is described more below.

  • damage: binary data about the status of detectors in the data. If 0 for a used detector, then the event need to be rejected

  • epics: common EPICS PVs (slits, goniometer motors and a few standard laser motors)

  • epicsUser: user added experiment specific PVs (user motors, temperatures of cryojets/lakeshore read backs,...)

  • ebeam: electron beam parameters, most typically used ones are L3Energy, which is the electron energy measured at the beginning of the undulator, and is related to the actual photon energy of each pulse.

  • phase_cav: the output value of the JoeFrisch phase cavity for electron beam arrival time monitoring. A good indicator for the machine overall timing status and used for timing correction in the early days. Less used nowadays since ‘tt’ took over.

  • gas_detector: FEE gas detector measurement for pink beam pulse energy, this is upstream from the HOMS.

  • ipm1: beamline I zero upstream in hutch 2, largely irrelevant for user analysis.

  • ipm1c: beamline I zero upstream in hutch 2, largely irrelevant for user analysis.

  • ipm_hx2: beamline I zero upstream in hutch 2, wave8 version. Only useful for experiments in pink beam.

  • lombpm: dectris quad detector on the LODCM diagnostic tower.

  • lomdiode: additional PIPS diodes in the LODCM. ch0 is the PIPS diode on the LODCM diagnostic tower for photon energy calibration. ch1 is the PIPS diode on the LODCM 2nd xtal tower for initial alignment of the second crystal.

  • UserDataCfg: configuration information.

  • adc: analog input out/put voltages.

  • evr: contains binary event code status for each event code, such as Evr/140.

  • fiducials/event_time/EvtID(old): contains the event fiducial and time stamp

  • l3t: level 3 trigger status used during the data acquisition, the threshold were setup typically by xpppython.

If you have older data, the fields might have somewhat different names, but they should be similar enough to find them in this list.

Timetool information:

In addition to tt/ttCorr ,we keep the results of the online algorithm (on which ttCorr is also based). The algorithms fits a step in the ratio of the current events data and a reference image. The step position corresponds to a given time delay between the optical and X-ray laser.

  • FLTPOS The step position in pixels

  • FLTPOS_PS The step position in pico seconds. This uses the calibration available while the data was taken.

  • FLTPOS_FWHM The step width. This is a useful variable to decide if a fit is believable.

  • AMPL The step amplitude. Should correlate with the X-ray intensity (e.g. IPM2/sum).

User Data:

In addition, as configured, the little data can contain different form of (ideally) condensed area or waveform detector information. This is where the data reduction happens. These content are located within UserData with names definable by the users. Typical options are:

  • A region of interest of the area detector. Applicable when the focus is e.g. a small diffraction peak.

    • the sum and center-of-mass of these ROIS are always stored. It is possible to also store all the pixels in that area and/or to add projections.

  • Radial integrations of a liquid/gas scattering pattern, or a powder diffraction pattern.

    • this can also do a 2-d bin with q & phi bins.

    • A low resolution (binned) image of the whole detector.

  • Photon 'droplets'

  • Full resolution (time) sorted and binned images. This is also typically referred to as the cube.

Users somewhat familiar with python/numpy/scipy welcome to develop methods of preprocessing/condensing mechanisms.

More User Data:

It is now also possible to store more "exotic" user data, e.g. generated from correlating two detectors or performing user-defined procedures on detector data that DetObject does not yet provide.

More details about setting up user data are given here: 3. Configuring User Data using SmallDataAna_psana

Binned Data:

There is an additional format where data is stored as a sum over events falling into a chosen "bin", typically delay time bins. This is a second step processing relying on the "smallData". Only the default smallData is necessary. This allows for optimal memory treatment when using many time bins and it also allows only access to events that pass a given event selection. This code can be run as post processing on smallData files or it is also possible to add full resolution images of the larger area detectors available. This is also typically referred to as the cube. More information on this can be found here: 4. Binned Data Production

How to generate the small data

The small data generation takes advantage of the local development of the PSANA framework. It can be customized and run both again the xtc files while they are being written on the ffb system (ongoing experiment only) as well as offline against the .xtc files. Parallel computing has been built-in. For a typical experiment, at the moment, Silke will help set up the little data processing/generation in the directory /reg/d/psdm/xpp/xpp*****/results/smalldata_tools.

A "driver" python file (typically called SmallDataProducer.py) can then be edited to, e.g., choose and optimize a different ROI on area detectors, define beam center for radial integration, define delay time range and bin sizes, etc.

The default output will be saved to

/reg/d/psdm/xpp/xpp*****/hdf5/smalldata

To run a single DAQ run, you can use (in the res directory or in your own private release):

smallDataRun -r <#> -e xpp....

During data taking, you can omit the "-e experiment_name" parameter and the jobs will be sent to a special queue with priority access.

The full list of options is here:

smallDataRun options

(ana-1.2.9) snelson@psanaphi107:/reg/d/psdm/xpp/xpptut15/results/smalldata_tools$ ./examples/smallDataRun -h

usage: ./examples/smallDataRun options

OPTIONS:

-r run# (NEEDED!)

-e expname (def: current experiment, example: xppe7815)

-d directory for output files (default: ftc diretory of specified experiment)

-q queue name (default: psanaq if exp specified, psneh(hi)prioq for current exp)

-j #: run on # number of cores

-n number of events (for testing)

-s run locally

During data taking we typically run a "procServ" process (similar to a cron job) that checks the database for the last run number and if it finds one that has not been processed yet, it will submit a job to the batch queue. The number of jobs is tuned to use as few cores as necessary to process data at data taking speed to keep the time before the files are available to a minimum while keep the the queue as empty as possible. This is useful in cases where the reduction parameters are stable for sets of runs (most of the XPP and XCS experiments fall into this category).

Configuration of the smallData

How to configure smallData, in particular how to extract features of "big data" area detectors, is described on its own subpage.

(Optional:) Create your local copy of little data generation module

First, create your own working directory and setup the psana environment:

source /reg/g/psdm/etc/psconda.sh

Then, checkout smalldata_tools from github:

https://github.com/slac-lcls/smalldata_tools

smallDataRun contains the name of the driver script (by default SmallDataProducer), check that it uses the one you are editing should you not do exactly as written above.

Analysis Examples - python

While you can simply use the h5py python module to read the data, we have created an ipython based analysis modules that also comes with xppmodules.

You can start the interactive ipython session:

smalldata_tools/examples/runSmallDataAna -r <#> [-e <expname>]

If no experiment name is passed on, the current experiment is assumed. Two object will be created: the first is called "anaps". This object will only need the etc files to be present but in turn it replies on the psana machinery. It is predominantly meant to help setting up the littleData reduction. These function are described and illustrated here: 4. Configuring User Data with the help of anaps. The second object "ana" help you take a look at your littleData and can do many simple analysis jobs. More detail and examples can be found here: 5. Analyzing your data with "ana" . Tab completion will let you see the functions available to you for either object.

Analysis Examples - MATLAB - (careful: old)

The content of the little data can be browsed through HDFView. You will immediately see that the new format has less layers of structures and more intuitive data field names. A first example of Little Data files and the corresponding matlab simple analysis are located at

/reg/g/xpp/xppcode/matlab/tutorial/LittleData/cs140

This example illustrates the analysis of a diffraction peak intensity as a function of pump-probe delay with basic time sorting. We thank Wei-Sheng Lee for kindly agreeing to share some of the scans from his previous experiment for the tutorial.

A second example involves the processing of the whole big cspad with time sorting. Due to the requirement of Q space coverage and resolution, the data condensing can not be achieved through ROI or image resolution reduction. As a result, the output in these so-called 'cube' hdf5 files are not event based any more but rather time bin based. The events are filtered and sorted based on the predetermined criterion and only the sum of all events falling into each designated time is saved. The current version of the preprocessed output can be found at

/reg/g/xpp/xppcode/matlab/tutorial/LittleData/cspad

It is still under discussion about the final naming convention and structure but the example gives a rough idea of the process for now. This is part of the dataset that led to the published femtosecond diffuse scattering work that can be found here.

You will still need some of the XPP hdf5 importing functions to run the examples. They are located at

/reg/g/xpp/xppcode/matlab/hdf5tools

other information is here