We have organized a data science competition to stimulate both the ML and HEP communities to renew the toolkit of physicists in preparation for the advent of the next generation of particle detectors in the Large Hadron Collider at CERN. With event rates already reaching hundred of millions of collisions per second, physicists must sift through ten of petabytes of data per year. Ever better software is needed for processing and filtering the most promising events. This has allowed the LHC to fulfill its rich physics programme, understanding the private life of the Higgs boson, searching for the elusive dark matter, or elucidating the dominance of matter over anti-matter in the observable Universe.
To mobilise the scientific community around this problem, we are organizing the TrackML challenge, which objective is to use machine learning to quickly reconstruct particle tracks from points left in the silicon detectors. The challenge has been conducted in two phases:
- the Accuracy phase May-Aug 2018 : favoring innovation of algorithms reaching the highest accuracy, with no speed concern.This phase has been accepted as an official IEEE WCCI 2018 competition (Rio de Janeiro, July 2018) This phase was hosted by Kaggle. It is now completed, winners have been announced.
- the Throughput phase Sep 2018 - March 2019 : focussing on speed optimisation. This phase is an official NeurIPS 2018 competition (Montreal, December 2018). The Throughput phase has run on Codalab.
We will have a grand final workshop at CERN with prize delivery in spring 2019. Subsequent events will be organized, in particular a dedicated workshop on tracking at University Paris-Saclay in 2019 (already funded). The challenge dataset will remain available for further studies on CERN Open Data portal.
In more details : for each collision, about 10.000 space tracks (helicoidal trajectories originating approximately from the center of the detector), will leave about 10 precise 3D points. The core pattern recognition tracking task is to associate the 100.000 3D points into tracks. Current studies show that traditional algorithms suffer from a combinatorial explosion of the CPU time.
There is a strong potential for application of Machine Learning techniques to this tracking issue. The problem can be related to representation learning, to combinatorial optimization, to clustering (associate together the hits which were deposited by the same particle), and even to time series prediction. An essential question is to efficiently exploit the a priori knowledge about geometrical constraints (structural priors).
The score which will be used is based on the fraction of points which have correctly been associated together.
The Accuracy phase focusses on reaching the highest score. The participant train on a large dataset, apply their algorithm on a test dataset (for which the ground truth is held out on the challenge platform), and submit their solution ( a text file specifying the grouping of the 3D points). The challenge platform computes the score from a fraction of the test dataset and updates the leaderboard live. At the end of the challenge, the score obtained from the held-out dataset (reaching per-mille accuracy) is used to compute the final leaderboard and determine the winners.
The submissions accompanied with documented software has been examined by a jury to give additional prizes (a NVIDIA Titan V100 GPU), invitation to NeurIPS 2018 or to CERN spring 2019 workshop) based on the originality of the algorithm w.r.t to the HEP domain combinatorial approach.
The Throughput phase focusses on the speed. There is still be no limit on the resources used for training. However the evaluation takes place on the challenge platform in a controlled environment. The software (written in C/C++, python, go), possibly using open source libraries, should run in a docker with 2 i686 processor core and 4GB of memory. The score used in this phase combines the accuracy used in the first phase and the speed.