Can Machine Learning (ML) assist High Energy Physics (HEP) in discovering and characterizing new particles?

We are organizing a data science competition to stimulate both the ML and HEP communities to renew the toolkit of physicists in preparation for the advent of the next generation of particle detectors in the Large Hadron Collider at CERN. With event rates already reaching hundred of millions of collisions per second, physicists must sift through ten of petabytes of data per year. Ever better software is needed for processing and filtering the most promising events. This will allow the LHC to fulfill its rich physics programme, understanding the private life of the Higgs boson, searching for the elusive dark matter, or elucidating the dominance of matter over anti-matter in the observable Universe.

To mobilise the scientific community around this problem, we are organizing the TrackML challenge, which objective is to use machine learning to quickly reconstruct particle tracks from points left in the silicon detectors. The challenge will be conducted in two phases: 

  • the on-going Accuracy phase May-Aug 2018 : favoring innovation of algorithms reaching the highest accuracy, with no speed concern.This phase has been accepted as an official IEEE WCCI 2018 competition (Rio de Janeiro, July 2018) This phase is hosted by Kaggle.
  • the Throughput phase Jul-Oct 2018 : focussing on speed optimisation. This phase has been accepted as an official NIPS 2018 competition (Montreal, December 2018). That phase will be hosted by Codalab.

We will have a grand finale workshop at CERN with prize delivery in spring 2019. Subsequent events will be organized, in particular a dedicated workshop on tracking at University Paris-Saclay in 2019 (already funded). The challenge dataset will remain available for further studies on CERN Open Data portal

In more details : for each collision, about 10.000 space tracks (helicoidal trajectories originating approximately from the center of the detector), will leave about 10 precise 3D points. The core pattern recognition tracking task is to associate the 100.000 3D points into tracks. Current studies show that traditional algorithms suffer from a combinatorial explosion of the CPU time.

There is a strong potential for application of Machine Learning techniques to this tracking issue. The problem can be related to representation learning, to combinatorial optimization, to clustering (associate together the hits which were deposited by the same particle), and even to time series prediction. An essential question is to efficiently exploit the a priori knowledge about geometrical constraints (structural priors).

The score which will be used is based on the fraction of points which have correctly been associated together.

The Accuracy phase will focus on reaching the highest score. The participant will train on a large dataset, apply their algorithm on a test dataset (for which the ground truth is held out on the challenge platform), and submit their solution ( a text file specifying the grouping of the 3D points). The challenge platform will compute the score from a fraction of the test dataset and update the leaderboard live. At the end of the challenge, the score obtained from the held-out dataset (reaching per-mille accuracy) will be used to compute the final leaderboard and determine the winners.

The submissions accompanied with documented software will be examined by a jury which will give additional prizes (a NVIDIA Titan V100 GPU), invitation to NIPS 2018 or to CERN spring 2019 workshop) based on the originality of the algorithm w.r.t to the HEP domain combinatorial approach.

The Throughput phase will focus on the speed. There will still be no limit on the resources use for training. However the evaluation will take place on the challenge platform in a controlled environment. The software (written in C/C++, python, go), possibly using open source libraries, should run on a i686 processor with 2GB (tbc) of memory. (At this point, a separate track with modern processors/GPU is also been considered). The score used in this phase will combine the accuracy used in the first phase and the speed.