R&D Projects

ChapGPT for structured and semi-structured data

ChatGPT is well-known service which provides linguistic user interface over pre-trained LLVM networks over massive amount of publically available data. It can answer variety of question on diverse list of topics, including but not limited to writing recipes, code, etc. In today world we also have a massive amount of structured (in form of databases) and semi-structured (in variety of different formats) data which quite often treated as Meta-Data. Common interface to this kind of data is done via SQL, GraphQL and similar approaches which provide queries. But such queries requires specific knowledge like schemas, or domain knowledge, and it would be extremely useful to perform studies how ChatGPT approach can be applied to this domain. The project we foresee is to train NN network on structured and semi-structured data sources and provide linguistic interface (ChatGPT prompt) to explore these datasets. And, another sub-project would be to provide ChatGPT for ML hub reposi

Swift for TensorFlow in HEP

Recently, Google announced a project of using Swift for TensorFlow as a next generation system for Deep Learning and differentiable computing. We would like to explore Swift4TF in context of HEP, build ML model using one of CMS use-cases, and benchmark it against traditional Python based frameworks (PyTorch, Keras+TF, etc.).  In particular, we would like to understand if Swift4TF compiler optimized model can outperform Python based training on specific hardware resources (GPUs, TPUs) and can it provide significant performance boost at inference phase using different HEP specific ML models.

[1] https://www.tensorflow.org/swift

Port CMS web python based applications to Go

The CMS web cluster consists of dozen individual services written in Python. We would like to port most critical ones, like Data Bookkeeping System backed by ORACLE DB to Go based implementation. The latter, provides native concurrency, and aim to improve performance of our services. The project involves in designing and implementing new framework in Go, and scaling it up on Kubernetes infrastructure to achieve high throughput in user queries. The ORACLE database consist of O(100K) datasets, O(10M) blocks, and O(100M) files, its total size reach 1/2TB and average load on a system should sustain 500+ concurrent clients.

Exploiting Apache Spark platform for CMS computing analytics

The CERN IT provides a set of Hadoop clusters featuring more than 5 PBytes of raw storage with different open-source, user-level tools available for analytical purposes. The CMS experiment started collecting a large set of computing meta-data, e.g. dataset, file access logs, since 2015. These records represent a valuable, yet scarcely investigated, set of information that needs to be cleaned, categorized and analyzed. CMS can use this information to discover useful patterns and enhance the overall efficiency of the distributed data, improve CPU and site utilization as well as tasks completion time. Here we present evaluation of Apache Spark platform for CMS needs. This project aims two main use-cases: CMS analytics and ML studies where efficient process billions of records stored on HDFS plays an important role.

For more information see https://github.com/vkuznet/CMSSpark  

TensorFlow as a Service (TFaaS)

CMS experiment at CERN use various Machine Learning (ML) techniques, including DNN, in various physics and computing related projects. The popularity of TensorFlow Google framework make it excellent choice to apply ML algorithms for using in CMS workflow pipeline. The project intends to build end-to-end data-service to serve TF trained model for CMSSW framework. Sub-project: port PyTorch/fast.ai models into TensorFlow.

For more information see https://github.com/vkuznet/TFaaS

Machine Learning as a Service for CMS experiment

Machine Learning as a Service (MLaaS) for CMS experiment is big project with many components, including TFaaS mentioned above. It aims to provide full pipeline from reading PBytes data accessible from remote sites (via XRootD), train ML models from native HEP data-format (ROOT) and service predictions via TFaaS.

Integrate PyTorch models into TensorFlow as a Service solution

As a part of TFaaS we either need to develop translation layer to port PyTorch models into TensorFlow or provide an interface to manage PyTorch models directly in TFaas via go-torch library. The former can be done via https://onnx.ai/ Open Neural Network Exchange Format, while later will require to work on LibTorch C++ binding for Go and port go-torch library into TFaaS framework.

Develop Neural Network components to handle Jagged Arrays

The HEP data are stored in ROOT data-format. Each HEP event may contain different number of objects, e.g. number of electrons in an event. These data are represented as Jagged Arrays and we're looking into development of necessary layers/components to deal with these data-structures.

Event Classification using Deep Learning Networks

Perform feasibility studies of ML mainstream toolkits with CMS root based files. We're looking for creation of common framework to explore Big Data datasets within Machine Learning (ML)/Deep-Learning (DL) frameworks. The tasks of this project is two-fold. Explore how to handle PB of data within existing frameworks and  perform event classification of CMS data via (un)-supervised learning.

For more information see https://github.com/vkuznet/DLEventClassification