Machine Learning for Computing Center Operation
Anomaly detection of Grid service logs
Anomaly detection of Grid service logs
The ICEPP in the University of Tokyo operates a computing center for the ATLAS experiment at the Large Hadron Collider. Computing resources in the center are used as a part of the Worldwide LHC Computing Grid (WLCG). The WLCG is a global collaboration of around 170 computing centers in more than 40 countries to manage a large amount of data produced by the LHC experiments.
Stable and reliable operations of the computing center are crucial requirements for the success of the experiment. Machine Learning technologies are a promising approach to automate or improve computing center operations. Several activities using the ML for the computing center operations are discussed and studied in the group.
A Grid computing site consists of various services including Grid middleware, such as Computing Element, Storage Element and so on. Ensuring a safe and stable operation of the services is a key role of site administrators. Logs produced by the services provide useful information for understanding the status of the site. However, it is a time-consuming task for site administrators to monitor and analyze the service logs everyday. Therefore, a support framework, which detects anomaly logs and alerts to site administrators, was developed using ML techniques.
Typical classifications using ML require pre-defined labels. It is difficult to collect a large amount of anomaly logs to build a ML model that covers all possible pre-defined anomalies. Therefore, unsupervised MLs based on word embedding and clustering algorithms are used to detect anomaly logs.
The framework was tested using simple sshd logs. The left plot shows word vector distributions for the logs. The sshd logs, which show unusual login counts or login from an unusual host, are observed away from the normal cluster as anomaly events as expected. The framework also shows a reasonable accuracy of anomaly detection for a Gird middleware (DPM). The accuracy was greater than 80% with hyper parameter tuning .