Senior Specialist, Artificial Intelligence at Ericsson Research.
The evolution toward AI-native networks necessitates seamless Artificial Intelligence (AI) integration, which demands privacy-preserving data management, scalable model deployment, and optimized resource allocation. To address these challenges, we introduce key AI enablers as Data Operations (DataOps), Machine Learning Operations (MLOps), and AI as a Service (AIaaS). DataOps enables secure, privacy-preserving data collection through techniques such as Differential Privacy (DP) and secure aggregation, facilitating AI-driven insights without exposing sensitive user data. MLOps enhances the AI life cycle by leveraging distributed learning paradigms, including horizontal and Vertical Federated Learning (VFL), supported by Split Learning (SL). AIaaS extends these capabilities by exposing AI models and services through standardized APIs, enabling on-demand training, inference, and automation. By integrating AIaaS with DataOps and MLOps, networks can achieve greater intelligence, adaptability, and compliance with privacy and interoperability standards. This paper introduces a coherent architectural framework and operational strategy for embedding AI-driven intelligence into 6G networks, with a focus on key innovations in data governance, AI model coordination, and service exposure.
The complex dynamics in telecommunication networks that vary over time is due to network reconfiguration, hardware reinstallation, and other factors related to user behavior. This requires machine learning (ML) models that are robust to dynamic and time-varying data distributions in order to assist reliable network operations. This motivates us to focus on building a generalized ML model that is robust to temporal changes in data distributions by continuously training on various data distributions, that is, concepts , at different times. In a federated learning (FL) setting, when there is a concept drift at one client, iterative aggregation of client model weights causes model entanglement. This may slow down both the forgetting of the data concept learned at previous rounds and also the learning the new data concept in the current round. Traditionally, models are trained using both old and new datasets to prevent performance degradation due to catastrophic forgetting and to simultaneously adapt to new data. This requires the storage of old and new datasets. Continuous increase in the training set size with changing concepts leads to increased arithmetic operations on the data, which subsequently causes an increase in computation requirements. In this article, we propose and evaluate various sample selection methods to sustain the overall performance of ML models while reducing the number of training samples. The proposed methods reduced the training set size, addressed high computation requirement, reduced the impact of catastrophic forgetting, and helped to adapt to new data concepts better.
Artificial Neural Networks (NNs) are unable to learn tasks continually using a single model, which leads to forgetting old knowledge, known as catastrophic forgetting. This is one of the shortcomings that usually plague intelligent systems based on NN models. Federated Learning (FL) is a decentralized approach to training machine learning models on multiple local clients without exchanging raw data. A paradigm that handles model learning in both settings, federated and continual, is known as Federated Continual Learning (FCL). In this work, we propose a novel FCL algorithm, called FedCluLearn, which uses a stream micro-cluster indexing scheme to deal with catastrophic forgetting. FedCluLearn interprets the federated training process as a stream clustering scenario. It stores statistics, similar to micro-clusters in stream clustering algorithms, about the learned concepts at the server and updates them at each training round to reflect the current local updates of the clients. FedCluLearn uses only active concepts in each training round to build the global model, meaning it temporarily forgets the knowledge that is not relevant to the current situation. In addition, the proposed algorithm is flexible in that it can consider the age of local updates to reflect the greater importance of more recent data. The proposed FCL approach has been benchmarked against three baseline algorithms by evaluating its performance in several control and real-world data experiments. The implementation of FedCluLearn and the experimental results are available at this link.
Distributed intelligence (DI) is becoming increasingly important in mobile networks due to privacy and regulatory restrictions on data movement. This trend is fueled by a surge in data volume required for training machine learning models, and increasing computational and storage capabilities of devices, and network nodes near data collection points. Techniques such as Federated and Split Learning, applied in horizontal and vertical feature spaces, enable efficient training of machine learning models while maintaining data privacy. Despite the adoption of DI techniques in mobile networks, there are still gaps and areas in mobile network architecture where DI is either not applied or not taking advantage of its full potential. To that end, this paper provides a state-of-the-art overview of the adoption of DI in mobile networks from the perspective of standardization. We identify the importance of machine learning model lifecycle management (ML LCM) and position DI as an extension of that. From this overview, gaps and limitations are identified and presented. To overcome these limitations, we perform a thought experiment which lowers ML LCM with DI capability from Core Network (CN) to Radio Access Networks (RAN) which are distributed by nature. In that setting a set of Key Performance Indicators (KPIs) to quantify the performance of different DI techniques is introduced. These KPIs aim at quantifying computational overhead, memory/storage requirements, communication footprint, model efficiency and robustness, flexibility and privacy. We evaluate these KPIs on a RAN use case, that of secondary carrier prediction and show the interplay between model efficiency and computational footprint which are important metrics to consider when migrating from a centralized to a distributed setting. Finally, we provide additional enhancements to improve these KPIs and thus further promote the adoption of DI techniques.
Optimization of radio hardware and AI-based network management software yield significant energy savings in radio access networks. The execution of underlying Machine Learning (ML) models, which enable energy savings through recommended actions, may require additional compute and energy, highlighting the opportunity to explore and adopt accurate and energy-efficient ML technologies. This work evaluates the novel use of sparsely structured Neural Circuit Policies (NCPs) in a use case to estimate the energy consumption of base stations. Sparsity in ML models yields reduced memory, computation and energy demand, hence facilitating a low-cost and scalable solution. We also evaluate the generalization capability of NCPs in comparison to traditional and widely used ML models such as Long Short Term Memory (LSTM), via quantifying their sensitivity to varying model hyper-parameters (HPs). NCPs demonstrated a clear reduction in computational overhead and energy consumption. Moreover, results indicated that the NCPs are robust to varying HPs such as number of epochs and neurons in each layer, making them a suitable option to ease model management and to reduce energy consumption in Machine Learning Operations (MLOps) in telecommunications.
The integration of Artificial Intelligence (AI) into the 6G architecture, referred to as AI-native 6G architecture, signifies a transformative era for communication technology. Nevertheless, practical implementation encounters challenges including architectural complexities, data quality concerns, and operational difficulties in managing machine learning models, allocating resources, and implementing intent-based management. In this paper, we present a comprehensive approach to address these challenges in emerging 6G networks through AI. Our approach involves two steps: first, we identify impairments hindering progress, analyzing the importance of addressing operational challenges in Machine Learning Operations (MLOps), 6G evolution, and democratizing AI, while addressing interoperability issues and complexities in the translation of business intents into network configurations. Upon the analysis, we highlight AI enablers—architectural enhancements, MLOps, Data Operations (DataOps), AI as a Service (AIaaS), and intent-based management—as essential solutions for practical AI implementation in 6G networks. We conclude by stating that architectural improvements prioritize privacy, security, and data accuracy, while MLOps and DataOps optimize the management of the AI life cycle. Privacy-aware data collection and training employ federated learning and split learning, and AIaaS streamlines AI access, and intent based management with integrated AI enhances decision-making through advanced algorithms.
The success of Federated Learning (FL) hinges upon the active participation and contributions of edge devices as they collaboratively train a global model while preserving data privacy. Understanding the behavior of individual clients within the FL framework is essential for enhancing model performance, ensuring system reliability, and protecting data privacy. However, analyzing client behavior poses a significant challenge due to the decentralized nature of FL, the variety of participating devices, and the complex interplay between client models throughout the training process. This research proposes a novel approach based on eccentricity analysis to address the challenges associated with understanding the different clients’ behavior in the federation. We study how the eccentricity analysis can be applied to monitor the clients’ behaviors through the training process by assessing the eccentricity metrics of clients’ local models and clients’ data representation in the global model. The Kendall ranking method is used for evaluating the correlations between the defined eccentricity metrics and the clients’ benefit from the federation and influence on the federation, respectively. Our initial experiments on a publicly available data set demonstrate that the defined eccentricity measures can provide valuable information for monitoring the clients’ behavior and eventually identify clients with deviating behavioral patterns.
In telecommunications, information delivery is performed over inherently distributed elements such as application servers, packet core network equipment, radio base stations, and mobile user equipment (UE). Predicting key performance indicators (KPIs) for services is important for mobile operators in preventing customer churn. In addition, it is important to achieve accurate estimations, to troubleshoot and localize faults in the end-to-end transmission by collecting data from potentially causal distributed measurement points located on the distributed delivery elements. Due to the regulations for protecting data privacy, there is an increasing interest in decentralized machine learning models including Split Learning (SL). A distributed learning technique such as parallel SL enables accurate model training jointly, but it necessitates all distributed elements to synchronize and correlate on a unique identifier such as sample id, to align the vertically distributed data samples, which causes dependency and high signaling overhead between the collaborating nodes. In this paper, we demonstrate a Translator Model that imputes the remote model parameters based on the local ones in a SL setting. This way, sample alignment overhead during model inference is addressed, and significant reduction in communication cost is achieved.
In telecommunication networks, many factors that influence QoE are inherently distributed in the network; related to creation, delivery, and presentation of the content at the end-user terminal. Split Learning (SL) is a scalable distributed machine learning (ML) technique that enables joint training and inference on decentralized datasets. These decentralized datasets can be potentially large and sensitive, and can not be collected at a central server. SL allows decentralized models to be trained at where the corresponding local data are collected. Still, the computation cost of SL at the decentralized nodes can be high, negatively impacting the training time and energy consumption, especially when there are too many local decentralized ML input attributes. In this paper, we present a Gradient trajectory based local feature selection (GS) technique that deposes noncontributing input features from local nodes early in a supervised SL training, where those nodes do not have access to the target label, e.g., QoE metric. With the proposed approach, more than 50% reduction in memory and in average more than 20% reduction in computation complexity were observed over an empirical study.
The development of Quality of Experience (QoE) models using Machine Learning (ML) is challenging, since it can be difficult to share datasets between research entities to protect the intellectual property of the ML model and the confidentiality of user studies in compliance with data protection regulations such as General Data Protection Regulation (GDPR). This makes distributed machine learning techniques that do not necessitate sharing of data or attribute names appealing. One suitable use case in the scope of QoE can be the task of mapping QoE indicators for the perception of quality such as Mean Opinion Scores (MOS), in a distributed manner. In this article, we present Distributed Ensemble Learning (DEL), and Vertical Federated Learning (vFL) to address this context. Both approaches can be applied to datasets that have different feature sets, i.e., split features. The DEL approach is ML model-agnostic and achieves up to 12% accuracy improvement of ensembling various generic and specific models. The vFL approach is based on neural networks and achieves on-par accuracy with a conventional Fully Centralized machine learning model, while exhibiting statistically significant performance that is superior to that of the Isolated local models with an average accuracy improvement of 26%. Moreover, energy-efficient vFL with reduced network footprint and training time is obtained by further tuning the model hyper-parameters.
Increasing complexity in management of immense number of network elements and their dynamically changing environment necessitates machine learning based recommendation models to guide human experts in setting appropriate network configurations to sustain end-user Quality of Experience (QoE). In this paper, we present and demonstrate a generative Conditional Variational AutoEncoder (CVAE)-based technique to reconstruct realistic underlying QoE factors together with improvement suggestions in a video streaming use case. Based on our experiment setting consisting of a set of what-if scenarios, our approach pinpointed the potential required changes on the QoE factors to improve the estimated video Mean Opinion Scores (MOS).
Machine Learning (ML) based Quality of Experience (QoE) models potentially suffer from over-fitting due to limitations including low data volume, and limited participant profiles. This prevents models from becoming generic. Consequently, these trained models may under-perform when tested outside the experimented population. One reason for the limited datasets, which we refer in this paper as small QoE data lakes, is due to the fact that often these datasets potentially contain user sensitive information and are only collected throughout expensive user studies with special user consent. Thus, sharing of datasets amongst researchers is often not allowed. In recent years, privacy preserving machine learning models have become important and so have techniques that enable model training without sharing datasets but instead relying on secure communication protocols. Following this trend, in this paper, we present Round-Robin based Collaborative Machine Learning model training, where the model is trained in a sequential manner amongst the collaborated partner nodes. We benchmark this work using our customized Federated Learning mechanism as well as conventional Centralized and Isolated Learning methods.
Rapid change in sensitive behaviour and profile of distributed mobile network elements necessitates privacy preserving distributed learning mechanism such as Federated Learning. Moreover, this mechanism needs to be robust that seamlessly sustains the jointly trained model accuracy. In order to provide a automated management of the learning process in FL on datasets that are not independently and identically distributed (non-iid), we propose a Multi-Arm Bandit (MAB) based method that helps the federation to select the nodes that benefits the overall model. This automated selection of the training nodes throughout each round yielded an improvement in accuracy, while decreasing network footprint.
Quality of Experience (QoE) models need good generalization that necessitates sufficient amount of user-labeled datasets associated with measurements related to underlying QoE factors. However, obtaining QoE datasets is often costly, since they are preferably collected from many subjects with diverse background, and eventually dataset sizes and representations are limited. Models can be improved by sharing and merging those collected local datasets, however regulations such as GDPR make data sharing difficult, as those local user datasets might contain sensitive information about the subjects. A privacy-preserving machine learning approach such as Federated Learning (FL) is a potential candidate that enables sharing of QoE data models between collaborators without exposing ground truth, but only by means of sharing the securely aggregated form of extracted model parameters. While FL can enable a seamless QoE model management, if collaborators do not have the same level of data quality, more iterations of information sharing over a communication channel might be necessary for models to reach an acceptable accuracy. In this paper, we present an ensemble based Bayesian synthetic data generation method for FL, LOO (Leave-One-Out), which reduces the training time by 30% and the network footprint in the communication channel by 60%.
The development of QoE models by means of Machine Learning (ML) is challenging, amongst others due to small-size datasets, lack of diversity in user profiles in the source domain, and too much diversity in the target domains of QoE models. Furthermore, datasets can be hard to share between research entities, as the machine learning models and the collected user data from the user studies may be IPR- or GDPR-sensitive. This makes a decentralized learning-based framework appealing for sharing and aggregating learned knowledge in-between the local models that map the obtained metrics to the user QoE, such as Mean Opinion Scores (MOS). In this paper, we present a transfer learning-based ML model training approach, which allows decentralized local models to share generic indicators on MOS to learn a generic base model, and then customize the generic base model further using additional features that are unique to those specific localized (and potentially sensitive) QoE nodes. We show that the proposed approach is agnostic to specific ML algorithms, stacked upon each other, as it does not necessitate the collaborating localized nodes to run the same ML algorithm. Our reproducible results reveal the advantages of stacking various generic and specific models with corresponding weight factors. Moreover, we identify the optimal combination of algorithms and weight factors for the corresponding localized QoE nodes.
On Network Performance Indicators for Network Promoter Score Estimation [Machine learning model distillation in QoE]
Estimation of user perceived quality of offered services, from massive number of Key Performance Indicator (KPI)'s that are measured in diverse components, has been a necessity for mobile network operators. The goal is first to have a good estimator for poor Quality of Experience (QoE), which can potentially be achieved with machine learning, and then pinpoint the features that are contributing to the poor performance. There is often a tradeoff between accuracy and interpretability of models. In this paper, we address this tradeoff by first developing a robust but complex teacher machine learning model to map the subjective Net Promoter Score (NPS) values computed from the user quality feedback to the underlying subset of KPI metrics. Next, we develop a rather interpretable student model supervised by the pre-trained teacher model. Eventually the compact student decision tree model learns to mimic the behavior of the teacher model with an at least 10 % improved accuracy in testset as compared to conventional way of directly training using the decision tree model. In the last step, we extract the rules and important influential features of the distilled student model.
The importance of cellular networks continuously increases as we assume ubiquitous connectivity in our daily lives. As a result, the underlying core telecom systems have very high reliability and availability requirements, that are sometimes hard to meet. This study presents a proactive approach that could aid satisfying these high requirements on reliability and availability by predicting future base station alarms. A data set containing 231 internal performance measures from cellular (4G) base stations is correlated with a data set containing base station alarms. Next, two experiments are used to investigate (i) the alarm prediction performance of six machine learning models, and (ii) how different predict-ahead times (ranging from 10 min to 48 hours) affect the predictive performance. A 10-fold cross validation evaluation approach and statistical analysis suggested that the Random Forest models showed best performance. Further, the results indicate the feasibility of predicting severe alarms one hour in advance with a precision of 0.812 (±0.022, 95 % CI), recall of 0.619 (±0.027) and F 1 -score of 0.702 (±0.022). A model interpretation package, ELI5, was used to identify the most influential features in order to gain model insight. Overall, the results are promising and indicate the potential of an early-warning system that enables a proactive means for achieving high reliability and availability requirements.