An intriguing property of CMOS devices is that the power consumption or electromagnetic (EM) emanations from the devices depend on the data they process. Consequently, the device's internal data can be partially inferred by observing its power consumption or EM emanations. Side-Channel Analysis (SCA) exploits this characteristic to recover secrets from cryptographic devices by analyzing the power consumption or EM emanation data, commonly referred to as power/EM traces (where each trace is a sequential data). Historically, various statistical techniques such as Pearson's correlation, mutual information, principal component analysis, and probabilistic generative models have been employed in SCA. However, these classical approaches face several limitations. First, they require the explicit selection of informative features from the power/EM traces through a separate pre-processing step, which sometimes relies on internal device information that may not always be accessible. Second, higher-order SCAs depend on selecting appropriate higher-order statistics, which may not be optimal when the power consumption or EM emanation behavior deviates from the well-known idealistic models. Third, classical SCAs struggle with implementations protected by jitter-based countermeasures, necessitating additional pre-processing to mitigate the effects of jitter. However, these pre-processing steps often depend on heuristic properties of power traces, limiting their effectiveness in various scenarios.
Deep learning-based SCAs overcome the limitations of classical SCAs. Unlike classical methods, DL-based SCAs do not require prior selection of informative features, as DL models can be trained to automatically assign higher weight to informative features and lower weight to uninformative ones. Additionally, because DL models are capable of learning arbitrary input-output mappings, they have been highly successful in higher-order SCA. Furthermore, since certain DL models exhibit shift-invariance, they have been found to be highly effective against jitter-based countermeasures.
While DL-based SCAs have demonstrated significant potential in addressing many of the limitations of classical SCA, they still lag in several attack scenarios:
The ability to capture dependencies between features that are far apart is crucial for a DL model to successfully attack many implementations protected by masking countermeasures. Similarly, shift invariance is essential for a DL model to perform effectively against implementations protected by jitter-based countermeasures. Therefore, a major challenge in DL-based SCA research is developing models that are good at learning long-distance dependencies while maintaining shift invariance.
The length of power/EM traces can exceed hundreds of thousands of features. Thus, building DL models that can be efficiently trained on such long inputs, while preserving other desirable properties like shift-invariance, remains a significant challenge.
As power/EM traces become longer, the informative features within them become increasingly sparse. Consequently, building and training shift-invariant DL models to perform optimally on such data becomes more challenging.
To address the above challenges, we developed four DL models: TransNet, LinNet, EstraNet, and HierNet, each improving and generalizing the previous one. The initial model, TransNet, is a shift-invariant Transformer Network (TN)-based model with quadratic computational complexity in trace length, making it suitable only for traces with lengths on the order of a thousand features. In contrast, the final model, HierNet, is a hierarchical TN-based model with linear computational cost in trace length, making it scalable to traces longer than hundreds of thousands of features. HierNet incorporates novel components such as Gaussian Positional attention (GaussiP attention) and self-attention with absolute positional encoding, replacing the standard self-attention layer of a TN model. GaussiP attention has linear computational complexity in trace length, overcoming the quadratic cost of standard self-attention operations. Moreover, with the incorporation of relative positional encoding, GaussiP attention becomes shift-invariant. The attention scores it generates are sparse, a property that makes it effective for handling longer traces where informative features are sparse. At the top layer, HierNet employs a self-attention layer that incorporates absolute positional encoding. Incorporating such encoding makes HierNet configurable to attack different kinds of jitter-based countermeasures like random delay, clock jitter, or shuffling.