EREBA: Blackbox Energy Testing of Adaptive Neural Networks

Introduction

Adaptive Neural Networks (AdNNs)

Different Thresholds for ETP and ITP

Approach

Mean Absolute Error per input of Estimator Model.

Correlation Study

Evaluation

Effectiveness

Corruptions and Perturbation

Adversarial Training Results

Introduction

Due to the energy concern in many application fields, mobile and embedded computing, in particular, a large set of energy-saving adaptive neural networks (AdNNs) have been developed to reduce computation and save energy. Nonetheless, the important problem of testing the energy robustness of AdNNs has received little attention. Due to the lack of transferability in energy-based testing inputs, existing techniques generating accuracy-based testing inputs cannot be applied to energy testing. To address this challenge, in this paper, we present EREBA, an energy-oriented black-box testing framework for testing the energy robustness against major categories of AdNNs.

Based on the key insight that the pattern of energy consumption of AdNNs is stepwise, EREBA explores and infers the relation between inputs and the energy consumption of AdNNs. Extensive implementation and evaluation using three state-of-the-art AdNNs demonstrate that test inputs generated by EREBA could increase the energy consumption of AdNNs up to 1,000 % than the original inputs. Our results also show that retraining AdNN with test inputs generated via EREBA can achieve more than 70 % energy savings. The code can be found here. This work has been accepted to International Conference on Software Engineering (ICSE 2022).

Adaptive Neural Networks (AdNNs)

In this work, we focus on evaluating the energy robustness of AdNNs. The main objective of AdNNs is to avoid executing a few layers in a Neural Network. The AdNNs produce intermediate outputs before each layer or block (in the case of ResNet). A shallow computing unit (linear classifier or simple CNN, RNN) is generally added between two layers or blocks to calculate intermediate outputs. Predefined conditions are used to evaluate these intermediate outputs, and if output attains a certain threshold, the condition is fulfilled. After fulfilling the condition, if AdNN decides that certain layers or blocks are not required for the inference and those layers or blocks are skipped.

That type of AdNN is called Conditional-skipping AdNN. Otherwise, if AdNN terminates the operations within a block or network early by deciding that later part of the operations are not required, then we call those AdNNs as Early-termination AdNNs. In this work, we discuss robustness of three models: SkipNet, BlockDrop and BranchyNet.

Different Thresholds for ETP and ITP

Approach

We use a estimator model, which predicts energy consumption of AdNNs to generate test images. The training procedure of the estimator model is shown in the figure above.

Using estimator model, we propose to two types of testing : Input-based testing and Universal testing. The image in the above illustrates creation of Input-based test images. In Universal testing, we try to find the worst image for a model, which can increase the energy consumption of a model to highest.

Mean Absolute Error per input of Estimator Model.

Mean Absolute Error per input of Estimator Model. For CIFAR10 dataset, the MAE per input is 12.8, 3.27 and 22.62 for RANet, BlockDrop and BranchtNet models respectively. For CIFAR100 dataset, the MAE per input is 31.15, 3.41 and 11.3 for RANet, BlockDrop and BranchtNet models respectively.

Correlation Study

Correlation between estimated and measured energy value. To illustrate the correlation between energy consumption predicted by the estimator model and original energy consumption, we present the Pearson correlation coefficient (r) and correlation p-value. If two set of values are correlated, the r value would be significant and the p-value will be low.

For CIFAR-100 dataset, the r values are 0.38, 0.17 and 0.31 for RANet, BlockDrop and BranchtNet models respectively, where all the p-values are less then 0.0005. These results conclude that the values are correlated.

For CIFAR-10 dataset, r value for RANet is 0.021, however p-value is 0.03, therefore it is more likely that the values are correlated. For BlockDrop model, the r value and p-value are 0.004 and 0.66, which suggests that the values are less likely to be correlated. But if we consider only the inputs whose energy consumption is higher than the 75th percentile value, the p-value becomes 0.14, suggesting correlation. Therefore, if the estimator model can accurately predict high energy consuming inputs (i.e. differentiate clearly between low/mid and high energy consuming inputs), we can use the estimator model to generate energy-expensive testing inputs.

Evaluation

We measure the usefulness of EREBA using four characteristics: Effectiveness, Sensitivity, Quality, Robustness.

Effectiveness

We have evaluated our techniques against datasets CIFAR-10-C and CIFAR-10-P. We first evaluate each baseline dataset on three Adaptive Neural Networks and recorded average percentage increase in AdNN-reduced FLOPs. Two tables shown below illustrate the results. Left table shows the results on corrruption data while right table shows the results on perturbation data.

Corruptions and Perturbation

We picked top 5 performing corruptions and perturbations from the above tables for each model, and measured the energy consumption on those. Below figures are the comparison results between baseline and EREBA. EREBA generated test images can increase upto 1000 % energy consumption in AdNNs.

Percentage of energy increased by EREBA and baseline techniques on BlockDrop (CIFAR-10)

Percentage of energy increased by EREBA and baseline techniques on BranchyNet (CIFAR-10)

Percentage of energy increased by EREBA and baseline techniques on RANet (CIFAR-10)

Effect Size Tests

Effect size is a statistical concept that measures the strength of the relationship between two variables on a numeric scale. Peasrson Correlation is one of the metrics than can find the strength of the relationship between two variables. Therefore, for CIFAR-100 data, we show the aforementioned table that represents the Pearson Correlation Coeff (r) and p-value between percentage of energy consumption increased by Input-based testing inputs and energy consumption increased by baseline technique generated inputs. It can be noticed that for most of the cases, the energy increase percentages are less likely to be correlated. Only for BranchyNet, we can find significant negative correlation between IP-based and Perturbation induced energy consumption increase.

Sensitivity

Through exploring sensitivity of EREBA, we find relation between change of energy consumption in AdNNs and magnitude of perturbation added in input-based testing. Magnitude of pertubation is defined by average squared difference of test images and original images. The results are shown below. With adding perturbation, BranchyNet's increase in energy consumption is the highest.

Average energy consumption increase of testing inputs constrained by magnitude of perturbation for BlockDrop, RANet and BranchyNet (CIFAR-10)

Quality

To explore the difference in semantic quality between test images generated through input-based testing and original images, we measure PSNR in those images. Results are shown below.

PSNR of the generated images (CIFAR-10)

Robustness

In this section, we analyze the robustness of EREBA by exploring its behavior against practical corruptions (e.g., Fog, Snow, Frost). The results are shown below. The increase in energy consumption achieved by EREBA using corrupted images is in-fact higher than that achieved using the original CIFAR-10 dataset due to the target AdNNs not being tuned to the corrupted images.

Effect of corruptions on EREBA with BlockDrop (CIFAR-10)

Effect of corruptions on EREBA with BranchyNet (CIFAR-10)

Effect of corruptions on EREBA with RANet (CIFAR-10)

Functional Accuracy

Functional Accuracy of Test Inputs. We classify 600 inputs and generated input-based testing inputs against all three testing inputs for the experiments (C values are 1,10, 100 for BlockDrop, RANet and BranchyNet respectively). We compare the generated prediction in both cases . For RANet, BlockDrop and BranchyNet. we notice 370, 521 and 74 predictions are same between testing and original inputs. We can notice that with increasing value of C, the perturbation magnitude is increased also, therefore the rate of misclassification becomes higher.

Adversarial Training Results

Energy consumption (J) for original and retrained version of AdNNs. (Left) BranchyNet under testing inputs (Right) BranchyNet under original images (CIFAR-10)

Energy consumption (J) for original and retrained version of AdNNs. (Left) RANet under testing inputs (Right) RANet under original images (CIFAR-10)