Distribution-guided Testing

Abstract

Deep learning (DL) models are trained on sampled data, where the distribution of training data differs from that of real-world data (i.e., the distribution shift), which reduces the model robustness. Various testing techniques have been proposed, including distribution-unaware and distribution-aware methods. However, distribution-unaware testing lacks effectiveness by not explicitly considering the distribution of test cases and may generate redundant errors (within the same distribution). Distribution-aware testing techniques primarily focus on generating test cases that follow the training distribution, missing out-of-distribution data that may also be valid and should be considered in the testing process.

In this paper, we propose a novel distribution-guided approach for generating \textit{valid} test cases with \textit{diverse} distributions, which can better evaluate the model robustness (i.e., generating hard-to-detect errors) and enhance the model robustness (i.e., enriching training data). Unlike existing testing techniques that optimize individual test cases, DistXplore optimizes test suites that represent specific distributions. To evaluate and enhance the model robustness, we design two metrics: \textit{distribution difference}, which maximizes the similarity in distribution between two different classes of data to generate hard-to-detect errors, and \textit{distribution diversity}, which increase the distribution diversity of generated test cases for enhancing the model robustness. To evaluate the effectiveness of DistXplore in model evaluation and enhancement, we compare DistXplore with 9 state-of-the-art baselines on 8 models across 4 datasets.The evaluation results show that DistXplore not only detects a larger number of errors (e.g., 2$\times$+ on average), but also identifies more hard-to-detect errors (e.g., 12.1%+ on average); Furthermore, DistXplore achieves a higher improvement in empirical robustness (e.g., 5.3% more accuracy improvement than the baselines on average).

Motivation: Distribution Shift

Figure 1: Data Sampling and an illustrative example of DL system

Discriminating the four data space mentioned in Figure 1

Valid Data: The task-relevant data for a special task (e.g. digit classification)

Invalid Date: The task-irrelevant data for a sepcial task (e.g. noisy data, audio data and tabular data when the task is digit classification)

In-Distribution Data: A part of the Valid Data, the data follows the certain distribution of the collected training data.

Out-of Distribution Data: Another part of the Valid Data, the data which is out of the distribution of the collected training data.

Data shift affects the robustness

Machine learning (ML) aims to learn a model based on sampled data (i.e., training data) in order to make decisions on a specific task. Due to the huge input space, it is impossible to collect all data for training. In practice some high-quality data that follows a certain distribution is collected for training. As shown in the left part of the figure 1, for digit classification task, there is a huge amout of task-relevant data for digits. (i.e., the valid data shown in the dashed rectangle) in the whole input space (i.e., all data shown in the left part). The task irrelevant data is called invalid data (i.e., noisy data, audio data and the tabular data) with respect to the given task. A small part of the valid data (e.g.,the dataset a and b in the figure 1 is collected for training the model.However,the training distribution is often not exactly the same as the distribution of valid data (i.e.,the distribution shift),which greatly affects the robustness.A fundamental assumption is that the model is intended to handle the in-distribution data (ID) that follows the distribution of training data,but is hard to correctly predict data (e.g.,the dataset c,d and e in the figure 1)that does not follow the training distribution,i.e.,out-of-distribution data (OOD), which motivates the need for testing before the deployment.

The shortage of the existing DL testing techniques

Existing Distribution-unaware techniques: As data distribution is ignored by these techniques (e.g., DeepTest, DeepHunter and TensorFuzz), the redundanta errors with the same distribution are generated by these techniques, which limits their effectiveness in testing and retraining.

Existing Distribution-aware techniques: These techniques always characterize the training distribution via Variational Auto-Encoder (VAE) or Generative Adversarial Network (GAN) and generate In-Distribution Data while the Out-of-Distribution Data is considered as "invalid". However, we argure that the Out-of-Distribution Data is just data that not follow the distribution of the collected training data, but may still be valid and should be handled properly in the real deployment environment. For example, as shown in figure 2, the inputs on the right column are considered as invalid data by existing distribution-aware testing, but they should be still valid data visually.

We argue that the visually valid data should be handled properly by the well-trained model, so the question becomes what kind of valid data should be generated by DL testing.

What kind of valid data should be generated by DL testing?

The testing goals of DL testing is model evaluation or enhancement. For model evaluation, the errors that cannot be detected(e.g. callede strong errors) are more useful for evaluating the weaknesses of the system. Taking traditional software as a comparison, we usually discover few bugs in traditional software because there are many defenses (e.g., parser, exception handling) that filter invalid or vulnerable inputs. Similarly, the existing defense techniques (e.g., adversarial example detection) can also provide defenses for DL systems, which is ignored by existing testing techniques. For model enhancement, the testing should generate valid data with diverse distributions that can be added into training data for improving the model generalizability and robustness.

To this end , we take distribution into consideration and propose a novel distribution-guided testing framework (named DistXplore) to generate stronger valid data with more diverse distributions.

Figure 2: Examples of OOD data that are considered as invalid by distribution-aware techniques but valid visully

DistXplore

DistXplore is a novel distribution-guided testing framework for evaluating and enhancing DL systems, it adopts the search-based approach (i.e.,genetic algorithm)to adaptively generate test cases with the guidance of distribution. Unlike existing techniques that optimize test cases individuality,the optimization of DistXplore is performed on a test suite that represents a specific distribution.

For model evaluation,DistXplore minimizes the distribution closeness between the data in two different classes that allows to generate statistically indistinguishable errors,which are difficult to defend.

To enhance the model robustness,we propose a metric to measure the distribution diversity of the test cases,which guides DistXplore to generate test suites with various distributions. The test cases with diverse distributions are more likely to cover unseen data and improve the model robustness.

Overview

Figure 3: Illustration of generating test suites

Figure 3 shows the main idea of DistXplore, DisXplore generates test cases with diverse distributions for each class, it calculates the distribution difference between test suite from a class and the training data in each of other classes. The key insight is that the classification is performed based on the relationship between all classes.

As shown in the right part of Figure 3, given the initial test suite sampled from the training data of a class, DistXplore aims to generate new test suites that have different distribution distances with the training data in other classes. The distribution curve of the test suites (i.e., red curve) shits between the original distribution and the target distribution.

For robustness enhancement,we adopt DistXplore to generate test suites with diverse distributions (not only hard- to-defend errors)that can enrich the training data (i.e.,to add unseen data)for improving the model robustness.

Algorithm

Evaluation Results

Validity

Robustness

Error Strength

Targeted Attack results

Code：https://anonymous.4open.science/r/DistXplore

Data：https://drive.google.com/drive/folders/1rgZA2xuMLhcYE40u4llWMxqEsew4rbzb?usp=sharing

Page updated

Google Sites

Report abuse