In this page, we present our experimental evaluation results to show the effectiveness of the proposed framework.
RQ1: Is Olive more efficient than existing approaches? In this RQ, we want to compare the efficiency of Olive with BaB-baseline approaches to assess if Olive performs better than baselines.
In this Table, it shows the average speedup w.r.t. BaB over solved problem instances (i.e., at least one approach gives a verdict true or false) and the number of problem instances that are not solved by BaB but solved by other approaches. The results clearly show the performance advantage of the three incremental verification approaches over BaB, in the sense that incremental verification approaches can achieve a significant speedup over BaB. Due to such speedup, for all of models, incremental approaches manage to solve more verification problem instances within the timeout. This insight is consistent with that in Ivan [Ugare et al. 2023], indicating that incremental verification that leverages existing knowledgeabout verification of 𝑁 can indeed bring advantages in verification of a similar network 𝑁∗
RQ2.1 How does Olive perform for violated and certified verification problems respectively
MNIST L2
MNIST L4
OVAL Base
OVAL Deep
OVAL Wide
The above show the breakdown results of Olive for the problem instances where there exist counterexamples (i.e., violated problem instances) and where there exists no counterexample (i.e., certified problem instances). For each benchmark model, we produce box plots that present the speedup rates of Olive-G (Greedy) and Olive-B (Balanced), over Ivan baseline for each individual problem instance, and the left plot in each sub-figure is the results for violated instances, and the right plot in each sub-figure is the results for certified instances.
By observing these, we can further confirm our conjecture that Olive performs significantly better than Ivan in the problem instances where there exist counterexamples, and its performance is as good as Ivan in the certified problem instances. For all of the benchmark models, it is evident that the speedup rates for violated instances are significantly better than that of certified instances. While in most cases the distributions of speedup for certified instances are around 1, i.e., the performances of Olive for those instances are similar with Ivan, the distributions of speedup for violated instances are much greater than 1, signifying the superiority of \tool for these violated problem instances in terms of efficiency.
For violated problem instances, such performance differences are also very small, and this can be due to that, under the specified parameters, both weight pruning and quantization do not trigger a big difference between the original model N and the updated model N*.
RQ2.2 Detailed comparison over BaB-baseline, Ivan ( annotated as resue), Olive-G (annotated as greedy), Olive-B (annotated as balanced) and detailed verified problem instances and voliated instances time
RQ3 Which strategy of Olive can achieve better performance?
In most of the benchmark models, our proposed Olive achieves a higher efficiency than Ivan for many problem instances. In 13 out of 15 models, there are problem instances for which Olive achieves more than 10× speedup; specifically, in MNISTL2 with Quant(Int8), the speedup achieved by Olive can be up to more than 400× and in MNISTL4 with Prune(12%), the speedup can be around 120×. Moreover, in MNISTL2 with Prune(7%), MNISTL4 with Prune(7%), OVAL21BASE with Prune(0.1%) and OVAL21BASE with Prune(1%), we can observe a significant number of problem instances for which Olive performs much faster than Ivan, and a big portion of these instances take more than 10 minutes for their verification, which signifies that they are complex verification problems.
We observe that both Olive-G and Olive-B have many points above the horizontal line of speedup 1, which indicates that both Olive-G and Olive-B hold many problem instances for which they outperform Ivan. We then count the number of the problem instances which significantly outperform Ivan for Olive-G and Olive-B respectively. We find that, first, in many benchmark models the performances between Olive-G and Olive-B are comparable, e.g., in the model MNISTL4 with Prune(12%) and in the model OVAL21WIDE with Prune(1%); moreover, in general, Olive-B outperforms Oliveg in many different cases. For instance, in model OVAL21BASE with all different weight pruning options Prune(0.1%), Prune(1%), and Prune(3%), there appear a cluster of points of Oliveb that outperform Ivan while there is no Olive-G in the surroundings. These results show the performance advantages of Olive-B over Olive-G, and such superiority can be attributed to the balanced strategy of Olive-B that does not focus on exploitation of suspicious sub-problems only.
RQ4 How do different hyperparameters influence the performance of Olive-B?
The parameters we used in our paper are as follows,
following the guideline provided by literature [1]
self.alpha = 0.5 // 𝛼 in line 3
self.C = 1.2 // 𝐶 in line 3
self.sigma = 0.5 // 𝜎 is in ComputeReward
self.c = 0.2 // 𝑐 in line 15
Recall the above Algorithm 3 for Olive-B; there are multiple hyperparameters that can influence the performance of Olive-B, including 𝐶, 𝛼 related to progressive widening, 𝑐 related to UCB1, and 𝜎 related to reward computation. To understand how these hyperparameters influence Oliveb, we randomly select 25 instances from each of the 15 models (including all of the models with different architectures, under all different model alteration methods), leading to 375 instances in total. These instances encompass both certified and violated cases, and also some cases not solved by any approach, because they are still possible to be solved by the variant approaches.
RQ4.1 Overall observation
Here, we can observe that, the range of speedup is between 5.33× (by RewDepth) and 7.67× (by RewBound), which exhibits a consistent performance advantage of Oliveb over Ivan.
RQ4.2 – The distributions of solved instances by variants of Olive-B, in different speedups and time costs
Olive-B (Default)
Reward Depth
Reward Bound
RewBound, which defines reward purely by approximated lower bounds, outperforms the default configuration; in contrast, RewDepth, which defines reward purely by node depth, underperforms the default configuration. First, the result suggests the effectiveness of approximated lower bounds in guiding the search of counterexample. While this may suggest that defining rewards purely by approximated lower bounds is good, the merit can be attributed to a “balance” flavor of that setting, because approximated lower bounds are regardless of node depth. In general cases, the optimal ratio between these two attributes may still depend on concrete problems.
RewDepth shows a slight improvement (49 instances in 0-100s) in quick solutions, due to its greediness, but its effectiveness diminishes for longer-running instances with only 14 solved in the 200-400s range, which suggests a limitation in capturing problem complexity.
RewBound exhibits enhanced performance for medium-difficulty problems, solving 42 instances in 0-100s and notably 18 instances in the 200-400s range, indicating that defining reward in this way provides effective guidance for more complex problems.
UCB1Exploit
UCB1Balance
UCB1Explore
UCB1_B2 (σ=0.6)
UCB1_B3 (σ=0.8)
Different settings regarding UCB1 perform very similarly, and all of them slightly are comparable with the default setting. In particular, UCB1Explore exhibits the worst performance, which indicates that too much emphasis on exploration may not bring good performance.
PWExplore1
PWExplore2
PWBalance
PWExploit
Different settings regarding progressive widening are also comparable with the default setting, but PWExploit can be a bit worse. This is expected, because by PWExploit, the progressive widening is too greedy and does not often expand a new arm, thus missing the chance of finding a counterexample in other arms than the one suggested by the template.
PWExplore1 shows remarkable efficiency for the most challenging problems by solving 52 instances in 600-1000s
PWExplore2 excels in quicker solutions with 46 instances in 0-100s and still performs excellently for medium problems with 15 instances in 400-600s. This result indicates that the strategy that favors exploration is generally useful in progressive widening. In contrast.
PWBalance performs well for quick solutions by achieving speedups over 10× for 46 instances in the 0-100s
PWExploit solves 41 instances in 0-100s but only 26 instances in 200-400s. This stresses that different settings of progressive widening can significantly impact the efficiency of algorithm for problems of different complexities.
RQ5 How is the performance of Olive subject to the similarity of 𝑁 ∗ to 𝑁
We find that with the increase of dissimilarity between the two networks, it is evident that the performances of our approaches Oliveg and Oliveb decrease. For the problem instances of OVAL21BASE, such decrease is monotonic, while for the problem instances of OVAL21DEEP, there is also a same trend. Note that the Ivan baseline also demonstrates a similar phenomenon, showing that sufficient similarity between two networks is a necessary condition for incremental verification to be effective. Specifically, for the selected models OVAL21BASE and OVAL21DEEP, both require a similarity such that the weight pruning percentage should be lower than 3%; otherwise, incremental verification approaches does not perform as well as BaB approach.
[1] Browne, Cameron B., et al. "A survey of monte carlo tree search methods." IEEE Transactions on Computational Intelligence and AI in games 4.1 (2012): 1-43.