ai-CPS benchmark - RQ2: Falsification

RQ2: Effectiveness of Falsification

In this section, we introduce the falsification tools in our paper, and discuss their effectiveness on AI-enabled CPS.

What is Falsification?

Falsification is a search-based testing technique based on hill climbing for CPS quality assurance, which aims to find a time-variant input signal u violating a formal specification describing the desired behavior. As illustrated by the figure below, the AI-enabled CPS represented by M^c takes input signal u and produces output signal M^c(u), which can be regarded as a black-box. Further, the system consists of an AI controller C and a physical system (Plant). The Plant is a system whose dynamics are given by a black-box function M, while AI-controller C with the inputs y(t) and u(t) outputs a control command c(t). Given a temporal specification φ, usually in the form of Signal Temporal Logic (STL) [1], falsification aims to hunt for a temporal signal u to the system that violates φ.

For this propose, Donzé [2] presents a quantitative robustness semantics of temporal formulas. Robust semantics extends the classical boolean satisfaction relation in the following sense: given a specification φ and an input signal u, it tells not only whether φ is satisfied or violated, but also how robustly the formula is satisfied or violated. Falsification is usually cast as an optimization approach that generates input signals with the aim of minimizing robustness. The search process terminates when an input signal with negative robustness is found, which means violate φ.

Falsification of AI-enabled CPS

Experiment Design

Falsification Tools. Up to the present, different optimization-based falsification methods have been proposed, and some well-known established tools (e.g., Breach [3] and S-TaLiRo [4] ) have also been developed. In RQ2, we select Global Nelder-Mead (GNM) and CMAES for Breach, and Simulated Annealing (SA) and stochastic optimization with adaptive restart (SOAR) for S-TaLiRo. Note that, although Breach and S-TaLiRo are based on the basic falsification algorithm, they have integrated different back-end optimization solvers to better achieve the search-based testing of CPS. In RQ2, we will call these four algorithms in abbreviated form for simplicity, namely, GNM-BR, CMAES-BR, SA-ST and SOAR-ST.

Experiment setup. Due to the randomness of falsification algorithms, we repeat each configuration for 30 times with an upper limit of 300 simulations each time. A corresponding falsification rate FR = # successful trials/ # total trials will be obtained. The time and #sim in the table below are the average time and simulation numbers of the successful trials.

Falsification performance comparison between four existing falsification algorithms

Effectiveness of Falsification

1. Adaptive Cruise Control (ACC)

To visually view the effectiveness of different falsification tools, we use histogram to present the abilities of different falsification algorithms for specific cases. For ACC-DDPG, both GNM-BR and SA-ST find one counterexample, compared with CMAES-BR and SOAR-ST. This also reflects the performance of ACC-DDPG controller to a certain extent. For ACC-TD3, GNM-BR and CMAES-BR completely outperform SA-ST and SOAR-ST. As for ACC-SAC and ACC-T, only CMAES-BR has slightly better performance.

2. Lane Keeping Assistant (LKA)

For LKA, only LKA-DDPG cannot be falsified by these four algorithms. For four other CPS, there is no obvious difference on the effectiveness of these falsifiaiotn algorithms.

3. Automatic Parking Valet (APV)

As for APV, there is no difference on the effectiveness of four falsification algorithms. Obviously, DRL-based controller has worse performance than traditional controller.

4. Exothermic Chemical Reactor (CSTR)

All of them cannot falsify CSTR-DDPG, CSTR-PPO and CSTR-TD3. This proves the performance of DRL-controllers to a certain extent. For CSTR-T, only GNM-BR does not achieve 100% FR.

5. Land a Rocket (LR)

For LR-DDPG, the falsification rate of four algorithms achieve 100%. For LR-T, all four algorithms detect some counterexamples, but not all.

6. Abstract Fuel Control (AFC)

The results of AFC clearly present the effectiveness of four falsification algorithms.

For AFC-DDPG, none of these four algorithms can find the input signal that violates the hard safety S1.
For AFC-PPO, only GNM-BR achieves a 100% falsification rate and CMAES-BR succeeds 10 times. Instead, SA-ST and SOAR-ST fails to seek an input signal that violates the hard safety S1.
The result of AFC-A2C is similar to AFC-PPO.
As for AFC-T, GNM-BR and CMAES-BR are inferior to two other algorithms.

7. Wind Turbine (WT)

Four falsification algorithms show a consistency on the WT-DDPG and WT-PPO. Instead, for WT-T, The FR of GNM-BR and SOAR-ST has reached 100% and CMAES-BR fails to find any counterexample after 30 trials.

8. Steam Condenser (SC)

For all SC systems with DRL controllers, the performance of four algorithms is completely consistent. However, The falsification rate of four algorithms achieve 100%. This is also shows that DRL controllers perform better than traditional PID controller from the side.

9. Water Tank (WTK)

For WTK-DDPG and WTK-TD3, no disagreement is observed for these four methods. Same with SC, 100% falsification rate on WTK-T also proves the effectiveness of falsification tools on traditional CPS.

Summary and Discussion

Summary.

Although falsification is the state-of-the-art validation method. However, it sometimes will fail to find counterexamples on AI-enabled CPS.
Different algorithms have distinct performance on the same AI-enabled CPS.
DRL-based controllers have more powerful and safer performance, compared with traditional controller, such as WTK. However, there are also some exceptions like APV.

Discussion.

The performance of these four falsification algorithms has been evaluated and analyzed in detail. For the failed falsification cases, on the one hand, it partly depends on the concrete search algorithm; on the other hand, the trajectory of AI-enabled controllers with more local minimum or non-differentiable points makes the search process more difficult. The above results prove the famous no free lunch theorem in optimization, and also motivates us to develop more effective algorithms to handle these emerging AI-enabled CPS.
From another perspective, new DRL training methods and strategies are urgently needed to better train the controllers that perform some complex control tasks.

Reference

Maler, O., Nickovic, D.: Monitoring temporal properties of continuous signals. In: Lakhnech, Y., Yovine, S. (eds.) FORMATS 2004 and FTRTFT 2004. LNCS, vol. 3253, pp. 152–166. Springer, Heidelberg (2004)
Donzé A, Maler O. Robust satisfaction of temporal logic over real-valued signals. International Conference on Formal Modeling and Analysis of Timed Systems. Springer, Berlin, Heidelberg, 2010: 92-106.
Donzé, A.: Breach, A toolbox for verification and parameter synthesis of hybrid systems. In: Touili, T., Cook, B., Jackson, P. (eds.) CAV 2010. LNCS, vol. 6174, pp. 167–170. Springer, Heidelberg (2010)
Annpureddy, Y., Liu, C., Fainekos, G., Sankaranarayanan, S.: S-taLiRo: A tool for temporal logic falsification for hybrid systems. In: Abdulla, P.A., Leino, K.R.M. (eds.) TACAS 2011. LNCS, vol. 6605, pp. 254–257. Springer, Heidelberg (2011)

Google Sites

Report abuse