RQ3: How effective are different optimization methods in falsifying physics engine-based robotics manipulation tasks?

The findings of RQ1 also emphasize the need for testing support in the development of AI-enabled robotics applications with Isaac Sim. Building upon this insight, we develop a Python-based falsification framework that can be directly used with physical simulators as well as OpenAI Gym environments. However, although falsification has proved to be effective on traditional CPSs, its efficacy in the context of robotics tasks simulated using modern physical simulators remains unclear. To address this gap, we introduce RQ3 to analyze the performance of different optimization methods, i.e., random, Nelder-Mead, and dual annealing, in falsifying robotics tasks with AI software controllers. By using our benchmark, we conduct a falsification test that reveals the robustness and reliability of AI controllers and helps identify potential failures and vulnerabilities in the system.  Additionally, this test also demonstrates the extensibility and applicability of our benchmark.

Falsification Test

We test the trained AI controllers from RQ2 with the proposed falsification framework. For each AI controller, we conduct 30 falsification trials, each consisting of a maximum of 300 task simulations.

The number of successful falsifications, the average time used, and the average number of required task simulations for successful falsifications are presented in the following table.

Falsification Test

The results of our falsification test show that dual annealing outperforms other methods. While achieving only one less successful falsification than the random approach in the BP and PH tasks with the TRPO controller, it has a noticeable advantage in other tasks. In contrast, Nelder-Mead performs poorly in robotics manipulation tasks. A possible reason for this could be that, due to the highly nonlinear function of the robustness of STL specifications, a large number of local optima exist. In such cases, heuristic direct search methods like Nelder-Mead are prone to get stuck in local optima and, therefore, may fail to falsify these tasks. 

It is also worth mentioning that achieving high rewards in the training process does not necessarily indicate that the AI controller will reliably accomplish its desired task. Success in reward does not guarantee success in task completion. Our tests also show that even well-trained AI controllers can still be falsifiable to state-of-the-art falsification techniques. This highlights the necessity of incorporating falsification techniques into the testing framework of AI-enabled robotics manipulation.