Towards Automated Crowdsourced Testing via Personified-LLM

Introduction

The rapid proliferation and increasing complexity of software demand robust quality assurance, with graphical user interface (GUI) testing playing a pivotal role. Crowdsourced testing has proven effective in this context by leveraging the diversity of human testers to achieve rich, scenario-based coverage across varied devices, user behaviors, and usage environments. In parallel, automated testing, particularly with the advent of large language models (LLMs), offers significant advantages in controllability, reproducibility, and efficiency, enabling scalable and systematic exploration. However, automated approaches often lack the behavioral diversity characteristic of human testers, limiting their capability to fully simulate real-world testing dynamics. To address this gap, we present PersonaTester, a novel personified-LLM-based framework designed to automate crowdsourced GUI testing. By injecting representative personas, defined along three orthogonal dimensions: testing mindset, exploration strategy, and interaction habit, into LLM-based agents, PersonaTester enables the simulation of diverse human-like testing behaviors in a controllable and repeatable manner. Experimental results demonstrate that PersonaTester faithfully reproduces the behavioral patterns of real crowdworkers, exhibiting strong intra-persona consistency and clear inter-persona variability (117.86% - 126.23% improvement over the baseline). Moreover, persona-guided testing agents consistently generate more effective test events and trigger more crashes (100+) and functional bugs (11) than the baseline without persona, thus substantially advancing the realism and effectiveness of automated crowdsourced GUI testing.

Contribution

This paper presents PersonaTester, a novel framework that practically enables personified LLM agents to simulate diverse and realistic human-like GUI testing behaviors, which is the first work to apply the persona concept to software testing.
This paper introduces a structured three-dimensional persona schema that systematically models testing mindsets, exploration strategies, and interaction habits for test exploration.
Empirical experiments show persona-guided LLM agents enhance test diversity and effectiveness, outperforming the non-personified baseline in bug triggering.

Motivation

Persona B (Path 1)

Testing Mindset A. sequential & coherent, Exploration Strategy b. core function focused, Interaction Habit ii. valid & short input

Persona C (Path 2)

Testing Mindset A. sequential & coherent, Exploration Strategy c. input oriented, Interaction Habit iii. invalid input

Persona E (Path 3)

Testing Mindset B. divergent & non-linear, Exploration Strategy a. click oriented, Interaction Habit ii. valid & short input

Framework

Code Repository

Link: https://github.com/iGUITest/PersonaTester (including the app and task information for experiment)

Prompt Example with LLM Personification (Complete Version of Fig. 3)

Experiment

RQ1: Exploration Trend

RQ1.1 Intra-Cluster Cohesion

RQ1.2 Inter-Cluster Separation

RQ2: Test Generation Effectiveness

RQ2.1 General Test Generation Effectiveness

RQ2.2 Input Test Generation Effectiveness

RQ3: Bug Triggering Capability

RQ3.1 Crash Bug Triggering Capability

RQ3.2 Functional Bug Triggering Capability

Dicussion: Economic Cost Analysis

PersonaTester-Overhead-GUIAnalysis.xlsx

PersonaTester-Overhead-DecisionMaking.xlsx

Dicussion: PersonaTester vs. Random Strategy

Inter-Cluster Separation Strategy between Random Strategy Exploration and Personified LLM Agents

(the random strategy exploration cannot finish specific tasks but can only randomly explore the apps under test, making the exploration trend quite different)

Dicussion: Performance with Open-Source Models

We replace the GPTs with DeepSeek models and keep all other configurations the same (1 non-personified agent and 9 different personified agents). The results show the similar patterns although the similarity drops a little. This is because that DeepSeek models are slightly weaker than GPT models in our tasks. Nevertheless, this can help illustrate that the design of PersonaTest can help achieve intended cwordworker simulation and express the persona distinctions.

Page updated

Google Sites

Report abuse