Online Appendix - Effects of Variability in Models: A Family of Experiments

This site is an online appendix for the paper "Effects of Variability in Models: A Family of Experiments". In the paper, we report our findings on a comparative study of various variability representations and their affect on model comprehension for three model types. We choose three variability representations: enumerative, annotative, and compositional. Additionally, we consider three model types: class diagrams, state machine diagrams and activity diagrams. Our findings correspond to a family of three experiments, each comparing various variability representations and their effect on the comprehension of one model type. As participants, we recruit undergraduate and graduate students from three universities, each situated in a different country.

Software product lines are inherently diverse and complex, and often, simplification of some kind is desired before implementing and testing them. Variability representations facilitate developers to model product lines by providing specialized notations to incorporate variability. Enumerative representation is the simplest, which comprises one model/representation per variant. While enumerative representation offers ease and adaptability, it poses scalability issues as the number of variants increase. Annotative representation is more concise; it comprises a unified model, which has annotations that represent features. Different types of textual edits (e.g., colors, bold, italic) can be leveraged to enhance annotative representations and make them more understandable. However, as number of features increase, these models start getting cluttered and soon pose the trade-off between richness/diversity and understandability. Compositional representation consists of reasonably-sized sub-models, which are composed to form the model for one variant. The sub-models can capture features and functionalities. However, with compositional representation, developers are faced with the extra cognitive step of composing various sub-models and merging them into one, which is governed by its own set of rules. Since model comprehension is the root of many subsequent developer activities (prototyping, implementation, testing), we study the effect of the choice of variability mechanisms on different types of models. We choose the model types due to their popularity and degree of familiarity our participants hold with them. Also, with the chosen model types, our goal is to cover models that capture both the static and dynamic structure of the system.

To this end, for each experiment, we created a model for every representation featured in the experiment. For the first experiment, we created nine class diagrams (3 sub-systems, 3 representations). We chose three sub-systems: Simulink, Phone management system, and a project management system. The representations featured in the first experiment were enumerative, annotative, and compositional. We removed enumerative from the last two experiments as it was leading to results which were very similar to the results with the annotative representation. After dropping the enumerative representation, we were able to see a better contrast in the results. For the second experiment, we created four state machine diagrams (2 sub-systems, 2 variability representations). We chose Robocode (xXXXXXX) as we were aiming to not let subject system's complexity affect the results. Robocode is a programming game which is deployed when teaching complex systems in software engineering, and our participants were familiar with the system on account of it being the term project. We chose 2 high level features of the sub-systems to create material. For the third experiment, we created four activity diagrams (2 sub-systems, 2 variability representations). We chose two sub-systems: an airline ticket reservation system, and an Email service provider.

In each experiment, participants were required to perform a set of tasks. To ensure homogeneity, we kept the same task types in all three experiments. The three task types were understanding variants, comparing two variants, and comparing all variants. We designed task types to depict common activities performed by developers in variant-rich systems. The experiments were designed such that each participant interacted with each sub-system and and variability representation once and once only. We followed a Latin-square design in all experiments.

Before the experiment, participants were required to record certain information. The information included demographics, a check for color-blindness, experience with the model type, experience with programming, and experience with variability representation. We conducted a qualitative and quantitative analysis for our evaluation. For quantitative evaluation, we measured two aspects: completion time and accuracy. Completion time was measured in minutes. To measure accuracy, we compared participant responses to the responses all authors vetted to be correct. Responses were marked to be either correct (1), partially correct (0.5), or incorrect (0). After the experiment, participants were asked to report their ratings on understandability and difficulty for each variability representation. All the self-reported stats including demographics and ratings were reported on a 5-point Likert scale.

For models with a scope and size similar to our examples and for similar tasks, we can conclude that:

  • Annotative variability resulted in better comprehensibility than compositional variability for all task types.

  • Compositional mechanism can impair comprehensibility in tasks that require a good overview of all variants.

  • Annotative variability is preferred over the compositional one by a majority of the participants for all task types in all model types.

  • The preferred variability mechanism depends on the task at hand.

Replication Package

Below is a replication package for our family of experiments. We share the material we used in our experiments, as well as the questionnaire we used in the experiment. We also share raw responses from the participants, both the subjective assessments and the task responses. In addition, we share our evaluation (scoring) of the task responses. Lastly, we share our R-scripts for interpreting the responses and comparing the representations, as well as plotting the figures added in the paper.