Our review of IR-based trace recovery compares 79 publications containing 132 empirical studies, systematically derived according to established procedures. Our study constitutes the most extensive summary of publications of IR-based trace recovery yet published, enabling an overview of the topic based on empirical results.
More than 10 IR models have been applied to trace recovery. More studies have evaluated algebraic IR models (i.e. VSM and LSI), than probabilistic models (e.g. BIM, PIN, LM, LDA). A visible trend is, in line with development in the general field of IR, that the probabilistic subset of statistical language models have received increased attention in recent years. While extracting data from the primary publications, it became clear that the inconsistent use of IR terminology is an issue in the fi.eld. In an attempt to homogenize the language, we present structure in the form of a hierarchy of IR models and a collection of IR terminology.
In the 132 mapped empirical studies, artifacts from the entire development process have been linked. The dominant artifact type is requirements at various levels of abstraction, followed by source code. Substantially fewer studies have been conducted on test artifacts, and only single publications
have targeted user manuals and defect reports. Furthermore, a majority of the evaluations of IR-based trace recovery have been made on bipartite datasets, i.e. only trace links between two disjoint sets of artifacts were recovered.
Among the 79 primary publications mapped in our study, we conclude that the heterogeneity of reporting detail obstructs the aggregation of empirical evidence. Also, most evaluations have been conducted on small bipartite datasets containing fewer than 500 artifacts, which is a severe threat to external validity. Furthermore, a majority of evaluations have been using artifacts originating from a university environment, or a dataset of proprietary artifacts from NASA. As a result, the two small datasets EasyClinic and CM-1 constitute the de-facto benchmark in IR-based trace recovery. Another validity threat to the applicability of IR-based trace recovery is that a clear majority of the evaluations have been conducted in "the cave of IR evaluation" as defined by Järvelin and Ingwersen. Thus, we argue that in-vivo evaluations, in which IR-based trace recovery should be studied within the full complexity of an industrial setting, are needed to motivate the feasibility of the approach and further studies on the topic. As such, our empirical findings intensify the recent call for additional empirical studies by CoEST (The Center of Excellence for Traceability).
Based on our synthesized results from IR-based trace recovery tools, we found no empirical evidence that the technologically-oriented research on tools has resulted in more accurate trace recovery. No IR model regularly outperforms the classic VSM with TFIDF feature weighting on text preprocessed by stop word removal and stemming. As long as trace recovery is based on nothing but the limited NL content of artifacts, there appears to be little value in solely hunting improved P-R values in small datasets. Instead, our recommendation is to focus on the evaluations rather than the technical details of the tools. Simply evaluating VSM for trace recovery, the classical IR model with several available open source implementations, in an industrial context has a large potential to contribute to the field.
The strongest empirical evidence in favor of IR-based trace recovery tools comes from a set of controlled experiments on student subjects, reporting that tool-supported subjects outperform manual control groups. However, our results show that only a quarter of the reported P-R values in the primary publications reach "acceptable" level as defined by Huffman Hayes, Dekhtyar, and Sundaram. This suggests that more research is required on how accurate candidate trace links need to be for an engineer to benefit from them.
In several primary publications it is not made clear whether a query-based or matrix-based evaluation style has been used. Also, the different reporting styles of P-R values make aggregation of candidate trace link accuracies challenging. We argue that the standard measures precision at fixed recall levels and P-R at specific document cut-offs should be reported, complemented by secondary measures such as mean average precision and discounted cumulative gain. Moreover, based on P-R values extracted from the query-based evaluations in the primary publications, we show that IR-based trace recovery is considerably more sensitive to the choice of input dataset than to the applied IR model.