Abstract:
Introduction:
Software engineering as a discipline demands theories grounded in real world evidence, but unfortunately researchers frequently find that investigations into similar phenomena conducted in different studies produce conflicting results. These differences may arise from a number of issues including heterogeneous sampling methods, measurement or reporting which makes comparing and combining results problematic. This study focuses upon empirical studies that have compared the accuracy levels of two competing software project cost prediction systems, namely regression analysis and analogy. We seek to answer the question to what extent are the empirical results consistent between and also within studies.
Empirical studies in software engineering are used to investigate the efficacy of methods and the impact of various factors on productivity, quality or cost. Hence, ‘the reliable combination of results from independent experiments is an essential building block in any discipline attempting to build a solid empirical foundation' [1]. However, different studies of the same phenomenon frequently report different findings perhaps because the data have been collected or analysed in a non-standardised manner. This problem is compounded because of difficulties defining the population to which the results can generalize and thus in obtaining a representative sample [2]. In addition, the selection of studies is influenced by heterogeneity of measures, environmental factors, publication bias and the 'file drawer problem' [3]. Therefore in order to perform ‘a study of studies’[4], a meta-analysis [5], a systematic and documented procedure needs to be used to search and screen relevant studies, code results, and provide a quantitative summary of the findings [1, 6].
The existence of context variables poses a serious challenge to forming a body of knowledge in empirical software engineering. Basili, Shull et al. [7] suggest researchers should build models using a common framework for data collection to represent common observations that would allow generalization [8]. However these would involve replication of individual yet comparable studies in which results are refined rather than combined. In such cases, authors typically dismiss seemingly contradictory results rather than use them [9].
The need for empirical validation of different and often competing software project effort prediction systems, has led to hundreds of studies being conducted. Yet there remains a lack of synthesised findings. Thus, to the best of our knowledge, this investigation is the first systematic comparison of the empirical evidence for two competing prediction systems.
The remainder of this paper is organised as follows. The next section, Section 2, briefly describes regression and analogy techniques as applied in effort prediction, and goes on to summarise work in which both regression and analogy-based approaches are used. This body of work comprises the papers from conference proceedings and refereed software engineering journals identified using our search strategy which is documented in Section 3. Section 4 details the results of the analysis, while the conclusions are given in Section 5.
Related Work:
An effort prediction system can be derived using a number of techniques for example, expert judgment, statistical (e.g. regression), and more recently machine learning approaches (such as Artificial Neural Networks and Case-Based Reasoning (CBR)). These prediction systems typically have a primary cost factor such as size (typically lines of code (LOC) or function points (FP)) and a number of adjustment factors (cost drivers) which characterise the project and influence effort. Cost drivers are used to adjust the preliminary estimate provided by the primary cost factor [10].
As the ‘best’ technique frequently varies among studies, some researchers recommend at least two prediction approaches are needed in order to reduce risk [11]. This can be through the application of multiple techniques to different subsets of the data, or by using more than one technique to produce a range of estimated values [12]. Despite adopting these recommendations, to date no converging results have been obtained [e.g. 13, 14]. The most commonly applied prediction methods for software effort prediction systems are regression [15] and analogybased [16]. In this paper, we investigate the consistency of results when using regression and analogy-based prediction methods. In the following subsections we define regression and analogy-based techniques, then briefly synthesise outline results from papers included in this study.
Varying results have been found within and among studies using multiple techniques and measures, and many researchers [19] suggest that the success of a technique is dependent on data set characteristics. For example, Myrveit and Stensrud [20] concluded that results are sensitive to experimental design after finding contradictory results when replicating previous studies which had claimed analogy outperformed regression. Similarly, Briand et al. [21] found analogybased prediction systems were less robust than regression models when using data external to the organisation for which the model is built. In contrast, Mendes and Kitchenham [22] suggest that CBR predicts better across a large heterogeneous data set, and regression is better for within-company predictions. Finnie, Wittig and Desharnais [23, 24] claimed that because of the complexities involved in software development projects, regression models were less effective than analogy-based prediction systems which benefit from human judgement and intuition. Shepperd and Schofield [16] found that analogy gave better results than regression in terms accuracy, and Angelis and Stamelos [25] also found analogy-based methods superior except when using transformed data derived from non-parametric bootstrap methods. Mendes et al. [26] and Jeffery et al. [27] found that overall, stepwise regression outperformed CBR. However, in later work, Mendes [28] claimed that CBR gave better prediction accuracy. Briand et al. [21] concluded that using OLS is ‘probably sufficient’ (p.385). Niessink and van Vliet [29] proposed that analogy offers an alternative to regression models when applied to a heterogeneous data set.
The studies outlined above provide clear evidence of a lack of synthesis and consistency when attempting to determine which is the better prediction method: regression or analogy. In order to address this, we attempt to compare the consistency of results and conclusions within and between these studies in which regression and analogy-based prediction techniques are used. The rationale for our methods is described in the following section.
Conclusions:
Over recent years there has been considerable activity by the empirical software engineering research community to investigate and compare competing software project cost prediction systems. In this paper we have examined two approaches that have received significant attention over the past ten years, namely regression and analogy-based techniques. A systematic review of the literature identified 20 empirical studies comparing the relative accuracy levels yielded by the two approaches. Unfortunately these studies yield little clear evidence as to which technique should be preferred with 45% offering some support for analogy, 35% for regression and 20% undecided.
The problem we need to address is: why are the results inconsistent? One might expect different results when models are generated from different data sets, however in some cases results were inconsistent despite utilising the same data set and the same prediction techniques. For example, [33], [36], and [22] each used the Desharnais [41] data set, but found conflicting results. In [33], CBR outperformed LSR in 2 out of the 3 comparisons performed in terms of MMRE, whereas LSR outperformed analogy in both comparisons in [36] in terms of MMRE and Pred(25). And although CBR resulted in a lower MMRE, Pred(25) was greater in comparisons with both regression models in [22]. In [33] and [22], projects with missing values were excluded; in [36], missing data were replaced by ‘random samples from the other projects’ (p.864). In [33] the holdout sample comprised 15% of the projects, in [36] it was 22%, whereas in [22] the entire data set was used to generate the regression models. Given the inconsistencies within and between these papers, we might conclude that variations in method are responsible.
The lack of standardisation in software engineering research methodology leads to heterogeneous sampling, measurement, and reporting techniques. These are compounded by the impact of different contexts and variants of the methods. From our results there would seem to be some evidence of both. As commented previously there are conflicting results from the same data set and therefore presumably the same context. On the other hand this is evident in a minority of results and other researchers have reported conflicting results conducted within the same study (e.g. [31, 36, 37]). The likely impact of context upon comparative accuracy of prediction techniques has also been noted in the simulation work of Shepperd and Kadoda [36]. Indeed it is hardly surprising that different data set characteristics will favour different techniques.
In order that our findings may be generalised, we need to consider the validity of our work. Because the validity of our study is contingent on the validity of the studies on which it is based, we firstly consider threats to validity within these papers. Studies such as these cannot be conducted blind, therefore researcher bias needs be considered. For example, a research group might have greater expertise in a particular method, or they may have pioneered a particular technique. Either or both of these could manifest in a disproportionate time to be spent on a ‘pet technique’ [42] in comparison to others. Clearly, this can lead to inconsistent results when comparing results from seemingly similar studies derived from different research groups. One solution might be to investigate the background of each research group with the aim of identifying whether or not they are pioneering, or are simply in favour of, a particular technique. Alternatively, the paper could state where the research group’s interest s lie. Furthermore, as protocols were rarely described, replicating these studies would be problematic in most cases. For the purposes of this study, we assumed the data were unbiased.
If this were not the case, the validity of our work might be affected. A further possible threat is that we included only papers published in English. However, we do not consider this a serious threat. More importantly we made judgements as to what constituted regression and analogy-based techniques when each is in fact ‘a family’ of techniques. Therefore we grouped and treated equally all regression, except log linear, techniques and ignored any transformations of skewed variables. For analogy-based techniques we ignored parameters such as feature subset, adaptation, number of analogies and distance measures used. Had we used narrower definitions for each technique our results would likely have been more consistent. However, we took the view that if a technique is so sensitivity to minor variants in its deployment, the community needs to be aware of this factor.
The discriminant analysis failed to find a significant level of discrimination among the variables. Hence we conclude factors other than those accounted for in the present study, might contribute to the inconsistencies within and between results. We thus conclude that using available evidence, and broad definitions of each technique, neither dominates in the sense of always being preferable. Therefore, as a starting point researchers should ask questions such as when might it be better to use technique A rather B, as opposed to is technique A better than B?