Abstract:
Managing large-scale software projects involves a number of activities such as viewpoint extraction, feature detection, and requirements management, all of which require a human analyst to perform the arduous task of organizing requirements into meaningful topics and themes. Automating these tasks through the use of data mining techniques such as clustering could potentially increase both the efficiency of performing the tasks and the reliability of the results. Unfortunately, the unique characteristics of this domain, such as high dimensional, sparse, noisy data sets, resulting from short and ambiguous expressions of need, as well as the need for the interactive engagement of stakeholders at various stages of the process, present difficult challenges for standard clustering algorithms.
In this paper, we propose a semi-supervised clustering framework, based on a combination of consensus-based and constrained clustering techniques, which can effectively handle these challenges. Specifically, we provide a probabilistic analysis for informative constraint generation based on a co-association matrix, and utilize consensus clustering to combine multiple constrained partitions in order to generate high-quality, robust clusters. Our approach is validated through a series of experiments on six well-studied TREC data sets and on two sets of user requirements.
Introduction:
Software development projects include a number of human intensive activities that can benefit significantly from automated support. For example, activities such as feature detection, requirements elicitation [7], and certain types of automated traceability [14] all rely upon a human analyst to organize an extensive set of requirements into meaningful topics and themes. This is illustrated in the requirements elicitation process, where stakeholders document their needs as short unstructured statements which must then be manually reviewed, analyzed, and classified. The challenge of performing these tasks in a large project can be daunting. However data mining techniques such as clustering can be used to organize and manage stakeholders’ feature requests in order to increase efficiency, provide scalable software engineering processes, and improve reliability of the results.
Unfortunately, the unique characteristics of this domain, such as high dimensional, sparse, noisy data sets resulting from short and ambiguous expressions of need, present difficult challenges for standard clustering algorithms. Furthermore, the context in which the clusters will be used, dictates the need to create extremely fine-grained, high-quality clusters, sometimes containing as few as 10-20 requirements. Although high quality is clearly a goal of all clustering algorithms, it is especially important in the requirements domain because project stakeholders will directly interact with and scrutinize the generated clusters, and those which lack a clear and dominant theme, or clusters with even a few misplaced requirements may cause project stakeholders to lose trust in the automated approach and will lead to poor adoption of related tools and processes.
In prior studies, we broadly investigated the use of standard clustering techniques such as K-means, agglomerate hierarchical clustering, bisecting, and probabilistic techniques to determine if any of these basic approaches could consistently return requirements clusters at the quality needed to support the proposed software engineering activities [15]. Each of the algorithms was evaluated against several requirements datasets, using both standard coupling and cohesion metrics, and also by comparing the generated clusters to known answer sets. We also conducted a subjective user analysis of the answer sets because this provided more insight into the strengths and weaknesses of the clustering process. For example, one of the smaller datasets we evaluated represented a set of 366 feature requests gathered from MS students describing their needs for an Amazon-like student-centric web portal. A subjective analysis found that the generated clusters included very few highly cohesive ones and that almost all clusters contained misfits. Furthermore there were a significant number of clusters containing no obviously dominant theme. As a result of this extensive study which included multiple algorithms and data sets, we concluded that fully automated single-technique clustering algorithms do not appear to produce sufficiently high quality results to adequately support the targeted software engineering tasks.
In this paper, we propose a semi-supervised clustering framework, based on a combination of consensus-based and constrained clustering techniques, which can effectively handle the challenges described above. The new approach takes advantage of the high levels of interactive user feedback expected in the requirements elicitation task to constrain future clusterings in an ensemble clustering framework. The quality of the initial baseline clustering in the ensemble is also significantly improved through a consensus-based approach. Each clustering is generated by selecting a sub-sample of needs and then using the generated clusters to classify remaining needs. This ensemble is then used to identify a set of constraints that maximize the benefits obtained from the costly constraint collection process. The framework is tested against six TREC datasets, which have been used in related work [23], and two sets of feature requests. All of these are discussed in greater detail later in the paper.
The remainder of the paper is laid out as follows. Sections 2 and 3 provide a background discussion of constrained and consensus clustering both of which are adopted in our proposed framework. Section 4 then introduces our consensus-based constrained clustering framework, and section 5 reports on a series of experiments we conducted to validate it within the requirements domain. Section 6 concludes with an overall analysis of the results. Notations that are used throughout the remainder of this paper are defined in Table 1.
Conclusion and Future Work:
This paper has described a new framework for clustering high dimensional datasets such as requirements documents. The framework adopts a hybrid model which combines both consensus and constrained clustering techniques, and in which constraints are selected that are expected to maximize supervisory potential for improving cluster quality. The reported experimental results demonstrated the effectiveness of this approach especially for clustering short documents into finely grained partitions. These characteristics closely match those of the targeted requirements domain, and in fact the clustering results were especially promising for the SUGAR and STUDENT datasets. In future work we intend to build a far more extensive set of requirements related datasets and corresponding answer sets, so that we can further assess and fine-tune the usefulness of our framework.
The work in this paper was primarily motivated by our research in automating and scaling up components of the requirements process, and our subsequent observations that rudimentary clustering techniques did not produce sufficiently cohesive clusters to support our intended tasks. The clustering improvements obtained through use of the framework described in this paper, have significantly mitigated this problem, to the extent that they are anticipated to support future research and tool development that will enable us to move towards higher levels of automation in the requirements engineering domain.