Abstract:
Often software systems are developed by organizations consisting of many teams of individuals working together. Brooks states in the Mythical Man Month book that product quality is strongly affected by organization structure. Unfortunately there has been little empirical evidence to date to substantiate this assertion.
In this paper we present a metric scheme to quantify organizational complexity, in relation to the product development process to identify if the metrics impact failure-proneness. In our case study, the organizational metrics when applied to data from Windows Vista were statistically significant predictors of failure-proneness. The precision and recall measures for identifying failure-prone binaries, using the organizational metrics, was significantly higher than using traditional metrics like churn, complexity, coverage, dependencies, and pre-release bug measures that have been used to date to predict failure-proneness. Our results provide empirical evidence that the organizational metrics are related to, and are effective predictors of failure-proneness.
Introduction:
Software engineering is a complex engineering activity. It involves interactions between people, processes, and tools to develop a complete product. In practice, commercial software development is performed by teams consisting of a number of individuals ranging from the tens to the thousands. Often these people work via an organizational structure reporting to a manager or set of managers.
The intersection of people [9], processes [29] and organization [33] and the area of identifying problem prone components early in the development process using software metrics (e.g. [12, 23, 27, 30]) has been studied extensively in recent years. Early indicators of software quality are beneficial for software engineers and managers in determining the reliability of the system, estimating and prioritizing work items, focusing on areas that require more testing, inspections and in general identifying “problem-spots” to manage for unanticipated situations. Often such estimates are obtained from measures like code churn, code complexity, code coverage, code dependencies, etc. But these studies often ignore one of the most influential factors in software development, specifically “people and organizational structure”. This interesting fact serves as our main motivation to understand the intersection between organizational structure and software quality: How does organizational complexity influence quality? Can we identify measures of the organizational structure? How well do they do at predicting quality, e.g., do they do a better job of identifying problem components than earlier used metrics?
Conway’s Law states that “organizations that design systems are constrained to produce systems which are copies of the communication structures of these organizations.” [8]. Similarly, Fred Brooks argues in the Mythical Man Month [6] that the product quality is strongly affected by org structure. With the advent of global software development where teams are distributed across the world the impact of organization structure on Conway’s law [14] and its implications on quality is significant. To the best of our knowledge there has been little or no empirical evidence regarding the relationship/association between organizational structure and direct measures of software quality like failures.
In this paper we investigate this relationship between organizational structure and software quality by proposing a set of eight measures that quantify organizational complexity. These eight measures provide a balanced view of organizational complexity from the code viewpoint. For the organizational metrics, we try to capture issues such as organizational distance of the developers; the number of developers working on a component; the amount of multi-tasking developers are doing across organizations; and the amount of change to a component within the context of that organization etc. from a quantifiable perspective. Using these measures we empirically evaluate the efficacy of the organizational metrics to identify failure-prone binaries in Windows Vista.
The organization of the rest of the paper is as follows. Section 2 describes the related work focusing on prior work on organizational structure and predicting defects/failures. Section 3 highlights our contribution and Section 4 describes the organizational metric suite. Section 5 presents our case study and the results of our investigation on the relationship between organizational metrics and quality. Section 6 discusses the threats to validity and section 7 the conclusions and future work.
Software Organizational Studies:
From the historical perspective, Fred Brooks in his classic book The Mythical Man Month [6] provides an analogy in the chapter on Why did the (mythical) Tower of Babel Fail? The observation being that, the people had (1) a clear mission; (2) manpower; (3) (raw) materials; (4) time and (5) technology. The project failed because of – communication, and its consequent organization [6]. Brooks further states that in software systems: schedule disasters, functional misfits and system bugs arise from a lack of communication between different teams. Quoting Brooks[6] “The purpose of organization is to reduce the amount of communication and coordination necessary; hence organization is a radical attack on the communication problems…”. In 1968 Conway [8] also observed from his study (organizations produce designs which are copies of the communication structures of these organizations) that the flexibility of an organization is important to effective design [8]. He further went on to say that ways must be found to reward design managers for keeping their organizations lean and flexible indicating the importance of organization on design quality [8]. In a similar vein, Parnas [32] also indicated that a software module is “a responsibility assignment rather than a subprogram” indicating the importance of organizational structure in the software industry.
We summarize here recent work from the perspective of organizational structure towards communication and coordination. Herbsleb and Grinter [13] look at Conway’s law from the perspective of global software development. Their paper explores global software development from a team organizational context based on teams working in Germany and UK. They provide recommendations based on their empirical case study for the associated problems geographically distributed organizations face with respect to communication barriers and coordination mechanisms. They observed the primary barriers to team coordination were lack of unplanned contact; knowing the right person to contact about specific issues; cost of initiating the contact; effective communication and lack of trust. Further Herbsleb and Mockus [15] formulate and evaluate an empirical theory (of coordination) towards understanding engineering decisions from the viewpoint of coordination within software projects. This paper is one of the closest in scale, size and motivation to our study, though our study focuses on predicting quality using the organization metrics (with the underlying relationship between organizational structure and coordination). Also Mockus et al. [22] investigate how different individuals across geographical boundaries contribute towards open source projects (Apache and Mozilla). Perry et al. [33] discuss and motivate the need to consider the larger development picture, which encompasses organizational and social as well as technological factors. They discuss quantitatively measuring people factors and report on the result of two experiments, one which is a self-reported diary of developer activities and the second an observational study of developer activities. These two experiments also were used to asses the efficacy of each technique towards quantifying people factors.
Software Metrics and Faults/Failures:
Code Churn: Graves et al. [12] predict fault incidences using software change history based on a weighted time damp model using the sum of contributions from all changes to a module, where large and/or recent changes contribute the most to fault potential [12]. Ostrand et al. [31] use information of file status such as new, changed, unchanged files along with other explanatory variables such as lines of code, age, prior faults etc. as predictors in a negative binomial regression equation to successfully predict (high accuracy for faults found in both early and later stages of development) the number of faults in a multiple release software system. Nagappan and Ball [25] in a prior study on Windows Server 2003 showed the use of relative code churn measures (relative churn measures are normalized values of the various measures obtained during the evolution of the system) to predict defect density at strong statistically significant levels. Zimmermann et al. [37] mined source code repositories of eight large scale open source systems (IBM Eclipse, Postgres, KOffice, gcc, Gimp, JBoss, JEdit and Python) to predict where future changes will take place in these systems. The top three recommendations made by their system identified a correct location for future change with an accuracy of 70%.
Code Complexity: Khoshgoftaar et al. [18] studied two consecutive releases of a large legacy system (containing over 38,000 procedures in 171 modules) for telecommunications. Discriminant analysis identified fault-prone modules based on 16 static software product metrics. Their model when used on the second release showed a type I and II misclassification rate of 21.7%, 19.1% respectively and an overall misclassification rate of 21.0%. From the O-O (object-oriented) perspective the CK metric suite [7] consist of six metrics (designed primarily as object oriented design measures): weighted methods per class (WMC), coupling between objects (CBO), depth of inheritance (DIT), number of children (NOC), response for a class (RFC) and lack of cohesion among methods (LCOM). The CK metrics have also been investigated in the context of fault-proneness. Basili et al. [1] studied the fault-proneness in software programs using eight student projects. They observed that the WMC, CBO, DIT, NOC and RFC were correlated with defects while the LCOM was not correlated with defects. Further, Briand et al. [5] performed an industrial case study and observed the CBO, RFC, and LCOM to be associated with the fault-proneness of a class. Within five Microsoft projects, Nagappan et al. [27] identified complexity metrics that predict post-release failures and reported how to systematically build predictors for post-release failures from history.
Code Dependencies: Pogdurski and Clarke [34] presented a formal model of program dependencies as the relationship between two pieces of code inferred from the program text. Schröter et al. [35] showed that import dependencies can predict defects. They proposed an alternate way of predicting failures for Java classes. Rather than looking at the complexity of a class, they looked exclusively at the components that a class uses. For Eclipse, the open source IDE they found that using compiler packages results in a significantly higher failure-proneness (71%) than using GUI packages (14%). Prior work at Microsoft [24] on the Windows Server 2003 system illustrates that code dependencies can be used to successfully identify failure-prone binaries with precision and recall values of around 73% and 75% respectively.
Code Coverage: Hutchins et al. [16] evaluate all-edges and alluses coverage criteria using an experiment with 130 fault seeded versions of seven programs and observed that test sets achieving coverage levels over 90% usually showed significantly better fault detection than randomly chosen test sets of the same size. In addition, significant improvements in the effectiveness of coverage-based tests usually occurred as coverage increased from 90% to 100%. Frankl and Weiss [11] evaluated all-edges and alluses coverage using nine subject programs. Error-exposing ability was shown to be positive and strongly correlated to percentage of covered definition-use associations in four of the nine subjects. Error exposing ability was also shown to be positively correlated with the percentage of covered edges in four (different) subjects, but the relationship was weaker.
Combination of metrics: Denaro et al. [10] calculated 38 different software metrics (lines of code, halstead software metrics, nesting levels, cyclomatic complexity, knots, number of comparison operators, loops etc.) for the open source Apache 1.3 and Apache 2.0 projects. Using logistic regression models built using the data collected from the Apache 1.3 they verified the models against the Apache 2.0 project with high correctness/completeness. Khoshgoftaar et al. [19] use code churn as a measure of software quality in a program of 225,000 lines of assembly language. Using eight complexity measures, including code churn, they found neural networks and multiple regression to be an efficient predictor of software quality, as measured by gross change in the code. Nagappan et al. [26] used code churn, code complexity and code coverage measures to predict post-release field failures in Windows Server 2003 using logistic regression models built with Windows XP data. The built models identify failure-prone binaries with a statistically significant positive and strong correlation between actual and estimated failures.
Pre-release bugs: Biyani and Santhanam [4] show for four industrial systems at IBM there is a very strong relationship between development defects per module and field defects per module. This allows building of prediction models based on development defects to identify field defects.
Organization Metrics:
Number of Engineers (NOE): This is the absolute number of unique engineers who have touched a binary and are still employed by the company.
Implication: The more people who touch the code, the higher the chances of defective code as there is a higher need for coordination amongst the engineers[6]. Brooks [6] states that if there are N engineers who touch a piece of code there needs to be (N*(N-1))/2 theoretical communication paths for the N engineers to communicate amongst themselves.
Number of Ex-Engineers (NOEE): This is the total number of unique engineers who have touched a binary and have left the company as of the release date of the software system (in our case A.dll).
Implications: This measure deals with knowledge transfer. If the employee(s) who worked on a piece of code leaves the company then there is a likelihood that the new person taking over might not be familiar with the design rationale, the reasoning behind certain bug fixes, and information about other stake holders in the code.
Edit Frequency (EF): This is the total number times the source code, that makes up the binary, was edited. An edit is when an engineer checks code out of the VCS, alters it and checks it back in again. This is independent of the number of lines of code altered during the edit.
Implications: This measure serves two purposes. One being that, if a binary had too many edits it could be an indicator of the lack of stability/control in the code from the different perspectives of reliability, performance etc. , this is even if a small number of engineers where making the majority of the edits. Secondly, it provides a more complete view of the distribution of the edits: did a single engineer make majority of the edits, or were they widely distributed amongst the engineers?. The EF cross balances with NOE and NOEE to make sure that a few engineers making all the edits do not inflate our measurements and ultimately affect our predict model. Also if the engineers who made most of the edits have left the company (NOEE) then it can lead to the above discussed issues of knowledge transfer.
Depth of Master Ownership (DMO): This metric determines the level of ownership of the binary depending on the number of edits done. The organization level of the person whose reporting engineers perform more than 75% of the rolled up edits is deemed as the DMO. The DMO metric determines the binary owner based on activity on that binary. Our choice of 75% is based on prior historical information on Windows to quantify ownership.
Implications: The deeper in the tree is the ownership the more focused the activities, communication, and responsibility. A deeper level of ownership indicates less diffusion of activities, a single point of approval/control which should improve intellectual control. If a binary does not have a clear owner (or has a very low DMO at which 75% of the edits toll up) then there could be issues regarding decision-making when performing a risky bug fix, lack of engineers to follow-up if there is an issue, understanding intersecting code dependencies etc. A management owner who has not made a large number of edits (i.e. not familiar with the code) may not be able to make the above decisions without affecting code quality.
Percentage of Org contributing to development (PO): The ratio of the number of people reporting at the DMO level owner relative to the Master owner org size.
Implications: The lower the percentage the more local is the ownership and contributions to the binary leading to lower coordination/communication overhead across organizations and improved synchronization amongst individuals, better intellectual control and provide a single point of contact. This metric minimizes the impact of an unbalanced organization, whereby the DMO may be two levels deep but 90% of the total organization reports into that DMO.
Level of Organizational Code Ownership (OCO): The percent of edits from the organization that contains the binary owner or if there is no owner then the organization that made the majority of the edits to that binary.
Implications: The more the development contributions belong to a single organization, the more they share a common culture, focus, and social cohesion. The more diverse the contributors to the code itself, the higher the chances of defective code, e.g., synchronization issues, mismatches, build breaks. If a binary has a defined owner then this measure identifies whether the remaining edits to the binary was performed by people in the same organization (common culture). This measure is particularly important when a binary does not have a defined owner, as it provides a measure of how much control any single organization has over the binary. Also if there is a large PO value due to several of the engineers only having worked on the binary a few times the OCO measure will counter-balance that taking into account the development activities in terms of the edits. Example: This ratio is 200/ (200+40+10). 200 is the highest proportion of edits made in org reporting to AB. This ratio is computed against the total edits of 200+40+10 across all the three orgs.
Overall Organization Ownership (OOW): This is the ratio of the percentage of people at the DMO level making edits to a binary relative to total engineers editing the binary. A high value is good.
Implications: As with previous ownership measures the more the activities belong to a single organization, the more they share a common culture, focus, and social cohesion. Furthermore, the bigger the organizational distance the more chance there is of miscommunication and misunderstanding of goals focus, etc. This measure counter balances OCO and PO to account for a common phenomenon in large teams that exist due to “super” engineers. These engineers have considerable experience in the code base and contribute a substantial amount of code to the system. We do not want one or a few such engineers influencing our measures nor do we want them to be ignored. PO, OCO and OOW account for this type of inter relationship.
Organization Intersection Factor (OIF): A measure of the number of different organizations that contribute greater than 10% of edits, as measured at the level of the overall org owners.
Implications: Greater is the OIF the more diffused is the contribution to a binary. This implies a lack of strong ownership from one particular org. This measure is particularly important when a binary has no owner as it identifies how diffused the ownership is across the total organization.
Conclusion & Future Work:
In this paper we have reported on our empirical investigation of organizational metrics from the perspective of software quality. We define a set of organizational measures that quantify the complexity of a software development organization. The organizational measures are then used to quantify and study the effect that an organization structure would have on software quality. More generally, it is beneficial to obtain early estimates of software quality (e.g. failure-proneness) to help inform decisions on testing, code inspections, design rework, as well as financial costs associated with a delayed release. Our organizational measures predict failure-proneness in Windows Vista with significant precision, recall and sensitivity. Our study also compares the prediction models built using organizational metrics against traditional code churn, code complexity, code coverage, code dependencies and pre-release defect measures to show that organizational metrics are better predictors of failure-proneness than the traditional metrics used so far.