Software Metrics
 

Software Metrics:

Within the software engineering community, there is much confusion and inconsistency over the use of the terms metric and measure. In this report, a metric is defined to be the mathematical definition, algorithm, or function used to obtain a quantitative assessment of a product or process. The actual numerical value produced by a metric is a measure. Thus, for example, cyclomatic complexity is a metric, but the value of this metric is the cyclomatic complexity measure.

Data on individual errors can be used to calculate metrics values. Two general classes of metrics include the following:

Management metrics, which assist in the control or management of the development process; and

Quality metrics, which are predictors or indicators of the product qualities

Management metrics can be used for controlling any industrial production or manufacturing activity. They are used to assess resources, cost, and task completion. Examples of resource-related metrics include elapsed calendar time, effort, and machine usage. Typical metrics for software estimate task completion include percentage of modules coded, or percentage of statements tested. Other management metrics used in project control include defect-related metrics. Information on the nature and origin of defects are used to estimate costs associated with defect discovery and removal. Defect rates for the current project can be compared to that of past projects to ensure that the current project is behaving as expected.

Quality metrics are used to estimate characteristics or qualities of a software product. Examples of these metrics include complexity metrics, and readability indexes for software documents. The use of these metrics for quality assessment is based on the assumptions that the metric measures some inherent property of the software, and that the inherent property itself influences the behavioral characteristics of the final product.

Some metrics may be both management metrics and quality metrics, i.e., they can be used for both project control and quality assessment. These metrics include simple size metrics (e.g., lines of code, number of function points) and primitive problem, fault, or error metrics. For example, size is used to predict project effort and time scales, but it can also be used as a quality predictor, since larger projects may be more complex and difficult to understand, and thus more error-prone.

A disadvantage of some metrics is that they do not have an interpretation scale, which allows for consistent interpretation, as with measuring temperature (in degrees Celsius) or length (in meters). This is particularly true of metrics for software quality characteristics (e.g., maintainability, reliability, usability). Measures must be interpreted relatively, through comparison with plans and expectations, comparison with similar past projects, or comparison with similar components within the current project. While some metrics are mathematically based, most, including reliability models, have not been proven.

Since there is virtually an infinite number of possible metrics, users must have some criteria for choosing which metrics to apply to their particular projects. Ideally, a metric should possess all of the following characteristics:

  • -- Simple - definition and use of the metric is simple
  • -- Objective - different people will give identical values; allows for consistency, and prevents individual bias
  • -- Easily collected - the cost and effort to obtain the measure is reasonable
  • -- Robust - metric is insensitive to irrelevant changes; allows for useful comparison
  • -- Valid - metric measures what it is supposed to; this promotes trustworthiness of the measure

Within the software engineering community, two philosophies on measurement are embodied by two major standards organizations. A draft standard on software quality metrics sponsored by the Institute for Electrical and Electronics Engineers Software Engineering Standards Subcommittee supports the single value concept. This concept is that a single numerical value can be computed to indicate the quality of the software; the number is computed by measuring and combining the measures for attributes related to several quality characteristics. The international community, represented by the ISO/IEC organization through its Joint Technical Committee, Subcommittee 7 for software engineering appears to be adopting the view that a range of values, rather than a single number, for representing overall quality is more appropriate.


Metrics Throughout the Lifecycle

Metrics enable the estimation of work required in each phase, in terms of the budget and schedule. They also allow for the percentage of work completed to be assessed at any point during the phase, and establish criteria for determining the completion of the phase.

The general approach to using metrics, which is applicable to each lifecycle phase, is as follows:

  • --Select the appropriate metrics to be used to assess activities and outputs in each phase of the lifecycle.
  • -- Determine the goals or expected values of the metrics.
  • -- Determine or compute the measures, or actual values.
  • -- Compare the actual values with the expected values or goals.
  • -- Devise a plan to correct any observed deviations from the expected values.

Some complications may be involved when applying this approach to software. First, there will often be many possible causes for deviations from expectations and for each cause there may be several different types of corrective actions. Therefore, it must be determined which of the possible causes is the actual cause before the appropriate corrective action can be taken. In addition, the expected values themselves may be inappropriate, when there are no very accurate models available to estimate them.

In addition to monitoring using expected values derived from other projects, metrics can also identify anomalous components that are unusual with respect to other components values in the same project. In this case, project monitoring is based on internally generated project norms, rather than estimates from other projects.

The metrics described in the following subsections comprise a representative sample of management and quality metrics that can be used in the lifecycle phases to support error analysis. This section does not evaluate or compare metrics, but provides definitions to help readers decide which metrics may be useful for a particular application.



Metrics Used in All Phases

Primitive metrics such as those listed below can be collected throughout the lifecycle. These metrics can be plotted using bar graphs, histograms, and Pareto charts as part of statistical process control. The plots can be analyzed by management to identify the phases that are most error prone, to suggest steps to prevent the recurrence of similar errors, to suggest procedures for earlier detection of faults, and to make general improvements to the development process.

Problem Metrics

Primitive problem metrics.

Number of problem reports per phase, priority, category, or cause

Number of reported problems per time period

Number of open real problems per time period

Number of closed real problems per time period

Number of unevaluated problem reports

Age of open real problem reports

Age of unevaluated problem reports

Age of real closed problem reports

Time when errors are discovered

Rate of error discovery

Cost and Effort Metrics

Primitive cost and effort metrics.

Time spent

Elapsed time

Staff hours

Staff months

Staff years

Change Metrics

Primitive change metrics.

Number of revisions, additions, deletions, or modifications

Number of requests to change the requirements specification and/or design during

lifecycle phases after the requirements phase

Fault Metrics

Primitive fault metrics. Assesses the efficiency and effectiveness of fault resolution/removal activities, and check that sufficient effort is available for fault resolution/removal.

Number of unresolved faults at planned end of phase

Number of faults that, although fully diagnosed, have not been corrected, and number of outstanding change requests

Number of requirements and design faults detected during reviews and walkthroughs


Requirements Metrics

The main reasons to measure requirements specifications is to provide early warnings of quality problems, to enable more accurate project predictions, and to help improve the specifications.

Primitive size metrics. These metrics involve a simple count. Large components are assumed to have a larger number of residual errors, and are more difficult to understand than small components; as a result, their reliability and extendibility may be affected.

Number of pages or words

Number of requirements

Number of functions

Requirements traceability. This metric is used to assess the degree of traceability by measuring the percentage of requirements that has been implemented in the design. It is also used to identify requirements that are either missing from, or in addition to the original requirements. The measure is computed using the equation: RT = R1/R2 x 100%, where R1 is the number of requirements met by the architecture (design), and R2 is the number of original requirements.

Completeness (CM). Used to determine the completeness of the software specification during requirements phase. This metric uses eighteen primitives (e.g., number of functions not satisfactorily defined, number of functions, number of defined functions, number of defined functions not used, number of referenced functions, and number of decision points). It then uses ten derivatives (e.g., functions satisfactorily defined, data references having an origin, defined functions used, reference functions defined), which are derived from the primitives.

The metric is the weighted sum of the ten derivatives expressed as CM = wi Di, where the summation is from i=1 to i=10, each weight wi has a value between 0 and 1, the sum of the weights is 1, and each Di is a derivative with a value between 1 and 0. The values of the primitives also can be used to identify problem areas within the requirements specification.

Fault-days number. Specifies the number of days that faults spend in the software product from its creation to their removal. This measure uses two primitives: the phase, date, or time that the fault was introduced, and the phase, date, or time that the fault was removed. The fault days for the ith fault, (FDi), is the number of days from the creation of the fault to its removal. The measure is calculated as follows: FD = FDi.

This measure is an indicator of the quality of the software design and development process. A high value may be indicative of untimely removal of faults and/or existence of many faults, due to an ineffective development process.


Function points. This measure was originated by Allan Albrecht at IBM in the late 1970's, and was further developed by Charles Symons. It uses a weighted sum of the number of inputs, outputs, master files and inquiries in a product to predict development size [ALBRECHT]. To count function points, the first step is to classify each component by using standard guides to rate each component as having low, average, or high complexity. The second basic step is to tabulate function component counts. This is done by entering the appropriate counts in the Function Counting Form, multiplying by the weights on the form, and summing up the totals for each component type to obtain the Unadjusted Function Point Count. The third step is to rate each application characteristic from 0 to 5 using a rating guide, and then adding all the ratings together to obtain the Characteristic Influence Rating. Finally, the number of function points is calculated using the equation

FP = Unadjusted function point count*(.65+.01*Character Influence Rating)


Design Metrics

The main reasons for computing metrics during the design phase are the following: gives early indication of project status; enables selection of alternative designs; identifies potential problems early in the lifecycle; limits complexity; and helps in deciding how to modularize so the resulting modules are both testable and maintainable. In general, good design practices involve high cohesion of modules, low coupling of modules, and effective modularity.

Size Metrics

Primitive size metrics. These metrics are used to estimate the size of the design or design documents.

Number of pages or words

DLOC (lines of PDL)

Number of modules

Number of functions

Number of inputs and outputs

Number of interfaces

(Estimated) number of modules (NM). Provides measure of product size, against which the completeness of subsequent module based activities can be assessed. The estimate for the number of modules is given by, NM = S/M, where S is the estimated size in LOC, M is the median module size found in similar projects. The estimate NM can be compared to the median number of modules for other projects.


Fault Metrics

Primitive fault metrics. These metrics identify potentially fault-prone modules as early as possible.

Number of faults associated with each module

Number of requirements faults and structural design faults detected during detailed design


Complexity Metrics

Primitive complexity metrics. Identifies early in development modules which are potentially complex or hard to test. [ROOK]

Number of parameters per module

Number of states or data partitions per parameter

Number of branches in each module

Coupling. Coupling is the manner and degree of interdependence between software modules . Module coupling is rated based on the type of coupling, using a standard rating chart, which can be found in [SQE]. According to the chart, data coupling is the best type of coupling, and content coupling is the worst. The better the coupling, the lower the rating.

Cohesion. Cohesion is the degree to which the tasks performed within a single software module are related to the module's purpose. The module cohesion value for a module is assigned using a standard rating chart, which can be found in. According to the chart, the best cohesion level is functional, and the worst is coincidental , with the better levels having lower values. Case studies have shown that fault rate correlates highly with cohesion strength.

(Structural) fan-in / fan-out. Fan-in/fan-out represents the number of modules that call/are called by a given module. Identifies whether the system decomposition is adequate (e.g., no modules which cause bottlenecks, no missing levels in the hierarchical decomposition, no unused modules ("dead" code), identification of critical modules). May be useful to compute maximum, average, and total fan-in/fan-out.

Information flow metric (C). Represents the total number of combinations of an input source to an output destination, given by, C = Ci (fan-in fan-out) 2, where C i is a code metric, which may be omitted. The product inside the parentheses represents the total number of paths through a module.



Design Inspection Metrics

Staff hours per major defect detected. Used to evaluate the efficiency of the design inspection processes. The following primitives are used: time expended in preparation for inspection meeting (T1), time expended in conduct of inspection meeting (T2), number of major defects detected during the ith inspection (Si ), and total number of inspections to date (I). The staff hours per major defect detected is given below, with the summations being from i=1 to i=I.

M = (T1 + T2)i
--------------------
S
i



This measure is applied to new code, and should fall between three and five. If there is significant deviation from this range, then the matter should be investigated. (May be adapted for code inspections). [IEEE982.2]

Defect Density (DD). Used after design inspections of new development or large block modifications in order to assess the inspection process. The following primitives are used: total number of unique defects detected during the ith inspection or ith lifecycle phase (D i), total number of inspections to date (I), and number of source lines of design statements in thousands (KSLOD). The measure is calculated by the ratio
DD =
Di where the sum is from i=1 to i=I.
---------------
KSLOD

This measure can also be used in the implementation phase, in which case the number of source lines of executable code in thousands (KSLOC) should be substituted for KSLOD. [IEEE982.2]


Test Related Metrics.

Test related primitives. Checks that each module will be / has been adequately tested, or assesses the effectiveness of early testing activities. [ROOK]

Number of integration test cases planned/executed involving each module

Number of black box test cases planned/executed per module

Number of requirements faults detected during testing (also re-assesses quality of requirements specification)


5.3.1.4. Implementation Metrics

Metrics used during the implementation phase can be grouped into four basic types: size metrics, control structure metrics, data structure metrics, and other code metrics.

Size Metrics

Lines of Code (LOC). Although lines of code is one of the most popular metrics, it has no standard definition(6) . The predominant definition for line of code is "any line of a program text that is not a comment or blank line, regardless of the number of statements or fragments of statements on the line." [SQE] It is an indication of size, which allows for estimation of effort, timescale, and total number of faults. For the same application, the length of a program partly depends on the language the code is written in, thus making comparison using LOC difficult. However, LOC can be a useful measure if the projects being compared are consistent in their development methods (e.g., use the same language, coding style). Because of its disadvantages, the use of LOC as a management metric (e.g., for project sizing beginning from the requirements phase) is controversial, but there are uses for this metric in error analysis, such as to estimate the values of other metrics. The advantages of this metric are that it is conceptually simple, easily automated, and inexpensive. [SQE]

Halstead software science metrics. This set of metrics was developed by Maurice Halstead, who claimed they could be used to evaluate the mental effort and time required to create a program, and how compactly a program is expressed. These metrics are based on four primitives listed below:

n1 = number of unique operators

n2 = number of unique operands

N1 = total occurrences of operators

N2 = total occurrences of operands

The program length measure, N, is the sum of N1 and N2. Other software science metrics are listed below. [SQE]

Vocabulary: n = n1+ n2

Predicted length: N^ = (n1 * log 2 n1) + (n2 * log2n2)

Program volume: V = N * log2n

Effort: E = (n1N2 Nlog2n)/(2n2)

Time: T = E/B ; Halstead B=18

Predicted number of bugs: B = V/3000



Control Structure Metrics

Number of entries/exits per module. Used to assess the complexity of a software architecture, by counting the number of entry and exit points for each module. The equation to determine the measure for the ith module is simply mi = ei + xi, where ei is the number of entry points for the ith module, and xi is the number of exit points for the ith module. [IEEE982.2]

Cyclomatic complexity (C). Used to determine the structural complexity of a coded module in order to limit its complexity, thus promoting understandability. In general, high complexity leads to a high number of defects and maintenance costs. Also used to identify minimum number of test paths to assure test coverage. The primitives for this measure include the number of nodes (N), and the number of edges (E), which can be determined from the graph representing the module. The measure can then be computed with the formula, C = E - N + 1. [IEEE982.2], [SQE]



Data Structure Metrics

Amount of data. This measure can be determined by primitive metrics such as Halstead's n2 and N2, number of inputs/outputs, or the number of variables. These primitive metrics can be obtained from a compiler cross reference. [SQE]

Live variables. For each line in a section of code, determine the number of live variables (i.e., variables whose values could change during execution of that section of code). The average number of live variables per line of code is the sum of the number of live variables for each line, divided by the number of lines of code. [SQE]

Variable scope. The variable scope is the number of source statements between the first and last reference of the variable. For example, if variable A is first referenced on line 10, and last referenced on line 20, then the variable scope for A is 9. To determine the average variable scope for variables in a particular section of code, first determine the variable scope for each variable, sum up these values, and divide by the number of variables [SQE]. With large scopes, the understandability and readability of the code is reduced.

Variable spans. The variable span is the number of source statements between successive references of the variable. For each variable, the average span can be computed. For example, if the variable X is referenced on lines 13, 18, 20, 21, and 23, the average span would be the sum of all the spans divided by the number of spans, i.e., (4+1+0+1)/4 = 1.5. With large spans, it is more likely that a far back reference will be forgotten. [SQE]

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
(6) The SEI has made an effort to provide a complete definition for LOC. See [PARK].
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Test Metrics

Test metrics may be of two types: metrics related to test results or the quality of the product being tested; and metrics used to assess the effectiveness of the testing process.

PRODUCT METRICS
Defect/Error/Fault Metrics

Primitive defect/error/fault metrics. These metrics can be effectively used with SPC techniques, such as bar charts, and Pareto diagrams. These metrics can also be used to form percentages (e.g., percentage of logic errors = number of logic errors / total number of errors).

Number of faults detected in each module

Number of requirements, design, and coding faults found during unit and integration testing

Number of errors by type (e.g., number of logic errors, number of computational errors, number of interface errors, number of documentation errors)

Number of errors by cause or origin

Number of errors by severity (e.g., number of critical errors, number of major errors, number of cosmetic errors)

Fault density (FD). This measure is computed by dividing the number of faults by the size (usually in KLOC, thousands of lines of code). It may be weighted by severity using the equation

FDw = (W1 S/N + W2 A/N + W3 M/N) / Size

where

N = total number of faults

S = number of severe faults

A = number of average severity faults

M = number of minor faults

Wi = weighting factors (defaults are 10, 3, and 1)

FD can be used to perform the following: predict remaining faults by comparison with expected fault density; determine if sufficient testing has been completed based on predetermined goals; establish standard fault densities for comparison and prediction. [IEEE982.2], [SQE]

Defect age. Defect age is the time between when a defect is introduced to when it is detected or fixed. Assign the numbers 1 through 6 to each of the lifecycle phases from requirements to operation and maintenance. The defect age is then the difference of the numbers corresponding to the phase introduced and phase detected. The average defect age = (phase detected - phase introduced)/number of defects, the sum being over all the defects. [SQE]

Defect response time. This measure is the time between when a defect is detected to when it is fixed or closed. [SQE]

Defect cost. The cost of a defect may be a sum of the cost to analyze the defect, the cost to fix it, and the cost of failures already incurred due to the defect. [SQE]

Defect removal efficiency (DRE). The DRE is the percentage of defects that have been removed during a process, computed with the equation:

DRE = Number of defects removed
----------------------------------x 100%
Number of defects at start of process

The DRE can also be computed for each lifecycle phase and plotted on a bar graph to show the relative defect removal efficiencies for each phase. Or, the DRE may be computed for a specific process (e.g., design inspection, code walkthrough, unit test, six- month operation, etc.). [SQE]



PROCESS METRICS Test case metrics

Primitive test case metrics.

Total number of planned white/black box test cases run to completion

Number of planned integration tests run to completion

Number of unplanned test cases required during test phase



Coverage metrics
(7)

Statement coverage. Measures the percentage of statements executed (to assure that each statement has been tested at least once). [SQE]

Branch coverage. Measures the percentage of branches executed. [SQE]

Path coverage. Measures the percentage of program paths executed. It is generally impractical and inefficient to test all the paths in a program. The count of the number of paths may be reduced by treating all possible loop iterations as one path. [SQE] Path coverage may be used to ensure 100 percent coverage of critical (safety or security related) paths.

Data flow coverage. Measures the definition and use of variables and data structures. [SQE]

Test coverage. Measures the completeness of the testing process. Test coverage is the percentage of requirements implemented (in the form of defined test cases or functional capabilities) multiplied by the percentage of the software structure (in units, segments, statements, branches, or path test results) tested. [AIRFORCE]



Failure metrics

Mean time to failure (MTTF). Gives an estimate of the mean time to the next failure, by accurately recording failure times ti, the elapsed time between the ith and the (i-1)st failures, and computing the average of all the failure times. This metric is the basic parameter required by most software reliability models. High values imply good reliability. [IEEE982.2]

Failure rate. Used to indicate the growth in the software reliability as a function of test time and is usually used with reliability models. This metric requires two primitives: ti, the observed time between failures for a given severity level i, and fi, the number of failures of a given severity level in the ith time interval. The failure rate (t) can be estimated from the reliability function R(t), which is obtained from the cumulative probability distribution F(t) of the time until the next failure, using a software reliability estimation model, such as the non-homogeneous Poisson

process (NHPP) or Bayesian type model. The failure rate is (t) = -1/R(t) [dR(t)]/dt where R(t) = 1 - F(t). [IEEE982.2]

Cumulative failure profile. Uses a graphical technique to predict reliability, to estimate additional testing time needed to reach an acceptable reliability level, and to identify modules and subsystems that require additional testing. This metric requires one primitive, fi, the total number of failures of a given severity level i in a given time interval. Cumulative failures are plotted on a time scale. The shape of the curve is used to project when testing will be complete, and to assess reliability. It can provide an indication of clustering of faults in modules, suggesting further testing for these modules. A non-asymptotic curve also indicates the need for continued testing. [IEEE982.2]

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
(7) Commercial tools are available for statement coverage, branch coverage, and path coverage, but only private tools exist for data flow coverage. [BEIZER] Coverage tools report the percentage of items covered and lists what is not covered. [SQE]
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

5.3.1.6. Installation and Checkout Metrics

Most of the metrics used during the test phase are also applicable during installation and checkout. The specific metrics used will depend on the type of testing performed. If acceptance testing is conducted, a requirements trace may be performed to determine what percentage of the requirements are satisfied in the product (i.e., number of requirements fulfilled divided by the total number of requirements).


5.3.1.7. Operation and Maintenance Metrics

Every metric that can be applied during software development may also be applied during maintenance. The purposes may differ somewhat. For example, requirements traceability may be used to ensure that maintenance requirements are related to predecessor requirements, and that the test process covers the same test areas as for the development. Metrics that were used during development may be used again during maintenance for comparison purposes (e.g., measuring the complexity of a module before and after modification). Elements of support, such as customer perceptions, training, hotlines, documentation, and user manuals, can also be measured.

Change Metrics

Primitive change metrics.

Number of changes

Cost/effort of changes

Time required for each change

LOC added, deleted, or modified

Number of fixes, or enhancements

Customer Related Metrics

Customer ratings. These metrics are based on results of customer surveys, which ask customers to provide a rating or a satisfaction score (e.g., on a scale of one to ten) of a vendor's product or customer services (e.g., hotlines, fixes, user manual). Ratings and scores can be tabulated and plotted in bar graphs.

Customer service metrics.

Number of hotline calls received

Number of fixes for each type of product

Number of hours required for fixes

Number of hours for training (for each type of product)