Software Metrics:
Within the software engineering community, there is much confusion and
inconsistency over the use of the terms metric and measure. In this report, a
metric is defined to be the mathematical definition, algorithm, or function
used to obtain a quantitative assessment of a product or process. The actual
numerical value produced by a metric is a measure. Thus, for example,
cyclomatic complexity is a metric, but the value of this metric is the
cyclomatic complexity measure.
Data on individual errors can be used to calculate metrics values. Two general
classes of metrics include the following:
Management metrics, which assist in the control or management of the development process; and
Quality metrics, which are predictors or indicators of the product qualities
Management metrics can be used for controlling any industrial production or
manufacturing activity. They are used to assess resources, cost, and task
completion. Examples of resource-related metrics include elapsed calendar time,
effort, and machine usage. Typical metrics for software estimate task
completion include percentage of modules coded, or percentage of statements
tested. Other management metrics used in project control include defect-related
metrics. Information on the nature and origin of defects are used to estimate
costs associated with defect discovery and removal. Defect rates for the
current project can be compared to that of past projects to ensure that the
current project is behaving as expected.
Quality metrics are used to estimate characteristics or qualities of a software
product. Examples of these metrics include complexity metrics, and readability
indexes for software documents. The use of these metrics for quality assessment
is based on the assumptions that the metric measures some inherent property of
the software, and that the inherent property itself influences the behavioral
characteristics of the final product.
Some metrics may be both management metrics and quality metrics, i.e., they can
be used for both project control and quality assessment. These metrics include
simple size metrics (e.g., lines of code, number of function points) and
primitive problem, fault, or error metrics. For example, size is used to
predict project effort and time scales, but it can also be used as a quality
predictor, since larger projects may be more complex and difficult to
understand, and thus more error-prone.
A disadvantage of some metrics is that they do not have an interpretation
scale, which allows for consistent interpretation, as with measuring
temperature (in degrees Celsius) or length (in meters). This is particularly
true of metrics for software quality characteristics (e.g., maintainability,
reliability, usability). Measures must be interpreted relatively, through comparison
with plans and expectations, comparison with similar past projects, or
comparison with similar components within the current project. While some
metrics are mathematically based, most, including reliability models, have not
been proven.
Since there is virtually an infinite number of possible metrics, users must
have some criteria for choosing which metrics to apply to their particular
projects. Ideally, a metric should possess all of the following
characteristics:
- -- Simple - definition and use of the metric is simple
- -- Objective - different people will give identical values; allows for consistency, and prevents individual bias
- -- Easily collected - the cost and effort to obtain the measure is reasonable
- -- Robust - metric is insensitive to irrelevant changes; allows for useful comparison
- -- Valid - metric measures what it is supposed to; this promotes trustworthiness of the measure
Within the software engineering community, two philosophies on measurement are embodied by two major standards organizations. A draft standard on software quality metrics sponsored by the Institute for Electrical and Electronics Engineers Software Engineering Standards Subcommittee supports the single value concept. This concept is that a single numerical value can be computed to indicate the quality of the software; the number is computed by measuring and combining the measures for attributes related to several quality characteristics. The international community, represented by the ISO/IEC organization through its Joint Technical Committee, Subcommittee 7 for software engineering appears to be adopting the view that a range of values, rather than a single number, for representing overall quality is more appropriate.
Metrics Throughout the Lifecycle
Metrics enable the estimation of work required in each phase, in terms of
the budget and schedule. They also allow for the percentage of work completed
to be assessed at any point during the phase, and establish criteria for
determining the completion of the phase.
The general approach to using metrics, which is applicable to each lifecycle
phase, is as follows:
- --Select the appropriate metrics to be used to assess activities and outputs in each phase of the lifecycle.
- -- Determine the goals or expected values of the metrics.
- -- Determine or compute the measures, or actual values.
- -- Compare the actual values with the expected values or goals.
- -- Devise a plan to correct any observed deviations from the expected values.
Some complications may be involved when applying this approach to software.
First, there will often be many possible causes for deviations from
expectations and for each cause there may be several different types of
corrective actions. Therefore, it must be determined which of the possible
causes is the actual cause before the appropriate corrective action can be
taken. In addition, the expected values themselves may be inappropriate, when
there are no very accurate models available to estimate them.
In addition to monitoring using expected values derived from other projects,
metrics can also identify anomalous components that are unusual with respect to
other components values in the same project. In this case, project monitoring
is based on internally generated project norms, rather than estimates from
other projects.
The metrics described in the following subsections comprise a representative
sample of management and quality metrics that can be used in the lifecycle
phases to support error analysis. This section does not evaluate or compare
metrics, but provides definitions to help readers decide which metrics may be
useful for a particular application.
Metrics Used in All Phases
Primitive metrics such as those listed below can be collected throughout the
lifecycle. These metrics can be plotted using bar graphs, histograms, and
Pareto charts as part of statistical process control. The plots can be analyzed
by management to identify the phases that are most error prone, to suggest
steps to prevent the recurrence of similar errors, to suggest procedures for
earlier detection of faults, and to make general improvements to the
development process.
Problem Metrics
Primitive problem metrics.
Number of problem reports per phase, priority, category, or cause
Number of reported problems per time period
Number of open real problems per time period
Number of closed real problems per time period
Number of unevaluated problem reports
Age of open real problem reports
Age of unevaluated problem reports
Age of real closed problem reports
Time when errors are discovered
Rate of error discovery
Cost and Effort Metrics
Primitive cost and effort metrics.
Time spent
Elapsed time
Staff hours
Staff months
Staff years
Change Metrics
Primitive change metrics.
Number of revisions, additions, deletions, or modifications
Number of requests to change the requirements specification and/or design during
lifecycle phases after the requirements phase
Fault Metrics
Primitive fault metrics. Assesses the efficiency and effectiveness of fault resolution/removal activities, and check that sufficient effort is available for fault resolution/removal.
Number of unresolved faults at planned end of phase
Number of faults that, although fully diagnosed, have not been corrected, and number of outstanding change requests
Number of requirements and design faults detected during reviews and walkthroughs
Requirements Metrics
The main reasons to measure requirements specifications is to provide early warnings of quality problems, to enable more accurate project predictions, and to help improve the specifications.
Primitive size metrics. These metrics involve a simple count. Large components are assumed to have a larger number of residual errors, and are more difficult to understand than small components; as a result, their reliability and extendibility may be affected.
Number of pages or words
Number of requirements
Number of functions
Requirements traceability. This metric is used to assess the degree of traceability by measuring the percentage of requirements that has been implemented in the design. It is also used to identify requirements that are either missing from, or in addition to the original requirements. The measure is computed using the equation: RT = R1/R2 x 100%, where R1 is the number of requirements met by the architecture (design), and R2 is the number of original requirements.
Completeness (CM). Used to determine the completeness of the
software specification during requirements phase. This metric uses eighteen primitives
(e.g., number of functions not satisfactorily defined, number of functions,
number of defined functions, number of defined functions not used, number of
referenced functions, and number of decision points). It then uses ten derivatives
(e.g., functions satisfactorily defined, data references having an origin,
defined functions used, reference functions defined), which are derived from
the primitives. wi Di,
where the summation is from i=1 to i=10, each weight wi has a value between 0 and 1, the sum of the
weights is 1, and each Di is a derivative
with a value between 1 and 0. The values of the primitives also can be used to
identify problem areas within the requirements specification.
Fault-days number.
Specifies the number of days that faults spend in the software product from its
creation to their removal. This measure uses two primitives: the phase, date,
or time that the fault was introduced, and the phase, date, or time that the
fault was removed. The fault days for the ith fault, (FDi), is the number of days from the creation of
the fault to its removal. The measure is calculated as follows: FD = FDi.
This measure is an indicator of the quality of the software design and
development process. A high value may be indicative of untimely removal of faults
and/or existence of many faults, due to an ineffective development process.
Function points. This measure was originated by Allan Albrecht at
IBM in the late 1970's, and was further developed by Charles Symons. It uses a
weighted sum of the number of inputs, outputs, master files and inquiries in a
product to predict development size [ALBRECHT]. To count function points, the
first step is to classify each component by using standard guides to rate each
component as having low, average, or high complexity. The second basic step is
to tabulate function component counts. This is done by entering the appropriate
counts in the Function Counting Form, multiplying by the weights on the form,
and summing up the totals for each component type to obtain the Unadjusted
Function Point Count. The third step is to rate each application characteristic
from 0 to 5 using a rating guide, and then adding all the ratings together to
obtain the Characteristic Influence Rating. Finally, the number of function
points is calculated using the equation
FP = Unadjusted function point count*(.65+.01*Character Influence Rating)
Design Metrics
The main reasons for computing metrics during the design phase are the
following: gives early indication of project status; enables selection of
alternative designs; identifies potential problems early in the lifecycle;
limits complexity; and helps in deciding how to modularize so the resulting
modules are both testable and maintainable. In general, good design practices
involve high cohesion of modules, low coupling of modules, and effective
modularity.
Size Metrics
Primitive size metrics. These metrics are used to estimate the size of the design or design documents.
Number of pages or words
DLOC (lines of PDL)
Number of modules
Number of functions
Number of inputs and outputs
Number of interfaces
(Estimated) number of modules (NM). Provides measure of product size, against which the completeness of subsequent module based activities can be assessed. The estimate for the number of modules is given by, NM = S/M, where S is the estimated size in LOC, M is the median module size found in similar projects. The estimate NM can be compared to the median number of modules for other projects.
Fault Metrics
Primitive fault metrics. These metrics identify potentially fault-prone modules as early as possible.
Number of faults associated with each module
Number of requirements faults and structural design faults detected during detailed design
Complexity Metrics
Primitive complexity metrics. Identifies early in development modules which are potentially complex or hard to test. [ROOK]
Number of parameters per module
Number of states or data partitions per parameter
Number of branches in each module
Coupling. Coupling is the manner and degree of interdependence between software modules . Module coupling is rated based on the type of coupling, using a standard rating chart, which can be found in [SQE]. According to the chart, data coupling is the best type of coupling, and content coupling is the worst. The better the coupling, the lower the rating.
Cohesion. Cohesion is the degree to which the tasks performed within a single software module are related to the module's purpose. The module cohesion value for a module is assigned using a standard rating chart, which can be found in. According to the chart, the best cohesion level is functional, and the worst is coincidental , with the better levels having lower values. Case studies have shown that fault rate correlates highly with cohesion strength.
(Structural) fan-in / fan-out. Fan-in/fan-out represents the number of modules that call/are called by a given module. Identifies whether the system decomposition is adequate (e.g., no modules which cause bottlenecks, no missing levels in the hierarchical decomposition, no unused modules ("dead" code), identification of critical modules). May be useful to compute maximum, average, and total fan-in/fan-out.
Information flow metric (C). Represents the total number of combinations of an input source to an output destination, given by, C = Ci (fan-in fan-out) 2, where C i is a code metric, which may be omitted. The product inside the parentheses represents the total number of paths through a module.
Design Inspection Metrics
Staff hours per major defect detected. Used to evaluate the efficiency of the design inspection processes. The following primitives are used: time expended in preparation for inspection meeting (T1), time expended in conduct of inspection meeting (T2), number of major defects detected during the ith inspection (Si ), and total number of inspections to date (I). The staff hours per major defect detected is given below, with the summations being from i=1 to i=I.
M = (T1 + T2)i
--------------------Si
This measure is applied to new code, and should fall between three and five. If
there is significant deviation from this range, then the matter should be
investigated. (May be adapted for code inspections). [IEEE982.2]
Defect Density (DD).
Used after design inspections of new development or large block modifications
in order to assess the inspection process. The following primitives are used:
total number of unique defects detected during the ith inspection or ith
lifecycle phase (D i), total number of inspections to date (I), and
number of source lines of design statements in thousands (KSLOD). The measure
is calculated by the ratio
DD = Di where the sum is from i=1 to i=I.
---------------
KSLOD
This measure can also be used in the implementation phase, in which case the
number of source lines of executable code in thousands (KSLOC) should be
substituted for KSLOD. [IEEE982.2]
Test Related Metrics.
Test related primitives. Checks that each module will be / has been adequately tested, or assesses the effectiveness of early testing activities. [ROOK]
Number of integration test cases planned/executed involving each module
Number of black box test cases planned/executed per module
Number of requirements faults detected during testing (also re-assesses quality of requirements specification)
5.3.1.4. Implementation Metrics
Metrics used during the implementation phase can be
grouped into four basic types: size metrics, control structure metrics, data
structure metrics, and other code metrics.
Size Metrics
Lines of Code (LOC). Although lines of code is one of the most popular metrics, it has no standard definition(6) . The predominant definition for line of code is "any line of a program text that is not a comment or blank line, regardless of the number of statements or fragments of statements on the line." [SQE] It is an indication of size, which allows for estimation of effort, timescale, and total number of faults. For the same application, the length of a program partly depends on the language the code is written in, thus making comparison using LOC difficult. However, LOC can be a useful measure if the projects being compared are consistent in their development methods (e.g., use the same language, coding style). Because of its disadvantages, the use of LOC as a management metric (e.g., for project sizing beginning from the requirements phase) is controversial, but there are uses for this metric in error analysis, such as to estimate the values of other metrics. The advantages of this metric are that it is conceptually simple, easily automated, and inexpensive. [SQE]
Halstead software science metrics. This set of metrics was developed by Maurice Halstead, who claimed they could be used to evaluate the mental effort and time required to create a program, and how compactly a program is expressed. These metrics are based on four primitives listed below:
n1 = number of unique operators
n2 = number of unique operands
N1 = total occurrences of operators
N2 = total occurrences of operands
The program length measure, N, is the sum of N1 and N2. Other software science metrics are listed below. [SQE]
Vocabulary: n = n1+ n2
Predicted length: N^ = (n1 * log 2 n1) + (n2 * log2n2)
Program volume: V = N * log2n
Effort: E = (n1N2 Nlog2n)/(2n2)
Time: T = E/B ; Halstead B=18
Predicted number of bugs: B = V/3000
Control Structure Metrics
Number of entries/exits per module. Used to assess the complexity of a software architecture, by counting the number of entry and exit points for each module. The equation to determine the measure for the ith module is simply mi = ei + xi, where ei is the number of entry points for the ith module, and xi is the number of exit points for the ith module. [IEEE982.2]
Cyclomatic complexity (C). Used to determine the structural complexity of a coded module in order to limit its complexity, thus promoting understandability. In general, high complexity leads to a high number of defects and maintenance costs. Also used to identify minimum number of test paths to assure test coverage. The primitives for this measure include the number of nodes (N), and the number of edges (E), which can be determined from the graph representing the module. The measure can then be computed with the formula, C = E - N + 1. [IEEE982.2], [SQE]
Data Structure Metrics
Amount of data. This measure can be determined by primitive metrics such as Halstead's n2 and N2, number of inputs/outputs, or the number of variables. These primitive metrics can be obtained from a compiler cross reference. [SQE]
Live variables. For each line in a section of code, determine the number of live variables (i.e., variables whose values could change during execution of that section of code). The average number of live variables per line of code is the sum of the number of live variables for each line, divided by the number of lines of code. [SQE]
Variable scope. The variable scope is the number of source statements between the first and last reference of the variable. For example, if variable A is first referenced on line 10, and last referenced on line 20, then the variable scope for A is 9. To determine the average variable scope for variables in a particular section of code, first determine the variable scope for each variable, sum up these values, and divide by the number of variables [SQE]. With large scopes, the understandability and readability of the code is reduced.
Variable spans. The variable span is the number of source statements between successive references of the variable. For each variable, the average span can be computed. For example, if the variable X is referenced on lines 13, 18, 20, 21, and 23, the average span would be the sum of all the spans divided by the number of spans, i.e., (4+1+0+1)/4 = 1.5. With large spans, it is more likely that a far back reference will be forgotten. [SQE]
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
(6) The SEI has made an effort to provide a
complete definition for LOC. See [PARK].
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Test Metrics
Test metrics may be of two types: metrics related
to test results or the quality of the product being tested; and metrics used to
assess the effectiveness of the testing process.
PRODUCT METRICS
Defect/Error/Fault Metrics
Primitive defect/error/fault metrics. These metrics can be effectively used with SPC techniques, such as bar charts, and Pareto diagrams. These metrics can also be used to form percentages (e.g., percentage of logic errors = number of logic errors / total number of errors).
Number of faults detected in each module
Number of requirements, design, and coding faults found during unit and integration testing
Number of errors by type (e.g., number of logic errors, number of computational errors, number of interface errors, number of documentation errors)
Number of errors by cause or origin
Number of errors by severity (e.g., number of critical errors, number of major errors, number of cosmetic errors)
Fault density (FD). This measure is computed by dividing the number of faults by the size (usually in KLOC, thousands of lines of code). It may be weighted by severity using the equation
FDw = (W1 S/N + W2 A/N + W3 M/N) / Size
where
N = total number of faults
S = number of severe faults
A = number of average severity faults
M = number of minor faults
Wi = weighting factors (defaults are 10, 3, and 1)
FD can be used to perform the following: predict remaining faults by comparison with expected fault density; determine if sufficient testing has been completed based on predetermined goals; establish standard fault densities for comparison and prediction. [IEEE982.2], [SQE]
Defect age. Defect age is the time between when a defect
is introduced to when it is detected or fixed. Assign the numbers 1 through 6
to each of the lifecycle phases from requirements to operation and maintenance.
The defect age is then the difference of the numbers corresponding to the phase
introduced and phase detected. The average defect age = (phase
detected - phase introduced)/number of defects, the sum being over all the
defects. [SQE]
Defect response time. This measure is the time between when a defect is detected to when it is fixed or closed. [SQE]
Defect cost. The cost of a defect may be a sum of the cost to analyze the defect, the cost to fix it, and the cost of failures already incurred due to the defect. [SQE]
Defect removal efficiency (DRE). The DRE is the percentage of defects that have been removed during a process, computed with the equation:
DRE = Number of defects removed
----------------------------------x 100%
Number of defects at start of process
The DRE can also be computed for each lifecycle phase and plotted on a bar graph to show the relative defect removal efficiencies for each phase. Or, the DRE may be computed for a specific process (e.g., design inspection, code walkthrough, unit test, six- month operation, etc.). [SQE]
PROCESS METRICS Test case metrics
Primitive test case metrics.
Total number of planned white/black box test cases run to completion
Number of planned integration tests run to completion
Number of unplanned test cases required during test phase
Coverage metrics(7)
Statement coverage. Measures the percentage of statements executed (to assure that each statement has been tested at least once). [SQE]
Branch coverage. Measures the percentage of branches executed. [SQE]
Path coverage. Measures the percentage of program paths executed. It is generally impractical and inefficient to test all the paths in a program. The count of the number of paths may be reduced by treating all possible loop iterations as one path. [SQE] Path coverage may be used to ensure 100 percent coverage of critical (safety or security related) paths.
Data flow coverage. Measures the definition and use of variables and data structures. [SQE]
Test coverage. Measures the completeness of the testing process. Test coverage is the percentage of requirements implemented (in the form of defined test cases or functional capabilities) multiplied by the percentage of the software structure (in units, segments, statements, branches, or path test results) tested. [AIRFORCE]
Failure metrics
Mean time to failure (MTTF). Gives an estimate of the mean time to the next failure, by accurately recording failure times ti, the elapsed time between the ith and the (i-1)st failures, and computing the average of all the failure times. This metric is the basic parameter required by most software reliability models. High values imply good reliability. [IEEE982.2]
Failure rate. Used to indicate the growth in the software reliability as a function of test time and is usually used with reliability models. This metric requires two primitives: ti, the observed time between failures for a given severity level i, and fi, the number of failures of a given severity level in the ith time interval. The failure rate (t) can be estimated from the reliability function R(t), which is obtained from the cumulative probability distribution F(t) of the time until the next failure, using a software reliability estimation model, such as the non-homogeneous Poisson
process (NHPP) or Bayesian type model. The failure rate is (t) = -1/R(t) [dR(t)]/dt where R(t) = 1 - F(t). [IEEE982.2]
Cumulative failure profile. Uses a graphical technique to predict reliability, to estimate additional testing time needed to reach an acceptable reliability level, and to identify modules and subsystems that require additional testing. This metric requires one primitive, fi, the total number of failures of a given severity level i in a given time interval. Cumulative failures are plotted on a time scale. The shape of the curve is used to project when testing will be complete, and to assess reliability. It can provide an indication of clustering of faults in modules, suggesting further testing for these modules. A non-asymptotic curve also indicates the need for continued testing. [IEEE982.2]
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
(7) Commercial tools are available for statement
coverage, branch coverage, and path coverage, but only private tools exist for
data flow coverage. [BEIZER] Coverage tools report the percentage of items
covered and lists what is not covered. [SQE]
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
5.3.1.6. Installation and Checkout Metrics
Most of the metrics used during the test phase are also applicable during installation and checkout. The specific metrics used will depend on the type of testing performed. If acceptance testing is conducted, a requirements trace may be performed to determine what percentage of the requirements are satisfied in the product (i.e., number of requirements fulfilled divided by the total number of requirements).
5.3.1.7. Operation and Maintenance Metrics
Every metric that can be applied during software
development may also be applied during maintenance. The purposes may differ
somewhat. For example, requirements traceability may be used to ensure that
maintenance requirements are related to predecessor requirements, and that the
test process covers the same test areas as for the development. Metrics that
were used during development may be used again during maintenance for
comparison purposes (e.g., measuring the complexity of a module before and
after modification). Elements of support, such as customer perceptions,
training, hotlines, documentation, and user manuals, can also be measured.
Change Metrics
Primitive change metrics.
Number of changes
Cost/effort of changes
Time required for each change
LOC added, deleted, or modified
Number of fixes, or
enhancements
Customer Related Metrics
Customer ratings. These metrics are based on results of customer surveys, which ask customers to provide a rating or a satisfaction score (e.g., on a scale of one to ten) of a vendor's product or customer services (e.g., hotlines, fixes, user manual). Ratings and scores can be tabulated and plotted in bar graphs.
Customer service metrics.
Number of hotline calls received
Number of fixes for each type of product
Number of hours required for fixes
Number of hours for training (for each type of product)