Software Testing and Analysis
Software Testing and Analysis
and Techniques
byMauro PezzandMichal Young
John Wiley & Sons 2008 (510 pages)
ISBN:9780471455936
Providing students and professionals with strategies for reliable,
cost-effective software development, this guide covers a full
spectrum of topics from basic principles and underlying theory to
organizational and process issues in real-world application.
Table of Contents
Software Testing and Analysis-Process, Principles, and Techniques
Preface
Part I - Fundamentals of Test and Analysis
Chapter 1 - Software Test and Analysis in a Nutshell
Chapter 2 - A Framework for Test and Analysis
Chapter 3 - Basic Principles
Chapter 4 - Test and Analysis Activities Within a Software Process
Part II - Basic Techniques
Chapter 5 - Finite Models
Chapter 6 - Dependence and Data Flow Models
Chapter 7 - Symbolic Execution and Proof of Properties
Chapter 8 - Finite State Verification
Part III - Problems and Methods
Chapter 9 - Test Case Selection and Adequacy
Chapter 10 - Functional Testing
Chapter 11 - Combinatorial Testing
Chapter 12 - Structural Testing
Chapter 13 - Data Flow Testing
Chapter 14 - Model-Based Testing
Chapter 15 - Testing Object-Oriented Software
Chapter 16 - Fault-Based Testing
Chapter 17 - Test Execution
Chapter 18 - Inspection
Chapter 19 - Program Analysis
Part IV - Process
Chapter 20 - Planning and Monitoring the Process
Chapter 21 - Integration and Component-based Software Testing
Chapter 22 - System, Acceptance, and Regression Testing
Chapter 23 - Automating Analysis and Test
Chapter 24 - Documenting Analysis and Test
Bibliography
Index
List of Figures
List of Tables
List of Sidebars
Back Cover
You can't test quality into a software product, but neither can you
build a quality software product without test and analysis. Software
test and analysis is increasingly recognized, in research and in
industrial practice, as a core challenge in software engineering and
computer science. Software Testing and Analysis: Process,
Principles, and Techniques is the first book to present a range of
complementary software test and analysis techniques in an
integrated, coherent fashion. It covers a full spectrum of topics from
basic principles and underlying theory to organizational and process
issues in real-world application. The emphasis throughout is on
selecting a complementary set of practical techniques to achieve an
acceptable level of quality at an acceptable cost.
Highlights of the book include
Interplay among technical and non-technical issues in crafting
an approach to software quality, with chapters devoted to
planning and monitoring the software quality process.
A selection of practical techniques ranging from inspection to
automated program and design analyses to unit, integration,
system, and regression testing, with technical material set in
the context of real-world problems and constraints in software
development.
A coherent view of the state of the art and practice, with
technical and organizational approaches to push the state of
practice toward the state of the art.
Throughout, the text covers techniques that are suitable for nearterm application, with sufficient technical background to help you
know how and when to apply them. Exercises reinforce the
instruction and ensure that you master each topic before
proceeding.
By incorporating software testing and analysis techniques into
modern practice, Software Testing and Analysis: Process, Principles,
Preface
Overview
This book addresses software test and analysis in the context of an
overall effort to achieve quality. It is designed for use as a primary
textbook for a course in software test and analysis or as a
supplementary text in a software engineering course, and as a
resource for software developers.
The main characteristics of this book are:
It assumes that the reader's goal is to achieve a suitable balance of cost, schedule,
and quality. It is not oriented toward critical systems for which ultra-high reliability
must be obtained regardless of cost, nor will it be helpful if one's aim is to cut cost or
schedule regardless of consequence.
It presents a selection of techniques suitable for near-term application, with sufficient
technical background to understand their domain of applicability and to consider
variations to suit technical and organizational constraints. Techniques of only
historical interest and techniques that are unlikely to be practical in the near future
are omitted.
It promotes a vision of software testing and analysis as integral to modern software
engineering practice, equally as important and technically demanding as other
aspects of development. This vision is generally consistent with current thinking on
the subject, and is approached by some leading organizations, but is not universal.
It treats software testing and static analysis techniques together in a coherent
framework, as complementary approaches for achieving adequate quality at
acceptable cost.
Combinatorial Testing
14
Model-Based Testing
15
17
Test Execution
18
Inspection
19
Program Analysis
20
Open table as spreadsheet
Chapter List
Chapter 1: Software Test and Analysis in a Nutshell
Chapter 2: A Framework for Test and Analysis
Chapter 3: Basic Principles
Chapter 4: Test and Analysis Activities Within a Software Process
specialized tests and analyses designed particularly for the case at hand.
Software is among the most variable and complex of artifacts engineered on a regular
basis. Quality requirements of software used in one environment may be quite different
and incompatible with quality requirements of a different environment or application
domain, and its structure evolves and often deteriorates as the software system grows.
Moreover, the inherent nonlinearity of software systems and uneven distribution of faults
complicates verification. If an elevator can safely carry a load of 1000 kg, it can also
safely carry any smaller load, but if a procedure correctly sorts a set of 256 elements, it
may fail on a set of 255 or 53 or 12 elements, as well as on 257 or 1023.
The cost of software verification often exceeds half the overall cost of software
development and maintenance. Advanced development technologies and powerful
supporting tools can reduce the frequency of some classes of errors, but we are far
from eliminating errors and producing fault-free software. In many cases new
development approaches introduce new subtle kinds of faults, which may be more
difficult to reveal and remove than classic faults. This is the case, for example, with
distributed software, which can present problems of deadlock or race conditions that are
not present in sequential programs. Likewise, object-oriented development introduces
new problems due to the use of polymorphism, dynamic binding, and private state that
are absent or less pronounced in procedural software.
The variety of problems and the richness of approaches make it challenging to choose
and schedule the right blend of techniques to reach the required level of quality within
cost constraints. There are no fixed recipes for attacking the problem of verifying a
software product. Even the most experienced specialists do not have pre-cooked
solutions, but need to design a solution that suits the problem, the requirements, and the
development environment.
90% of all statements are executed by the functional tests, this is taken as an indication
that either the interface specifications are incomplete (if the missing coverage
corresponds to visible differences in behavior), or else additional implementation
complexity hides behind the interface. Either way, additional test cases are devised
based on a more complete description of unit behavior.
Integration and system tests are generated by the quality team, working from a catalog
of patterns and corresponding tests. The behavior of some subsystems or components
is modeled as finite state machines, so the quality team creates test suites that exercise
program paths corresponding to each state transition in the models.
Scaffolding and oracles for integration testing are part of the overall system architecture.
Oracles for individual components and units are designed and implemented by
programmers using tools for annotating code with conditions and invariants. The
Chipmunk developers use a home-grown test organizer tool to bind scaffolding to code,
schedule test runs, track faults, and organize and update regression test suites.
The quality plan includes analysis and test activities for several properties distinct from
functional correctness, including performance, usability, and security. Although these are
an integral part of the quality plan, their design and execution are delegated in part or
whole to experts who may reside elsewhere in the organization. For example, Chipmunk
maintains a small team of human factors experts in its software division. The human
factors team will produce look-and-feel guidelines for the Web purchasing system, which
together with a larger body of Chipmunk interface design rules can be checked during
inspection and test. The human factors team also produces and executes a usability
testing plan.
Parts of the portfolio of verification and validation activities selected by Chipmunk are
illustrated in Figure 1.1. The quality of the final product and the costs of the quality
assurance activities depend on the choice of the techniques to accomplish each activity.
Most important is to construct a coherent plan that can be monitored. In addition to
monitoring schedule progress against the plan, Chipmunk records faults found during
each activity, using this as an indicator of potential trouble spots. For example, if the
number of faults found in a component during design inspections is high, additional
dynamic test time will be planned for that component.
Figure 1.1: Main analysis and testing activities through the software life
cycle.
in its organization, it is fruitful to begin measuring reliability when debug testing is yielding
less than one fault ("bug") per day of tester time. For some application domains, Chipmunk
has gathered a large amount of historical usage data from which to define an operational
profile, and these profiles can be used to generate large, statistically valid sets of randomly
generated tests. If the sample thus tested is a valid model of actual executions, then
projecting actual reliability from the failure rate of test cases is elementary. Unfortunately, in
many cases such an operational profile is not available.
Chipmunk has an idea of how the Web sales facility will be used, but it cannot construct and
validate a model with sufficient detail to obtain reliability estimates from a randomly
generated test suite. They decide, therefore, to use the second major approach to verifying
reliability, using a sample of real users. This is commonly known as alpha testing if the tests
are performed by users in a controlled environment, observed by the development
organization. If the tests consist of real users in their own environment, performing actual
tasks without interference or close monitoring, it is known as beta testing. The Chipmunk
team plans a very small alpha test, followed by a longer beta test period in which the
software is made available only in retail outlets. To accelerate reliability measurement after
subsequent revisions of the system, the beta test version will be extensively instrumented,
capturing many properties of a usage profile.
Summary
The quality process has three distinct goals: improving a software product (by
preventing, detecting, and removing faults), assessing the quality of the software
product (with respect to explicit quality goals), and improving the long-term quality and
cost- effectiveness of the quality process itself. Each goal requires weaving quality
assurance and improvement activities into an overall development process, from product
inception through deployment, evolution, and retirement.
Each organization must devise, evaluate, and refine an approach suited to that
organization and application domain. A well-designed approach will invariably combine
several test and analysis techniques, spread across stages of development. An array of
fault detection techniques are distributed across development stages so that faults are
removed as soon as possible. The overall cost and cost-effectiveness of techniques
depends to a large degree on the extent to which they can be incrementally re-applied
as the product evolves.
Further Reading
This book deals primarily with software analysis and testing to improve and assess the
dependability of software. That is not because qualities other than dependability are
unimportant, but rather because they require their own specialized approaches and
techniques. We offer here a few starting points for considering some other important
properties that interact with dependability. Norman's The Design of Everyday Things
[Nor90] is a classic introduction to design for usability, with basic principles that apply to
both hardware and software artifacts. A primary reference on usability for interactive
computer software, and particularly for Web applications, is Nielsen's Designing Web
Usability [Nie00]. Bishop's text Computer Security: Art and Science [Bis02] is a good
introduction to security issues. The most comprehensive introduction to software safety
is Leveson's Safeware [Lev95].
Exercises
Philip has studied "just-in-time" industrial production methods and is convinced that
they should be applied to every aspect of software development. He argues that
1.1 test case design should be performed just before the first opportunity to execute
the newly designed test cases, never earlier. What positive and negative
consequences do you foresee for this just-in-time test case design approach?
A newly hired project manager at Chipmunk questions why the quality manager is
involved in the feasibility study phase of the project, rather than joining the team
1.2 only when the project has been approved, as at the new manager's previous
company. What argument(s) might the quality manager offer in favor of her
involvement in the feasibility study?
Chipmunk procedures call for peer review not only of each source code module,
but also of test cases and scaffolding for testing that module. Anita argues that
inspecting test suites is a waste of time; any time spent on inspecting a test case
designed to detect a particular class of fault could more effectively be spent
1.3 inspecting the source code to detect that class of fault. Anita's project manager, on
the other hand, argues that inspecting test cases and scaffolding can be costeffective when considered over the whole lifetime of a software product. What
argument(s) might Anita's manager offer in favor of this conclusion?
The spiral model of software development prescribes sequencing incremental
prototyping phases for risk reduction, beginning with the most important project
risks. Architectural design for testability involves, in addition to defining testable
1.4 interface specifications for each major module, establishing a build order that
supports thorough testing after each stage of construction. How might spiral
development and design for test be complementary or in conflict?
You manage an online service that sells downloadable video recordings of classic
movies. A typical download takes one hour, and an interrupted download must be
restarted from the beginning. The number of customers engaged in a download at
1.5 any given time ranges from about 10 to about 150 during peak hours. On average,
your system goes down (dropping all connections) about two times per week, for
an average of three minutes each time. If you can double availability or double
mean time between failures, but not both, which will you choose? Why?
Having no a priori operational profile for reliability measurement, Chipmunk will
depend on alpha and beta testing to assess the readiness of its online purchase
functionality for public release. Beta testing will be carried out in retail outlets, by
1.6 retail store personnel, and then by customers with retail store personnel looking on.
How might this beta testing still be misleading with respect to reliability of the
software as it will be used at home and work by actual customers? What might
Chipmunk do to ameliorate potential problems from this reliability misestimation?
1.7
The junior test designers of Chipmunk Computers are annoyed by the procedures
for storing test cases together with scaffolding, test results, and related
documentation. They blame the extra effort needed to produce and store such data
for delays in test design and execution. They argue for reducing the data to store
to the minimum required for reexecuting test cases, eliminating details of test
documentation, and limiting test results to the information needed for generating
oracles. What argument(s) might the quality manager use to convince the junior
test designers of the usefulness of storing all this information?
Figure 2.1: Validation activities check work products against actual user
requirements, while verification activities check consistency of work
products.
Validation activities refer primarily to the overall system specification and the final code.
With respect to overall system specification, validation checks for discrepancies
between actual needs and the system specification as laid out by the analysts, to ensure
that the specification is an adequate guide to building a product that will fulfill its goals.
With respect to final code, validation aims at checking discrepancies between actual
need and the final product, to reveal possible failures of the development process and to
make sure the product meets end-user expectations. Validation checks between the
specification and final product are primarily checks of decisions that were left open in the
specification (e.g., details of the user interface or product features). Chapter 4 provides
a more thorough discussion of validation and verification activities in particular software
process models.
We have omitted one important set of verification checks from Figure 2.1 to avoid
clutter. In addition to checks that compare two or more artifacts, verification includes
checks for self-consistency and well-formedness. For example, while we cannot judge
that a program is "correct" except in reference to a specification of what it should do, we
can certainly determine that some programs are "incorrect" because they are ill- formed.
We may likewise determine that a specification itself is ill-formed because it is
inconsistent (requires two properties that cannot both be true) or ambiguous (can be
interpreted to require some property or not), or because it does not satisfy some other
well-formedness constraint that we impose, such as adherence to a standard imposed
by a regulatory agency.
Validation against actual requirements necessarily involves human judgment and the
potential for ambiguity, misunderstanding, and disagreement. In contrast, a specification
should be sufficiently precise and unambiguous that there can be no disagreement about
whether a particular system behavior is acceptable. While the term testing is often used
informally both for gauging usefulness and verifying the product, the activities differ in
both goals and approach. Our focus here is primarily on dependability, and thus primarily
on verification rather than validation, although techniques for validation and the relation
between the two is discussed further in Chapter 22.
Dependability properties include correctness, reliability, robustness, and safety.
Correctness is absolute consistency with a specification, always and in all
circumstances. Correctness with respect to nontrivial specifications is almost never
achieved. Reliability is a statistical approximation to correctness, expressed as the
likelihood of correct behavior in expected use. Robustness, unlike correctness and
reliability, weighs properties as more and less critical, and distinguishes which properties
should be maintained even under exceptional circumstances in which full functionality
cannot be maintained. Safety is a kind of robustness in which the critical property to be
maintained is avoidance of particular hazardous behaviors. Dependability properties are
discussed further in Chapter 4.
[1]A
part of the diagram is a variant of the well-known "V model" of verification and
validation.
construct a logical proof. How long would this take? If we ignore implementation details
such as the size of the memory holding a program and its data, the answer is "forever."
That is, for most programs, exhaustive testing cannot be completed in any finite amount
of time.
Suppose we do make use of the fact that programs are executed on real machines with
finite representations of memory values. Consider the following trivial Java class:
1 class Trivial{
2
static int sum(int a, int b) { return a+b; }
3 }
The Java language definition states that the representation of an int is 32 binary digits,
and thus there are only 232 232 = 264 1021 different inputs on which the method
Trivial.sum() need be tested to obtain a proof of its correctness. At one nanosecond
(109 seconds) per test case, this will take approximately 1012 seconds, or about
30,000 years.
A technique for verifying a property can be inaccurate in one of two directions (Figure
2.2). It may be pessimistic, meaning that it is not guaranteed to accept a program even
if the program does possess the property being analyzed, or it can be optimistic if it
may accept some programs that do not possess the property (i.e., it may not detect all
violations). Testing is the classic optimistic technique, because no finite number of tests
can guarantee correctness. Many automated program analysis techniques for properties
of program behaviors[3] are pessimistic with respect to the properties they are designed
to verify. Some analysis techniques may give a third possible answer, "don't know." We
can consider these techniques to be either optimistic or pessimistic depending on how
we interpret the "don't know" result. Perfection is unobtainable, but one can choose
techniques that err in only a particular direction.
int i, sum;
int first=1;
for (i=0; i<10; ++i) {
if (first) {
sum=0; first=0;
}
sum += i;
}
It is impossible in general to determine whether each control flow path can be executed,
and while a human will quickly recognize that the variable sum is initialized on the first
iteration of the loop, a compiler or other static analysis tool will typically not be able to
rule out an execution in which the initialization is skipped on the first iteration. Java neatly
solves this problem by making code like this illegal; that is, the rule is that a variable
must be initialized on all program control paths, whether or not those paths can ever be
executed.
Software developers are seldom at liberty to design new restrictions into the
programming languages and compilers they use, but the same principle can be applied
through external tools, not only for programs but also for other software artifacts.
Consider, for example, the following condition that we might wish to impose on
requirements documents:
1. Each significant domain term shall appear with a definition in the glossary of
the document.
This property is nearly impossible to check automatically, since determining whether a
particular word or phrase is a "significant domain term" is a matter of human judgment.
Moreover, human inspection of the requirements document to check this requirement will
be extremely tedious and error-prone. What can we do? One approach is to separate
the decision that requires human judgment (identifying words and phrases as
"significant") from the tedious check for presence in the glossary.
Each significant domain term shall be set off in the requirements document by the
1.a use of a standard style term. The default visual representation of the term style is a
single underline in printed documents and purple text in on-line displays.
Each word or phrase in the term style shall appear with a definition in the glossary
1.b of the document.
Property (1a) still requires human judgment, but it is now in a form that is much more
amenable to inspection. Property (1b) can be easily automated in a way that will be
completely precise (except that the task of determining whether definitions appearing in
the glossary are clear and correct must also be left to humans).
As a second example, consider a Web-based service in which user sessions need not
directly interact, but they do read and modify a shared collection of data on the server.
In this case a critical property is maintaining integrity of the shared data. Testing for this
property is notoriously difficult, because a "race condition" (interference between writing
data in one process and reading or writing related data in another process) may cause
an observable failure only very rarely.
Fortunately, there is a rich body of applicable research results on concurrency control
that can be exploited for this application. It would be foolish to rely primarily on direct
testing for the desired integrity properties. Instead, one would choose a (well- known,
formally verified) concurrency control protocol, such as the two-phase locking protocol,
and rely on some combination of static analysis and program testing to check
conformance to that protocol. Imposing a particular concurrency control protocol
substitutes a much simpler, sufficient property (two-phase locking) for the complex
property of interest (serializability), at some cost in generality; that is, there are
programs that violate two-phase locking and yet, by design or dumb luck, satisfy
serializability of data access.
It is a common practice to further impose a global order on lock accesses, which again
simplifies testing and analysis. Testing would identify execution sequences in which data
is accessed without proper locks, or in which locks are obtained and relinquished in an
order that does not respect the two-phase protocol or the global lock order, even if data
integrity is not violated on that particular execution, because the locking protocol failure
indicates the potential for a dangerous race condition in some other execution that might
occur only rarely or under extreme load.
With the adoption of coding conventions that make locking and unlocking actions easy to
recognize, it may be possible to rely primarily on flow analysis to determine
conformance with the locking protocol, with the role of dynamic testing reduced to a
"back-up" to raise confidence in the soundness of the static analysis. Note that the
critical decision to impose a particular locking protocol is not a post-hoc decision that
can be made in a testing "phase" at the end of development. Rather, the plan for
verification activities with a suitable balance of cost and assurance is part of system
design.
[3]Why
throughout development, but to the extent possible will also attempt to codify usability
guidelines in a form that permits verification. For example, if the usability group determines
that the fuel gauge should always be visible when the fuel level is below a quarter of a tank,
then this becomes a specified property that is subject to verification. The graphical interface
also poses a challenge in effectively checking output. This must be addressed partly in the
architectural design of the system, which can make automated testing feasible or not
depending on the interfaces between high-level operations (e.g., opening or closing a
window, checking visibility of a window) and low-level graphical operations and
representations.
Summary
Verification activities are comparisons to determine the consistency of two or more
software artifacts, or self-consistency, or consistency with an externally imposed criterion.
Verification is distinct from validation, which is consideration of whether software fulfills its
actual purpose. Software development always includes some validation and some
verification, although different development approaches may differ greatly in their relative
emphasis.
Precise answers to verification questions are sometimes difficult or impossible to obtain, in
theory as well as in practice. Verification is therefore an art of compromise, accepting some
degree of optimistic inaccuracy (as in testing) or pessimistic inaccuracy (as in many static
analysis techniques) or choosing to check a property that is only an approximation of what
we really wish to check. Often the best approach will not be exclusive reliance on one
technique, but careful choice of a portfolio of test and analysis techniques selected to obtain
acceptable results at acceptable cost, and addressing particular challenges posed by
characteristics of the application domain or software.
Further Reading
The "V" model of verification and validation (of which Figure 2.1 is a variant) appears in
many software engineering textbooks, and in some form can be traced at least as far back
as Myers' classic book [Mye79]. The distinction between validation and verification as given
here follow's Boehm [Boe81], who has most memorably described validation as "building
the right system" and verification as "building the system right."
Exercises
2.1
The Chipmunk marketing division is worried about the start-up time of the new
version of the RodentOS operating system (an (imaginary) operating system of
Chipmunk). The marketing division representative suggests a software requirement
stating that the start-up time shall not be annoying to users.
Explain why this simple requirement is not verifiable and try to reformulate the
requirement to make it verifiable.
2.2
2.3
Analysis and testing (A&T) has been common practice since the
earliest software projects. A&T activities were for a long time based
on common sense and individual skills. It has emerged as a distinct
discipline only in the last three decades.
This chapter advocates six principles that characterize various
approaches and techniques for analysis and testing: sensitivity,
redundancy, restriction, partition, visibility, and feedback. Some of
these principles, such as partition, visibility, and feedback, are quite
general in engineering. Others, notably sensitivity, redundancy, and
restriction, are specific to A&T and contribute to characterizing A&T
as a discipline.
3.1 Sensitivity
Human developers make errors, producing faults in software. Faults
may lead to failures, but faulty software may not fail on every
execution. The sensitivity principle states that it is better to fail
every time than sometimes.
Consider the cost of detecting and repairing a software fault. If it is detected immediately
(e.g., by an on-the-fly syntactic check in a design editor), then the cost of correction is very
small, and in fact the line between fault prevention and fault detection is blurred. If a fault is
detected in inspection or unit testing, the cost is still relatively small. If a fault survives initial
detection efforts at the unit level, but triggers a failure detected in integration testing, the
cost of correction is much greater. If the first failure is detected in system or acceptance
testing, the cost is very high indeed, and the most costly faults are those detected by
customers in the field.
A fault that triggers a failure on every execution is unlikely to survive past unit testing. A
characteristic of faults that escape detection until much later is that they trigger failures only
rarely, or in combination with circumstances that seem unrelated or are difficult to control.
For example, a fault that results in a failure only for some unusual configurations of
customer equipment may be difficult and expensive to detect. A fault that results in a failure
randomly but very rarely - for example, a race condition that only occasionally causes data
corruption - may likewise escape detection until the software is in use by thousands of
customers, and even then be difficult to diagnose and correct.
The small C program in Figure 3.1 has three faulty calls to string
copy procedures. The call to strcpy, strncpy, and stringCopy all pass
a source string "Muddled," which is too long to fit in the array
middle. The vulnerability of strcpy is well known, and is the culprit in
the by-now-standard buffer overflow attacks on many network
services. Unfortunately, the fault may or may not cause an
observable failure depending on the arrangement of memory (in this
case, it depends on what appears in the position that would be
middle[7], which will be overwritten with a newline character). The
standard recommendation is to use strncpy in place of strcpy. While
strncpy avoids overwriting other memory, it truncates the input
without warning, and sometimes without properly null-terminating
the output. The replacement function stringCopy, on the other hand,
uses an assertion to ensure that, if the target string is too long, the
program always fails in an observable manner.
1 /**
2 * Worse than broken: Are you feeling lucky?
3 */
4
5 #include <assert.h>
6
7
char before[ ] = "=Before=";
8
char middle[ ] = "Middle";
9
char after[ ] = "=After=";
10
11 void show() {
12
printf("%s\n%s\n%s\n", before, middle, after);
13 }
14
15 void stringCopy(char *target, const char *source, int howBi
16
17 int main(int argc, char *argv) {
18
show();
19
strcpy(middle, "Muddled"); /* Fault, but may not fail */
20
show();
21
strncpy(middle, "Muddled", sizeof(middle)); /* Fault, may
22
show();
23
stringCopy(middle, "Muddled",sizeof(middle)); /* Guarante
24
show();
25 }
26
27 /* Sensitive version of strncpy; can be counted on to fail
28 * in an observable way EVERY time the source is too large
29 * for the target, unlike the standard strncpy or strcpy.
30 */
31 void stringCopy(char *target, const char *source, int howBi
32
assert(strlen(source) < howBig);
33
strcpy(target, source);
34 }
Similarly, skilled test designers can derive excellent test suites, but
the quality of the test suites depends on the mood of the designers.
Systematic testing criteria may not do better than skilled test
designers, but they can reduce the influence of external factors,
such as the tester's mood.
[1]Existence of a general, reliable test coverage criterion would allow
3.2 Redundancy
Redundancy is the opposite of independence. If one part of a software artifact
(program, design document, etc.) constrains the content of another, then they are not
entirely independent, and it is possible to check them for consistency.
The concept and definition of redundancy are taken from information theory. In
communication, redundancy can be introduced into messages in the form of errordetecting and error-correcting codes to guard against transmission errors. In software
test and analysis, we wish to detect faults that could lead to differences between
intended behavior and actual behavior, so the most valuable form of redundancy is in the
form of an explicit, redundant statement of intent.
Where redundancy can be introduced or exploited with an automatic, algorithmic check
for consistency, it has the advantage of being much cheaper and more thorough than
dynamic testing or manual inspection. Static type checking is a classic application of this
principle: The type declaration is a statement of intent that is at least partly redundant
with the use of a variable in the source code. The type declaration constrains other parts
of the code, so a consistency check (type check) can be applied.
An important trend in the evolution of programming languages is introduction of additional
ways to declare intent and automatically check for consistency. For example, Java
enforces rules about explicitly declaring each exception that can be thrown by a method.
Checkable redundancy is not limited to program source code, nor is it something that
can be introduced only by programming language designers. For example, software
design tools typically provide ways to check consistency between different design views
or artifacts. One can also intentionally introduce redundancy in other software artifacts,
even those that are not entirely formal. For example, one might introduce rules quite
analogous to type declarations for semistructured requirements specification documents,
and thereby enable automatic checks for consistency and some limited kinds of
completeness.
When redundancy is already present - as between a software specification document
and source code - then the remaining challenge is to make sure the information is
represented in a way that facilitates cheap, thorough consistency checks. Checks that
can be implemented by automatic tools are usually preferable, but there is value even in
organizing information to make inconsistency easier to spot in manual inspection.
Of course, one cannot always obtain cheap, thorough checks of source code and other
documents. Sometimes redundancy is exploited instead with run-time checks. Defensive
programming, explicit run-time checks for conditions that should always be true if the
program is executing correctly, is another application of redundancy in programming.
3.3 Restriction
When there are no acceptably cheap and effective ways to check a property,
sometimes one can change the problem by checking a different, more restrictive
property or by limiting the check to a smaller, more restrictive class of programs.
Consider the problem of ensuring that each variable is initialized before it is used, on
every execution. Simple as the property is, it is not possible for a compiler or analysis
tool to precisely determine whether it holds. See the program in Figure 3.2 for an
illustration. Can the variable k ever be uninitialized the first time i is added to it? If
someCondition(0) always returns true, then k will be initialized to zero on the first time
through the loop, before k is incremented, so perhaps there is no potential for a run-time
error - but method someCondition could be arbitrarily complex and might even depend
on some condition in the environment. Java's solution to this problem is to enforce a
stricter, simpler condition: A program is not permitted to have any syntactic control paths
on which an uninitialized reference could occur, regardless of whether those paths could
actually be executed. The program in Figure 3.2 has such a path, so the Java compiler
rejects it.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
Figure 3.2: Can the variable k ever be uninitialized the first time i is added to it? The
property is undecidable, so Java enforces a simpler, stricter property.
Java's rule for initialization before use is a program source code restriction that enables
precise, efficient checking of a simple but important property by the compiler. The
Hypertext Transport Protocol (HTTP) 1.0 of the World-Wide-Web, which made Web
servers not only much simpler and more robust but also much easier to test.
3.4 Partition
Partition, often also known as "divide and conquer," is a general engineering principle.
Dividing a complex problem into subproblems to be attacked and solved independently is
probably the most common human problem-solving strategy. Software engineering in
particular applies this principle in many different forms and at almost all development
levels, from early requirements specifications to code and maintenance. Analysis and
testing are no exception: the partition principle is widely used and exploited.
Partitioning can be applied both at process and technique levels. At the process level,
we divide complex activities into sets of simple activities that can be attacked
independently. For example, testing is usually divided into unit, integration, subsystem,
and system testing. In this way, we can focus on different sources of faults at different
steps, and at each step, we can take advantage of the results of the former steps. For
instance, we can use units that have been tested as stubs for integration testing. Some
static analysis techniques likewise follow the modular structure of the software system
to divide an analysis problem into smaller steps.
Many static analysis techniques first construct a model of a system and then analyze the
model. In this way they divide the overall analysis into two subtasks: first simplify the
system to make the proof of the desired properties feasible and then prove the property
with respect to the simplified model. The question "Does this program have the desired
property?" is decomposed into two questions, "Does this model have the desired
property?" and "Is this an accurate model of the program?"
Since it is not possible to execute the program with every conceivable input, systematic
testing strategies must identify a finite number of classes of test cases to execute.
Whether the classes are derived from specifications (functional testing) or from program
structure (structural testing), the process of enumerating test obligations proceeds by
dividing the sources of information into significant elements (clauses or special values
identifiable in specifications, statements or paths in programs), and creating test cases
that cover each such element or certain combinations of elements.
3.5 Visibility
Visibility means the ability to measure progress or status against goals. In software
engineering, one encounters the visibility principle mainly in the form of process visibility,
and then mainly in the form of schedule visibility: ability to judge the state of development
against a project schedule. Quality process visibility also applies to measuring achieved
(or predicted) quality against quality goals. The principle of visibility involves setting goals
that can be assessed as well as devising methods to assess their realization.
Visibility is closely related to observability, the ability to extract useful information from a
software artifact. The architectural design and build plan of a system determines what
will be observable at each stage of development, which in turn largely determines the
visibility of progress against goals at that stage.
A variety of simple techniques can be used to improve observability. For example, it is
no accident that important Internet protocols like HTTP and SMTP (Simple Mail
Transport Protocol, used by Internet mail servers) are based on the exchange of simple
textual commands. The choice of simple, human-readable text rather than a more
compact binary encoding has a small cost in performance and a large payoff in
observability, including making construction of test drivers and oracles much simpler.
Use of human-readable and human-editable files is likewise advisable wherever the
performance cost is acceptable.
A variant of observability through direct use of simple text encodings is providing readers
and writers to convert between other data structures and simple, human- readable and
editable text. For example, when designing classes that implement a complex data
structure, designing and implementing also a translation from a simple text format to the
internal structure, and vice versa, will often pay back handsomely in both ad hoc and
systematic testing. For similar reasons it is often useful to design and implement an
equality check for objects, even when it is not necessary to the functionality of the
software product.
3.6 Feedback
Feedback is another classic engineering principle that applies to analysis and testing.
Feedback applies both to the process itself (process improvement) and to individual
techniques (e.g., using test histories to prioritize regression testing).
Systematic inspection and walkthrough derive part of their success from feedback.
Participants in inspection are guided by checklists, and checklists are revised and
refined based on experience. New checklist items may be derived from root cause
analysis, analyzing previously observed failures to identify the initial errors that lead to
them.
Summary
Principles constitute the core of a discipline. They form the basis of methods,
techniques, methodologies and tools. They permit understanding, comparing, evaluating
and extending different approaches, and they constitute the lasting basis of knowledge
of a discipline.
The six principles described in this chapter are
Sensitivity: better to fail every time than sometimes,
Redundancy: making intentions explicit,
Restriction: making the problem easier,
Partition: divide and conquer,
Visibility: making information accessible, and
Feedback: applying lessons from experience in process and techniques.
Principles are identified heuristically by searching for a common denominator of
techniques that apply to various problems and exploit different methods, sometimes
borrowing ideas from other disciplines, sometimes observing recurrent phenomena.
Potential principles are validated by finding existing and new techniques that exploit the
underlying ideas. Generality and usefulness of principles become evident only with time.
The initial list of principles proposed in this chapter is certainly incomplete. Readers are
invited to validate the proposed principles and identify additional principles.
Further Reading
Analysis and testing is a relatively new discipline. To our knowledge, the principles
underlying analysis and testing have not been discussed in the literature previously.
Some of the principles advocated in this chapter are shared with other software
engineering disciplines and are discussed in many books. A good introduction to
software engineering principles is the third chapter of Ghezzi, Jazayeri, and Mandrioli's
book on software engineering [GJM02].
Exercises
Indicate which principles guided the following choices:
1. Use an externally readable format also for internal files, when possible.
2. Collect and analyze data about faults revealed and removed from the
code.
3.1
3. Separate test and debugging activities; that is, separate the design and
execution of test cases to reveal failures (test) from the localization and
removal of the corresponding faults (debugging).
4. Distinguish test case design from execution.
5. Produce complete fault reports.
6. Use information from test case design to improve requirements and
design specifications.
7. Provide interfaces for fully inspecting the internal state of a class.
architectural design may suggest structures and interfaces that not only facilitate testing
earlier in development, but also make key interfaces simpler and more precisely defined.
There is also another reason for carrying out quality activities at the earliest opportunity
and for preferring earlier to later activities when either could serve to detect the same
fault: The single best predictor of the cost of repairing a software defect is the time
between its introduction and its detection. A defect introduced in coding is far cheaper to
repair during unit test than later during integration or system test, and most expensive if
it is detected by a user of the fielded system. A defect introduced during requirements
engineering (e.g., an ambiguous requirement) is relatively cheap to repair at that stage,
but may be hugely expensive if it is only uncovered by a dispute about the results of a
system acceptance test.
It is quite possible to build systems that are very reliable, relatively free from usefulness
hazards, and completely useless. They may be unbearably slow, or have terrible user
interfaces and unfathomable documentation, or they may be missing several crucial
features. How should these properties be considered in software quality? One answer is
that they are not part of quality at all unless they have been explicitly specified, since
quality is the presence of specified properties. However, a company whose products are
rejected by its customers will take little comfort in knowing that, by some definitions,
they were high-quality products.
We can do better by considering quality as fulfillment of required and desired properties,
as distinguished from specified properties. For example, even if a client does not
explicitly specify the required performance of a system, there is always some level of
performance that is required to be useful.
One of the most critical tasks in software quality analysis is making desired properties
explicit, since properties that remain unspecified (even informally) are very likely to
surface unpleasantly when it is discovered that they are not met. In many cases these
implicit requirements can not only be made explicit, but also made sufficiently precise
that they can be made part of dependability or reliability. For example, while it is better
to explicitly recognize usability as a requirement than to leave it implicit, it is better yet to
augment[1] usability requirements with specific interface standards, so that a deviation
from the standards is recognized as a defect.
[1]Interface
properties that are less dependent on specification and that do distinguish among
failures depending on severity.
Software safety is an extension of the well-established field of system safety into
software. Safety is concerned with preventing certain undesirable behaviors, called
hazards. It is quite explicitly not concerned with achieving any useful behavior apart from
whatever functionality is needed to prevent hazards. Software safety is typically a
concern in "critical" systems such as avionics and medical systems, but the basic
principles apply to any system in which particularly undesirable behaviors can be
distinguished from run-of-the-mill failure. For example, while it is annoying when a word
processor crashes, it is much more annoying if it irrecoverably corrupts document files.
The developers of a word processor might consider safety with respect to the hazard of
file corruption separately from reliability with respect to the complete functional
requirements for the word processor.
Just as correctness is meaningless without a specification of allowed behaviors, safety
is meaningless without a specification of hazards to be prevented, and in practice the
first step of safety analysis is always finding and classifying hazards. Typically, hazards
are associated with some system in which the software is embedded (e.g., the medical
device), rather than the software alone. The distinguishing feature of safety is that it is
concerned only with these hazards, and not with other aspects of correct functioning.
The concept of safety is perhaps easier to grasp with familiar physical systems. For
example, lawn-mowers in the United States are equipped with an interlock device,
sometimes called a "dead-man switch." If this switch is not actively held by the operator,
the engine shuts off. The dead-man switch does not contribute in any way to cutting
grass; its sole purpose is to prevent the operator from reaching into the mower blades
while the engine runs.
One is tempted to say that safety is an aspect of correctness, because a good system
specification would rule out hazards. However, safety is best considered as a quality
distinct from correctness and reliability for two reasons. First, by focusing on a few
hazards and ignoring other functionality, a separate safety specification can be much
simpler than a complete system specification, and therefore easier to verify. To put it
another way, while a good system specification should rule out hazards, we cannot be
confident that either specifications or our attempts to verify systems are good enough to
provide the degree of assurance we require for hazard avoidance. Second, even if the
safety specification were redundant with regard to the full system specification, it is
important because (by definition) we regard avoidance of hazards as more crucial than
satisfying other parts of the system specification.
Correctness and reliability are contingent upon normal operating conditions. It is not
reasonable to expect a word processing program to save changes normally when the
file does not fit in storage, or to expect a database to continue to operate normally when
the computer loses power, or to expect a Web site to provide completely satisfactory
service to all visitors when the load is 100 times greater than the maximum for which it
was designed. Software that fails under these conditions, which violate the premises of
its design, may still be "correct" in the strict sense, yet the manner in which the software
fails is important. It is acceptable that the word processor fails to write the new file that
does not fit on disk, but unacceptable to also corrupt the previous version of the file in
the attempt. It is acceptable for the database system to cease to function when the
power is cut, but unacceptable for it to leave the database in a corrupt state. And it is
usually preferable for the Web system to turn away some arriving users rather than
becoming too slow for all, or crashing. Software that gracefully degrades or fails "softly"
outside its normal operating parameters is robust.
Software safety is a kind of robustness, but robustness is a more general notion that
concerns not only avoidance of hazards (e.g., data corruption) but also partial
functionality under unusual situations. Robustness, like safety, begins with explicit
consideration of unusual and undesirable situations, and should include augmenting
software specifications with appropriate responses to undesirable events.
Figure 4.1 illustrates the relation among dependability properties.
Quality analysis should be part of the feasibility study. The sidebar on page 47 shows an
excerpt of the feasibility study for the Chipmunk Web presence. The primary quality
requirements are stated in terms of dependability, usability, and security. Performance,
portability and interoperability are typically not primary concerns at this stage, but they
may come into play when dealing with other qualities.
4.5 Analysis
Analysis techniques that do not involve actual execution of program source code play a
prominent role in overall software quality processes. Manual inspection techniques and
automated analyses can be applied at any development stage. They are particularly well
suited at the early stages of specifications and design, where the lack of executability of
many intermediate artifacts reduces the efficacy of testing.
Inspection, in particular, can be applied to essentially any document including
requirements documents, architectural and more detailed design documents, test plans
and test cases, and of course program source code. Inspection may also have
secondary benefits, such as spreading good practices and instilling shared standards of
quality. On the other hand, inspection takes a considerable amount of time and requires
meetings, which can become a scheduling bottleneck. Moreover, re-inspecting a
changed component can be as expensive as the initial inspection. Despite the versatility
of inspection, therefore, it is used primarily where other techniques are either
inapplicable or where other techniques do not provide sufficient coverage of common
faults.
Automated static analyses are more limited in applicability (e.g., they can be applied to
some formal representations of requirements models but not to natural language
documents), but are selected when available because substituting machine cycles for
human effort makes them particularly cost-effective. The cost advantage of automated
static analyses is diminished by the substantial effort required to formalize and properly
structure a model for analysis, but their application can be further motivated by their
ability to thoroughly check for particular classes of faults for which checking with other
techniques is very difficult or expensive. For example, finite state verification techniques
for concurrent systems requires construction and careful structuring of a formal design
model, and addresses only a particular family of faults (faulty synchronization structure).
Yet it is rapidly gaining acceptance in some application domains because that family of
faults is difficult to detect in manual inspection and resists detection through dynamic
testing.
Excerpt of Web Presence Feasibility Study
Purpose of this document
This document was prepared for the Chipmunk IT management team. It describes the
results of a feasibility study undertaken to advise Chipmunk corporate management
whether to embark on a substantial redevelopment effort to add online shopping
functionality to the Chipmunk Computers' Web presence.
Goals
Sometimes the best aspects of manual inspection and automated static analysis can be
obtained by carefully decomposing properties to be checked. For example, suppose a
desired property of requirements documents is that each special term in the application
domain appear in a glossary of terms. This property is not directly amenable to an
automated static analysis, since current tools cannot distinguish meaningful domain
terms from other terms that have their ordinary meanings. The property can be checked
with manual inspection, but the process is tedious, expensive, and error-prone. A hybrid
approach can be applied if each domain term is marked in the text. Manually checking
that domain terms are marked is much faster and therefore less expensive than
manually looking each term up in the glossary, and marking the terms permits effective
automation of cross-checking with the glossary.
4.6 Testing
Despite the attractiveness of automated static analyses when they
are applicable, and despite the usefulness of manual inspections for
a variety of documents including but not limited to program source
code, dynamic testing remains a dominant technique. A closer look,
though, shows that dynamic testing is really divided into several
distinct activities that may occur at different points in a project.
Tests are executed when the corresponding code is available, but
testing activities start earlier, as soon as the artifacts required for
designing test case specifications are available. Thus, acceptance
and system test suites should be generated before integration and
unit test suites, even if executed in the opposite order.
Early test design has several advantages. Tests are specified
independently from code and when the corresponding software
specifications are fresh in the mind of analysts and developers,
facilitating review of test design. Moreover, test cases may highlight
inconsistencies and incompleteness in the corresponding software
specifications. Early design of test cases also allows for early repair
of software specifications, preventing specification faults from
propagating to later stages in development. Finally, programmers
may use test cases to illustrate and clarify the software
specifications, especially for errors and unexpected conditions.
No engineer would build a complex structure from parts that have not themselves been
subjected to quality control. Just as the "earlier is better" rule dictates using inspection to
reveal flaws in requirements and design before they are propagated to program code, the
same rule dictates module testing to uncover as many program faults as possible before
they are incorporated in larger subsystems of the product. At Chip- munk, developers are
expected to perform functional and structural module testing before a work assignment is
considered complete and added to the project baseline. The test driver and auxiliary files
are part of the work product and are expected to make reexecution of test cases, including
result checking, as simple and automatic as possible, since the same test cases will be
used over and over again as the product evolves.
programming languages with unconstrained pointers and without array bounds checking,
which may in turn be attributed to performance concerns and a requirement for
interoperability with a large body of legacy code. The countermeasure could involve
differences in programming methods (e.g., requiring use of certified "safe" libraries for
buffer management), or improvements to quality assurance activities (e.g., additions to
inspection checklists), or sometimes changes in management practices.
Summary
Test and analysis activities are not a late phase of the development process, but rather
a wide set of activities that pervade the whole process. Designing a quality process with
a suitable blend of test and analysis activities for the specific application domain,
development environment, and quality goals is a challenge that requires skill and
experience.
A well-defined quality process must fulfill three main goals: improving the software
product during and after development, assessing its quality before delivery, and
improving the process within and across projects. These challenging goals can be
achieved by increasing visibility, scheduling activities as early as practical, and
monitoring results to adjust the process. Process visibility - that is, measuring and
comparing progress to objectives - is a key property of the overall development
process. Performing A&T activities early produces several benefits: It increases control
over the process, it hastens fault identification and reduces the costs of fault removal, it
provides data for incrementally tuning the development process, and it accelerates
product delivery. Feedback is the key to improving the process by identifying and
removing persistent errors and faults.
Further Reading
Qualities of software are discussed in many software engineering textbooks; the
discussion in Chapter 2 of Ghezzi, Jazayeri, and Mandrioli [GJM02] is particularly useful.
Process visibility is likewise described in software engineering textbooks, usually with an
emphasis on schedule. Musa [Mus04] describes a quality process oriented particularly
to establishing a quantifiable level of reliability based on models and testing before
release. Chillarege et al. [CBC+92] present principles for gathering and analyzing fault
data, with an emphasis on feedback within a single process but applicable also to quality
process improvement.
Exercises
We have stated that 100% reliability is indistinguishable from correctness, but they
are not quite identical. Under what circumstance might an incorrect program be
4.1 100% reliable? Hint: Recall that a program may be more or less reliable depending
on how it is used, but a program is either correct or incorrect regardless of usage.
We might measure the reliability of a network router as the fraction of all packets
4.2 that are correctly routed, or as the fraction of total service time in which packets
are correctly routed. When might these two measures be different?
If I am downloading a very large file over a slow modem, do I care more about the
4.3 availability of my internet service provider or its mean time between failures?
Chapter List
Chapter 5: Finite Models
Chapter 6: Dependence and Data Flow Models
Chapter 7: Symbolic Execution and Proof of Properties
Chapter 8: Finite State Verification
5.1 Overview
A model is a representation that is simpler than the artifact it represents but preserves (or
at least approximates) some important attributes of the actual artifact. Our concern in this
chapter is with models of program execution, and not with models of other (equally
important) attributes such as the effort required to develop the software or its usability. A
good model of (or, more precisely, a good class of models) must typically be:
latter, and we may choose conventional concurrency control protocols over novel
approaches for the same reason. However, if a program analysis technique for C programs
is applicable only to programs without pointer variables, we are unlikely to find much use for
it.
Since design models are intended partly to aid in making and evaluating design decisions,
they should share these characteristics with models constructed primarily for analysis.
However, some kinds of models - notably the widely used UML design notations - are
designed primarily for human communication, with less attention to semantic meaning and
prediction.
Models are often used indirectly in evaluating an artifact. For example, some models are
not themselves analyzed, but are used to guide test case selection. In such cases, the
qualities of being predictive and semantically meaningful apply to the model together with
the analysis or testing technique applied to another artifact, typically the actual program or
system.
Graph Representations
We often use directed graphs to represent models of programs.
Usually we draw them as "box and arrow" diagrams, but to reason
about them it is important to understand that they have a welldefined mathematical meaning, which we review here.
A directed graph is composed of a set of nodes N and a relation E on the set (that is, a set
of ordered pairs), called the edges. It is conventional to draw the nodes as points or
shapes and to draw the edges as arrows. For example:
Typically, the nodes represent entities of some kind, such as procedures or classes or
regions of source code. The edges represent some relation among the entities. For
example, if we represent program control flow using a directed graph model, an edge (a,b)
would be interpreted as the statement "program region a can be directly followed by
program region b in program execution."
We can label nodes with the names or descriptions of the entities they represent. If nodes a
and b represent program regions containing assignment statements, we might draw the two
nodes and an edge (a,b) connecting them in this way:
Drawings of graphs can be refined in many ways, for example, depicting some relations as
attributes rather than directed edges. Important as these presentation choices may be for
clear communication, only the underlying sets and relations matter for reasoning about
models.
since the first and third program execution steps modify only the
omitted attribute. The relation between (2a) and (2b) illustrates
introduction of nondeterminism, because program execution states
with different successor states have been merged.
Finite models of program execution are inevitably imperfect. Collapsing the potentially
infinite states of actual execution into a finite number of representative model states
necessarily involves omitting some information. While one might hope that the omitted
information is irrelevant to the property one wishes to verify, this is seldom completely true.
In Figure 5.1, parts 2(a) and 2(b) illustrate how abstraction can cause a set of deterministic
transitions to be modeled by a nondeterministic choice among transitions, thus making the
analysis imprecise. This in turn can lead to "false alarms" in analysis of models.
[1]We put aside, for the moment, the possibility of parallel or
Figure 5.2: Building blocks for constructing intraprocedural control flow graphs. Other
control constructs are represented analogously. For example, the for construct of C,
C++, and Java is represented as if the initialization part appeared before a while
loop, with the increment part at the end of the while loop body.
In terms of program execution, we can say that a control flow graph model retains some
information about the program counter (the address of the next instruction to be
executed), and elides other information about program execution (e.g., the values of
variables). Since information that determines the outcome of conditional branches is
elided, the control flow graph represents not only possible program paths but also some
paths that cannot be executed. This corresponds to the introduction of nondeterminism
illustrated in Figure 5.1.
The nodes in a control flow graph could represent individual program statements, or
even individual machine operations, but it is desirable to make the graph model as
compact and simple as possible. Usually, therefore, nodes in a control flow graph model
of a program represent not a single point but rather a basic block, a maximal program
/**
* Remove/collapse multiple newline characters.
*
* @param String string to collapse newlines in.
* @return String
*/
public static String collapseNewlines(String argStr)
{
char last = argStr.charAt(0);
StringBuffer argBuf = new StringBuffer();
Figure 5.3: A Java method to collapse adjacent newline characters, from the
StringUtilities class of the Velocity project of the open source Apache project. (c)
2001 Apache Software Foundation, used with permission.
Figure 5.4: A control flow graph corresponding to the Java method in Figure 5.3. The
for statement and the predicate of the if statement have internal control flow
branches, so those statements are broken across basic blocks.
Some analysis algorithms are simplified by introducing a distinguished node to represent
procedure entry and another to represent procedure exit. When these distinguished start
and end nodes are used in a CFG, a directed edge leads from the start node to the
node representing the first executable block, and a directed edge from each procedure
exit (e.g., each return statement and the last sequential block in the program) to the
distinguished end node. Our practice will be to draw a start node identified with the
procedure or method signature, and to leave the end node implicit.
The intraprocedural control flow graph may be used directly to define thoroughness
criteria for testing (see Chapters 9 and 12). Often the control flow graph is used to
define another model, which in turn is used to define a thoroughness criterion. For
example, some criteria are defined by reference to linear code sequences and jumps
(LCSAJs), which are essentially subpaths of the control flow graph from one branch to
another. Figure 5.5 shows the LCSAJs derived from the control flow graph of Figure 5.4.
From Sequence of Basic Blocks To
entry b1 b2 b3
jX
entry b1 b2 b3 b4
jT
entry b1 b2 b3 b4 b5
jE
entry b1 b2 b3 b4 b5 b6 b7
jL
jX
jL
b8 return
b3 b4
jT
jL
b3 b4 b5
jE
jL
b3 b4 b5 b6 b7
jL
Figure 5.5: Linear code sequences and jumps (LCSAJs) corresponding to the Java
method in Figure 5.3 and the control flow graph in Figure 5.4. Note that proceeding to
the next sequential basic block is not considered a "jump" for purposes of identifying
LCSAJs.
For use in analysis, the control flow graph is usually augmented with other information.
For example, the data flow models described in the next chapter are constructed using a
CFG model augmented with information about the variables accessed and modified by
each program statement.
Not all control flow is represented explicitly in program text. For example, if an empty
string is passed to the collapseNewlines method of Figure 5.3, the exception
java.lang.StringIndexOutOfBoundsException will be thrown by String.charAt, and
execution of the method will be terminated. This could be represented in the CFG as a
directed edge to an exit node. However, if one includes such implicit control flow edges
for every possible exception (for example, an edge from each reference that might lead
to a null pointer exception), the CFG becomes rather unwieldy.
More fundamentally, it may not be simple or even possible to determine which of the
implicit control flow edges can actually be executed. We can reason about the call to
argStr.charAt(cIdx) within the body of the for loop and determine that cIdx must always
be within bounds, but we cannot reasonably expect an automated tool for extracting
control flow graphs to perform such inferences. Whether to include some or all implicit
control flow edges in a CFG representation therefore involves a trade-off between
possibly omitting some execution paths or representing many spurious paths. Which is
preferable depends on the uses to which the CFG representation will be put.
Even the representation of explicit control flow may differ depending on the uses to
which a model is put. In Figure 5.3, the for statement has been broken into its
constituent parts (initialization, comparison, and increment for next iteration), each of
which appears at a different point in the control flow. For some kinds of analysis, this
breakdown would serve no useful purpose. Similarly, a complex conditional expression in
Java or C is executed by "short-circuit" evaluation, so the single expression i > 0&&i < 10
can be broken across two basic blocks (the second test is not executed if the first
evaluates to false). If this fine level of execution detail is not relevant to an analysis, we
may choose to ignore short-circuit evaluation and treat the entire conditional expression
as if it were fully evaluated.
Figure 5.6: Overapproximation in a call graph. Although the method A.check() can
never actually call C.foo(), a typical call graph construction will include it as a possible
call.
If a call graph model represents different behaviors of a procedure depending on where
the procedure is called, we call it context-sensitive. For example, a context-sensitive
model of collapseNewlines might distinguish between one call in which the argument
string cannot possibly be empty, and another in which it could be. Contextsensitive
analyses can be more precise than context-insensitive analyses when the model includes
some additional information that is shared or passed among procedures. Information not
only about the immediate calling context, but about the entire chain of procedure calls
may be needed, as illustrated in Figure 5.7. In that case the cost of context-sensitive
analysis depends on the number of paths from the root (main program) to each lowest
level procedure. The number of paths can be exponentially larger than the number of
procedures, as illustrated in Figure 5.8.
Figure 5.7: The Java code above can be represented by the context-insensitive call
graph at left. However, to capture the fact that method depends never attempts to
store into a nonexistent array element, it is necessary to represent parameter values
that differ depending on the context in which depends is called, as in the contextsensitive call graph on the right.
Figure 5.8: The number of paths in a call graph - and therefore the number of calling
contexts in a context-sensitive analysis - can be exponentially larger than the number
of procedures, even without recursion.
1 public class C {
2
3
public static C cFactory(String kind) {
4
if (kind == "C") return new C();
5
if (kind == "S") return new S();
6
return null;
7
}
8
9
void foo() {
10
System.out.println("You called the parent's method");
11
}
12
13
public static void main(String args[]) {
14
(new A()).check();
15
}
16 }
17
18 class S extends C {
19
void foo() {
20
System.out.println("You called the child's method"
21
}
22 }
23
24 class A {
25
void check() {
26
C myC = C.cFactory("S");
27
myC.foo();
28
}
29 }
The Java compiler uses a typical call graph model to enforce the language rule that all
checked exceptions are either handled or declared in each method. The throws clauses
in a method declaration are provided by the programmer, but if they were not, they
would correspond exactly to the information that a context insensitive analysis of
exception propagation would associate with each procedure (which is why the compiler
can check for completeness and complain if the programmer omits an exception that can
be thrown).
Figure 5.9: Finite state machine (Mealy machine) description of line-end conversion
procedure, depicted as a state transition diagram (top) and as a state transition table
(bottom). An omission is obvious in the tabular representation, but easy to overlook in
the state transition diagram.
There are three kinds of correctness relations that we may reason about with respect to
finite state machine models, illustrated in Figure 5.10. The first is internal properties,
such as completeness and determinism. Second, the possible executions of a model,
described by paths through the FSM, may satisfy (or not) some desired property. Third,
the finite state machine model should accurately represent possible behaviors of the
program. Equivalently, the program should be a correct implementation of the finite state
machine model. We will consider each of the three kinds of correctness relation in turn
with respect to the FSM model of Figure 5.9.
Figure 5.10: Correctness relations for a finite state machine model. Consistency and
completeness are internal properties, independent of the program or a higher-level
specification. If, in addition to these internal properties, a model accurately represents
a program and satisfies a higher-level specification, then by definition the program
itself satisfies the higher-level specification.
The graph on the right is called the dual of the graph on the left. Taking the dual of
the graph on the right, one obtains again the graph on the left.
The choice between associating nodes or edges with computations performed by a
program is only a matter of convention and convenience, and is not an important
difference between CFG and FSM models. In fact, aside from this minor difference in
customary presentation, the control flow graph is a particular kind of finite state
machine model in which the abstract states preserve some information about control
flow (program regions and their execution order) and elide all other information about
program state.
Many details are purposely omitted from the FSM model depicted in Figure 5.9, but it is
also incomplete in an undesirable way. Normally, we require a finite state machine
specification to be complete in the sense that it prescribes the allowed behavior(s) for
any possible sequence of inputs or events. For the line-end conversion specification, the
state transition diagram does not include a transition from state l on carriage return; that
is, it does not specify what the program should do if it encounters a carriage return
immediately after a line feed.
An alternative representation of finite state machines, including Mealy machines, is the
state transition table, also illustrated in Figure 5.9. There is one row in the transition
table for each state node and one column for each event or input. If the FSM is
complete and deterministic, there should be exactly one transition in each table entry.
Since this table is for a Mealy machine, the transition in each table entry indicates both
the next state and the response (e.g., d / emit means "emit and then proceed to state
d"). The omission of a transition from state l on a carriage return is glaringly obvious
when the state transition diagram is written in tabular form.
Concrete state
w (Within line)
12
0 >0
12
d (Done)
35
34
35
}
}
Figure 5.11: Procedure to convert among Dos, Unix, and Macintosh line
ends.
Open table as spreadsheet
LF
CR
e e / emit l / emit
EOF
other
d/ w / append
e/ l / emit
d/ w / append
Figure 5.12: Completed finite state machine (Mealy machine) description of line-end
conversion procedure, depicted as a state-transition table (bottom). The omitted
transition in Figure 5.9 has been added.
With this state abstraction function, we can check conformance between the source
code and each transition in the FSM. For example, the transition from state e to state l
is interpreted to mean that, if execution is at the head of the loop with pos equal to zero
and atCR also zero (corresponding to state e), and the next character encountered is a
carriage return, then the program should perform operations corresponding to the emit
action and then enter a state in which pos is zero and atCR is 1 (corresponding to state
l). It is easy to verify that this transition is implemented correctly. However, if we
examine the transition from state l to state w, we will discover that the code does not
correspond because the variable atCR is not reset to zero, as it should be. If the
program encounters a carriage return, then some text, and then a line feed, the line feed
will be discarded - a program fault.
The fault in the conversion program was actually detected by the authors through
testing, and not through manual verification of correspondence between each transition
and program source code. Making the abstraction function explicit was nonetheless
important to understanding the nature of the error and how to repair it.
Summary
Models play many of the same roles in software development as in engineering of other
kinds of artifacts. Models must be much simpler than the artifacts they describe, but
must preserve enough essential detail to be useful in making choices. For models of
software execution, this means that a model must abstract away enough detail to
represent the potentially infinite set of program execution states by a finite and suitably
compact set of model states.
Some models, such as control flow graphs and call graphs, can be extracted from
programs. The key trade-off for these extracted models is precision (retaining enough
information to be predictive) versus the cost of producing and storing the model. Other
models, including many finite state machine models, may be constructed before the
program they describe, and serve as a kind of intermediate-level specification of
intended behavior. These models can be related to both a higher-level specification of
intended behavior and the actual program they are intended to describe.
The relation between finite state models and programs is elaborated in Chapter 6.
Analysis of models, particularly those involving concurrent execution, is described further
in Chapter 8.
Further Reading
Finite state models of computation have been studied at least since the neural models of
McColloch and Pitts [MP43], and modern finite state models of programs remain close
to those introduced by Mealy [Mea55] and Moore [Moo56]. Lamport [Lam89] provides
the clearest and most accessible introduction the authors know regarding what a finite
state machine model "means" and what it means for a program to conform to it. Guttag
[Gut77] presents an early explication of the abstraction relation between a model and a
program, and why the abstraction function goes from concrete to abstract and not vice
versa. Finite state models have been particularly important in development of reasoning
and tools for concurrent (multi-threaded, parallel, and distributed) systems; Pezz,
Taylor, and Young [PTY95] overview finite models of concurrent programs.
Exercises
We construct large, complex software systems by breaking them into manageable
pieces. Likewise, models of software systems may be decomposed into more
5.1 manageable pieces. Briefly describe how the requirements of model compactness,
predictiveness, semantic meaningfulness, and sufficient generality apply to
approaches for modularizing models of programs. Give examples where possible.
Models are used in analysis, but construction of models from programs often
requires some form of analysis. Why bother, then? If one is performing an initial
analysis to construct a model to perform a subsequent analysis, why not just
5.2 merge the initial and subsequent analysis and dispense with defining and
constructing the model? For example, if one is analyzing Java code to construct a
call graph and class hierarchy that will be used to detect overriding of inherited
methods, why not just analyze the source code directly for method overriding?
Linear code sequence and jump (LCSAJ) makes a distinction between "sequential"
control flow and other control flow. Control flow graphs, on the other hand, make
5.3 no distinction between sequential and nonsequential control flow. Considering the
criterion of model predictiveness, is there a justification for this distinction?
What upper bound can you place on the number of basic blocks in a program,
5.4 relative to program size?
A directed graph is a set of nodes and a set of directed edges. A mathematical
relation is a set of ordered pairs.
1. If we consider a directed graph as a representation of a relation, can we
ever have two distinct edges from one node to another?
5.5
with other abstraction functions used in reasoning about programs, the mapping is
from concrete representation to abstract representation, and not from abstract to
concrete. This is because the mapping from concrete to abstract is many-to-one, and its
inverse is therefore not a mathematical function (which by definition maps each object in
the domain set into a single object in the range).
{ /* A: def x,y */
/* def tmp
/* B: use
/*
/*
/*
/* F: use
Figure 6.1: Java implementation of Euclid's algorithm for calculating the greatest
common denominator of two positive integers. The labels AF are provided to relate
statements in the source code to graph nodes in subsequent figures.
Each definition-use pair associates a definition of a variable (e.g., the assignment to y in
line 6) with a use of the same variable (e.g., the expression y!=0 in line 3). A single
definition can be paired with more than one use, and vice versa. For example, the
definition of variable y in line 6 is paired with a use in line 3 (in the loop test), as well as
additional uses in lines 4 and 5. The definition of x in line 5 is associated with uses in
lines 4 and 8.
A definition-use pair is formed only if there is a program path on which the value
assigned in the definition can reach the point of use without being overwritten by another
value. If there is another assignment to the same value on the path, we say that the first
definition is killed by the second. For example, the declaration of tmp in line 2 is not
paired with the use of tmp in line 6 because the definition at line 2 is killed by the
definition at line 4. A definition-clear path is a path from definition to use on which the
definition is not killed by another definition of the same variable. For example, with
reference to the node labels in Figure 6.2, path E,B,C,D is a definition-clear path from
the definition of y in line 6 (node E of the control flow graph) to the use of y in line 5
(node D). Path A,B,C,D,E is not a definition-clear path with respect to tmp because of
the intervening definition at node C.
Figure 6.3: Data dependence graph of GCD method in Figure 6.1, with nodes for
statements corresponding to the control flow graph in Figure 6.2. Each directed edge
represents a direct data dependence, and the edge label indicates the variable that
transmits a value from the definition at the head of the edge to the use at the tail of
the edge.
The data dependence graph in Figure 6.3 captures only dependence through flow of
data. Dependence of the body of the loop on the predicate governing the loop is not
represented by data dependence alone. Control dependence can also be represented
with a graph, as in Figure 6.5, which shows the control dependencies for the GCD
method. The control dependence graph shows direct control dependencies, that is,
where execution of one statement controls whether another is executed. For example,
execution of the body of a loop or if statement depends on the result of a predicate.
Control dependence differs from the sequencing information captured in the control flow
graph. The control flow graph imposes a definite order on execution even when two
statements are logically independent and could be executed in either order with the
same results. If a statement is control- or data-dependent on another, then their order of
execution is not arbitrary. Program dependence representations typically include both
data dependence and control dependence information in a single graph with the two
kinds of information appearing as different kinds of edges among the same set of nodes.
A node in the control flow graph that is reached on every execution path from entry point
to exit is control dependent only on the entry point. For any other node N, reached on
some but not all execution paths, there is some branch that controls execution of N in the
sense that, depending on which way execution proceeds from the branch, execution of N
either does or does not become inevitable. It is this notion of control that control
dependence captures.
The notion of dominators in a rooted, directed graph can be used to make this intuitive
notion of "controlling decision" precise. Node M dominates node N if every path from the
root of the graph to N passes through M. A node will typically have many dominators,
but except for the root, there is a unique immediate dominator of node N, which is
closest to N on any path from the root and which is in turn dominated by all the other
dominators of N. Because each node (except the root) has a unique immediate
dominator, the immediate dominator relation forms a tree.
The point at which execution of a node becomes inevitable is related to paths from a
node to the end of execution - that is, to dominators that are calculated in the reverse of
the control flow graph, using a special "exit" node as the root. Dominators in this
direction are called post-dominators, and dominators in the normal direction of execution
can be called pre-dominators for clarity.
We can use post-dominators to give a more precise definition of control dependence.
Consider again a node N that is reached on some but not all execution paths. There
must be some node C with the following property: C has at least two successors in the
control flow graph (i.e., it represents a control flow decision); C is not post-dominated by
N (N is not already inevitable when C is reached); and there is a successor of C in the
control flow graph that is post-dominated by N. When these conditions are true, we say
node N is control-dependent on node C. Figure 6.4 illustrates the control dependence
calculation for one node in the GCD example, and Figure 6.5 shows the control
dependence relation for the method as a whole.
Figure 6.4: Calculating control dependence for node E in the control flow graph of the
GCD method. Nodes C, D, and E in the gray region are post-dominated by E; that is,
execution of E is inevitable in that region. Node B has successors both within and
outside the gray region, so it controls whether E is executed; thus E is
controldependent on B.
Figure 6.5: Control dependence tree of the GCD method. The loop test and the
return statement are reached on every possible execution path, so they are controldependent only on the entry point. The statements within the loop are controldependent on the loop test.
This rule can be broken down into two parts to make it a little more intuitive and more
efficient to implement. The first part describes how node E receives values from its
predecessor D, and the second describes how it modifies those values for its
successors:
In this form, we can easily express what should happen at the head of the while loop
(node B in Figure 6.2), where values may be transmitted both from the beginning of the
procedure (node A) and through the end of the body of the loop (node E). The beginning
of the procedure (node A) is treated as an initial definition of parameters and local
variables. (If a local variable is declared but not initialized, it is treated as a definition to
the special value "uninitialized.")
Remarkably, the reaching definitions can be calculated simply and efficiently, first
initializing the reaching definitions at each node in the control flow graph to the empty
set, and then applying these equations repeatedly until the results stabilize. The
algorithm is given as pseudocode in Figure 6.6.
Algorithm Reaching definitions
Input:
Reach(n)= mpred(n)ReachOut(m);
ReachOut(n)=(Reach(n) \ kill(n)) gen(n) ;
if ( ReachOut(n) = oldVal ) then
// Propagate changed value to successor nodes
workList = workList succ(n)
end if;
end loop;
Figure 6.6: An iterative work-list algorithm to compute reaching definitions by
applying each flow equation until the solution stabilizes.
The similarity to the set equations for reaching definitions is striking. Both propagate
sets of values along the control flow graph in the direction of program execution (they
are forward analyses), and both combine sets propagated along different control flow
paths. However, reaching definitions combines propagated sets using set union, since a
definition can reach a use along any execution path. Available expressions combines
propagated sets using set intersection, since an expression is considered available at a
node only if it reaches that node along all possible execution paths. Thus we say that,
while reaching definitions is a forward, any-path analysis, available expressions is a
forward, all-paths analysis. A work-list algorithm to implement available expressions
analysis is nearly identical to that for reaching definitions, except for initialization and the
flow equations, as shown in Figure 6.7.
G, K, and U can be any events we care to check, so long as we can mark their
occurrences in a control flow graph.
An example problem of this kind is variable initialization. We noted in Chapter 3 that Java
requires a variable to be initialized before use on all execution paths. The analysis that
enforces this rule is an instance of Avail. The tokens propagated through the control flow
graph record which variables have been assigned initial values. Since there is no way to
"uninitialize" a variable in Java, the kill sets are empty. Figure 6.8 repeats the source
code of an example program from Chapter 3. The corresponding control flow graph is
shown with definitions and uses in Figure 6.9 and annotated with gen and kill sets for the
initialized variable check in Figure 6.10.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
Figure 6.9: Control flow graph of the source code in Figure 6.8, annotated with
variable definitions and uses.
Figure 6.10: Control flow graph of the source code in Figure 6.8, annotated with gen
and kill sets for checking variable initialization using a forward, all-paths Avail
analysis. (Empty gen and kill sets are omitted.) The Avail set flowing from node G to
node C will be {i,k}, but the Avail set flowing from node B to node C is {i}. The allpaths analysis intersects these values, so the resulting Avail (C) is {i}. This value
propagates through nodes C and D to node F, which has a use of k as well as a
definition. Since k Avail(F), a possible use of an uninitialized variable is
detected.
Reaching definitions and available expressions are forward analyses; that is, they
propagate values in the direction of program execution. Given a control flow graph
model, it is just as easy to propagate values in the opposite direction, backward from
nodes that represent the next steps in computation. Backward analyses are useful for
determining what happens after an event of interest. Live variables is a backward
analysis that determines whether the value held in a variable may be subsequently used.
Because a variable is considered live if there is any possible execution path on which it is
used, a backward, any-path analysis is used.
A variable is live at a point in the control flow graph if, on some execution path, its
current value may be used before it is changed. Live variables analysis can be
expressed as set equations as before. Where Reach and Avail propagate values to a
node from its predecessors, Live propagates values from the successors of a node. The
gen sets are variables used at a node, and the kill sets are variables whose values are
replaced. Set union is used to combine values from adjacent nodes, since a variable is
live at a node if it is live at any of the succeeding nodes.
These set equations can be implemented using a work-list algorithm analogous to those
already shown for reaching definitions and available expressions, except that successor
edges are followed in place of predecessors and vice versa.
Like available expressions analysis, live variables analysis is of interest in testing and
analysis primarily as a pattern for recognizing properties of a certain form. A backward,
any-paths analysis allows us to check properties of the following form:
"After D occurs, there is at least one execution path on which G occurs with no
intervening occurrence of K."
Again we choose tokens that represent properties, using gen sets to mark occurrences
of G events (where a property becomes true) and kill sets to mark occurrences of K
events (where a property ceases to be true).
One application of live variables analysis is to recognize useless definitions, that is,
assigning a value that can never be used. A useless definition is not necessarily a
program error, but is often symptomatic of an error. In scripting languages like Perl and
Python, which do not require variables to be declared before use, a useless definition
typically indicates that a variable name has been misspelled, as in the common gateway
interface (CGI) script of Figure 6.11.
1
2
3
class SampleForm(FormData):
""" Used with Python cgi module
to hold and validate data
4
5
6
7
8
9
10
11
12
13
14
15
Figure 6.11: Part of a CGI program (Web form processing) in Python. The
misspelled variable name in the data validation method will be implicitly declared and
will not be rejected by the Python compiler or interpreter, which could allow invalid
data to be treated as valid. The classic live variables data flow analysis can show that
the assignment to valid is a useless definition, suggesting that the programmer
probably intended to assign the value to a different variable.
We have so far seen a forward, any-path analysis (reaching definitions), a forward, allpaths analysis (available definitions), and a backward, any-path analysis (live variables).
One might expect, therefore, to round out the repertoire of patterns with a backward,
all-paths analysis, and this is indeed possible. Since there is no classical name for this
combination, we will call it "inevitability" and use it for properties of the form
"After D occurs, G always occurs with no intervening occurrence of K"
or, informally,
"D inevitably leads to G before K"
Examples of inevitability checks might include ensuring that interrupts are reenabled after
executing an interrupt-handling routine in low-level code, files are closed after opening
them, and so on.
any variable that is assigned a tainted value at that point. Sets of tainted variables would
be propagated forward to a node from its predecessors, with set union where a node in
the control flow graph has more than one predecessor (e.g., the head of a loop).
There is one fundamental difference between such an analysis and the classic data flow
analyses we have seen so far: The gen and kill sets associated with a program point are
not constants. Whether or not the value assigned to a variable is tainted (and thus
whether the variable belongs in the gen set or in the kill set) depends on the set of
tainted variables at that program point, which will vary during the course of the analysis.
There is a kind of circularity here - the gen set and kill set depend on the set of tainted
variables, and the set of tainted variables may in turn depend on the gen and kill set.
Such circularities are common in defining flow analyses, and there is a standard
approach to determining whether they will make the analysis unsound. To convince
ourselves that the analysis is sound, we must show that the output values computed by
each flow equation are monotonically increasing functions of the input values. We will
say more precisely what "increasing" means below.
The determination of whether a computed value is tainted will be a simple function of the
set of tainted variables at a program point. For most operations of one or more
arguments, the output is tainted if any of the inputs are tainted. As in Perl, we may
designate one or a few operations (operations used to check an input value for validity)
as taint removers. These special operations always return an untainted value regardless
of their inputs.
Suppose we evaluate the taintedness of an expression with the input set of tainted
variables being {a,b}, and again with the input set of tainted variables being {a,b,c}. Even
without knowing what the expression is, we can say with certainty that if the expression
is tainted in the first evaluation, it must also be tainted in the second evaluation, in which
the set of tainted input variables is larger. This also means that adding elements to the
input tainted set can only add elements to the gen set for that point, or leave it the
same, and conversely the kill set can only grow smaller or stay the same. We say that
the computation of tainted variables at a point increases monotonically.
To be more precise, the monotonicity argument is made by arranging the possible
values in a lattice. In the sorts of flow analysis framework considered here, the lattice is
almost always made up of subsets of some set (the set of definitions, or the set of
tainted variables, etc.); this is called a powerset lattice because the powerset of set A is
the set of all subsets of A. The bottom element of the lattice is the empty set, the top is
the full set, and lattice elements are ordered by inclusion as in Figure 6.12. If we can
follow the arrows in a lattice from element x to element y (e.g., from {a} to {a,b,c}), then
we say y > x. A function f is monotonically increasing if
Figure 6.12: The powerset lattice of set {a,b,c}. The powerset contains all subsets of
the set and is ordered by set inclusion.
Not only are all of the individual flow equations for taintedness monotonic in this sense,
but in addition the function applied to merge values where control flow paths come
together is also monotonic:
If we have a set of data flow equations that is monotonic in this sense, and if we begin
by initializing all values to the bottom element of the lattice (the empty set in this case),
then we are assured that an iterative data flow analysis will converge on a unique
minimum solution to the flow equations.
The standard data flow analyses for reaching definitions, live variables, and available
expressions can all be justified in terms of powerset lattices. In the case of available
expressions, though, and also in the case of other all-paths analyses such as the one we
have called "inevitability," the lattice must be flipped over, with the empty set at the top
and the set of all variables or propositions at the bottom. (This is why we used the set of
all tokens, rather than the empty set, to initialize the Avail sets in Figure 6.7.)
a[i] = 13;
k = a[j];
Are these two lines a definition-use pair? They are if the values of i and j are equal,
which might be true on some executions and not on others. A static analysis cannot, in
general, determine whether they are always, sometimes, or never equal, so a source of
imprecision is necessarily introduced into data flow analysis.
Pointers and object references introduce the same issue, often in less obvious ways.
Consider the following snippet:
1
2
a[2] = 42;
i = b[2];
It seems that there cannot possibly be a definition-use pair involving these two lines,
since they involve none of the same variables. However, arrays in Java are dynamically
allocated objects accessed through pointers. Pointers of any kind introduce the
possibility of aliasing, that is, of two different names referring to the same storage
location. For example, the two lines above might have been part of the following
program fragment:
1
2
3
4
Here a and b are aliases, two different names for the same dynamically allocated array
object, and an assignment to part of a is also an assignment to part of b.
The same phenomenon, and worse, appears in languages with lower-level pointer
manipulation. Perhaps the most egregious example is pointer arithmetic in C:
1
2
p=&b;
*(p+i)=k;
It is impossible to know which variable is defined by the second line. Even if we know
the value of i, the result is dependent on how a particular compiler arranges variables in
memory.
Dynamic references and the potential for aliasing introduce uncertainty into data flow
analysis. In place of a definition or use of a single variable, we may have a potential
definition or use of a whole set of variables or locations that could be aliases of each
other. The proper treatment of this uncertainty depends on the use to which the analysis
will be put. For example, if we seek strong assurance that v is always initialized before it
is used, we may not wish to treat an assignment to a potential alias of v as initialization,
but we may wish to treat a use of a potential alias of v as a use of v.
A useful mental trick for thinking about treatment of aliases is to translate the uncertainty
introduced by aliasing into uncertainty introduced by control flow. After all, data flow
analysis already copes with uncertainty about which potential execution paths will
actually be taken; an infeasible path in the control flow graph may add elements to an
any-paths analysis or remove results from an all-paths analysis. It is usually appropriate
to treat uncertainty about aliasing consistently with uncertainty about control flow. For
example, considering again the first example of an ambiguous reference:
1
2
a[i] = 13;
k = a[j];
a[i] = 13;
if (i == j) {
k = a[i];
} else {
k = a[j];
}
In the (imaginary) transformed code, we could treat all array references as distinct,
because the possibility of aliasing is fully expressed in control flow. Now, if we are using
an any-path analysis like reaching definitions, the potential aliasing will result in creating
a definition-use pair. On the other hand, an assignment to a[j] would not kill a previous
assignment to a[i]. This suggests that, for an any-path analysis, gen sets should include
everything that might be referenced, but kill sets should include only what is definitely
referenced.
If we were using an all-paths analysis, like available expressions, we would obtain a
different result. Because the sets of available expressions are intersected where control
flow merges, a definition of a[i] would make only that expression, and none of its
potential aliases, available. On the other hand, an assignment to a[j] would kill a[i]. This
suggests that, for an all-paths analysis, gen sets should include only what is definitely
referenced, but kill sets should include all the possible aliases.
Even in analysis of a single procedure, the effect of other procedures must be
considered at least with respect to potential aliases. Consider, for example, this
fragment of a Java method:
1
2
3
4
5
6
7
We cannot determine whether the two arguments fromCust and toCust are references
to the same object without looking at the context in which this method is called.
Moreover, we cannot determine whether fromHome and fromWork are (or could be)
references to the same object without more information about how CustInfo objects are
treated elsewhere in the program.
Sometimes it is sufficient to treat all nonlocal information as unknown. For example, we
could treat the two CustInfo objects as potential aliases of each other, and similarly treat
the four PhoneNum objects as potential aliases. Sometimes, though, large sets of
aliases will result in analysis results that are so imprecise as to be useless. Therefore
data flow analysis is often preceded by an interprocedural analysis to calculate sets of
aliases or the locations that each pointer or reference can refer to.
Figure 6.13: Spurious execution paths result when procedure calls and returns are
treated as normal edges in the control flow graph. The path (A,X,Y,D) appears in the
combined graph, but it does not correspond to an actual execution
order.
It is possible to represent procedure calls and returns precisely, for example by making
a copy of the called procedure for each point at which it is called. This would result in a
context-sensitive analysis. The shortcoming of context sensitive analysis was already
mentioned in the previous chapter: The number of different contexts in which a
procedure must be considered could be exponentially larger than the number of
procedures. In practice, a context-sensitive analysis can be practical for a small group of
closely related procedures (e.g., a single Java class), but is almost never a practical
option for a whole program.
Some interprocedural properties are quite independent of context and lend themselves
naturally to analysis in a hierarchical, piecemeal fashion. Such a hierarchical analysis can
be both precise and efficient. The analyses that are provided as part of normal
compilation are often of this sort. The unhandled exception analysis of Java is a good
example: Each procedure (method) is required to declare the exceptions that it may
throw without handling. If method M calls method N in the same or another class, and if
N can throw some exception, then M must either handle that exception or declare that it,
too, can throw the exception. This analysis is simple and efficient because, when
analyzing method M, the internal structure of N is irrelevant; only the results of the
analysis at N (which, in Java, is also part of the signature of N) are needed.
Two conditions are necessary to obtain an efficient, hierarchical analysis like the
exception analysis routinely carried out by Java compilers. First, the information needed
to analyze a calling procedure must be small: It must not be proportional either to the
size of the called procedure, or to the number of procedures that are directly or
indirectly called. Second, it is essential that information about the called procedure be
independent of the caller; that is, it must be context-independent. When these two
conditions are true, it is straightforward to develop an efficient analysis that works
upward from leaves of the call graph. (When there are cycles in the call graph from
recursive or mutually recursive procedures, an iterative approach similar to data flow
analysis algorithms can usually be devised.)
Unfortunately, not all important properties are amenable to hierarchical analysis.
Potential aliasing information, which is essential to data flow analysis even within
individual procedures, is one of those that are not. We have seen that potential aliasing
can depend in part on the arguments passed to a procedure, so it does not have the
context-independence property required for an efficient hierarchical analysis. For such
an analysis, additional sacrifices of precision must be made for the sake of efficiency.
Even when a property is context-dependent, an analysis for that property may be
context-insensitive, although the context-insensitive analysis will necessarily be less
precise as a consequence of discarding context information. At the extreme, a linear
time analysis can be obtained by discarding both context and control flow information.
Context- and flow-insensitive algorithms for pointer analysis typically treat each
statement of a program as a constraint. For example, on encountering an assignment
1
x=y;
where y is a pointer, such an algorithm simply notes that x may refer to any of the same
objects that y may refer to. References(x) References(y) is a constraint that is
completely independent of the order in which statements are executed. A procedure call,
in such an analysis, is just an assignment of values to arguments. Using efficient data
structures for merging sets, some analyzers can process hundreds of thousands of lines
of source code in a few seconds. The results are imprecise, but still much better than
the worst-case assumption that any two compatible pointers might refer to the same
object.
The best approach to interprocedural pointer analysis will often lie somewhere between
the astronomical expense of a precise, context- and flow-sensitive pointer analysis and
the imprecision of the fastest context- and flow-insensitive analyses. Unfortunately, there
is not one best algorithm or tool for all uses. In addition to context and flow sensitivity,
important design trade-offs include the granularity of modeling references (e.g., whether
individual fields of an object are distinguished) and the granularity of modeling the
program heap (that is, which allocated objects are distinguished from each other).
Summary
Data flow models are used widely in testing and analysis, and the data flow analysis
algorithms used for deriving data flow information can be adapted to additional uses.
The most fundamental model, complementary to models of control flow, represents the
ways values can flow from the points where they are defined (computed and stored) to
points where they are used.
Data flow analysis algorithms efficiently detect the presence of certain patterns in the
control flow graph. Each pattern involves some nodes that initiate the pattern and some
that conclude it, and some nodes that may interrupt it. The name "data flow analysis"
reflects the historical development of analyses for compilers, but patterns may be used
to detect other control flow patterns.
An any-path analysis determines whether there is any control flow path from the initiation
to the conclusion of a pattern without passing through an interruption. An all- paths
analysis determines whether every path from the initiation necessarily reaches a
concluding node without first passing through an interruption. Forward analyses check
for paths in the direction of execution, and backward analyses check for paths in the
opposite direction. The classic data flow algorithms can all be implemented using simple
work-list algorithms.
A limitation of data flow analysis, whether for the conventional purpose or to check other
properties, is that it cannot distinguish between a path that can actually be executed and
a path in the control flow graph that cannot be followed in any execution. A related
limitation is that it cannot always determine whether two names or expressions refer to
the same object.
Fully detailed data flow analysis is usually limited to individual procedures or a few
closely related procedures (e.g., a single class in an object-oriented program). Analyses
that span whole programs must resort to techniques that discard or summarize some
information about calling context, control flow, or both. If a property is independent of
calling context, a hierarchical analysis can be both precise and efficient. Potential
aliasing is a property for which calling context is significant. There is therefore a tradeoff between very fast but imprecise alias analysis techniques and more precise but much
more expensive techniques.
Further Reading
Data flow analysis techniques were originally developed for compilers, as a systematic
way to detect opportunities for code-improving transformations and to ensure that those
transformations would not introduce errors into programs (an all-too-common experience
with early optimizing compilers). The compiler construction literature remains an
important source of reference information for data flow analysis, and the classic "Dragon
Book" text [ASU86] is a good starting point.
Fosdick and Osterweil recognized the potential of data flow analysis to detect program
errors and anomalies that suggested the presence of errors more than two decades ago
[FO76]. While the classes of data flow anomaly detected by Fosdick and Osterweil's
system has largely been obviated by modern strongly typed programming languages,
they are still quite common in modern scripting and prototyping languages. Olender and
Osterweil later recognized that the power of data flow analysis algorithms for
recognizing execution patterns is not limited to properties of data flow, and developed a
system for specifying and checking general sequencing properties [OO90, OO92].
Interprocedural pointer analyses - either directly determining potential aliasing relations,
or deriving a "points-to" relation from which aliasing relations can be derived - remains
an area of active research. At one extreme of the cost-versus-precision spectrum of
analyses are completely context- and flow-insensitive analyses like those described by
Steensgaard [Ste96]. Many researchers have proposed refinements that obtain
significant gains in precision at small costs in efficiency. An important direction for future
work is obtaining acceptably precise analyses of a portion of a large program, either
because a whole program analysis cannot obtain sufficient precision at acceptable cost
or because modern software development practices (e.g., incorporating externally
developed components) mean that the whole program is never available in any case.
Rountev et al. present initial steps toward such analyses [RRL99]. A very readable
overview of the state of the art and current research directions (circa 2001) is provided
by Hind [Hin01].
Exercises
For a graph G =(N,V ) with a root r N, node m dominates node n if every path
from r to n passes through m. The root node is dominated only by itself.
The relation can be restated using flow equations.
1. When dominance is restated using flow equations, will it be stated in the
form of an any-path problem or an all-paths problem? Forward or
backward? What are the tokens to be propagated, and what are the gen
and kill sets?
2. Give a flow equation for Dom(n).
3. If the flow equation is solved using an iterative data flow analysis, what
should the set Dom(n) be initialized to at each node n?
The first line of input to your program is an integer between 1 and 100
indicating the number k of nodes in the graph. Each subsequent line of
input will consist of two integers, m and n, representing an edge from
node m to node n. Node 0 designates the root, and all other nodes are
designated by integers between 0 and k 1. The end of the input is
signaled by the pseudo-edge (1,1).
The output of your program should be a sequences of lines, each
containing two integers separated by blanks. Each line represents one
edge of the Dom relation of the input graph.
5. The Dom relation itself is not a tree. The immediate dominators relation is
a tree. Write flow equations to calculate immediate dominators, and then
modify the program from part (d) to compute the immediate dominance
relation.
Overview
Symbolic execution builds predicates that characterize the conditions
under which execution paths can be taken and the effect of the
execution on program state. Extracting predicates through symbolic
execution is the essential bridge from the complexity of program
behavior to the simpler and more orderly world of logic. It finds
important applications in program analysis, in generating test data,
and in formal verification[1] (proofs) of program correctness.
Conditions under which a particular control flow path is taken can be
determined through symbolic execution. This is useful for identifying
infeasible program paths (those that can never be taken) and paths
that could be taken when they should not. It is fundamental to
generating test data to execute particular parts and paths in a
program.
Deriving a logical representation of the effect of execution is essential in methods that
compare a program's possible behavior to a formal specification. We have noted in earlier
chapters that proving the correctness of a program is seldom an achievable or useful goal.
Nonetheless the basic methods of formal verification, including symbolic execution, underpin
practical techniques in software analysis and testing. Symbolic execution and the techniques
of formal verification find use in several domains:
Rigorous proofs of properties of (small) critical subsystems, such as a safety kernel
of a medical device;
1
2 /** Binary search for key in sorted array dictKeys, returni
3 * corresponding value from dictValues or null if key does
4 * not appear in dictKeys. Standard binary search algorithm
5 * as described in any elementary text on data structures an
6 **/
7
8 char * binarySearch( char *key, char *dictKeys[ ], char *di
9
int dictSize) {
10
11
int low=0;
12
int high = dictSize - 1;
13
int mid;
14
int comparison;
15
16
while (high >=low) {
17
mid = (high + low) / 2;
18
comparison = strcmp( dictKeys[mid], key );
19
if (comparison < 0) {
20
/* dictKeys[mid] too small; look higher */
21
low=mid+1;
22
} else if ( comparison > 0) {
23
/* dictKeys[mid] too large; look lower */
24
high=mid-1;
25
} else {
26
/* found */
27
return dictValues[mid];
28
}
29
}
30
return 0; /* null means not found */
31 }
32
Figure 7.2: Hand-tracing an execution step with concrete values (left) and symbolic
values (right).
When tracing execution with concrete values, it is clear enough what to do with a branch
statement, for example, an if or while test: The test predicate is evaluated with the
current values, and the appropriate branch is taken. If the values bound to variables are
symbolic expressions, however, both the True and False outcomes of the decision may
be possible. Execution can be traced through the branch in either direction, and
execution of the test is interpreted as adding a constraint to record the outcome. For
example, consider
If we trace execution of the test assuming a True outcome (leading to a second iteration
of the loop), the loop condition becomes a constraint in the symbolic state immediately
after the while test:
Later, when we consider the branch assuming a False outcome of the test, the new
constraint is negated and becomes
or, equivalently,
.
Execution can proceed in this way down any path in the program. One can think of
"satisfying" the predicate by finding concrete values for the symbolic variables that make
it evaluate to True; this corresponds to finding data values that would force execution of
that program path. If no such satisfying values are possible, then that execution path
cannot be executed with any data values; we say it is an infeasible path.
is necessary to execute the path, but it may not be sufficient. Showing that W cannot be
satisfied is still tantamount to showing that the execution path is infeasible.
Here we interpret s t for strings as indicating lexical order consistent with the C library
strcmp; that is, we assume that s t whenever strcmp(s,t) 0. For convenience we
will abbreviate the predicate above as sorted.
We can associate the following assertion with the while statement at line 16:
In other words, we assert that the key can appear only between low and high,ifit
appears anywhere in the array. We will abbreviate this condition as inrange.
Inrange must be true when we first reach the loop, because at that point the range
lowhigh is the same as 0size 1. For each path through the body of the loop, the
symbolic executor would begin with the invariant assertion above, and determine that it
is true again after following that path. We say the invariant is preserved.
While the inrange predicate should be true on each iteration, it is not the complete loop
invariant. The sorted predicate remains true and will be used in reasoning. In principle it
is also part of the invariant, although in informal reasoning we may not bother to write it
down repeatedly. The full invariant is therefore sorted inrange.
Let us consider the path from line 16 through line 21 and back to the loop test. We begin
by assuming that the loop invariant assertion holds at the beginning of the segment.
Where expressions in the invariant refer to program variables whose values may
change, they are replaced by symbols representing the initial values of those variables.
The variable bindings will be
We need not introduce symbols to represent the values of dictKeys, dictVals, key,or
size. Since those variables are not changed in the procedure, we can use the variable
names directly. The condition, instantiated with symbolic values, will be
Passing through the while test into the body of the loop adds the clause H L to this
condition. Execution of line 17 adds a binding of (H + L)/2 to variable mid, where x
is the integer obtained by rounding x toward zero. As we have discussed, this can be
simplified with an assertion so that the bindings and condition become
Tracing the execution path into the first branch of the if statement to line 21, we add the
constraint that strcmp(dictKeys[mid], key) returns a negative value, which we interpret
as meaning the probed entry is lexically less than the string value of the key. Thus we
arrive at the symbolic constraint
The assignment in line 21 then modifies a variable binding without otherwise disturbing
the conditions, giving us
Finally, we trace execution back to the while test at line 16. Now our obligation is to
show that the invariant still holds when instantiated with the changed set of variable
bindings. The sorted condition has not changed, and showing that it is still true is trivial.
The interesting part is the inrange predicate, which is instantiated with a new value for
low and thus becomes
Now the verification step is to show that this predicate is a logical consequence of the
predicate describing the program state. This step requires purely logical and
mathematical reasoning, and might be carried out either by a human or by a theoremproving tool. It no longer depends in any way upon the program. The task performed by
the symbolic executor is essentially to transform a question about a program (is the
invariant preserved on a particular path?) into a question of logic alone.
The path through the loop on which the probed key is too large, rather than too small,
proceeds similarly. The path on which the probed key matches the sought key returns
from the procedure, and our obligation there (trivial in this case) is to verify that the
contract of the procedure has been met.
The other exit from the procedure occurs when the loop terminates without locating a
matching key. The contract of the procedure is that it should return the null pointer
(represented in the C language by 0) only if the key appears nowhere in
dictKeys[0..size-1]. Since the null pointer is returned whenever the loop terminates, the
postcondition of the loop is that key is not present in dictKeys.
The loop invariant is used to show that the postcondition holds when the loop terminates.
What symbolic execution can verify immediately after a loop is that the invariant is true
but the loop test is false. Thus we have
Knowing that presence of the key in the array implies L H, and that in fact L > H, we
can conclude that the key is not present. Thus the postcondition is established, and the
procedure fulfills its contract by returning the null pointer in this case.
Finding and verifying a complete set of assertions, including an invariant assertion for
each loop, is difficult in practice. Even the small example above is rather tedious to verify
by hand. More realistic examples can be quite demanding even with the aid of symbolic
execution tools. If it were easy or could be fully automated, we might routinely use this
method to prove the correctness of programs. Writing down a full set of assertions
formally, and rigorously verifying them, is usually reserved for small and extremely
critical modules, but the basic approach we describe here can also be applied in a much
less formal manner and is quite useful in finding holes in an informal correctness
argument.
The meaning of this triple is that if the program is in a state satisfying the precondition
pre at entry to the block, then after execution of the block it will be in a state satisfying
the postcondition post.
There are standard templates, or schemata, for reasoning with triples. In the previous
section we were following this schema for reasoning about while loops:
The formula above the line is the premise of an inference, and the formula below the line
is the conclusion. An inference rule states that if we can verify the premise, then we can
infer the conclusion. The premise of this inference rule says that the loop body preserves
invariant I: If the invariant I is true before the loop, and if the condition C governing the
loop is also true, then the invariant is established again after executing the loop body S.
The conclusion says that the loop as a whole takes the program from a state in which
the invariant is true to a state satisfying a postcondition composed of the invariant and
the negation of the loop condition.
The important characteristic of these rules is that they allow us to compose proofs about
small parts of the program into proofs about larger parts. The inference rule for while
allows us to take a triple about the body of a loop and infer a triple about the whole
loop. There are similar rules for building up triples describing other kinds of program
This style of reasoning essentially lets us summarize the effect of a block of program
code by a precondition and a postcondition. Most importantly, we can summarize the
effect of a whole procedure in the same way. The contract of the procedure is a
precondition (what the calling client is required to provide) and a postcondition (what the
called procedure promises to establish or return). Once we have characterized the
contract of a procedure in this way, we can use that contract wherever the procedure is
called. For example, we might summarize the effect of the binary search procedure this
way:
Explicit consideration of the abstract model, abstraction function, and structural invariant
of a class or other data structure model is the basis not only of formal or informal
reasoning about correctness, but also of designing test cases and test oracles.
Summary
Symbolic execution is a bridge from an operational view of program execution to logical
and mathematical statements. The basic symbolic execution technique is like hand
execution using symbols rather than concrete values. To use symbolic execution for
loops, procedure calls, and data structures encapsulated in modules (e.g., classes), it is
necessary to proceed hierarchically, composing facts about small parts into facts about
larger parts. Compositional reasoning is closely tied to strategies for specifying intended
behavior.
Symbolic execution is a fundamental technique that finds many different applications.
Test data generators use symbolic execution to derive constraints on input data. Formal
verification systems combine symbolic execution to derive logical predicates with
theorem provers to prove them. Many development tools use symbolic execution
techniques to perform or check program transformations, for example, unrolling a loop
for performance or refactoring source code.
Human software developers can seldom carry out symbolic execution of program code
in detail, but often use it (albeit informally) for reasoning about algorithms and data
structure designs. The approach to specifying preconditions, postconditions, and
invariants is also widely used in programming, and is at least partially supported by tools
for run-time checking of assertions.
Further Reading
The techniques underlying symbolic execution were developed by Floyd [Flo67] and
Hoare [Hoa69], although the fundamental ideas can be traced all the way back to Turing
and the beginnings of modern computer science. Hantler and King [HK76] provide an
excellent clear introduction to symbolic execution in program verification. Kemmerer and
Eckman [KE85] describe the design of an actual symbolic execution system, with
discussion of many pragmatic details that are usually glossed over in theoretical
descriptions.
Generation of test data using symbolic execution was pioneered by Clarke [Cla76], and
Howden [How77, How78] described an early use of symbolic execution to test
programs. The PREfix tool described by Bush, Pincus, and Sielaff [BPS00] is a modern
application of symbolic testing techniques with several refinements and simplifications for
adequate performance on large programs.
Exercises
We introduce symbols to represent variables whose value may change, but we do
not bother to introduce symbols for variables whose value remains unchanged in
7.1 the code we are symbolically executing. Why are new symbols necessary in the
former case but not in the latter?
Demonstrate that the statement return dictValues[mid] at line 27 of the binary
7.2 search program of Figure 7.1 always returns the value of the input key.
Compute an upper bound to the number of iterations through the while loop of the
7.3 binary search program of Figure 7.1.
The body of the loop of the binary search program of Figure 7.1 can be modified
as follows:
7.4
1
2
3
4
5
6
7
8
9
10
11
12
if (comparison < 0) {
/* dictKeys[mid] too small; look higher */
low=mid+1;
}
if ( comparison > 0) {
/* dictKeys[mid] too large; look lower */
high=mid-1;
}
if (comparison=0) {
/* found */
return dictValues[mid];
}
Demonstrate that the path that traverses the false branch of all three statements is
infeasible.
Write the pre- and postconditions for a program that finds the index of the
7.5 maximum element in a nonempty set of integers.
8.1 Overview
Most important properties of program execution are undecidable in general, but finite
state verification can automatically prove some significant properties of a finite model of
the infinite execution space. Of course, there is no magic: We must carefully reconcile
and balance trade-offs among the generality of the properties to be checked, the class
of programs or models that can be checked, computational effort, and human effort to
use the techniques.
Symbolic execution and formal reasoning can prove many properties of program
behavior, but the power to prove complex properties is obtained at the cost of devising
complex conditions and invariants and expending potentially unbounded computational
effort. Construction of control and data flow models, on the other hand, can be fully and
efficiently automated, but is typically limited to very simple program properties. Finite
state verification borrows techniques from symbolic execution and formal verification, but
like control and data flow analysis, applies them to models that abstract the potentially
infinite state space of program behavior into finite representations. Finite state
verification techniques fall between basic flow analyses and full-blown formal verification
in the richness of properties they can address and in the human guidance and
computational effort they require.
Since even simple properties of programs are undecidable in general, one cannot expect
an algorithmic technique to provide precise answers in all cases. Often finite state
verification is used to augment or substitute for testing when the optimistic inaccuracy of
testing (due to examining only a sample of the program state space) is unacceptable.
Techniques are therefore often designed to provide results that are tantamount to formal
proofs of program properties. In trade for this assurance, both the programs and
properties that can be checked are severely restricted. Restrictions on program
constructs typically appear in procedures for deriving a finite state model from a
program, generating program code from a design model, or verifying consistency
between a program and a separately constructed model.
Finite state verification techniques include algorithmic checks, but it is misleading to
characterize them as completely automated. Human effort and considerable skill are
usually required to prepare a finite state model and a suitable specification for the
automated analysis step. Very often there is an iterative process in which the first
several attempts at verification produce reports of impossible or unimportant faults,
which are addressed by repeatedly refining the specification or the model.
The automated step can be computationally costly, and the computational cost can
impact the cost of preparing the model and specification. A considerable amount of
manual effort may be expended just in obtaining a model that can be analyzed within
available time and memory, and tuning a model or specification to avoid combinatorial
explosion is itself a demanding task. The manual task of refining a model and
testing, treating the model as a kind of specification. The combination of finite state
verification and conformance testing is often more effective than directly testing for the
property of interest, because a discrepancy that is easily discovered in conformance
testing may very rarely lead to a run-time violation of the property (e.g., it is much easier
to detect that a particular lock is not held during access to a shared data structure than
to catch the occasional data race that the lock protects against).
A property to be checked can be implicit in a finite state verification tool (e.g., a tool
specialized just for detecting potential null pointer references), or it may be expressed in
a specification formalism that is purposely limited to a class of properties that can be
effectively verified using a particular checking technique. Often the real property of
interest is not amenable to efficient automated checking, but a simpler and more
restrictive property is. That is, the property checked by a finite state verification tool may
be sufficient but not necessary for the property of interest. For example, verifying
freedom from race conditions on a shared data structure is much more difficult than
verifying that some lock is always held by threads accessing that structure; the latter is
a sufficient but not necessary condition for the former. This means that we may exclude
correct software that we are not able to verify, but we can be sure that the accepted
software satisfies the property of interest.
[1]Note
that one may independently derive several different models from one program,
but deriving one program from several different models is much more difficult.
43
44
45
46
}
47
60 ...
61 }
}
}
return theValues[i].getX() + theValues[i].getY();
Figure 8.3: Finite state models of individual threads executing the lookup and reInit
methods from Figure 8.2. Each state machine may be replicated to represent
concurrent threads executing the same method.
Java threading rules ensure that in a system state in which one thread has obtained a
monitor lock, the other thread cannot make a transition to obtain the same lock. We can
observe that the locking prevents both threads from concurrently calling the initialize
method. However, another race condition is possible, between two concurrent threads
each executing the lookup method.
Tracing possible executions by hand - "desk checking" multi-threaded execution - is
capable in principle of finding the race condition between two concurrent threads
executing the lookup method, but it is at best tedious and in general completely
impractical. Fortunately, it can be automated, and many state space analysis tools can
explore millions of states in a short time. For example, a model of the faulty code from
Figure 8.2 was coded in the Promela modeling language and submitted to the Spin
verification tool. In a few seconds, Spin systematically explored the state space and
reported a race condition, as shown in Figure 8.5.
Depth=
10 States=
51 Transitions=
92 Memory= 2.30
pan: assertion violated !(modifying) (at depth 17)
pan: wrote pan_in.trail
(Spin Version 4.2.5 -- 2 April 2005)
...
0.16 real
0.00 user
0.03 sys
Figure 8.5: Excerpts of Spin verification tool transcript. Spin has performed a depthfirst search of possible executions of the model, exploring 10 states and 51 state
transitions in 0.16 seconds before finding a sequence of 17 transitions from the initial
state of the model to a state in which one of the assertions in the model evaluates to
False.
A few seconds of automated analysis to find a critical fault that can elude extensive
testing seems a very attractive option. Indeed, finite state verification should be a key
component of strategies for eliminating faults in multi-threaded and distributed programs,
as well as some kinds of security problems (which are similarly resistant to systematic
sampling in conventional program testing) and some other domains. On the other hand,
we have so far glossed over several limitations and problems of state space exploration,
each of which also appears in other forms of finite state verification. We will consider
two fundamental and related issues in the following sections: the size of the state space
to be explored, and the challenge of obtaining a model that is sufficiently precise without
making the state space explosion worse.
in Figure 8.4 can be used to represent acquiring a monitor lock, because execution
blocks at this point until locked has the value False. The guard is enclosed in an
atomic block to prevent another process taking the lock between evaluation of the
guard condition and execution of the statement.
The concept of enabling or blocking in guarded commands is used in conditional and
looping constructs. Alternatives in an if fi construct, marked syntactically with ::,
begin with guarded commands. If none of the alternatives is enabled (all of the guards
evaluate to False), then the whole if construct blocks. If more than one of the guarded
alternatives is enabled, the if construct does not necessarily choose the first among
them, as a programmer might expect from analogous if else if constructs in
conventional programming languages. Any of the enabled alternatives can be
nondeterministically chosen for execution; in fact the Spin tool will consider the
possible consequences of each choice. The do od construct similarly chooses
nondeterministically among enabled alternatives, but repeats until a break or goto is
evaluated in one of the guarded commands.
The simplest way to check properties of a Promela model is with assertions, like the
two assert statements in Figure 8.4. Spin searches for any possible execution
sequence in which an assertion can be violated. Sequencing properties can also be
specified in the form of temporal logic formulas, or encoded as state machines.
Figure 8.6: A Spin guided simulation trace describes each of the 17 steps from the
initial model state to the state in which the assertion !(modifying) is violated. For
example, in step 8, one of the two processes (threads) simulating execution of the
Lookup method sets the global variable modifying to True, represented as the integer
value 1. A graphical representation of this trace is presented in Figure 8.7.
Figure 8.7: A graphical interpretation of Spin guided simulation output (Figure 8.6) in
terms of Java source code (Figure 8.2) and state machines (Figure
8.3).
Safety and Liveness Properties
Properties of concurrent systems can be divided into simple safety properties,
sequencing safety properties, and liveness properties.
Simple safety properties divide states of the system into "good" (satisfying the
property) and "bad" (violating the property). They are easiest to specify, and least
expensive to check, because we can simply provide a predicate to be evaluated at
each state. Often simple safety properties relate the local state of one process to
local states of other processes. For example, the assertion assert(! modifying) in the
Promela code of Figure 8.4 states a mutual exclusion property between two
instances of the lookup process. When simple safety properties are expressed in
temporal logic, they have the form p, where p is a simple predicate with no
temporal modalities.
Safety properties about sequences of events are similar, but treat the history of
events preceding a state as an attribute of that state. For example, an assertion that
two operations a and b strictly alternate is a safety property about the history of
those events; a "bad" state is one in which a or b is about to be performed out of
order. Sequencing properties can be specified in temporal logic, but do not require it:
They are always equivalent to simple safety properties embedded in an "observer"
process. Checking a sequencing property adds the same degree of complexity to the
verification process as adding an explicit observer process, whether there is a real
observer (which is straightforward to encode for some kinds of model, and nearly
impossible for others) or whether the observer is implicit in the checking algorithm (as
it would be using a temporal logic predicate with the Spin tool).
True liveness properties, sometimes called "eventuality" properties, are those that
can only be violated by an infinite length execution. For example, if we assert that p
must eventually be true (p), the assertion is violated only by an execution that runs
forever with p continuously false. Liveness properties are useful primarily as a way of
abstracting over sequences of unknown length. For example, fairness properties are
an important class of liveness properties. When we say, for example, that a mutual
exclusion protocol must be fair, we do not generally mean that all processes have an
equal chance to obtain a resource; we merely assert that no process can be starved
forever. Liveness properties (including fairness properties) must generally be stated in
temporal logic, or encoded directly in a Bchi automaton that appears similar to a
deterministic finite state acceptor but has different rules for acceptance. A finite state
verification tool finds violations of liveness properties by searching for execution loops
in which the predicate that should eventually be true remains false; this adds
considerably to the computational cost of verification.
A common mnemonic for safety and liveness is that safety properties say "nothing
bad happens," while liveness properties say "something good eventually happens."
Properties involving real time (e.g., "the stop signal is sent within 5 seconds of
receiving the damage signal") are technically safety properties in which the "bad thing"
is expiration of a timer. However, naive models involving time are so expensive that it
is seldom practical to simply add a clock to a model and use simple safety properties.
Usually it is best to keep reasoning about time separate from verifying untimed
properties with finite state verification.
[2]In
fact even a correctly implemented double-check pattern can fail in Java due to
properties of the Java memory model, as discussed below.
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
rightFork!Down;
leftFork!Down;
myState = Thinking;
od;
}
#define NumSeats 10
chan forkInterface[NumSeats] = [0] of {mtype} ;
init {
int i=0;
do :: i < NumSeats ->
run fork( forkInterface[i] );
i=i+1;
:: i >= NumSeats -> break;
od;
i=0;
do :: i < NumSeats ->
run philosopher( forkInterface[i], forkInterface[
i=i+1;
:: i >= NumSeats-1 -> break;
od;
}
Figure 8.8: The classic dining philosophers problem in Promela. The number of
unique states explored before finding the potential deadlock (with default settings)
grows from 145 with 5 philosophers, to 18,313 with 10 philosophers, to 148,897 with
15 philosophers.
The known complexity results strongly imply that, in the worst case, no finite state
verification technique can be practical. Worst case complexity results, however, say
nothing about the performance of verification techniques on typical problems. Experience
with a variety of automated techniques tells us a fair amount about what to expect: Many
techniques work very well when applied on well-designed models, within a limited
domain, but no single finite state verification technique gives satisfactory results on all
problems. Moreover, crafting a model that accurately and succinctly captures the
essential structure of a system, and that can be analyzed with reasonable performance
by a given verification tool, requires creativity and insight as well as understanding of the
verification approach used by that tool.
An Illustration of State Space Explosion
[3]It
is a useful exercise to try this, because even though the number of reachable states
is quite small, it is remarkably difficult to enumerate them by hand without making
mistakes. Programmers who attempt to devise clever protocols for concurrent operation
face the same difficulty, and if they do not use some kind of automated formal
verification, it is not an exaggeration to say they almost never get it right.
and more effective than testing alone. Consider again our simple example of
misapplication of the double-check pattern in Figure 8.2. Tens of thousands of test
executions can fail to reveal the race condition in this code, depending on the way
threads are scheduled on a particular hardware platform and Java virtual machine
implementation. Testing for a discrepancy between model and program, on the other
hand, is fairly straightforward because the model of each individual state machine can be
checked independently (in fact all but one are trivial). The complexity that stymies testing
comes from nondeterministic interleaving of their execution, but this interleaving is
completely irrelevant to conformance testing.
28
29
30
31
32 }
Figure 8.9: A simple data race in Java. The possible ending values of i depend on
how the statement i = i+1 in one thread is interleaved with the same sequence in the
other thread.
Figure 8.10: Coarse and fine-grain models of the same program from Figure 8.9. In
the coarse-grain model, i will be increased by 2, but other outcomes are possible in
the finer grain model in which the shared variable i is loaded into temporary variable
or register, updated locally, and then stored.
Figure 8.11: The lost update problem, in which only one of the two increments
affects the final value of i. The illustrated sequence of operations from the program of
Figure 8.9 can be found using the finer grain model of Figure 8.10, but is not revealed
by the coarser grain model.
Even representing each memory access as an individual action is not always sufficient.
Programming language definitions usually allow compilers to perform some
rearrangements in the order of instructions. What appears to be a simple store of a
value into a memory cell may be compiled into a store into a local register, with the
actual store to memory appearing later (or not at all, if the value is replaced first). Two
loads or stores to different memory locations may also be reordered for reasons of
efficiency. Moreover, when a machine instruction to store a value into memory is
executed by a parallel or distributed computer, the value may initially be placed in the
cache memory of a local processor, and only later written into a memory area accessed
by other processors. These reorderings are not under programmer control, nor are they
directly visible, but they can lead to subtle and unpredictable failures in multi-threaded
programs.
As an example, consider once again the flawed program of Figure 8.2. Suppose we
corrected it to use the double-check idiom only for lazy initialization and not for updates
of the data structure. It would still be wrong, and unfortunately it is unlikely we would
discover the flaw through finite state verification. Our model in Promela assumes that
memory accesses occur in the order given in the Java program, but Java does not
guarantee that they will be executed in that order. In particular, while the programmer
may assume that initialization invoked in line 18 of the Java program is completed before
field ref is set in line 19, Java makes no such guarantee.
Breaking sequences of operations into finer pieces exacerbates the state explosion
problem, but as we have seen, making a model too coarse risks failure to detect some
possible errors. Moreover, conformance testing may not be much help in determining
whether a model depends on unjustified assumptions of atomicity. Interruptions in a
sequence of program operations that are mistakenly modeled as an atomic action may
not only be extremely rare and dependent on uncontrolled features of the execution
environment, such as system load or the activity of connected devices, but may also
depend on details of a particular language compiler.
Conformance testing is not generally effective in detecting that a finite state model of a
program relies on unwarranted assumptions of atomicity and ordering of memory
accesses, particularly when those assumptions may be satisfied by one compiler or
machine (say, in the test environment) and not by another (as in the field). Tools for
extracting models, or for generating code from models, have a potential advantage in
that they can be constructed to assume no more than is actually guaranteed by the
programming language.
Many state space analysis tools will attempt to dynamically determine when a sequence
of operations in one process can be treated as if it were atomic without affecting the
results of analysis. For example, the Spin verification tool uses a technique called partial
order reduction to recognize when the next event from one process can be freely
reordered with the next event from another, so only one of the orders need be checked.
Many finite state verification tools provide analogous facilities, and though they cannot
completely compensate for the complexity of a model that is more fine-grained than
necessary, they reduce the penalty imposed on the cautious model-builder.
The extensional representation, given above, lists the elements of the set. The same set
can be represented intensionally as
The predicate x mod 2 = 0 0 < x < 20, which is true for elements included in the set
and false for excluded elements, is called a characteristic function. The length of the
representation of the characteristic function does not necessarily grow with the size of
the set it describes. For example, the set
contains four times as many elements as the one above, and yet the length of the
representation is the same.
It could be advantageous to use similarly compact representations for sets of reachable
states and transitions among them. For example, ordered binary decision diagrams
(OBDDs) are a representation of Boolean functions that can be used to describe the
characteristic function of a transition relation. Transitions in the model state space are
pairs of states (the state before and the state after executing the transition), and the
Boolean function represented by the OBDD takes a pair of state descriptions and
returns True exactly if there is a transition between such a pair of states. The OBDD is
built by an iterative procedure that corresponds to a breadth-first expansion of the state
space (i.e., creating a representation of the whole set of states reachable in k + 1 steps
from the set of states reachable in k steps). If the OBDD representation does not grow
too large to be manipulated in memory, it stabilizes when all the transitions that can
occur in the next step are already represented in the OBDD form.
Finding a compact intensional representation of the model state space is not, by itself,
enough. In addition we must have an algorithm for determining whether that set satisfies
the property we are checking. For example, an OBDD can be used to represent not only
the transition relation of a set of communicating state machines, but also a class of
Figure 8.12: Ordered binary decision diagram (OBDD) encoding of the Boolean
proposition a b c, which is equivalent to a (b c). The formula and OBDD
structure can be thought of as a function from the Boolean values of a, b, and c to a
single Boolean value True or False.
However, M is only an approximation of the real system, and we find that the verification
finds a violation of P because of some execution sequences that are possible in M1 but
not in the real system. In the first approach, we examine the counter-example (an
execution trace of M1 that violates P but is impossible in the real system) and create a
new model M2 that is more precise in a way that will eliminate that particular execution
trace (and many similar traces). We attempt verification again with the refined model:
If verification fails again, we repeat the process to obtain a new model M3, and so on,
until verification succeeds with some "good enough" model Mk or we obtain a counterexample that corresponds to an execution of the actual program.
One kind of model that can be iteratively refined in this way is Boolean programs. The
initial Boolean program model of an (ordinary) program omits all variables; branches (if,
while, etc.) refer to a dummy Boolean variable whose value is unknown. Boolean
programs are refined by adding variables, with assignments and tests - but only Boolean
variables. For instance, if a counter-example produced by trying to verify a property of a
pump controller shows that the waterLevel variable cannot be ignored, a Boolean
program might be refined by adding a Boolean variable corresponding to a predicate in
which waterLevel is tested (say, waterLevel < highLimit), rather than adding the variable
waterLevel itself. For some kinds of interprocedural control flow analysis, it is possible
to completely automate the step of choosing additional Boolean variables to refine Mi
into Mi+1 and eliminate some spurious executions.
In the second approach, M remains fixed,[4] but premises that constrain executions to be
checked are added to the property P. When bogus behaviors of M violate P,we add a
constraint C1 to rule them out and try the modified verification problem:
If the modified verification problem fails because of additional bogus behaviors, we try
again with new constraints C2:
[4]In
Exclude self loops from "links" relations; that is, specify that a page should not
be directly linked to itself.
Allow at most one type of link between two pages. Note that relations need not
be symmetric; that is, the relation between A and B is distinct from the relation
between B and A, so there can be a link of type private from A to B and a link of
type public from B back to A.
Require the Web site to be connected; that is, require that there be at least one
way of following links from the home page to each other page of the site.
A data model can be visualized as a diagram with nodes corresponding to sets and
edges representing relations, as in Figure 8.14.
A B = B A
commutative law
A B = B A
" "
(A B) C = A (B C)
associative law
(A B) C = A (B C)
" "
A (B C)=(A B) (A C)
distributive law etc.
These and many other laws together make up relational algebra, which is used
extensively in database processing and has many other uses.
It would be inconvenient to write down a data model directly as a collection of
mathematical formulas. Instead, we use some notation whose meaning is the same as
the mathematical formulas, but is easier to write, maintain, and comprehend. Alloy is
one such modeling notation, with the additional advantage that it can be processed by a
finite state verification tool.
The definition of the data model as sets and relations can be formalized and verified with
relational algebra by specifying signatures and constraints. Figure 8.15 presents a
formalization of the data model of the Web site in Alloy. Keyword sig (signature)
identifies three sets: Pages, User, and Site. The definition of set Pages also defines
three disjoint relations among pages: linksPriv (private links), linksPub (public links), and
linksMain (maintenance links). The definition of User also defines a relation between
users and pages. User is partitioned into three disjoint sets (Administrator, Registered,
and Unregistered). The definition of Site aggregates pages into the site and identifies the
home page. Site is defined static since it is a fixed classification of objects.
1 module WebSite
2
3 // Pages include three disjoint sets of links
4 sig Page{ disj linksPriv, linksPub, linksMain: set Page }
5 // Each type of link points to a particular class of page
6 fact connPub{ all p: Page, s: Site | p.linksPub in s.unres }
7 fact connPriv{ all p: Page, s: Site | p.linksPriv in s.res }
8 fact connMain{ all p: Page, s: Site | p.linksMain in s.main
9 // Self loops are not allowed
10 fact noSelfLoop{ no p: Page| p in p.linksPriv+p.linksPub+p.
11
12 // Users are characterized by the set of pages that they ca
13 sig User{ pages: set Page }
14 // Users are partitioned into three sets
15 part sig Administrator, Registered, Unregistered extends Us
16 // Unregistered users can access only the home page, and un
17 fact accUnregistered{
18
all u: Unregistered, s: Site| u.pages = (s.home+s.unres)
19 // Registered users can access home, restricted and unrestr
20 fact accRegistered{
21
all u: Registered, s: Site|
22
u.pages = (s.home+s.res+s.unres)
23 }
24 // Administrators can access all pages
25 fact accAdministrator{
26
all u: Administrator, s: Site|
27
u.pages = (s.home+s.res+s.unres+s.main)
28 }
29
30 // A web site includes one home page and three disjoint set
31 // of pages: restricted, unrestricted and maintenance
32
33
34
35
36
37
38
39
Figure 8.15: Alloy model of a Web site with different kinds of pages, users, and
access rights (data model part). Continued in Figure 8.16.
1 module WebSite
39 ...
40 // We consider one Web site that includes one home page
41 // and some other pages
42 fun initSite() {
43
one s: Site| one s.home and
44
some s.res and
45
some s.unres and
46
some s.main
47 }
48
49 // We consider one administrator and some registered and un
50 fun initUsers() {one Administrator and
51
some Registered and
52
some Unregistered}
53
54 fun init() {
55
initSite() and initUsers()
56 }
57
58 // ANALYSIS
59
60 // Verify if there exists a solution
61 // with sets of cardinality at most 5
62 run init for 5
63
64 // check if unregistered users can visit all unrestrited pa
65 // i.e., all unrestricted pages are connected to the home p
66 // at least a path of public links.
67
68
69
70
71
72
73
assert browsePub{
all p: Page, s: Site| p in s.unres implies s.home in p.
}
check browsePub for 3
Figure 8.16: Alloy model of a Web site with different kinds of pages, users, and
access rights, continued from Figure 8.15.
The keyword facts introduces constraints.[6] The constraints connPub, connPriv and
connMain restrict the target of the links relations, while noSelfLoop excludes links from a
page to itself. The constraints accAdministrator, accRegistered, and accUnregistered
map users to pages. The constraint that follows the definition of Site forces the Web site
to be connected by requiring each page to belong to the transitive closure of links
starting from the Web page (operator ).
A relational algebra specification may be over- or underconstrained. Overconstrained
specifications are not satisfiable by any implementation, while underconstrained
specifications allow undesirable implementations; that is, implementations that violate
important properties.
In general, specifications identify infinite sets of solutions, each characterized by a
different set of objects and relations (e.g., the infinite set of Web sites with different sets
of pages, users and correct relations among them). Thus in general, properties of a
relational specification are undecidable because proving them would require examining
an infinite set of possible solutions. While attempting to prove absence of a solution may
be inconclusive, often a (counter) example that invalidates a property can be found within
a finite set of small models.
We can verify a specification over a finite set of solutions by limiting the cardinality of the
sets. In the example, we first verify that the model admits solutions for sets with at most
five elements (run init for 5 issued after an initialization of the system.) A positive
outcome indicates that the specification is not overconstrained - there are no logical
contradictions. A negative outcome would not allow us to conclude that no solution
exists, but tells us that no "reasonably small" solution exists.
We then verify that the example is not underconstrained with respect to property
browsePub that states that unregistered users must be able to visit all unrestricted
pages by accessing the site from the home page. The property is asserted by requiring
that all unrestricted pages belong to the reflexive transitive closure of the linkPub relation
from the home page (here we use operator * instead of because the home page is
included in the closure). If we check whether the property holds for sets with at most
three elements (check browsePub for 3) we obtain a counter-example like the one
shown in Figure 8.17, which shows how the property can be violated.
Figure 8.17: A Web site that violates the "browsability" property, because public
page Page_2 is not reachable from the home page using only unrestricted links. This
diagram was generated by the Alloy tool.
The simple Web site in the example consists of two unrestricted pages (page_1, the
home page, and Page_2), one restricted page (page_0), and one unregistered user
(user_2). User_2 cannot visit one of the unrestricted pages (Page_2) because the only
path from the home page to Page_2 goes through the restricted page page_0. The
property is violated because unrestricted browsing paths can be "interrupted" by
restricted pages or pages under maintenance, for example, when a previously
unrestricted page is reserved or disabled for maintenance by the administrator.
The problem appears only when there are public links from maintenance or reserved
pages, as we can check by excluding them:
1
2
3
fact descendant{
all p: Page, s: Site| p in s.main+s.res implies no p.linksP
}
This new specification would not find any counter-example in a space of cardinality 3.
We cannot conclude that no larger counter-example exists, but we may be satisfied that
there is no reason to expect this property to be violated only in larger models.
Summary
Finite state verification techniques fill an important niche in verifying critical properties of
programs. They are particularly crucial where nondeterminism makes program testing
ineffective, as in concurrent execution. In principle, finite state verification of concurrent
execution and of data models can be seen as systematically exploring an enormous
space of possible program states. From a user's perspective, the challenge is to
construct a suitable model of the software that can be analyzed with reasonable
expenditure of human and computational resources, captures enough significant detail
for verification to succeed, and can be shown to be consistent with the actual software.
Further Reading
There is a large literature on finite state verification techniques reaching back at least to
the 1960s, when Bartlett et al. [BSW69] employed what is recognizably a manual
version of state space exploration to justify the corrrectness of a communication
protocol. A number of early state space verification tools were developed initially for
communication protocol verification, including the Spin tool. Holzmann's journal
description of Spin's design and use [Hol97], though now somewhat out of date, remains
an adequate introduction to the approach, and a full primer and reference manual
[Hol03] is available in book form.
The ordered binary decision diagram representation of Boolean functions, used in the
first symbolic model checkers, was introduced by Randal Bryant [Bry86]. The
representation of transition relations as OBDDs in this chapter is meant to illustrate
basic ideas but is simplified and far from complete; Bryant's survey paper [Bry92] is a
good source for understanding applications of OBDDs, and Huth and Ryan [HR00]
provide a thorough and clear step-by-step description of how OBDDs are used in the
SMV symbolic model checker.
Model refinement based on iterative refinements of an initial coarse model was
introduced by Ball and Rajamani in the tools Slam [BR01a] and Bebop [BR01b], and by
Henzinger and his colleagues in Blast [HJMS03]. The complementary refinement
approach of FLAVERS was introduced by Dwyer and colleagues [DCCN04].
Automated analysis of relational algebra for data modeling was introduced by Daniel
Jackson and his students with the Alloy notation and associated tools [Jac02].
Exercises
We stated, on the one hand, that finite state verification falls between basic flow
analysis and formal verification in power and cost, but we also stated that finite
state verification techniques are often designed to provide results that are
8.1 tantamount to formal proofs of program properties. Are these two statements
contradictory? If not, how can a technique that is less powerful than formal
verification produce results that are tantamount to formal proofs?
8.3
A property like "if the button is pressed, then eventually the elevator will come" is
classified as a liveness property. However, the stronger real-time version "if the
8.4 button is pressed, then the elevator will arrive within 30 seconds" is technically a
safety property rather than a liveness property. Why?
[6]The
order in which relations and constraints are given is irrelevant. We list constraints
after the relations they refer to.
Chapter List
Chapter 9: Test Case Selection and Adequacy
Chapter 10: Functional Testing
Chapter 11: Combinatorial Testing
Chapter 12: Structural Testing
Chapter 13: Data Flow Testing
Chapter 14: Model-Based Testing
Chapter 15: Testing Object-Oriented Software
Chapter 16: Fault-Based Testing
Chapter 17: Test Execution
Chapter 18: Inspection
Chapter 19: Program Analysis
Required Background
Chapter 2
The fundamental problems and limitations of test case selection are a
consequence of the undecidability of program properties. A grasp of the basic
problem is useful in understanding Section 9.3.
9.1 Overview
Experience suggests that software that has passed a thorough set of
systematic tests is likely to be more dependable than software that
has been only superficially or haphazardly tested. Surely we should
require that each software module or subsystem undergo thorough,
systematic testing before being incorporated into the main product.
But what do we mean by thorough testing? What is the criterion by
which we can judge the adequacy of a suite of tests that a software
artifact has passed?
Ideally, we should like an "adequate" test suite to be one that
ensures correctness of the product. Unfortunately, that goal is not
attainable. The difficulty of proving that some set of test cases is
adequate in this sense is equivalent to the difficulty of proving that
the program is correct. In other words, we could have "adequate"
testing in this sense only if we could establish correctness without
any testing at all.
In practice we settle for criteria that identify inadequacies in test
suites. For example, if the specification describes different treatment
in two cases, but the test suite does not check that the two cases
are in fact treated differently, then we may conclude that the test
suite is inadequate to guard against faults in the program logic. If no
test in the test suite executes a particular program statement, we
might similarly conclude that the test suite is inadequate to guard
against faults in that statement. We may use a whole set of
(in)adequacy criteria, each of which draws on some source of
information about the program and imposes a set of obligations that
an adequate set of test cases ought to satisfy. If a test suite fails to
satisfy some criterion, the obligation that has not been satisfied may
provide some useful information about improving the test suite. If a
set of test cases satisfies all the obligations by all the criteria, we still
do not know definitively that it is a well-designed and effective test
suite, but we have at least some evidence of its thoroughness.
consistent manner. Unfortunately, the terms we will need are not always used
consistently in the literature, despite the existence of an IEEE standard that defines
several of them. The terms we will use are defined as follows.
Test case A test case is a set of inputs, execution conditions, and a pass/fail
criterion. (This usage follows the IEEE standard.)
Test case specification A test case specification is a requirement to be satisfied by
one or more actual test cases. (This usage follows the IEEE standard.)
Test obligation A test obligation is a partial test case specification, requiring some
property deemed important to thorough testing. We use the term obligation to
distinguish the requirements imposed by a test adequacy criterion from more
complete test case specifications.
Test suite A test suite is a set of test cases. Typically, a method for functional
testing is concerned with creating a test suite. A test suite for a program, system, or
individual unit may be made up of several test suites for individual modules,
subsystems, or features. (This usage follows the IEEE standard.)
Test or test execution We use the term test or test execution to refer to the activity
of executing test cases and evaluating their results. When we refer to "a test," we
mean execution of a single test case, except where context makes it clear that the
reference is to execution of a whole test suite. (The IEEE standard allows this and
other definitions.)
Adequacy criterion A test adequacy criterion is a predicate that is true (satisfied) or
false (not satisfied) of a program, test suite pair. Usually a test adequacy
criterion is expressed in the form of a rule for deriving a set of test obligations from
another artifact, such as a program or specification. The adequacy criterion is then
satisfied if every test obligation is satisfied by at least one test case in the suite.
Test specifications drawn from program source code require coverage of particular
elements in the source code or some model derived from it. For example, we might
require a test case that traverses a loop one or more times. The general term for testing
based on program structure is structural testing, although the term white-box testing or
glass-box testing is sometimes used.
Previously encountered faults can be an important source of information regarding useful
test cases. For example, if previous products have encountered failures or security
breaches due to buffer overflows, we may formulate test requirements specifically to
check handling of inputs that are too large to fit in provided buffers. These fault-based
test specifications usually draw also from interface specifications, design models, or
source code, but add test requirements that might not have been otherwise considered.
A common form of fault-based testing is fault-seeding, purposely inserting faults in
source code and then measuring the effectiveness of a test suite in finding the seeded
faults, on the theory that a test suite that finds seeded faults is likely also to find other
faults.
Test specifications need not fall cleanly into just one of the categories. For example, test
specifications drawn from a model of a program might be considered specificationbased if the model is produced during program design, or structural if it is derived from
the program source code.
Consider the Java method of Figure 9.1. We might apply a general rule that requires
using an empty sequence wherever a sequence appears as an input; we would thus
create a test case specification (a test obligation) that requires the empty string as
input.[1] If we are selecting test cases structurally, we might create a test obligation that
requires the first clause of the if statement on line 15 to evaluate to true and the second
clause to evaluate to false, and another test obligation on which it is the second clause
that must evaluate to true and the first that must evaluate to false.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
/**
* Remove/collapse multiple spaces.
*
* @param String string to remove multiple spaces from.
* @return String
*/
public static String collapseSpaces(String argStr)
{
char last = argStr.charAt(0);
StringBuffer argBuf = new StringBuffer();
for (int cIdx=0; cIdx < argStr.length(); cIdx++)
{
char ch = argStr.charAt(cIdx);
if (ch != '' || last != '')
{
argBuf.append(ch);
last = ch;
}
}
return argBuf.toString();
}
Figure 9.1: A Java method for collapsing sequences of blanks, excerpted from the
StringUtils class of Velocity version 1.3.1, an Apache Jakarta project. Apache
Group, used by permission.
[1]Constructing
and using catalogs of general rules like this is described in Chapter 10.
met by a test suite containing a single test case, with the input value (value of argStr)
being "doesn'tEvenHaveSpaces." Requiring both the true and false branches of each
test to be taken subsumes the previous criterion and forces us to at least provide an
input with a space that is not copied to the output, but it can still be satisfied by a suite
with just one test case. We might add a requirement that the loop be iterated zero
times, once, and several times, thus requiring a test suite with at least three test cases.
The obligation to execute the loop body zero times would force us to add a test case
with the empty string as input, and like the specification-based obligation to consider an
empty sequence, this would reveal a fault in the code.
Should we consider a more demanding adequacy criterion, as indicated by the
subsumes relation among criteria, to be a better criterion? The answer would be "yes" if
we were comparing the guarantees provided by test adequacy criteria: If criterion A
subsumes criterion B, and if any test suite satisfying B in some program is guaranteed to
find a particular fault, then any test suite satisfying A is guaranteed to find the same fault
in the program. This is not as good as it sounds, though. Twice nothing is nothing.
Adequacy criteria do not provide useful guarantees for fault detection, so comparing
guarantees is not a useful way to compare criteria.
A better statistical measure of test effectiveness is whether the probability of finding at
least one program fault is greater when using one test coverage criterion than another.
Of course, such statistical measures can be misleading if some test coverage criteria
require much larger numbers of test cases than others. It is hardly surprising if a
criterion that requires at least 300 test cases for program P is more effective, on
average, than a criterion that requires at least 50 test cases for the same program. It
would be better to know, if we have 50 test cases that satisfy criterion B, is there any
value in finding 250 test cases to finish satisfying the "stronger" criterion A, or would it
be just as profitable to choose the additional 250 test cases at random?
Although theory does not provide much guidance, empirical studies of particular test
adequacy criteria do suggest that there is value in pursuing stronger criteria, particularly
when the level of coverage attained is very high. Whether the extra value of pursuing a
stronger adequacy criterion is commensurate with the cost almost certainly depends on
a plethora of particulars, and can only be determined by monitoring results in individual
organizations.
it is likely that future theoretical progress must begin with a quite different conception of
the fundamental goals of a theory of test adequacy.
The trend in research is toward empirical, rather than theoretical, comparison of the
effectiveness of particular test selection techniques and test adequacy criteria. Empirical
approaches to measuring and comparing effectiveness are still at an early stage. A
major open problem is to determine when, and to what extent, the results of an empirical
assessment can be expected to generalize beyond the particular programs and test
suites used in the investigation. While empirical studies have to a large extent displaced
theoretical investigation of test effectiveness, in the longer term useful empirical
investigation will require its own theoretical framework.
Further Reading
Goodenough and Gerhart made the original attempt to formulate a theory of "adequate"
testing [GG75]; Weyuker and Ostrand extended this theory to consider when a set of
test obligations is adequate to ensure that a program fault is revealed [WO80].
Gourlay's exposition of a mathematical framework for adequacy criteria is among the
most lucid developments of purely analytic characterizations [Gou83]. Hamlet and Taylor
show that, if one takes statistical confidence in (absolute) program correctness as the
goal, none of the standard coverage testing techniques improve on random testing
[HT90], from which an appropriate conclusion is that confidence in absolute correctness
is not a reasonable goal of systematic testing. Frankl and Iakounenko's study of test
effectiveness [FI98] is a good example of the development of empirical methods for
assessing the practical effectiveness of test adequacy criteria.
Related Topics
Test adequacy criteria and test selection techniques can be categorized by the sources
of information they draw from. Functional testing draws from program and system
specifications, and is described in Chapters 10, 11, and 14. Structural testing draws
from the structure of the program or system, and is described in Chapters 12 and 13.
The techniques for testing object-oriented software described in Chapter 15 draw on
both functional and structural approaches. Selection and adequacy criteria based on
consideration of hypothetical program faults are described in Chapter 16.
Exercises
Deterministic finite state machines (FSMs), with states representing classes of
program states and transitions representing external inputs and observable
program actions or outputs, are sometimes used in modeling system requirements.
We can design test cases consisting of sequences of program inputs that trigger
FSM transitions and the predicted program actions expected in response. We can
also define test coverage criteria relative to such a model. Which of the following
coverage criteria subsume which others?
State coverage For each state in the FSM model, there is a test case that visits
9.1 that state.
Transition coverage For each transition in the FSM model, there is a test case
that traverses that transition.
Path coverage For all finite-length subpaths from a distinguished start state in the
FSM model, there is at least one test case that includes a corresponding subpath.
State-pair coverage For each state r in the FSM model, for each state s
reachable from r along some sequence of transitions, there is at least one test
case that passes through state r and then reaches state s.
Adequacy criteria may be derived from specifications (functional criteria) or code
(structural criteria). The presence of infeasible elements in a program may make it
impossible to obtain 100% coverage. Since we cannot possibly cover infeasible
elements, we might define a coverage criterion to require 100% coverage of
feasible elements (e.g., execution of all program statements that can actually be
reached in program execution). We have noted that feasibility of program elements
9.2 is undecidable in general. Suppose we instead are using a functional test adequacy
criterion, based on logical conditions describing inputs and outputs. It is still
possible to have infeasible elements (logical condition A might be inconsitent with
logical condition B, making the conjunction A B infeasible). Would you expect
distinguishing feasible from infeasible elements to be easier or harder for functional
criteria, compared to structural criteria? Why?
Suppose test suite A satisfies adequacy criterion C1. Test suite B satisfies
9.3 adequacy criterion C2, and C2 subsumes C1. Can we be certain that faults revealed
by A will also be revealed by B?
10.1 Overview
In testing and analysis aimed at verification[2] - that is, at finding any discrepancies
between what a program does and what it is intended to do - one must obviously refer
to requirements as expressed by users and specified by software engineers. A
functional specification, that is, a description of the expected behavior of the program, is
the primary source of information for test case specification.
Functional testing, also known as black-box or specification-based testing, denotes
techniques that derive test cases from functional specifications. Usually functional testing
techniques produce test case specifications that identify classes of test cases and are
instantiated to produce individual test cases.
The core of functional test case design is partitioning[3] the possible behaviors of the
program into a finite number of homogeneous classes, where each such class can
reasonably be expected to be consistently correct or incorrect. In practice, the test case
designer often must also complete the job of formalizing the specification far enough to
serve as the basis for identifying classes of behaviors. An important side benefit of test
design is highlighting the weaknesses and incompleteness of program specifications.
Deriving functional test cases is an analytical process that decomposes specifications
into test cases. The myriad aspects that must be taken into account during functional
test case specification makes the process error prone. Even expert test designers can
miss important test cases. A methodology for functional test design helps by
decomposing the functional test design process into elementary steps. In this way, it is
possible to control the complexity of the process and to separate human intensive
activities from activities that can be automated.
Sometimes, functional testing can be fully automated. This is possible, for example,
when specifications are given in terms of some formal model, such as a grammar or an
extended state machine specification. In these (exceptional) cases, the creative work is
performed during specification and design of the software. The test designer's job is
then limited to the choice of the test selection criteria, which defines the strategy for
generating test case specifications. In most cases, however, functional test design is a
human intensive activity. For example, when test designers must work from informal
specifications written in natural language, much of the work is in structuring the
specification adequately for identifying test cases.
[1]We
use the term program generically for the artifact under test, whether that artifact is
a complete application or an individual unit together with a test harness. This is
consistent with usage in the testing research literature.
[2]Here
behavior and its specifications with respect to the users' expectations, is treated in
Chapter 22.
[3]We
are using the term partition in a common but rather sloppy sense. A true partition
would form disjoint classes, the union of which is the entire space. Partition testing
separates the behaviors or input space into classes whose union is the entire space, but
the classes may not be disjoint.
potential input even if "missing case" were included in the catalog of potential faults.
Functional specifications often address semantically rich domains, and we can use domain
information in addition to the cases explicitly enumerated in the program specification. For
example, while a program may manipulate a string of up to nine alphanumeric characters,
the program specification may reveal that these characters represent a postal code, which
immediately suggests test cases based on postal codes of various localities. Suppose the
program logic distinguishes only two cases, depending on whether they are found in a table
of U.S. zip codes. A structural testing criterion would require testing of valid and invalid U.S.
zip codes, but only consideration of the specification and richer knowledge of the domain
would suggest test cases that reveal missing logic for distinguishing between U.S.-bound
mail with invalid U.S. zip codes and mail bound for other countries.
Functional testing can be applied at any level of granularity where some form of
specification is available, from overall system testing to individual units, although the level of
granularity and the type of software influence the choice of the specification styles and
notations, and consequently the functional testing techniques that can be used.
In contrast, structural and fault-based testing techniques are invariably tied to program
structures at some particular level of granularity and do not scale much beyond that level.
The most common structural testing techniques are tied to fine-grain program structures
(statements, classes, etc.) and are applicable only at the level of modules or small
collections of modules (small subsystems, components, or libraries).
As an extreme example, suppose we are allowed to select only three test cases for a
program that breaks a text buffer into lines of 60 characters each. Suppose the first test
case is a buffer containing 40 characters, and the second is a buffer containing 30
characters. As a final test case, we can choose a buffer containing 16 characters or a
buffer containing 100 characters. Although we cannot prove that the 100-character buffer is
the better test case (and it might not be; the fact that 16 is a power of 2 might have some
unforeseen significance), we are naturally suspicious of a set of tests that is strongly biased
toward lengths less than 60.
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34 }
Figure 10.1: The Java class roots, which finds roots of a quadratic
equation. The case analysis in the implementation is incomplete: It
does not properly handle the case in which b2 4ac = 0 and a = 0.
We cannot anticipate all such faults, but experience teaches that
boundary values identifiable in a specification are disproportionately
valuable. Uniform random generation of even large numbers of test
cases is ineffective at finding the fault in this program, but selection
of a few "special values" based on the specification quickly uncovers
it.
Of course, it is unlikely that anyone would test only with random values. Regardless of the
overall testing strategy, most test designers will also try some "special" values. The test
designer's intuition comports with the observation that random sam-pling is an ineffective
way to find singularities in a large input space. The observation about singularities can be
generalized to any characteristic of input data that defines an infinitesimally small portion of
the complete input data space. If again we have just three real-valued inputs a, b, and c,
there is an infinite number of choices for which b = c, but random sampling is unlikely to
generate any of them because they are an infinitesimal part of the complete input data
space.
The observation about special values and random samples is by no means limited to
numbers. Consider again, for example, breaking a text buffer into lines. Since line breaks
are permitted at blanks, we would consider blanks a "special" value for this problem. While
random sampling from the character set is likely to produce a buffer containing a sequence
of at least 60 nonblank characters, it is much less likely to produce a sequence of 60
blanks.
Partition testing typically increases the cost of each test case, since
in addition to generation of a set of classes, creation of test cases
from each class may be more expensive than generating random
test data. In consequence, partition testing usually produces fewer
test cases than random testing for the same expenditure of time
and money. Partitioning can therefore be advantageous only if the
average value (fault detection effectiveness) is greater.
If we were able to group together test cases with such perfect knowledge that the outcome
of test cases in each class were uniform (either all successes or all failures), then partition
testing would be at its theoretical best. In general we cannot do that, nor can we even
quantify the uniformity of classes of test cases. Partitioning by any means, including
specification-based partition testing, is always based on experience and judgment that
leads one to believe that certain classes of test case are "more alike" than others, in the
sense that failure-prone test cases are likely to be concentrated in some classes. When we
appealed earlier to the test designer's intuition that one should try boundary cases and
special values, we were actually appealing to a combination of experience (many failures
occur at boundary and special cases) and knowledge that identifiable cases in the
specification often correspond to classes of input that require different treatment by an
implementation.
Given a fixed budget, the optimum may not lie in only partition testing or only random
testing, but in some mix that makes use of available knowledge. For example, consider
again the simple numeric problem with three inputs, a, b, and c. We might consider a few
special cases of each input, individually and in combination, and we might consider also a
few potentially significant relationships (e.g., a = b). If no faults are revealed by these few
test cases, there is little point in producing further arbitrary partitions - one might then turn
to random generation of a large number of test cases.
[4]Note that the relative value of different test cases would be quite
Identification of functional features that can be tested separately is different from module
decomposition. In both cases we apply the divide and conquer principle, but in the
former case, we partition specifications according to the functional behavior as
perceived by the users of the software under test,[5] while in the latter, we identify logical
units that can be implemented separately. For example, a Web site may require a sort
function, as a service routine, that does not correspond to an external functionality. The
sort function may be a functional feature at module testing, when the program under test
is the sort function itself, but is not a functional feature at system test, while deriving test
cases from the specifications of the whole Web site. On the other hand, the registration
of a new user profile can be identified as one of the functional features at system-level
testing, even if such functionality is spread across several modules. Thus, identifying
functional features does not correspond to identifying single modules at the design level,
but rather to suitably slicing the specifications to attack their complexity incrementally.
Independently testable features are described by identifying all the inputs that form their
execution environments. Inputs may be given in different forms depending on the notation
used to express the specifications. In some cases they may be easily identifiable. For
example, they can be the input alphabet of a finite state machine specifying the behavior
of the system. In other cases, they may be hidden in the specification. This is often the
case for informal specifications, where some inputs may be given explicitly as
parameters of the functional unit, but other inputs may be left implicit in the description.
For example, a description of how a new user registers at a Web site may explicitly
indicate the data that constitutes the user profile to be inserted as parameters of the
functional unit, but may leave implicit the collection of elements (e.g., database) in which
the new profile must be inserted.
Trying to identify inputs may help in distinguishing different functions. For example, trying
to identify the inputs of a graphical tool may lead to a clearer distinction between the
graphical interface per se and the associated callbacks to the application. With respect
to the Web-based user registration function, the data to be inserted in the database are
part of the execution environment of the functional unit that performs the insertion of the
user profile, while the combination of fields that can be used to construct such data is
part of the execution environment of the functional unit that takes care of the
management of the specific graphical interface.
Identify Representative Classes of Values or Derive a Model The execution
environment of the feature under test determines the form of the final test cases, which
are given as combinations of values for the inputs to the unit. The next step of a testing
process consists of identifying which values of each input should be selected to form test
cases. Representative values can be identified directly from informal specifications
expressed in natural language. Alternatively, representative values may be selected
indirectly through a model, which can either be produced only for the sake of testing or
be available as part of the specification. In both cases, the aim of this step is to identify
the values for each input in isolation, either explicitly through enumeration or implicitly
trough a suitable model, but not to select suitable combinations of such values (i.e., test
case specifications). In this way, we separate the problem of identifying the
representative values for each input from the problem of combining them to obtain
meaningful test cases, thus splitting a complex step into two simpler steps.
Most methods that can be applied to informal specifications rely on explicit enumeration
Consider the input of a procedure that searches for occurrences of a complex pattern in
a Web database. Its input may be characterized by the length of the pattern and the
presence of special characters in the pattern, among other aspects. Interesting values
for the length of the pattern may be zero, one, or many. Interesting values for the
presence of special characters may be zero, one, or many. However, the combination of
value "zero" for the length of the pattern and value "many" for the number of special
characters in the pattern is clearly impossible.
The test case specifications represented by the Cartesian product of all possible inputs
must be restricted by ruling out illegal combinations and selecting a practical subset of
the legal combinations. Illegal combinations are usually eliminated by constraining the set
of combinations. For example, in the case of the complex pattern presented above, we
can constrain the choice of one or more special characters to a positive length of the
pattern, thus ruling out the illegal cases of patterns of length zero containing special
characters.
Selection of a practical subset of legal combination can be done by adding information
that reflects the hazard of the different combinations as perceived by the test designer
or by following combinatorial considerations. In the former case, for example, we can
identify exceptional values and limit the combinations that contain such values. In the
pattern example, we may consider only one test for patterns of length zero, thus
eliminating many combinations that would otherwise be derived for patterns of length
zero. Combinatorial considerations reduce the set of test cases by limiting the number of
combinations of values of different inputs to a subset of the inputs. For example, we can
generate only tests that exhaustively cover all combinations of values for inputs
considered pair by pair.
Depending on the technique used to reduce the space represented by the Cartesian
product, we may be able to estimate the number of generated test cases generated and
modify the selected subset of test cases according to budget considerations. Subsets of
combinations of values (i.e., potential special cases) can often be derived from models
of behavior by applying suitable test selection criteria that identify subsets of interesting
behaviors among all behaviors represented by a model, for example by constraining the
iterations on simple elements of the model itself. In many cases, test selection criteria
can be applied automatically.
Generate Test Cases and Instantiate Tests The test generation process is completed
by turning test case specifications into test cases and instantiating them. Test case
specifications can be turned into test cases by selecting one or more test cases for
each test case specification. Test cases are implemented by creating the scaffolding
required for their execution.
[5]Here
the word "user" designates the individual using the specified service. It can be the
user of the system, when dealing with a system specification, but it can be another
approach, it is important to evaluate all relevant costs. For example, generating a large
number of random test cases may necessitate design and construction of sophisticated
test oracles, or the cost of training to use a new tool may exceed the advantages of
adopting a new approach.
Scaffolding costs Each test case specification must be converted to a concrete test
case, executed many times over the course of development, and checked each time for
correctness. If generic scaffolding code required to generate, execute, and judge the
outcome of a large number of test cases can be written just once, then a combinatorial
approach that generates a large number of test case specifications is likely to be
affordable. If each test case must be realized in the form of scaffolding code written by
hand - or worse, if test execution requires human involvement - then it is necessary to
invest more care in selecting small suites of test case specifications.
Many engineering activities require careful analysis of trade-offs. Functional testing is no
exception: Successfully balancing the many aspects is a difficult and often
underestimated problem that requires skilled designers. Functional testing is not an
exercise of choosing the optimal approach, but a complex set of activities for finding a
suitable combination of models and techniques that yield a set of test cases to satisfy
cost and quality constraints. This balancing extends beyond test design to software
design for test. Appropriate design not only improves the software development
process, but can greatly facilitate the job of test designers and lead to substantial
savings.
illustrated in this chapter, test cases can be derived in two broad ways, either by
identifying representative values or by deriving a model of the unit under test. A
variety of formal models could be used in testing. The research challenge lies in
identifying a trade-off between costs of creating formal models and savings in
automatically generating test cases.
Development of a general framework for deriving test cases from a range of
formal specifications. Currently research addresses techniques for generating
test cases from individual formal methods. Generalization of techniques will allow
more combinations of formal methods and testing.
Another important research area is fed by interest in different specification and design
paradigms (e.g., software architectures, software design patterns, and service- oriented
applications). Often these approaches employ new graphical or textual notations.
Research is active in investigating different approaches to automatically or
semiautomatically deriving test cases from these artifacts and studying the effectiveness
of existing test case generation techniques.
Increasing size and complexity of software systems is a challenge to testing. Existing
functional testing techniques do not take advantage of test cases available for parts of
the artifact under test. Compositional approaches for deriving test cases for a given
system taking advantage of test cases available for its subsystems is an important open
research problem.
Further Reading
Functional testing techniques, sometimes called black-box testing or specification- based
testing, are presented and discussed by several authors. Ntafos [DN81] makes the case
for random rather than systematic testing; Frankl, Hamlet, Littlewood, and Strigini
[FHLS98] is a good starting point to the more recent literature considering the relative
merits of systematic and statistical approaches.
Related topics
Readers interested in practical technique for deriving functional test specifications from
informal specifications and models may continue with the next two chapters, which
describe several functional testing techniques. Readers interested in the
complementarities between functional and structural testing may continue with Chapters
12 and 13, which describe structural and data flow testing.
Exercises
In the Extreme Programming (XP) methodology (see the sidebar on page 381), a
written description of a desired feature may be a single sentence, and the first
10.1 step to designing the implementation of that feature is designing and implementing
a set of test cases. Does this aspect of the XP methodology contradict our
assertion that test cases are a formalization of specifications?
10.2
1. Compute the probability of selecting a test case that reveals the fault in
line 19 of program Root of Figure 10.1 by randomly sampling the input
domain, assuming that type double has range 231 231 1.
2. Compute the probability of randomly selecting a test case that reveals a
fault if lines 13 and 19 were both missing the condition a 0.
, to clear display
, where
Required Background
Chapter 10
Understanding the limits of random testing and the needs of a systematic
approach motivates the study of combinatorial as well as model-based testing
techniques. The general functional testing process illustrated in Section 10.3
helps position combinatorial techniques within the functional testing process.
11.1 Overview
In this chapter, we introduce three main techniques that are successfully used in
industrial environments and represent modern approaches to systematically derive test
cases from natural language specifications: the category-partition approach to identifying
attributes, relevant values, and possible combinations; combinatorial sampling to test a
large number of potential interactions of attributes with a relatively small number of
inputs; and provision of catalogs to systematize the manual aspects of combinatorial
testing.
The category-partition approach separates identification of the values that characterize
the input space from the combination of different values into complete test cases. It
provides a means of estimating the number of test cases early, size a subset of cases
that meet cost constraints, and monitor testing progress.
Pairwise and n-way combination testing provide systematic ways to cover interactions
among particular attributes of the program input space with a relatively small number of
test cases. Like the category-partition method, it separates identification of
characteristic values from generation of combinations, but it provides greater control
over the number of combinations generated.
The manual step of identifying attributes and representative sets of values can be made
more systematic using catalogs that aggregate and synthesize the experience of test
designers in a particular organization or application domain. Some repetitive steps can
be automated, and the catalogs facilitate training for the inherently manual parts.
These techniques address different aspects and problems in designing a suite of test
cases from a functional specification. While one or another may be most suitable for a
specification with given characteristics, it is also possible to combine ideas from each.
designer has seen enough failures on "degenerate" inputs to test empty collections
wherever a collection is allowed.
The number of options that can (or must) be configured for a particular model of
computer may vary from model to model. However, the category-partition method
makes no direct provision for structured data, such as sets of slot,selection pairs. A
typical approach is to "flatten" collections and describe characteristics of the whole
collection as parameter characteristics. Typically the size of the collection (the length of
a string, for example, or in this case the number of required or optional slots) is one
characteristic, and descriptions of possible combinations of elements (occurrence of
special characters in a string, for example, or in this case the selection of required and
optional components) are separate parameter characteristics.
Suppose the only significant variation among slot,selection pairs was between pairs
that are compatible and pairs that are incompatible. If we treated each pair as a
separate characteristic, and assumed n slots, the category-partition method would
generate all 2n combinations of compatible and incompatible slots. Thus we might have a
test case in which the first selected option is compatible, the second is compatible, and
the third incompatible, and a different test case in which the first is compatible but the
second and third are incompatible, and so on. Each of these combinations could be
combined in several ways with other parameter characteristics. The number of
combinations quickly explode. Moreover, since the number of slots is not actually fixed,
we cannot even place an upper bound on the number of combinations that must be
considered. We will therefore choose the flattening approach and select possible
patterns for the collection as a whole.
Identifying and Bounding Variation
It may seem that drawing a boundary between a fixed program and a variable set of
parameters would be the simplest of tasks for the test designer. It is not always so.
Consider a program that produces HTML output. Perhaps the HTML is based on a
template, which might be encoded in constants in C or Java code, or might be
provided through an external data file, or perhaps both: it could be encoded in a C or
source code file that is generated at compile time from a data file. If the HTML
template is identified in one case as a parameter to varied in testing, it seems it
should be so identified in all three of these variations, or even if the HTML template is
embedded directly in print statements of the program, or in an XSLT transformation
script.
The underlying principle for identifying parameters to be varied in testing is
anticipation of variation in use. Anticipating variation is likewise a key part of
architectural and detailed design of software. In a well-designed software system,
module boundaries reflect "design secrets," permitting one part of a system to be
modified (and retested) with minimum impact on other parts. The most frequent
changes are facilitated by making them input or configurable options. The best
software designers identify and document not only what is likely to change, but how
often and by whom. For example, a configuration or template file that may be
modified by a user will be clearly distinguished from one that is considered a fixed
part of the system.
Ideally the scope of anticipated change is both clearly documented and consonant
with the program design. For example, we expect to see client-customizable aspects
of HTML output clearly isolated and documented in a configuration file, not embedded
in an XSLT script file and certainly not scattered about in print statements in the code.
Thus, the choice to encode something as "data" rather than "program" should at least
be a good hint that it may be a parameter for testing, although further consideration
of the scope of variation may be necessary. Conversely, defining the parameters for
variation in test design can be part of the architectural design process of setting the
scope of variation anticipated for a given product or release.
Should the representative values of the flattened collection of pairs be one compatible
selection, one incompatible selection, all compatible selections, all incompatible
selections, or should we also include mix of 2 or more compatible and 2 or more
incompatible selections? Certainly the latter is more thorough, but whether there is
sufficient value to justify the cost of this thoroughness is a matter of judgment by the test
designer.
We have oversimplified by considering only whether a selection is compatible with a slot.
It might also happen that the selection does not appear in the database. Moreover, the
selection might be incompatible with the model, or with a selected component of another
slot, in addition to the possibility that it is incompatible with the slot for which it has been
selected. If we treat each such possibility as a separate parameter characteristic, we
will generate many combinations, and we will need semantic constraints to rule out
combinations like there are three options, at least two of which are compatible with the
model and two of which are not, and none of which appears in the database. On the
other hand, if we simply enumerate the combinations that do make sense and are worth
testing, then it becomes more difficult to be sure that no important combinations have
been omitted. Like all design decisions, the way in which collections and complex data
are broken into parameter characteristics requires judgment based on a combination of
analysis and experience.
B. Identify Representative Values This step consists of identifying a list of
representative values (more precisely, a list of classes of values) for each of the
parameter characteristics identified during step A. Representative values should be
identified for each category independently, ignoring possible interactions among values
[error]
not in database
[error]
valid
Number of required slots for selected
model (#SMRS)
[single]
[single]
[property RSNE]
[single]
[property OSNE]
[single]
many
[property RSNE],
[property RSMANY]
many
[property OSNE],
[property OSMANY]
Parameter: Components
Correspondence of
selection with model
slots
omitted slots
[error]
extra slots
[error]
mismatched slots
[error]
complete
correspondence
Number of required components with nonempty selection
< number of
required slots
[if OSNE]
= number of required
slots
[if RSMANY]
= number of
required slots
[if OSMANY]
some default
all valid
all valid
1 incompatible with
slot
1 incompatible
with slot
1 incompatible with
another selection
1 incompatible
with another
selection
1 incompatible with
model
1 incompatible
with model
1 not in database
[single]
[error]
0
1
many
[error]
[single]
0
1
[error]
[single]
many
Identifying relevant values is an important but tedious task. Test designers may improve
manual selection of relevant values by using the catalog approach described in Section
11.4, which captures the informal approaches used in this section with a systematic
application of catalog entries.
C. Generate Test Case Specifications A test case specification for a feature is given
as a combination of value classes, one for each identified parameter characteristic.
Unfortunately, the simple combination of all possible value classes for each parameter
characteristic results in an unmanageable number of test cases (many of which are
impossible) even for simple specifications. For example, in the Table 11.1 we find 7
categories with 3 value classes, 2 categories with 6 value classes, and one with four
value classes, potentially resulting in 37 62 4 = 314,928 test cases, which would be
acceptable only if the cost of executing and checking each individual test case were very
small. However, not all combinations of value classes correspond to reasonable test
case specifications. For example, it is not possible to create a test case from a test
case specification requiring a valid model (a model appearing in the database) where
the database contains zero models.
The category-partition method allows one to omit some combinations by indicating value
classes that need not be combined with all other values. The label [error] indicates a
value class that need be tried only once, in combination with non-error values of other
parameters. When [error] constraints are considered in the category-partition
specification of Table 11.1, the number of combinations to be considered is reduced to
1331135522+11 = 2711. Note that we have treated "component not in
database" as an error case, but have treated "incompatible with slot" as a normal case
of an invalid configuration; once again, some judgment is required.
Although the reduction from 314,928 to 2,711 is impressive, the number of derived test
cases may still exceed the budget for testing such a simple feature. Moreover, some
values are not erroneous per se, but may only be useful or even valid in particular
combinations. For example, the number of optional components with non-empty
selection is relevant to choosing useful test cases only when the number of optional slots
is greater than 1. A number of non-empty choices of required component greater than
zero does not make sense if the number of required components is zero.
Erroneous combinations of valid values can be ruled out with the property and if-property
constraints. The property constraint groups values of a single parameter characteristic
to identify subsets of values with common properties. The property constraint is
indicated with label property PropertyName, where PropertyName identifies the property
for later reference. For example, property RSNE (required slots non- empty) in Table
11.1 groups values that correspond to non-empty sets of required slots for the
parameter characteristic Number of Required Slots for Selected Model (#SMRS), i.e.,
values 1 and many. Similarly, property OSNE (optional slots non-empty) groups nonempty values for the parameter characteristic Number of Optional Slots for Selected
Model (#SMOS).
The if-property constraint bounds the choices of values for a parameter characteristic
that can be combined with a particular value selected for a different parameter
characteristic. The if-property constraint is indicated with label if PropertyName, where
PropertyName identifies a property defined with the property constraint. For example,
the constraint if RSNE attached to value 0 of parameter characteristic Number of
required components with non-empty selection limits the combination of this value with
values 1 and many of the parameter characteristics Number of Required Slots for
Selected Model (#SMRS). In this way, we rule out illegal combinations like Number of
required components with non-empty selection = 0 with Number of Required Slots for
Selected Model (#SMRS) = 0.
The property and if-property constraints introduced in Table 11.1 further reduce the
number of combinations to be considered to 1 3 1 1 (3 + 2 + 1) 5 5 2 2
+ 11 = 1811.
The number of combinations can be further reduced by iteratively adding property and ifproperty constraints and by introducing the new single constraint, which is indicated with
label single and acts like the error constraint, i.e., it limits the number of occurrences of
a given value in the selected combinations to 1.
Test designers can introduce new property, if-property, and single constraints to reduce
the total number of combinations when needed to meet budget and schedule limits.
Placement of these constraints reflects the test designer's judgment regarding
combinations that are least likely to require thorough coverage.
The single constraints introduced in Table 11.1 reduces the number of combinations to
be considered to 1 1 1 1 1 3 4 4 1 1 + 19 = 67, which may be a
reasonable balance between cost and quality for the considered functionality. The
number of combinations can also be reduced by applying the pairwise and n-way
combination testing techniques, as explained in the next section.
The set of combinations of value classes for the parameter characteristics can be turned
into test case specifications by simply instantiating the identified combinations. Table
11.2 shows an excerpt of test case specifications. The error tag in the last column
indicates test case specifications corresponding to the error constraint. Corresponding
test cases should produce an error indication. A dash indicates no constraints on the
choice of values for the parameter or environment element.
Table 11.2: An excerpt of test case specifications derived from the value classes give
Open table as spreadsheet
#
#
#
Corr.
Required
Optional
# Required # Optional
required optional w/
components componen
components components
slots
slots
model
selection
selection
slots
Model#
malformed
many
many same
EQR
all valid
all va
Not in DB
many
many same
EQR
all valid
all va
valid
many same
all valid
all va
valid
many
many same
EQR
EQO
in-other
in-m
valid
many
many same
EQR
EQO
in-mod
all va
valid
many
many same
EQR
EQO
in-mod
in-s
valid
many
many same
EQR
EQO
in-mod
in-oth
valid
many
many same
EQR
EQO
in-mod
in-m
Legend
EQR
= # req
slot
EQO
= # opt
slot
in-mod
1 incompat w/
model
in-other
1 incompat w/
another slot
in-slot
1 incompat w/
slot
Choosing meaningful names for parameter characteristics and value classes allows
(semi)automatic generation of test case specifications.
[1]At
this point, readers may ignore the items in square brackets, which indicate
constraints identified in step C of the category-partition method.
of all combinations are identical, and contain 3 3 = 9 pairs of classes. When we add
the third parameter, fonts, generating all combinations requires combining each value
class from fonts with every pair of display mode screen size, a total of 27 tuples;
extending from n to n + 1 parameters is multiplicative. However, if we are generating
pairs of values from display mode, screen size, and fonts, we can add value classes of
fonts to existing elements of display mode screen size in a way that covers all the
pairs of fonts screen size and all the pairs of fonts display mode without increasing
the number of combinations at all (see Table 11.4). The key is that each tuple of three
elements contains three pairs, and by careful selecting value classes of the tuples we
can make each tuple cover up to three different pairs.
Table 11.3: Parameters and values
controlling Chipmunk Web site display
Open table as spreadsheet
Display Mode
Language Fonts
full-graphics
English
Minimal
text-only
French
Standard
limited-bandwidth
Spanish Document-loaded
Portuguese
Laptop
16-bit
Full-size
True-color
Table 11.4: Covering all pairs of value
classes for three parameters by extending
the cross-product of two parameters
Open table as spreadsheet
Display mode Screen size Fonts
Full-graphics Hand-held
Full-graphics
Full-graphics
Laptop
Minimal
Standard
Full-size Document-loaded
Text-only Hand-held
Standard
Text-only
Text-only
Laptop Document-loaded
Full-size
Minimal
Laptop
Minimal
Limited-bandwidth
Full-size
Standard
Table 11.3 shows 17 tuples that cover all pairwise combinations of value classes of the
five parameters. The entries not specified in the table ("") correspond to open choices.
Each of them can be replaced by any legal value for the corresponding parameter.
Leaving them open gives more freedom for selecting test cases.
Generating combinations that efficiently cover all pairs of classes (or triples, or ) is
nearly impossible to perform manually for many parameters with many value classes
(which is, of course, exactly when one really needs to use the approach). Fortunately,
efficient heuristic algorithms exist for this task, and they are simple enough to
incorporate in tools.
The tuples in Table 11.3 cover all pairwise combinations of value choices for the five
parameters of the example. In many cases not all choices may be allowed. For
example, the specification of the Chipmunk Web site display may indicate that
monochrome displays are limited to hand-held devices. In this case, the tuples covering
the pairs Monochrome,Laptop and Monochrome,Full-size, i.e., the fifth and
ninth tuples of Table 11.3, would not correspond to legal inputs. We can restrict the set
of legal combinations of value classes by adding suitable constraints. Constraints can be
expressed as tuples with wild-cards that match any possible value class. The patterns
describe combinations that should be omitted from the sets of tuples.
Table 11.5: Covering all pairs of value classes for the five parameters
Open table as spreadsheet
Language Color
Display Mode
English Monochrome
English
English
English
Screen Size
Full-graphics
Minimal
Hand-held
Text-only
Standard
Full-size
16-bit Limited-bandwidth
Full-size
Text-only Document-loaded
Laptop
Color-map
True-color
Fonts
Color-map
Standard
Laptop
Full-graphics Document-loaded
Full-size
French
16-bit
Text-only
Minimal
French
True-color
Hand-held
Document-loaded
Full-size
Spanish Monochrome
Spanish
Color-map Limited-bandwidth
Minimal
Hand-held
Spanish
16-bit
Full-graphics
Standard
Laptop
Spanish
True-color
Text-only
Hand-held
Portuguese Monochrome
Text-only
Minimal
Laptop
Hand-held
Portuguese
Portuguese
Color-map
Portuguese
True-color
Full-graphics
Minimal
Full-size
Portuguese
True-color Limited-bandwidth
Standard
Hand-held
indicate that tuples containing the pair Monochrome,Hand-held as values for the
fourth and fifth parameter are not allowed in the relation of Table 11.3.
Tuples that cover all pairwise combinations of value classes without violating the
constraints can be generated by simply removing the illegal tuples and adding legal
tuples that cover the removed pairwise combinations. Open choices must be bound
consistently in the remaining tuples, e.g., tuple
must become
Constraints can also be expressed with sets of tables to indicate only the legal
combinations, as illustrated in Table 11.6, where the first table indicates that the value
class Hand-held for parameter Screen can be combined with any value class of
parameter Color, including Monochrome, while the second table indicates that the value
classes Laptop and Full-size for parameter Screen size can be combined with all values
classes except Monochrome for parameter Color.
Language
Fonts
full-graphics
English
Minimal
text-only
French
Standard
limited-bandwidth
Spanish Document-loaded
Portuguese
Color
Color-map Screen size
16-bit
Hand-held
True-color
Laptop and Full-size devices
Display Mode
Language
Fonts
full-graphics
English
Minimal
text-only
French
Standard
limited-bandwidth
Spanish Document-loaded
Portuguese
Color
Monochrome Screen size
Color-map
Laptop
16-bit
Full size
True-color
If constraints are expressed as a set of tables that give only legal combinations, tuples
can be generated without changing the heuristic. Although the two approaches express
the same constraints, the number of generated tuples can be different, since different
tables may indicate overlapping pairs and thus result in a larger set of tuples. Other
ways of expressing constraints may be chosen according to the characteristics of the
specification and the preferences of the test designer.
As in other approaches that begin with an informal description, it is not possible to give a
precise recipe for extracting the significant elements. The result will depend on the
capability and experience of the test designer.
Consider the informal specification of a function for converting URL-encoded form data
into the original data entered through an html form. An informal specification is given in
Figure 11.2.[2]
cgi decode: Function cgi decode translates a cgi-encoded string to a plain ASCII
string, reversing the encoding applied by the common gateway interface (CGI) of
most Web servers.
CGI translates spaces to +, and translates most other non-alphanumeric characters
to hexadecimal escape sequences. cgi decode maps + to , "%xy" (where x and y
are hexadecimal digits) to to the corresponding ASCII character, and other
alphanumeric characters to themselves.
INPUT: encoded A string of characters, representing the input CGI sequence. It can
contain:
alphanumeric characters
the character +
the substring "%xy", where x and y are hexadecimal digits.
encoded is terminated by a null character.
OUTPUT: decoded A string containing the plain ASCII characters corresponding to
the input CGI sequence.
Alphanumeric characters are copied into the output in the corresponding
position
A blank is substituted for each + character in the input.
A single ASCII character with hexadecimal value xy16 is substituted for each
substring "%xy" in the input.
OUTPUT: return value cgi decode returns
0 for success
1 if the input is malformed
The description of cgi decode indicates several possible results. These can be
represented as a set of postconditions:
POST 1 if the input string Encoded contains alphanumeric characters, they are copied to
the corresponding position in the output string.
POST 2 if the input string Encoded contains + characters, they are replaced by ASCII
space characters in the corresponding positions in the output string.
POST 3 if the input string Encoded contains CGI-hexadecimals, they are replaced by
the corresponding ASCII characters.
POST 4 if the input string Encoded is a valid sequence, cgi decode returns 0.
POST 5 if the input string Encoded contains a malformed CGI-hexadecimal, i.e., a
substring "%xy", where either x or y is absent or are not hexadecimal digits, cgi decode
returns 1
POST 6 if the input string Encoded contains any illegal character, cgi decode returns 1.
The postconditions should, together, capture all the expected outcomes of the module
under test. When there are several possible outcomes, it is possible to capture all of
them in one complex postcondition or in several simple postconditions; here we have
chosen a set of simple contingent postconditions, each of which captures one case. The
informal specification does not distinguish among cases of malformed input strings, but
the test designer may make further distinctions while refining the specification.
Although the description of cgi decode does not mention explicitly how the results are
obtained, we can easily deduce that it will be necessary to scan the input sequence.
This is made explicit in the following operation:
OP 1 Scan the input string Encoded.
In general, a description may refer either explicitly or implicitly to elementary operations
which help to clearly describe the overall behavior, like definitions help to clearly
describe variables. As with variables, they are not strictly necessary for describing the
relation between pre- and postconditions, but they serve as additional information for
deriving test cases.
The result of step 1 for cgi decode is summarized in Figure 11.3.
PRE
(Assumed) the input string Encoded is a null-terminated string of characters
1
PRE
POST
1
POST
2
if the input string Encoded contains + characters, they are replaced in the
output string by ASCII space characters in the corresponding positions
POST if the input string Encoded contains CGI-hexadecimals, they are replaced by
3
the corresponding ASCII characters.
POST
4
POST
5
POST if the input string Encoded contains any illegal character, cgi decode returns
6
1
VAR
1
VAR
2
VAR
3
DEF
1
DEF
2
DEF
3
OP 1
Scan Encoded
that satisfy the precondition and values that do not. We thus derive two test case
specifications.
A compound precondition, given as a Boolean expression with and or or, identifies
several classes of inputs. Although in general one could derive a different test case
specification for each possible combination of truth values of the elementary
conditions, usually we derive only a subset of test case specifications using the
modified condition decision coverage (MC/DC) approach, which is illustrated in
Section 14.3 and in Chapter 12. In short, we derive a set of combinations of
elementary conditions such that each elementary condition can be shown to
independently affect the outcome of each decision. For each elementary condition
C, there are two test case specifications in which the truth values of all conditions
except C are the same, and the compound condition as a whole evaluates to True
for one of those test cases and False for the other.
Assumed Preconditions We do not derive test case specifications for cases that
violate assumed preconditions, since there is no defined behavior and thus no way to
judge the success of such a test case. We also do not derive test cases when the whole
input domain satisfies the condition, since test cases for these would be redundant. We
generate test cases from assumed preconditions only when the MC/DC criterion
generates more than one class of valid combinations (i.e., when the condition is a logical
disjunction of more elementary conditions).
Postconditions In all cases in which postconditions are given in a conditional form, the
condition is treated like a validated precondition, i.e., we generate a test case
specification for cases that satisfy and cases that do not satisfy the condition.
Definition Definitions that refer to input or output values and are given in conditional
form are treated like validated preconditions. We generate a set of test case
specification for cases that satisfy and cases that do not satisfy the specification. The
test cases are generated for each variable that refers to the definition.
The elementary items of the specification identified in step 1 are scanned sequentially
and a set of test cases is derived applying these rules. While scanning the
specifications, we generate test case specifications incrementally. When new test case
specifications introduce a refinement of an existing case, or vice versa, the more general
case becomes redundant and can be eliminated. For example, if an existing test case
specification requires a non-empty set, and we have to add two test case specifications
that require a size that is a power of two and one which is not, the existing test case
specification can be deleted because the new test cases must include a non-empty set.
Scanning the elementary items of the cgi decode specification given in Figure 11.3, we
proceed as follows:
PRE 1 The first precondition is a simple assumed precondition. We do not generate any
test case specification. The only condition would be "encoded: a null terminated string of
characters," but this matches every test case and thus it does not identify a useful test
case specification.
PRE 2 The second precondition is a simple validated precondition. We generate two test
case specifications, one that satisfies the condition and one that does not:
TC-PRE2-1 Encoded: a sequence of CGI items
TC-PRE2-2 Encoded: not a sequence of CGI items
All postconditions in the cgi decode specification are given in a conditional form with a
simple condition. Thus, we generate two test case specifications for each of them. The
generated test case specifications correspond to a case that satisfies the condition and
a case that violates it.
POST 1:
TC-POST1-1 Encoded: contains one or more alphanumeric characters
TC-POST1-2 Encoded: does not contain any alphanumeric characters
POST 2:
TC-POST2-1 Encoded: contains one or more character +
TC-POST2-2 Encoded: does not any contain character +
POST 3:
TC-POST3-1 Encoded: contains one or more CGI-hexadecimals
TC-POST3-2 Encoded: does not contain any CGI-hexadecimal
POST 4: We do not generate any new useful test case specifications, because the two
specifications are already covered by the specifications generated from PRE 2.
POST 5: We generate only the test case specification that satisfies the condition. The
test case specification that violates the specification is redundant with respect to the test
case specifications generated from POST 3
TC-POST5-1 : Encoded contains one or more malformed CGI-hexadecimals
POST 6: As for POST 5, we generate only the test case specification that satisfies the
condition. The test case specification that violates the specification is redundant with
respect to several of the test case specifications generated so far.
TC-POST6-1 Encoded: contains one or more illegal characters
None of the definitions in the specification of cgi decode is given in conditional terms, and
PRE 2
[TCPRE22]
POST 1
[TCPOST11]
[TCPOST12]
POST 2
if the input string Encoded contains + characters, they are replaced in the
output string by in the corresponding positions
[TCPOST21]
Encoded: contains +
[TCPOST22]
POST 3
[TCPOST31]
[TCPOST32]
POST 4
POST 5 substring "%xy", where either x or y are absent or non hexadecimal digits,
cgi decode returns 1
[TCPOST51]
POST 6
if the input string Encoded contains any illegal character, cgi decode
returns a positive value
[TCPOST61]
VAR 1
VAR 2
VAR 3
DEF 1
DEF 2
Scan Encoded
Figure 11.4: Test case specifications for cgi decode generated after step 2
STEP 3 Complete the test case specifications using catalogs The aim of this step is
to generate additional test case specifications from variables and operations used or
defined in the computation. The catalog is scanned sequentially. For each entry of the
catalog we examine the elementary components of the specification and add cases to
cover all values in the catalog. As when scanning the test case specifications during step
2, redundant test case specifications are eliminated.
Table 11.7 shows a simple catalog that we will use for the cgi decode example. A
catalog is structured as a list of kinds of elements that can occur in a specification. Each
catalog entry is associated with a list of generic test case specifications appropriate for
that kind of element. We scan the specification for elements whose type is compatible
with the catalog entry, then generate the test cases defined in the catalog for that entry.
For example, the catalog of Table 11.7 contains an entry for Boolean variables. When
we find a Boolean variable in the specification, we instantiate the catalog entry by
generating two test case specifications, one that requires a True value and one that
requires a False value.
True
[in/out]
False
Enumeration
[in/out]
[in]
[in]
[in/out]
[in/out]
[in/out]
[in]
[in/out]
[in/out]
[in]
[in]
[in/out]
Empty
[in/out]
A single element
[in/out]
[in/out]
[in]
[in]
Incorrectly terminated
[in]
[in]
[in]
PP occurs contiguously
[in]
[in]
[in]
Each generic test case in the catalog is labeled in, out,or in/out, meaning that a test
case specification is appropriate if applied to an input variable, or to an output variable,
or in both cases. In general, erroneous values should be used when testing the behavior
of the system with respect to input variables, but are usually impossible to produce when
testing the behavior of the system with respect to output variables. For example, when
the value of an input variable can be chosen from a set of values, it is important to test
the behavior of the system for all enumerated values and some values outside the
enumerated set, as required by entry ENUMERATION of the catalog. However, when
the value of an output variable belongs to a finite set of values, we should derive a test
case for each possible outcome, but we cannot derive a test case for an impossible
outcome, so entry ENUMERATION of the catalog specifies that the choice of values
outside the enumerated set is limited to input variables. Intermediate variables, if
present, are treated like output variables.
Entry Boolean of the catalog applies to Return value (VAR 3). The catalog requires a
test case that produces the value True and one that produces the value False. Both
cases are already covered by test cases TC-PRE2-1 and TC-PRE2-2 generated for
precondition PRE 2, so no test case specification is actually added.
Entry Enumeration of the catalog applies to any variable whose values are chosen from
an explicitly enumerated set of values. In the example, the values of CGI item (DEF 3)
and of improper CGI hexadecimals in POST 5 are defined by enumeration. Thus, we
can derive new test case specifications by applying entry enumeration to POST 5 and
to any variable that can contain CGI items.
The catalog requires creation of a test case specification for each enumerated value and
for some excluded values. For encoded, which should consist of CGI-items as defined in
DEF 3, we generate a test case specification where a CGI-item is an alphanumeric
character, one where it is the character +, one where it is a CGI-hexadecimal, and one
where it is an illegal value. We can easily ascertain that all the required cases are
already covered by test case specifications for TC-POST1-1, TC-POST1-2, TCPOST2-1, TC-POST2-2, TC-POST3-1, and TC-POST3-2, so any additional test case
specifications would be redundant.
From the enumeration of malformed CGI-hexadecimals in POST 5, we derive the
following test cases: %y, %x, %ky, %xk, %xy (where x and y are hexadecimal digits and
k is not). Note that the first two cases, %x (the second hexadecimal digit is missing) and
%y (the first hexadecimal digit is missing) are identical, and %x is distinct from %xk only
if %x are the last two characters in the string. A test case specification requiring a
correct pair of hexadecimal digits (%xy) is a value out of the range of the enumerated
set, as required by the catalog.
The added test case specifications are:
TC-POST5-2 encoded: terminated with %x, where x is a hexadecimal digit
TC-POST5-3 encoded: contains %ky, where k is not a hexadecimal digit and y is a
hexadecimal digit.
TC-POST5-4 encoded: contains %xk, where x is a hexadecimal digit and k is not.
The test case specification corresponding to the correct pair of hexadecimal digits is
redundant, having already been covered by TC-POST3-1. The test case TC-POST5-1
can now be eliminated because it is more general than the combination of TC-POST52, TC-POST5-3, and TC-POST5-4.
Entry Range applies to any variable whose values are chosen from a finite range. In the
example, ranges appear three times in the definition of hexadecimal digit. Ranges also
appear implicitly in the reference to alphanumeric characters (the alphabetic and numeric
ranges from the ASCII character set) in DEF 3. For hexadecimal digits we will try the
special values / and : (the characters that appear before 0 and after 9 in the ASCII
encoding), the values 0 and 9 (upper and lower bounds of the first interval), some
value between 0 and 9; similarly @, G, A, F, and some value between A and F
for the second interval; and , g, a, f, and some value between a and f for the third
interval.
These values will be instantiated for variable encoded, and result in 30 additional test
case specifications (5 values for each subrange, giving 15 values for each hexadecimal
digit and thus 30 for the two digits of CGI-hexadecimal). The full set of test case
specifications is shown in Table 11.8. These test case specifications are more specific
than (and therefore replace) test case specifications TC-POST3-1, TC-POST5-3, and
TC-POST5-4.
Table 11.8: Summary table: Test case specifications for cgi decode generated
with a catalog.
TC-POST2-2
TC-POST3-2
TC-POST5-2
TC-VAR1-1
TC-VAR1-2
TC-VAR1-3
TC-DEF2-1
%/y
TC-DEF2-2
%0y
TC-DEF2-3
TC-DEF2-4
%9y
TC-DEF2-5
%:y
TC-DEF2-6
%@y
TC-DEF2-7
%Ay
TC-DEF2-8
TC-DEF2-9
%Fy
TC-DEF2-10
%Gy
TC-DEF2-11
%y
TC-DEF2-12
%ay
TC-DEF2-13
TC-DEF2-14
%fy
TC-DEF2-15
%gy
TC-DEF2-16
%x/
TC-DEF2-17
%x0
TC-DEF2-18
TC-DEF2-19
TC-DEF2-20
%x9
%x:
TC-DEF2-21
%x@
TC-DEF2-22
%xA
TC-DEF2-23
TC-DEF2-24
%xF
TC-DEF2-25
%xG
TC-DEF2-26
%x
TC-DEF2-27
%xa
TC-DEF2-28
TC-DEF2-29
%xf
TC-DEF2-30
%xg
TC-DEF2-31
%$
TC-DEF2-32
%xyz
TC-DEF3-1
TC-DEF3-2
TC-DEF3-3
c, with c in [1..8]
TC-DEF3-4
TC-DEF3-5
TC-DEF3-6
TC-DEF3-7
TC-DEF3-8
a, with a in [B..Y]
TC-DEF3-9
TC-DEF3-10
TC-DEF3-11
TC-DEF3-12
TC-DEF3-13
a, with a in [b..y]
TC-DEF3-14
TC-DEF3-15
TC-OP1-1
TC-OP1-2
TC-OP1-3
%xy'
TC-OP1-4
a$
TC-OP1-5
+$
TC-OP1-6
%xy$
TC-OP1-7
aa
TC-OP1-8
++
TC-OP1-9
%xy%zw
TC-OP1-10
%x%yz
Further Reading
Category partition testing is described by Ostrand and Balcer [OB88]. The combinatorial
approach described in this chapter is due to Cohen, Dalal, Fredman, and Patton
[CDFP97]; the algorithm described by Cohen et al. is patented by Bellcore. Catalogbased testing of subsystems is described in Marick's The Craft of Software Testing
[Mar97].
Related topics
Readers interested in learning additional functional testing techniques may continue with
the next Chapter that describes model-based testing techniques. Readers interested in
the complementarities between functional and structural testing as well as readers
interested in testing the decision structures and control and data flow graphs may
continue with the following chapters that describe structural and data flow testing.
Readers interested in the quality of specifications may proceed to Chapter 18, which
describes inspection techniques.
Exercises
When designing a test suite with the category partition method, sometimes it is
useful to determine the number of test case specifications that would be
generated from a set of parameter characteristics (categories) and value classes
(choices) without actually generating or enumerating them. Describe how to
quickly determine the number of test cases in these cases:
1. Parameter characteristics and value classes are given, but no
constraints (error, single, property,or if-property) are used.
11.1
2. Only the constraints error and single are used (without property and ifproperty).
When the property and if-property are also used, they can interact in
ways that make a quick closed-form calculation of the number of test
cases difficult or impossible.
3. Sketch an algorithm for counting the number of test cases that would be
generated when if and if-property are used. Your algorithm should be
simple, and may not be more efficient than actually generating each test
case specification.
Suppose we are constructing a tool for combinatorial testing. Our tool will read a
specification in exactly the same form as the input of a tool for the category
partition method, except that it will achieve pairwise coverage rather than
exhaustive coverage of values. However, we notice that it is sometimes not
possible to cover all pairs of choices. For example, we might encounter the
following specification:
C1
V1 [ property P1 ]
V2 [ property P2 ]
C2
V3 [ property P3 ]
V4 [ property P4 ]
C3
V5 [ifP1]
V6 [ifP4]
Our tool prints a warning that it is unable to create any complete test case
specification that pairs value V2 with V3.
1. Explain why the values V2 and V3 cannot be paired in a test case
specification.
11.3
[ifP1]
V6
[ifP4]
V7 [ error ]
Would it be satisfactory to cover the test obligation C1 = V2,C2 =
V3 with the complete test case specification C1 = V2,C2 = V3,C3 =
V7?In general, should values marked error be used to cover pairs of
parameter characteristics?
[ifP1]
V6
[ifP4]
V7 [ single ]
Would it be a good idea to use V7 to complete a test case specification
matching V2 with V3? Does your answer depend on whether the single
constraint has been used just to reduce the total number of test cases
or to identify situations that are really treated as special cases in the
program specification and code?
Derive parameter characteristics, representative values, and semantic constraints
from the following specification of an Airport connection check function, suitable
for generating a set of test case specifications using the category partition
method.
Airport connection check The airport connection check is part of an (imaginary)
travel reservation system. It is intended to check the validity of a single connection
between two flights in an itinerary. It is described here at a fairly abstract level, as
it might be described in a preliminary design before concrete interfaces have been
worked out.
Specification Signature Valid Connection (Arriving Flight: flight, Departing Flight:
flight) returns Validity Code Validity Code 0 (OK) is returned if Arriving Flight and
Departing Flight make a valid connection (the arriving airport of the first is the
departing airport of the second) and there is sufficient time between arrival and
departure according to the information in the airport database described below.
Otherwise, a validity code other than 0 is returned, indicating why the connection
is not valid.
Data types
Flight: A "flight" is a structure consisting of
A unique identifying flight code, three alphabetic characters followed by up
to four digits. (The flight code is not used by the valid connection
function.)
The originating airport code (3 characters, alphabetic)
11.5
Derive a set of test cases for the Airport Connection Check example of Exercise
11.4 using the catalog based approach.
Extend the catalog of Table 11.7 as needed to deal with specification constructs.
[2]The
informal specification is ambiguous and inconsistent, i.e., it is the kind of spec one
is most likely to encounter in practice.
Required Background
Chapter 5
The material on control flow graphs and related models of program structure is
required to understand this chapter.
Chapter 9
The introduction to test case adequacy and test case selection in general sets the
context for this chapter. It is not strictly required for understanding this chapter, but
is helpful for understanding how the techniques described in this chapter should be
applied.
[1]In this chapter we use the term program generically for the
12.1 Overview
Testing can reveal a fault only when execution of the faulty element
causes a failure. For example, if there were a fault in the statement
at line 31 of the program in Figure 12.1, it could be revealed only
with test cases in which the input string contains the character %
followed by two hexadecimal digits, since only these cases would
cause this statement to be executed. Based on this simple
observation, a program has not been adequately tested if some of
its elements have not been executed.[2] Control flow testing criteria
are defined for particular classes of elements by requiring the
execution of all such elements of the program. Control flow
elements include statements, branches, conditions, and paths.
*
*
*
*
*
*
*
*
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
*
*
*
*
*
*
*
*
*
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
Figure 12.1: The C function cgi decode, which translates a cgiencoded string to a plain ASCII string (reversing the encoding
applied by the common gateway interface of most Web servers).
Unfortunately, a set of correct program executions in which all
control flow elements are exercised does not guarantee the absence
of faults. Execution of a faulty statement may not always result in a
failure. The state may not be corrupted when the statement is
executed with some data values, or a corrupt state may not
propagate through execution to eventually lead to failure. Let us
assume, for example, to have erroneously typed 6 instead of 16 in
the statement at line 31 of the program in Figure 12.1. Test cases
that execute the faulty statement with value 0 for variable digit high
would not corrupt the state, thus leaving the fault unrevealed
despite having executed the faulty statement.
The statement at line 26 of the program in Figure 12.1 contains a
fault, since variable eptr used to index the input string is
incremented twice without checking the size of the string. If the
input string contains a character % in one of the last two positions,
eptr* will point beyond the end of the string when it is later used to
index the string. Execution of the program with a test case where
specification and implementation, or they may reveal flaws of the software or its
development process: inadequacy of the specifications that do not include cases present in
the implementation; coding practice that radically diverges from the specification; or
inadequate functional test suites.
does not satisfy the statement adequacy criterion because it does not execute
statement ok = 1 at line 29. The test suite T0 results in statement coverage of .94
(17/18),or node coverage of .91 (10/11) relative to the control flow graph of Figure 12.2.
On the other hand, a test suite with only test case
Figure 12.2: Control flow graph of function cgi decode from Figure
12.1
causes all statements to be executed and thus satisfies the statement adequacy
criterion, reaching a coverage of 1.
Coverage is not monotone with respect to the size of test suites; test suites that contain
fewer test cases may achieve a higher coverage than test suites that contain more test
cases. T1 contains only one test case, while T0 contains three test cases, but T1
achieves a higher coverage than T0. (Test suites used in this chapter are summarized in
Table 12.1.)
Table 12.1: Sample test suites for C
function cgi decode from Figure 12.1
Open table as spreadsheet
T0 = { "", "test", "test+case%1Dadequacy" }
T1 = { "adequate+test%0Dexecution%7U" }
T2
T3
= { "", "+%0D+%4J" }
T4
= { "first+test%9Ktest%K9" }
Criteria can be satisfied by many test suites of different sizes. A test suite Both T1 and
Both T1 and
cause all statements to be executed and thus satisfy the statement adequacy criterion
for program cgi decode, although one consists of a single test case and the other
consists of four test cases.
Notice that while we typically wish to limit the size of test suites, in some cases we may
prefer a larger test suite over a smaller suite that achieves the same coverage. A test
suite with fewer test cases may be more difficult to generate or may be less helpful in
debugging. Let us suppose, for example, that we omitted the 1 in the statement at line
31 of the program in Figure 12.1. Both test suites T1 and T2 would reveal the fault,
resulting in a failure, but T2 would provide better information for localizing the fault, since
the program fails only for test case "%1D", the only test case of T2 that exercises the
statement at line 31.
On the other hand, a test suite obtained by adding test cases to T2 would satisfy the
statement adequacy criterion, but would not have any particular advantage over T2 with
respect to the total effort required to reveal and localize faults. Designing complex test
cases that exercise many different elements of a unit is seldom a good way to optimize
a test suite, although it may occasionally be justifiable when there is large and
unavoidable fixed cost (e.g., setting up equipment) for each test case regardless of
complexity.
Control flow coverage may be measured incrementally while executing a test suite. In
this case, the contribution of a single test case to the overall coverage that has been
achieved depends on the order of execution of test cases. For example, in test suite T2,
execution of test case "%1D" exercises 16 of the 18 statements of the program cgi
decode, but it exercises only 1 new statement if executed after "%A." The increment of
coverage due to the execution of a specific test case does not measure the absolute
efficacy of the test case. Measures independent from the order of execution may be
obtained by identifying independent statements. However, in practice we are only
interested in the coverage of the whole test suite, and not in the contribution of individual
test cases.
Figure 12.3: The control flow graph of C function cgi decode which is obtained from
the program of Figure 12.1 after removing node F.
satisfies the statement adequacy criterion for program cgi decode but does not exercise
the false branch from node D in the control flow graph model of the program.
The branch adequacy criterion requires each branch of the program to be executed by
at least one test case.
Let T be a test suite for a program P. T satisfies the branch adequacy criterion for P, iff,
for each branch B of P, there exists at least one test case in T that causes execution of
B.
This is equivalent to stating that every edge in the control flow graph model of program
P belongs to some execution path exercised by a test case in T .
The branch coverage CBranch of T for P is the fraction of branches of program P
can consider entry and exit from the control flow graph as branches, so that
branch adequacy will imply statement adequacy even for units with no other control flow.
As trivial as this fault seems, it can easily be overlooked if only the outcomes of complete
Boolean expressions are explored. The branch adequacy criterion can be satisfied, and
both branches exercised, with test suites in which the first comparison evaluates always to
False and only the second is varied. Such tests do not systematically exercise the first
comparison and will not reveal the fault in that comparison. Condition adequacy criteria
overcome this problem by requiring different basic conditions of the decisions to be
separately exercised. The basic conditions, sometimes also called elementary conditions,
are comparisons, references to Boolean variables, and other Boolean-valued expressions
whose component subexpressions are not Boolean values.
The simplest condition adequacy criterion, called basic condition coverage requires each
basic condition to be covered. Each basic condition must have a True and a False outcome
at least once during the execution of the test suite.
A test suite T for a program P covers all basic conditions of P, that is,
it satisfies the basic condition adequacy criterion, iff each basic
condition in P has a true outcome in at least one test case in T and a
false outcome in at least one test case in T .
The basic condition coverage CBasic Condition of T for P is the
fraction of the total number of truth values assumed by the basic
conditions of program P during the execution of all test cases in T .
T satisfies the basic condition adequacy criterion if CBasic Conditions = 1. Notice that the total
number of truth values that the basic conditions can take is twice the number of basic
conditions, since each basic condition can assume value true or false. For example, the
program in Figure 12.1 contains five basic conditions, which in sum may take ten possible
truth values. Three basic conditions correspond to the simple decisions at lines 18, 22, and
24 - decisions that each contain only one basic condition. Thus they are covered by any test
suite that covers all branches. The remaining two conditions occur in the compound decision
at line 27. In this case, test suites T1 and T3 cover the decisions without covering the basic
conditions. Test suite T1 covers the decision since it has an outcome True for the substring
%0D and an outcome False for the substring %7U of test case
"adequate+test%0Dexecution%7U." However test suite T1 does not cover the first
condition, since it has only outcome True. To satisfy the basic condition adequacy criterion,
we need to add an additional test case that produces outcome false for the first condition
(e.g., test case "basic%K7").
The basic condition adequacy criterion can be satisfied without satisfying branch coverage.
For example, the test suite
satisfies the basic condition adequacy criterion, but not the branch
condition adequacy criterion, since the outcome of the decision at
line 27 is always False. Thus branch and basic condition adequacy
criteria are not directly comparable.
An obvious extension that includes both the basic condition and the
branch adequacy criteria is called branch and condition adequacy
criterion, with the obvious definition: A test suite satisfies the
branch and condition adequacy criterion if it satisfies both the
branch adequacy criterion and the condition adequacy criterion.
A more complete extension that includes both the basic condition and the branch adequacy
criteria is the compound condition adequacy criterion,[5] which requires a test for each
possible evaluation of compound conditions. It is most natural to visualize compound
condition adequacy as covering paths to leaves of the evaluation tree for the expression.
For example, the compound condition at line 27 would require covering the three paths in
the following tree:
Notice that due to the left-to-right evaluation order and short-circuit evaluation of logical OR
expressions in the C language, the value True for the first condition does not need to be
combined with both values False and True for the second condition. The number of test
cases required for compound condition adequacy can, in principle, grow exponentially with
the number of basic conditions in a decision (all 2N combinations of N basic conditions),
which would make compound condition coverage impractical for programs with very
complex conditions. Short-circuit evaluation is often effective in reducing this to a more
manageable number, but not in every case. The number of test cases required to achieve
compound condition coverage even for expressions built from N basic conditions combined
only with short-circuit Boolean operators like the && and || of C and Java can still be
exponential in the worst case.
Consider the number of cases required for compound condition coverage of the following
two Boolean expressions, each with five basic conditions. For the expression
a&&b&&c&&d&&e, compound condition coverage requires:
(2) True
(3) True
(4) True
True False
(5)
True False
(6) False
True
True
True
Test Case
(1) True
(3) True
True True
True
False
True
False
True False
False False
(11) True
False
An alternative approach that can be satisfied with the same number of test cases for
Boolean expressions of a given length regardless of short-circuit evaluation is the modified
condition/decision coverage or MC/DC, also known as the modified condition adequacy
criterion. The modified condition/decision criterion requires that each basic condition be
shown to independently affect the outcome of each decision. That is, for each basic
condition C, there are two test cases in which the truth values of all evaluated conditions
except C are the same, and the compound condition as a whole evaluates to True for one
of those test cases and False for the other. The modified condition adequacy criterion can
be satisfied with N + 1 test cases, making it an attrac-tive compromise between number of
required test cases and thoroughness of the test. It is required by important quality
standards in aviation, including RTCA/DO-178B, "Software Considerations in Airborne
Systems and Equipment Certification," and its European equivalent EUROCAE ED-12B.
Decision
True
True
True
True True
a
(1) True
(3) True
(6) True
True
True
False False
(11) True
False False
False
False
False
The values underlined in the table independently affect the outcome of the decision. Note
that the same test case can cover the values of several basic conditions. For example, test
case (1) covers value True for the basic conditions a, c and e. Note also that this is not the
only possible set of test cases to satisfy the criterion; a different selection of Boolean
combinations could be equally effective.
[5]Compound condition adequacy is also known as multiple condition
coverage.
Figure 12.4: Deriving a tree from a control flow graph to derive subpaths for
boundary/interior testing. Part (i) is the control flow graph of the C function cgi
decode, identical to Figure 12.1 but showing only node identifiers without source
code. Part (ii) is a tree derived from part (i) by following each path in the control flow
graph up to the first repeated node. The set of paths from the root of the tree to each
leaf is the required set of subpaths for boundary/interior coverage.
Figures 12.5 12.7 illustrate a fault that may not be uncovered using statement or
decision testing, but will assuredly be detected if the boundary interior path criterion is
satisfied. The program fails if the loop body is executed exactly once - that is, if the
search key occurs in the second position in the list.
1 typedef struct cell {
2
itemtype itemval;
3
struct cell *link;
4
}*list;
5 #define NIL ((struct cell *) 0)
6
7 itemtype search( list *l, keytype k)
8
{
9
struct cell *p = *l;
10
struct cell *back = NIL;
11
12
/* Case 1: List is empty */
13
if (p == NIL) {
14
return NULLVALUE;
15
}
16
17
/* Case 2: Key is at front of list */
18
if (k == p->itemval) {
19
return p->itemval;
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
}
/* Default: Simple (but buggy) sequential search */
p=p->link;
while (1) {
if (p == NIL) {
return NULLVALUE;
}
if (k==p->itemval) { /* Move to front */
back->link = p->link;
p->link = *l;
*l=p;
return p->itemval;
}
back=p; p=p->link;
}
}
Figure 12.5: A C function for searching and dynamically rearranging a linked list,
excerpted from a symbol table package. Initialization of the back pointer is missing,
causing a failure only if the search key is found in the second position in the
list.
Figure 12.6: The control flow graph of C function search with move-to-front
feature.
so that in total N branches result in 2N paths that must be traversed. Moreover, choosing
input data to force execution of one particular path may be very difficult, or even
impossible if the conditions are not independent.[6]
Since coverage of non-looping paths is expensive, we can consider a variant of the
boundary/interior criterion that treats loop boundaries similarly but is less stringent with
respect to other differences among paths.
A test suite T for a program P satisfies the loop boundary adequacy criterion iff, for
each loop l in P,
In at least one execution, control reaches the loop, and then the loop control
condition evaluates to False the first time it is evaluated.[7]
In at least one execution, control reaches the loop, and then the body of the loop
is executed exactly once before control leaves the loop.
In at least one execution, the body of the loop is repeated more than once.
One can define several small variations on the loop boundary criterion. For example, we
might excuse from consideration loops that are always executed a definite number of
times (e.g., multiplication of fixed-size transformation matrices in a graphics application).
In practice we would like the last part of the criterion to be "many times through the
loop" or "as many times as possible," but it is hard to make that precise (how many is
"many?").
It is easy enough to define such a coverage criterion for loops, but how can we justify it?
Why should we believe that these three cases - zero times through, once through, and
several times through - will be more effective in revealing faults than, say, requiring an
even and an odd number of iterations? The intuition is that the loop boundary coverage
criteria reflect a deeper structure in the design of a program. This can be seen by their
relation to the reasoning we would apply if we were trying to formally verify the
correctness of the loop. The basis case of the proof would show that the loop is
executed zero times only when its postcondition (what should be true immediately
following the loop) is already true. We would also show that an invariant condition is
established on entry to the loop, that each iteration of the loop maintains this invariant
condition, and that the invariant together with the negation of the loop test (i.e., the
condition on exit) implies the postcondition. The loop boundary criterion does not require
us to explicitly state the precondition, invariant, and postcondition, but it forces us to
exercise essentially the same cases that we would analyze in a proof.
There are additional path-oriented coverage criteria that do not explicitly consider loops.
Among these are criteria that consider paths up to a fixed length. The most common
such criteria are based on Linear Code Sequence and Jump (LCSAJ).An LCSAJ is
defined as a body of code through which the flow of control may proceed sequentially,
a while or for loop, this is equivalent to saying that the loop body is executed zero
times.
complications may arise even in conventional procedural programs (e.g., using function
pointers in C), they are most prevalent in object-oriented programming. Not surprisingly,
then, approaches to systematically exercising sequences of procedure calls are
beginning to emerge mainly in the field of object-oriented testing, and we therefore cover
them in Chapter 15.
[8]The
"unit" in this case is the C source file, which provided a single data abstraction
through several related C functions, much as a C++ or Java class would provide a single
abstraction through several methods. The search function was analogous in this case to
a private (internal) method of a class.
Figure 12.8: The subsumption relation among structural test adequacy criteria
described in this chapter.
The hierarchy can be roughly divided into a part that relates requirements for covering
program paths and another part that relates requirements for covering combinations of
conditions in branch decisions. The two parts come together at branch coverage. Above
branch coverage, path-oriented criteria and condition-oriented criteria are generally
separate, because there is considerable cost and little apparent benefit in combining
them. Statement coverage is at the bottom of the subsumes hierarchy for systematic
coverage of control flow. Applying any of the structural coverage criteria, therefore,
implies at least executing all the program statements.
Procedure call coverage criteria are not included in the figure, since they do not concern
internal control flow of procedures and are thus incomparable with the control flow
coverage criteria.
The other main option is requiring justification of each element left uncovered. This is the
approach taken in some quality standards, notably RTCA/DO-178B and EUROCAE ED12B for modified condition/decision coverage (MC/DC). Explaining why each element is
uncovered has the salutary effect of distinguishing between defensive coding and sloppy
coding or maintenance, and may also motivate simpler coding styles. However, it is
more expensive (because it requires manual inspection and understanding of each
element left uncovered) and is unlikely to be cost-effective for criteria that impose test
obligations for large numbers of infeasible paths. This problem, even more than the large
number of test cases that may be required, leads us to conclude that stringent pathoriented coverage criteria are seldom cost-effective.
Further Reading
The main structural adequacy criteria are presented in Myers' The Art of Software
Testing [Mye79], which has been a preeminent source of information for more than two
decades. It is a classic despite its age, which is evident from the limited set of
techniques addressed and the programming language used in the examples. The
excellent survey by Adrion et al. [ABC82] remains the best overall survey of testing
techniques, despite similar age. Frankl and Weyuker [FW93] provide a modern
treatment of the subsumption hierarchy among structural coverage criteria.
Boundary/interior testing is presented by Howden [How75]. Woodward et al. [WHH80]
present LCSAJ testing. Cyclomatic testing is described by McCabe [McC83]. Residual
test coverage measurement is described by Pavlopoulou and Young [PY99].
Related Topics
Readers with a strong interest in coverage criteria should continue with the next chapter,
which presents data flow testing criteria. Others may wish to proceed to Chapter 15,
which addresses testing object-oriented programs. Readers wishing a more
comprehensive view of unit testing may continue with Chapters 17 on test scaffolding
and test data generation. Tool support for structural testing is discussed in Chapter 23.
Exercises
Let us consider the following loop, which appears in C lexical analyzers generated by
12.1
1
2
3
4
for (n=0;
Devise a set of test cases that satisfy the compound condition adequacy criterion and
satisfy the modified condition adequacy criterion with respect to this loop.
The following if statement appears in the Java source code of Grappa,[10] a graph lay
AT&T Laboratories:
12.2
1
if(pos < parseArray.length
2
&&
(parseArray[pos] == '{'
3
||parseArray[pos] == '}'
4
||parseArray[pos] == '|'))
5
continue;
6
}
1. Derive a set of test case specifications and show that it satisfies the MC/DC
statement. For brevity, abbreviate each of the basic conditions as follows:
Room for pos < parseArray.length
Open for parseArray[pos] == '{'
Close for parseArray[pos] == '}'
Bar for parseArray[pos] == '|'
Prove that the number of test cases required to satisfy the modified condition adequa
The number of basis paths (cyclomatic complexity) does not depend on whether node
12.4 graph are individual statements or basic blocks that may contain several statements.
Derive the subsumption hierarchy for the call graph coverage criteria described in this
12.5 of the relationships.
If the modified condition/decision adequacy criterion requires a test case that is not fe
12.6 interdependent basic conditions, should this always be taken as an indication of a def
Why or why not?
[9]Flex
is a widely used generator of lexical analyzers. Flex was written by Vern Paxson
and is compatible with the original AT&T lex written by M.E. Lesk. This excerpt is from
version 2.5.4 of flex, distributed with the Linux operating system.
[10]The
statement appears in file Table.java. This source code is copyright 1996, 1997,
1998 by AT&T Corporation. Grappa is distributed as open source software, available at
the time of this writing from http://www.graphviz.org. Formatting of the line has been
altered for readability in this printed form.
Required Background
Chapter 6
At least the basic data flow models presented in Chapter 6, Section 6.1, are
required to understand this chapter, although algorithmic details of data flow
analysis can be deferred. Section 6.5 of that chapter is important background
for Section 13.4 of the current chapter. The remainder of Chapter 6 is useful
background but not strictly necessary to understand and apply data flow testing.
Chapter 9
The introduction to test case adequacy and test case selection in general sets
the context for this chapter. It is not strictly required for understanding this
chapter, but is helpful for understanding how the techniques described in this
chapter should be applied.
Chapter 12
The data flow adequacy criteria presented in this chapter complement control
flow adequacy criteria. Knowledge about control flow adequacy criteria is
desirable but not strictly required for understanding this chapter.
13.1 Overview
We have seen in Chapter 12 that structural testing criteria are practical for single
elements of the program, from simple statements to complex combinations of conditions,
but become impractical when extended to paths. Even the simplest path testing criteria
require covering large numbers of paths that tend to quickly grow far beyond test suites
of acceptable size for nontrivial programs.
Close examination of paths that need to be traversed to satisfy a path selection criterion
often reveals that, among a large set of paths, only a few are likely to uncover faults that
could have escaped discovery using condition testing coverage. Criteria that select paths
based on control structure alone (e.g., boundary interior testing) may not be effective in
identifying these few significant paths because their significance depends not only on
control flow but on data interactions.
Data flow testing is based on the observation that computing the wrong value leads to a
failure only when that value is subsequently used. Focus is therefore moved from control
flow to data flow. Data flow testing criteria pair variable definitions with uses, ensuring
that each computed value is actually used, and thus selecting from among many
execution paths a set that is more likely to propagate the result of erroneous
computation to the point of an observable failure.
Consider, for example, the C function cgi decode of Figure 13.1, which decodes a string
that has been transmitted through the Web's Common Gateway Interface. Data flow
testing criteria would require one to execute paths that first define (change the value of)
variable eptr (e.g., by incrementing it at line 37) and then use the new value of variable
eptr (e.g., using variable eptr to update the array indexed by dptr at line 34). Since a
value defined in one iteration of the loop is used on a subsequent iteration, we are
obliged to execute more than one iteration of the loop to observe the propagation of
information from one iteration to the next.
1
2 /* External file hex values.h defines Hex Values[128]
3 * with value 0 to 15 for the legal hex digits (case-insensitive
4 * and value -1 for each illegal digit including special charact
5 */
6
7 #include "hex values.h"
8 /**
Translate a string from the CGI encoding to plain ascii
9 *
'+' becomes space, %xx becomes byte with hex value xx,
10 *
other alphanumeric characters map to themselves.
11 *
Returns 0 for success, positive for erroneous input
12 *
1 = bad hexadecimal digit
13 */
14 int cgi decode(char *encoded, char *decoded) {
15
char *eptr = encoded;
16
char *dptr = decoded;
17
int ok=0;
18
while (*eptr) {
19
char c;
20
c = *eptr;
21
22
if (c == '+') { /* Case 1: '+' maps to blank */
23
*dptr = '';
24
} else if (c == '%') { /* Case 2: '%xx' is hex for charact
25
int digit high = Hex Values[*(++eptr)];
26
int digit low = Hex Values[*(++eptr)];
27
if ( digit high == -1 || digit low==-1) {
28
/* *dptr='?'; */
29
ok=1; /* Bad return code */
30
} else {
31
*dptr = 16* digit high + digit low;
32
}
33
} else {/* Case 3: All other characters map to themselves
34
*dptr = *eptr;
35
}
36
++dptr;
37
++eptr;
38
}
39
*dptr = '\0';
/* Null terminator for string */
40
return ok;
41 }
Figure 13.1: The C function cgi_decode, which translates a cgi-encoded string to a
plain ASCII string (reversing the encoding applied by the common gateway interface
of most Web servers). This program is also used in Chapter 12 and also presented in
Figure 12.1 of Chapter 12.
Definitions
Uses
encoded
14
15
decoded
14
16
*eptr
eptr
40
dptr
16 36
ok
17, 29
40
20
22, 24
digit_high
25
27, 31
digit_low
26
27, 31
Hex_Values
25, 26
We will initially consider treatment of arrays and pointers in the current example in a
somewhat ad hoc fashion and defer discussion of the general problem to Section 13.4.
Variables eptr and dptr are used for indexing the input and output strings. In program cgi
decode, we consider the variables as both indexes (eptr and dptr) and strings (*eptr and
*dptr). The assignment *dptr = *eptr is treated as a definition of the string *dptr as well
as a use of the index dptr, the index eptr, and the string *eptr, since the result depends
on both indexes as well as the contents of the source string. A change to an index is
treated as a definition of both the index and the string, since a change of the index
changes the value accessed by it. For example, in the statement at line 36 (++dptr), we
have a use of variable dptr followed by a definition of variables dptr and *dptr.
It is somewhat counterintuitive that we have definitions of the string *eptr, since it is easy
to see that the program is scanning the encoded string without changing it. For the
purposes of data flow testing, though, we are interested in interactions between
computation at different points in the program. Incrementing the index eptr is a
"definition" of *eptr in the sense that it can affect the value that is next observed by a use
of *eptr.
Pairing definitions and uses of the same variable v identifies potential data interactions
through v - definition-use pairs (DU pairs). Table 13.2 shows the DU pairs in program cgi
decode of Figure 13.1. Some pairs of definitions and uses from Table 13.1 do not occur
in Table 13.2, since there is no definition-clear path between the two statements. For
example, the use of variable eptr at line 15 cannot be reached from the increment at line
37, so there is no DU pair 37,15. The definitions of variables *eptr and eptr at line 25,
are paired only with the respective uses at line 26, since successive definitions of the
two variables at line 26 kill the definition at line 25 and eliminate definition-clear paths to
any other use.
Table 13.2: DU pairs for C function cgi_decode. Variable Hex_Values does not
appear because it is not defined (modified) within the procedure.
Open table as spreadsheet
Variable DU Pairs
*eptr
39, 40
dptr
ok
17, 40 , 29, 40
20, 22 , 20 24
digit_high
25, 27 , 25, 31
digit_low
26, 27 , 26, 31
encoded
14, 15
decoded
14, 16
A DU pair requires the existence of at least one definition-clear path from definition to
use, but there may be several. Additional uses on a path do not interfere with the pair.
We sometimes use the term DU path to indicate a particular definition-clear path
between a definition and a use. For example, let us consider the definition of *eptr at line
37 and the use at line 34. There are infinitely many paths that go from line 37 to the use
at line 34. There is one DU path that does not traverse the loop while going from 37 to
34. There are infinitely many paths from 37 back to 37, but only two DU paths, because
the definition at 37 kills the previous definition at the same point.
Data flow testing, like other structural criteria, is based on information obtained through
static analysis of the program. We discard definitions and uses that cannot be statically
paired, but we may select pairs even if none of the statically identifiable definition-clear
paths is actually executable. In the current example, we have made use of information
that would require a quite sophisticated static data flow analyzer, as discussed in
Section 13.4.
The all DU pairs adequacy criterion assures a finer grain coverage than statement and
branch adequacy criteria. If we consider, for example, function cgi decode, we can
easily see that statement and branch coverage can be obtained by traversing the while
loop no more than once, for example, with the test suite Tbranch = {"+","%3D","%FG","t"}
while several DU pairs cannot be covered without executing the while loop at least twice.
The pairs that may remain uncovered after statement and branch coverage correspond
to occurrences of different characters within the source string, and not only at the
beginning of the string. For example, the DU pair 37, 25 for variable *eptr can be
covered with a test case TCDU pairs"test%3D" where the hexadecimal escape sequence
occurs inside the input string, but not with "%3D." The test suite TDU pairs obtained by
adding the test case TCDU pairs to the test suite Tbranch satisfies the all DU pairs
adequacy criterion, since it adds both the cases of a hexadecimal escape sequence and
an ASCII character occurring inside the input string.
One DU pair might belong to many different execution paths. The All DU paths
adequacy criterion extends the all DU pairs criterion by requiring each simple (non
looping) DU path to be traversed at least once, thus including the different ways of
pairing definitions and uses. This can reveal a fault by exercising a path on which a
definition of a variable should have appeared but was omitted.
A test suite T for a program P satisfies the all DU paths adequacy criterion iff, for each
simple DU path dp of P, there exists at least one test case in T that exercises a path
that includes dp.
The test suite TDU pairs does not satisfy the all DU paths adequacy criterion, since both
DU pairs 37,37 for variable eptr and 36,23 for variable dptr correspond each to
two simple DU paths, and in both cases one of the two paths is not covered by test
cases in TDU pairs. The uncovered paths correspond to a test case that includes
character + occurring within the input string (e.g., test case TCDU paths = "test+case").
Although the number of simple DU paths is often quite reasonable, in the worst case it
can be exponential in the size of the program unit. This can occur when the code
between the definition and use of a particular variable is essentially irrelevant to that
variable, but contains many control paths, as illustrated by the example in Figure 13.2:
The code between the definition of ch in line 2 and its use in line 12 does not modify ch,
but the all DU paths coverage criterion would require that each of the 256 paths be
exercised.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
procedure header with the final print statement, but there are 256 definition-clear
paths between those statements - exponential in the number of intervening if
statements.
We normally consider both All DU paths and All DU pairs adequacy criteria as relatively
powerful and yet practical test adequacy criteria, as depicted in Figure 12.8 on page
231. However, in some cases, even the all DU pairs criterion may be too costly. In these
cases, we can refer to a coarser grain data flow criterion, the All definitions adequacy
criterion, which requires pairing each definition with at least one use.
A test suite T for a program P satisfies the all definitions adequacy criterion for P iff, for
each definition def of P, there exists at least one test case in T that exercises a DU pair
that includes def.
The corresponding coverage measure is the proportion of covered definitions, where we
say a definition is covered only if the value is used before being killed:
The all definitions coverage CDef of T for P is the fraction of definitions of program P
covered by at least one test case in T.
4
5
6
7
8
9
p=&j+1;
q=&k;
*p = 30;
*q=*q+55;
printf("p=%d, q=%d\n", *p, *q);
}
Figure 13.3: Pointers to objects in the program stack can create essentially arbitrary
definition-use associations, particularly when combined with pointer arithmetic as in
this example.
Further Reading
The concept of test case selection using data flow information was apparently first
suggested in 1976 by Herman [Her76], but that original paper is not widely accessible.
The more widely known version of data flow test adequacy criteria was developed
independently by Rapps and Weyuker [RW85] and by Laski and Korel [LK83]. The
variety of data flow testing criteria is much broader than the handful of criteria described
in this chapter; Clarke et al. present a formal comparison of several criteria [CPRZ89].
Frankl and Weyuker consider the problem of infeasible paths and how they affect the
relative power of data flow and other structural test adequacy criteria [FW93].
Marx and Frankl consider the problem of aliases and application of alias analysis on
individual program paths [MF96]. A good example of modern empirical research on
costs and effectiveness of structural test adequacy criteria, and data flow test coverage
in particular, is Frankl and Iakounenko [FI98].
Related Topics
The next chapter discusses model-based testing. Section 14.4 shows how control and
data flow models can be used to derive test cases from specifications. Chapter 15
illustrates the use of data flow analysis for structural testing of object oriented programs.
Readers interested in the use of data flow for program analysis can proceed with
Chapter 19.
Exercises
Sometimes a distinction is made between uses of values in predicates (p-uses)
and other "computational" uses in statements (c-uses). New criteria can be
defined using that distinction, for example:
all p-use some c-use: for all definitions and uses, exercise all (def, p-use) pairs
and at least one (def, c-use) pair
13.1
all c-use some p-use: for all definitions and uses, exercise all (def, c-use) pairs
and at least one (def, p-use) pair
1. provide a precise definition of these criteria.
2. describe the differences in the test suites derived applying the different
criteria to function cgi decode in Figure 13.1.
Demonstrate the subsume relation between all p-use some c-use, all c-use some
13.2 p-use, all DU pairs, all DU paths and all definitions.
How would you treat the buf array in the transduce procedure shown in Figure
13.3 16.1?
Required Background
Chapter 10
The rationale of systematic approaches to functional testing is a key motivation
for the techniques presented in this chapter.
Chapters 12 and 13
The material on control and data flow graphs is required to understand Section
14.4, but it is not necessary to comprehend the rest of the chapter.
14.1 Overview
Combinatorial approaches to specification-based testing (Chapter 11) primarily select
combinations of orthogonal choices. They can accommodate constraints among choices,
but their strength is in systematically distributing combinations of (purportedly)
independent choices. The human effort in applying those techniques is primarily in
characterizing the elements to be combined and constraints on their combination, often
starting from informal or semistructured specifications.
Specifications with more structure can be exploited to help test designers identify input
elements, constraints, and significant combinations. The structure may be explicit and
available in a specification, for example, in the form of a finite state machine or
grammar. It may be derivable from a semiformal model, such as class and object
diagrams, with some guidance by the designer. Even if the specification is expressed in
natural language, it may be worthwhile for the test designer to manually derive one or
more models from it, to make the structure explicit and suitable for automatic derivation
of test case specifications.
Models can be expressed in many ways. Formal models (e.g., finite state machines or
grammars) provide enough information to allow one to automatically generate test
cases. Semiformal models (e.g, class and object diagrams) may require some human
judgment to generate test cases. This chapter discusses some of the most common
models used to express requirements specifications. Models used for object-oriented
design are discussed in Chapter 15.
Models can provide two kinds of help. They describe the structure of the input space
and thus allow test designers to take advantage of work done in software requirements
analysis and design. Moreover, discrepancies from the model can be used as an implicit
fault model to help identify boundary and error cases.
The utility of models for generating test cases is an important factor in determining the
cost-effectiveness of producing formal or semiformal specifications. The return on
investment for model building should be evaluated not only in terms of reduced
specification errors and avoided misinterpretation, but also improved effectiveness and
reduced effort and cost in test design.
on a basic strategy of checking each state transition. One way to understand this basic
strategy is to consider that each transition is essentially a specification of a precondition
and postcondition, for example, a transition from state S to state T on stimulus i means
"if the system is in state S and receives stimulus i, then after reacting it will be in state T
." For instance, the transition labeled accept estimate from state Wait for acceptance to
state Repair (maintenance station) of Figure 14.2 indicates that if an item is on hold
waiting for the customer to accept an estimate of repair costs, and the customer
accepts the estimate, then the item is designated as eligible for repair.
A faulty system could violate any of these precondition, postcondition pairs, so each
should be tested. For example, the state Repair (maintenance station) can be arrived at
through three different transitions, and each should be checked.
Details of the approach taken depend on several factors, including whether system
states are directly observable or must be inferred from stimulus/response sequences,
whether the state machine specification is complete as given or includes additional,
implicit transitions, and whether the size of the (possibly augmented) state machine is
modest or very large.
The transition coverage criterion requires each transition in a finite state model to be
traversed at least once. Test case specifications for transition coverage are often
transition coverage given as state sequences or transition sequences. For example, the
test suite T-Cover in Table 14.1 is a set of four paths, each beginning at the initial state,
which together cover all transitions of the finite state machine of Figure 14.2. T-Cover
thus satisfies the transition coverage criterion.
Table 14.1: A test suite satisfying the transition coverage criterion with respect to
the finite state machine of Figure 14.2
Open table as spreadsheet
T-Cover
TC-1 0-2-4-1-0
TC-2 0-5-2-4-5-6-0
TC-3 0-3-5-9-6-0
TC-4 0-3-5-7-5-8-78-9-7-9-6-0
The transition coverage criterion depends on the assumption that the finite state machine
model is a sufficient representation of all the "important" state, for example, that
transitions out of a state do not depend on how one reached that state. Although it can
be considered a logical flaw, in practice one often finds state machines that exhibit
"history sensitivity," (i.e., the transitions from a state depend on the path by which one
reached that state). For example, in Figure 14.2, the transition taken from state Wait for
component when the component becomes available depends on how the state was
entered. This is a flaw in the model - there really should be three distinct Wait for
component states, each with a well-defined action when the component becomes
available. However, sometimes it is more expedient to work with a flawed state machine
model than to repair it, and in that case test suites may be based on more than the
simple transition coverage criterion.
Coverage criteria designed to cope with history sensitivity include single state path
coverage, single transition path coverage, and boundary interior loop coverage. Each
of these criteria requires execution of paths that include certain subpaths in the FSM.
The single state path coverage criterion requires each subpath that traverses states at
most once to be included in a path that is exercised. The single transition path coverage
criterion requires each subpath that traverses transitions at most once to be included in
a path that is exercised. The boundary interior loop coverage criterion requires each
distinct loop of the state machine to be exercised the minimum, an intermediate, and the
maximum or a large number of times[1]. These criteria may be practical for very small
and simple finite state machine specifications, but since the number of even simple paths
(without repeating states) can grow exponentially with the number of states, they are
often impractical.
Specifications given as finite state machines are typically incomplete: They do not
include a transition for every possible (state, stimulus) pair. Often the missing transitions
are implicitly error cases. Depending on the system, the appropriate interpretation may
be that these are don't care transitions (since no transition is specified, the system may
do anything or nothing), self transitions (since no transition is specified, the system
should remain in the same state), or (most commonly) error transitions that enter a
distinguished state and possibly trigger some error-handling procedure. In at least the
latter two cases, thorough testing includes the implicit as well as the explicit state
transitions. No special techniques are required: The implicit transitions are simply added
to the representation before test cases are selected.
The presence of implicit transitions with a don't care interpretation is typically an implicit
or explicit statement that those transitions are impossible (e.g., because of physical
constraints). For example, in the specification of the maintenance procedure of Figure
14.2, the effect of event lack of component is specified only for the states that
represent repairs in progress because only in those states might we discover a needed
is missing.
Sometimes it is possible to test don't care transitions even if they are believed to be
impossible in the fielded system, because the system does not prevent the triggering
event from occurring in a test configuration. If it is not possible to produce test cases for
the don't care transitions, then it may be appropriate to pass them to other validation or
verification activities, for example, by including explicit assumptions in a requirements or
specification document that will undergo inspection.
[1]The
boundary interior path coverage was originally proposed for structural coverage of
program control flow, and is described in Chapter 12.
Pricing The pricing function determines the adjusted price of a configuration for a
particular customer. The scheduled price of a configuration is the sum of the
scheduled price of the model and the scheduled price of each component in the
configuration. The adjusted price is either the scheduled price, if no discounts are
applicable, or the scheduled price less any applicable discounts.
There are three price schedules and three corresponding discount schedules,
Business, Educational, and Individual. The Business price and discount schedules
apply only if the order is to be charged to a business account in good standing. The
Educational price and discount schedules apply to educational institutions. The
Individual price and discount schedules apply to all other customers. Account classes
and rules for establishing business and educational accounts are described further in
[].
A discount schedule includes up to three discount levels, in addition to the possibility
of "no discount." Each discount level is characterized by two threshold values, a value
for the current purchase (configuration schedule price) and a cumulative value for
purchases over the preceding 12 months (sum of adjusted price).
Educational prices The adjusted price for a purchase charged to an educational
account in good standing is the scheduled price from the educational price schedule.
No further discounts apply.
Business account discounts Business discounts depend on the size of the current
purchase as well as business in the preceding 12 months. A tier 1 discount is
applicable if the scheduled price of the current order exceeds the tier 1 current order
threshold, or if total paid invoices to the account over the preceding 12 months
exceeds the tier 1 year cumulative value threshold. A tier 2 discount is applicable if
the current order exceeds the tier 2 current order threshold, or if total paid invoices to
the account over the preceding 12 months exceeds the tier 2 cumulative value
threshold. A tier 2 discount is also applicable if both the current order and 12 month
cumulative payments exceed the tier 1 thresholds.
Individual discounts Purchase by individuals and by others without an established
account in good standing is based on current value alone (not on cumulative
purchases). A tier 1 individual discount is applicable if the scheduled price of the
configuration in the current order exceeds the tier 1 current order threshold. A tier 2
individual discount is applicable if the scheduled price of the configuration exceeds the
tier 2 current order threshold.
Special-price nondiscountable offers Sometimes a complete configuration is
offered at a special, non-discountable price. When a special, nondiscountable price is
available for a configuration, the adjusted price is the nondiscountable price or the
regular price after any applicable discounts, whichever is less.
Figure 14.3: The functional specification of feature Pricing of the Chipmunk Web
site.
When functional specifications can be given as Boolean expressions, we can apply any
of the condition testing approaches described in Chapter 12, Section 12.4. A good test
suite should at least exercise all elementary conditions occurring in the expression. For
simple conditions we might derive test case specifications for all possible combinations
of truth values of the elementary conditions. For complex formulas, when testing all 2n
combinations of n elementary conditions is apt to be too expensive, the modified
decision/condition coverage criterion (page 12.4) derives a small set of test conditions
such that each elementary condition independently affects the outcome.
We can produce different models of the decision structure of a specification depending
on the original specification and on the technique we want to use for deriving test cases.
If the original specification is expressed informally as in Figure 14.3, we can transform it
into either a Boolean expression, a graph, or a tabular model before applying a test
case generation technique.
Techniques for deriving test case specifications from decision structures were originally
developed for graph models, and in particular cause-effect graphs, which have been
used since the early 1970s. Cause-effect graphs are tedious to derive and do not scale
well to complex specifications. Tables, on the other hand, are easy to work with and
scale well.
The rows of a decision table represent basic conditions, and columns represent
combinations of basic conditions. The last row of the table indicates the expected
outputs for each combination. Cells of the table are labeled either True, False,or don't
care (usually written ), to indicate the truth value of the basic condition. Thus, each
column is equivalent to a logical expression joining the required values (negated, in the
case of False entries) and omitting the basic conditions with don't care values.[2]
Decision tables can be augmented with a set of constraints that limit the possible
combinations of basic conditions. A constraint language can be based on Boolean logic.
Often it is useful to add some shorthand notations for common combinations such as atmost-one(C1, , Cn) and exactly-one(C1, , Cn), which are tedious to express with
the standard Boolean connectives.
Figure 14.4 shows the decision table for the functional specification of feature pricing of
the Chipmunk Web site presented in Figure 14.3.
Open table as spreadsheet
Education
Individual
EduAc
F F F F F F
BusAc
F F F F F F
CP > CT1
F F T T
YP > YT1
CP > CT2
- F F T T
YP > YT2
SP > Sc
F T
SP > T1
- F T
SP > T2
- F T
Out Edu SP ND SP T1 SP T2 SP
Open table as spreadsheet
Business
EduAc
BusAc
T T T T T T T T T T T T
CP > CT1
F F T T F F T T
YP > YT1
F F F F T T T T
CP > CT2
- F F
YP > YT2
SP > Sc
F T
SP > T1
SP > T2
- T T
- F F
- T T
- F T F T
- F T F T F T
Out ND SP T1 SP T1 SP T2 SP T2 SP T2 SP
Constraints
at-most-one(EduAc, BusAc) at-most-one(YP < YT1, YP > YT2)
YP > YT2 YP > YT1 at-most-one(CP < CT1, CP > CT2)
CP > CT2 CP > CT1
SP > T2 SP > T1
Abbreviations
EduAc
BusAc
Business account ND
No discount
CP > CT1
Tier 1
Tier 2
CP > CT2
Special Price
SP > T1
SP > T2
Special Price better than tier 2
Open table as spreadsheet
Figure 14.4: A decision table for the functional specification of feature Pricing of the
Chipmunk Web site of Figure 14.3.
The informal specification of Figure 14.3 identifies three customer profiles: educational,
business, and individual. Figure 14.4 has only rows Educational account (EduAc) and
Business account (BusAc). The choice individual corresponds to the combination False,
False for choices EduAc and BusAc, and is thus redundant. The informal specification of
Figure 14.3 indicates different discount policies depending on the relation between the
current purchase and two progressive thresholds for the current purchase and the yearly
cumulative purchase. These cases correspond to rows 3 through 6 of Figure 14.4.
Conditions on thresholds that do not correspond to individual rows in the table can be
defined by suitable combinations of values for these rows. Finally, the informal
specification of Figure 14.3 distinguishes the cases in which special offer prices do not
exceed either the scheduled or the tier 1 or tier 2 prices. Rows 7 through 9 of the table,
suitably combined, capture all possible cases of special prices without redundancy.
Constraints formalize the compatibility relations among different basic conditions listed in
the table. For example, a cumulative purchase exceeding threshold tier 2 also exceeds
threshold tier 1.
The basic condition adequacy criterion requires generation of a test case specification
for each column in the table. Don't care entries of the table can be filled out arbitrarily,
as long as constraints are not violated.
The compound condition adequacy criterion requires a test case specification for each
combination of truth values of basic conditions. The compound condition adequacy
criterion generates a number of cases exponential in the number of basic conditions (2n
combinations for n conditions) and can thus be applied only to small sets of basic
conditions.
For the modified condition/decision adequacy criterion (MC/DC), each column in the
table represents a test case specification. In addition, for each of the original columns,
MC/DC generates new columns by modifying each of the cells containing True or False.
If modifying a truth value in one column results in a test case specifi modified
condition/decision coverage cation consistent with an existing column (agreeing in all
places where neither is don't care), the two test cases are represented by one merged
column, provided they can be merged without violating constraints.
The MC/DC criterion formalizes the intuitive idea that a thorough test suite would not
only test positive combinations of values - combinations that lead to specified outputs but also negative combinations of values - combinations that differ from the specified
ones - thus, they should produce different outputs, in some cases among the specified
ones, in some other cases leading to error conditions.
Applying MC/DC to column 1 of Figure 14.4 generates two additional columns: one for
Educational Account = False and Special Price better than scheduled price = False,
and the other for Educational Account = True and Special Price better than scheduled
price = True. Both columns are already in the table (columns 3 and 2, respectively) and
thus need not be added.
Similarly, from column 2, we generate two additional columns corresponding to
Educational Account = False and Special Price better than scheduled price = True, and
Educational Account = True and Special Price better than scheduled price = False,
also already in the table.
Generation of a new column for each possible variation of the Boolean values in the
columns, varying exactly one value for each new column, produces 78 new columns, 21
of which can be merged with columns already in the table. Figure 14.5 shows a table
obtained by suitably joining the generated columns with the existing ones. Many don't
care cells from the original table are assigned either True or False values, to allow
merging of different columns or to obey constraints. The few don't-care entries left can
be set randomly to obtain a complete test case.
EduAc T
T F F F F F F F F F F F F F
BusAc F
F F F F F F F T T T T T T T
CP > CT1 T
T F F T T T T F F T T F F T
YP > YT1 F
CP > CT2 F
F F F F F T T F F F F F F F
YP > YT2 -
SP > Sc F
T F T F T -
F T F - F T -
SP > T1 F
T F T F T F T F T F T F T F
SP > T2 F
- F T T F F F F T T T
- F - F T F
- F F F
- F - F - F
Out Edu SP ND SP T1 SP T2 SP ND SP T1 SP T1 SP T2
Open table as spreadsheet
EduAc F F F F F
T F
BusAc T T T T T
F F F
CP > CT1 T T T F F
YP > YT1 T F F T T
T T
CP > CT2 F T T F F
T F F
YP > YT2 F -
- T T
T F
SP > Sc T - T - T
SP > T1 T F T F T
T T T
SP > T2 T F T F T
T T
BusAc
Business account ND
No discount
CP > CT1
Tier 1
Tier 2
CP > CT2
YP > YT2
SP > Sc
SP > T1
Special Price
SP > T2
Special Price better than tier 2
Open table as spreadsheet
Figure 14.5: The set of test cases generated for feature Pricing of the Chipmunk
Web site applying the modified adequacy criterion.
There are many ways of merging columns that generate different tables. The table in
Figure 14.5 may not be the optimal one - the one with the fewest columns. The objective
in test design is not to find an optimal test suite, but rather to produce a cost effective
test suite with an acceptable trade-off between the cost of generating and executing test
cases and the effectiveness of the tests.
The table in Figure 14.5 fixes the entries as required by the constraints, while the initial
table in Figure 14.4 does not. Keeping constraints separate from the table
corresponding to the initial specification increases the number of don't care entries in
the original table, which in turn increases the opportunity for merging columns when
generating new cases with the MC/DC criterion. For example, if business account
(BusAc) = False, the constraint at-most-one(EduAc, BusAc) can be satisfied by
assigning either True or False to entry educational account. Fixing either choice
prematurely may later make merging with a newly generated column impossible.
[2]The
Process shipping order The Process shipping order function checks the validity of
orders and prepares the receipt.
A valid order contains the following data:
cost of goods If the cost of goods is less than the minimum processable
order (MinOrder), then the order is invalid.
shipping address The address includes name, address, city, postal code,
and country.
preferred shipping method If the address is domestic, the shipping method
must be either land freight,or expedited land freight,or overnight air.If the
address is international, the shipping method must be either air freight or
expedited air freight; a shipping cost is computed based on address and
shipping method.
type of customer A customer can be individual, business,or educational.
preferred method of payment Individual customers can use only credit
cards, while business and educational customers can choose between credit
card and invoice.
card information If the method of payment is credit card, fields credit card
number, name on card, expiration date, and billing address, if different from
shipping address, must be provided. If credit card information is not valid, the
user can either provide new data or abort the order.
The outputs of Process shipping order are
validity Validity is a Boolean output that indicates whether the order can be
processed.
total charge The total charge is the sum of the value of goods and the
computed shipping costs (only if validity = true).
payment status If all data are processed correctly and the credit card
information is valid or the payment method is by invoice, payment status is set
to valid, the order is entered, and a receipt is prepared; otherwise validity =
false.
Figure 14.6: Functional specification of the feature Process shipping order of the
Chipmunk Web site.
The informal specification in Figure 14.6 can be modeled with a control flow graph,
where the nodes represent computations and branches represent control flow consistent
with the dependencies among computations, as illustrated in Figure 14.7. Given a control
or a data flow graph model, we can generate test case specifications using the criteria
originally devised for structural testing and described in Chapters 12 and 13.
Ship
Cust
Pay
Same
CC
valid
TC-1 No
Int
Air
Bus
CC
No
Yes
TC-2 No
Dom
Air
Ind
CC
No (abort)
Abbreviations:
Too small
Cust type
Pay method
Same addr
CC Valid
Figure 14.8: Test suite T-node, comprising test case specifications TC-1 and TC-2,
exercises each of the nodes in a control flow graph model of the specification in
Figure 14.6.
The branch adequacy criterion requires each branch to be exercised at least once: each
edge of the graph must be traversed by at least one test case. Test suite T-branch
(Figure 14.9) covers all branches of the control flow graph of Figure 14.7 and thus
satisfies the branch adequacy criterion.
T-branch
Open table as spreadsheet
Case Too Ship
Ship
Cust
Pay
Same
CC
valid
TC-1 No
Int
Air
Bus
CC
No
Yes
TC-2 No
Dom
Land
TC-3 Yes
TC-4 No
Dom
Air
TC-5 No
Int
Land
TC-6 No
Edu
Inv
TC-7 No
CC
Yes
TC-8 No
CC
No (abort)
TC-9 No
CC
No (no abort)
Abbreviations:
Too small
Cust type
Pay method
Same addr
CC Valid
Figure 14.9: Test suite T-branch exercises each of the decision outcomes in a control
flow graph model of the specification in Figure 14.6.
In principle, other test adequacy criteria described in Chapters 12 and 13 can be applied
to more complex control structures derived from specifications, such as loops. A good
functional specification should rarely result in a complex control structure, but data flow
testing may be useful at a much coarser structure (e.g., to test interaction of
transactions through a database).
Advanced search The Advanced search function allows for searching elements in
the Web site database.
The key for searching can be:
a simple string, i.e., a simple sequence of characters
a compound string, i.e.,
a string terminated with character *, used as wild character, or
a string composed of substrings included in braces and separated
with commas, used to indicate alternatives
a combination of strings, i.e., a set of strings combined with the Boolean
operators NOT, AND, OR, and grouped within parentheses to change the
priority of operators.
Examples:
laptop The routine searches for string "laptop"
{DVD*,CD*} The routine searches for strings that start with substring "DVD"
or "CD" followed by any number of characters.
NOT (C2021*) AND C20* The routine searches for strings that start with
substring "C20" followed by any number of characters, except substring "21."
Figure 14.10: Functional specification of the feature Advanced search of the
Chipmunk Web site.
Figure 14.14: The derivation tree of a test case for functionality Advanced Search
derived from the BNF specification of Figure 14.11.
The simple production coverage criterion is subsumed by a richer criterion that applies
boundary conditions on the number of times each recursive production is applied
successively. To generate test cases for boundary conditions we need to choose a
minimum and maximum number of applications of each recursive production and then
generate a test case for the minimum, maximum, one greater than minimum and one
smaller than maximum. The approach is essentially similar to boundary interior path
testing of program loops (see Section 12.5 of Chapter 12, page 222), where the "loop"
in this case is in repeated applications of a production.
To apply the boundary condition criterion, we need to annotate recursive productions
with limits. Names and limits are shown in Figure 14.15, which extends the grammar of
Figure 14.13. Alternatives within compound productions are broken out into individual
productions. Production names are added for reference, and limits are added to
recursive productions. In the example of Figure 14.15, the limit of productions
compSeq1 and optCompSeq1 is set to 16; we assume that each model can have at
most 16 required and 16 optional components.
Model 1
weight
compSeq1 10
weight
compSeq2 0
weight optCompSeq1 10
weight optCompSeq2 0
weight
Comp 1
weight
OptComp 1
weight
modNum 1
weight
CompTyp 1
weight
CompVal 1
design models for test design, which involves adding or disambiguating the semantics of
notations intended for human communication. A challenge for future design notations is
to provide a better foundation for analysis and testing without sacrificing the
characteristics that make them useful for communicating and recording design decisions.
An important issue in modeling, and by extension in model-based testing, is how to use
multiple model "views" to together form a comprehensive model of a program. More
work is needed on test design that uses more than one modeling view, or on the
potential interplay between test specifications derived from different model views of the
same program.
As with many other areas of software testing and analysis, more empirical research is
also needed to characterize the cost and effectiveness of model-based testing
approaches. Perhaps even more than in other areas of testing research, this is not only
a matter of carrying out experiments and case studies, but is at least as much a matter
of understanding how to pose questions that can be effectively answered by
experiments and whose answers generalize in useful ways.
Further Reading
Myers' classic text [Mye79] describes a number of techniques for testing decision
structures. Richardson, O'Malley, and Tittle [ROT89] and Stocks and Carrington [SC96]
among others attempt to generate test cases based on the structure of (formal)
specifications. Beizer's Black Box Testing [Bei95] is a popular presentation of
techniques for testing based on control and data flow structure of (informal)
specifications.
Test design based on finite state machines has long been important in the domain of
communication protocol development and conformance testing; Fujiwara, von Bochmann,
Amalou, and Ghedamsi [FvBK+91] is a good introduction. Gargantini and Heitmeyer
[GH99] describe a related approach applicable to software systems in which the finite
state machine is not explicit but can be derived from a requirements specification.
Generating test suites from context-free grammars is described by Celentano et al.
[CCD+80] and apparently goes back at least to Hanford's test generator for an IBM PL/I
compiler [Han70]. The probabilistic approach to grammar-based testing is described by
Sirer and Bershad [SB99], who use annotated grammars to systematically generate
tests for Java virtual machine implementations.
Heimdahl et al. [HDW04] provide a cautionary note regarding how naive model- based
testing can go wrong, while a case study by Pretschner et al. [PPW+05] suggests that
model based testing is particularly effective in revealing errors in informal specifications.
Related Topics
Readers interested in testing based on finite state machines may proceed to Chapter
15, in which finite state models are applied to testing object-oriented programs.
Exercises
Derive sets of test cases for functionality Maintenance from the FSM specification
in Figure 14.2.
1. Derive a test suite that satisfies the Transition Coverage criterion.
2. Derive a test suite that satisfies the Single State Path Coverage
criterion.
14.1
Discuss how the test suite derived for functionality Maintenance applying
Transition Coverage to the FSM specification of Figure 14.2 (Exercise 14.1) must
be modified under the following assumptions.
14.2
1. How must it be modified if the implicit transitions ar error conditions?
2. How must it be modified if the implicit transitions are self-transitions?
Finite state machine specifications are often augumented with variables that may
be tested and changed by state transitions. The same system can often be
described by a machine with more or fewer states, depending on how much
information is represented by the states themselves and how much is represented
14.3 by extra variables. For example, in Figure 5.9 (page 69), the state of the buffer
(empty or not) is represented directly by the states, but we could also represent
that information with a variable empty and merge states Empty buffer and Within
line of the finite state machine into a single Gathering state to obtain a more
compact finite state machine, as in this diagram:
For the following questions, consider only scalar variables with a limited set of possible
values, like the Boolean variable empty in the example.
1. How can we systematically transform a test case for one version of the
specification into a test suite for the other? Under what conditions is this
transformation possible? Consider transformation in both directions, merging
states by adding variables and splitting states to omit variables.
2. If a test suite satisfies the transition coverage criterion for the version with
more states, will a corresponding test suite (converting each test case as you
described in part (a)) necessarily satisfy the transition coverage criterion for
the version with a suite that satisfies the transition coverage criterion for the
version with fewer states?
3. Conversely, if a test suite satisfies the transition coverage criterion for the
version of the specification with fewer states, will a corresponding test suite
(converted as you described in part (a)) necessarily satisfy the transition
coverage criterion for the version with more states?
4. How might you combine transition coverage with decision structure testing
methods to select test suites independently from the information coded
explicitly in the states or implicitly in the state variable?
Required Background
Chapters 11, 12, 13, and 14
This chapter builds on basic functional, structural, and model-based testing
techniques, including data flow testing techniques. Some basic techniques,
described more thoroughly in earlier chapters, are recapped very briefly here to
provide flexibility in reading order.
Chapter 5
Many of the techniques described here employ finite state machines for modeling
object state.
15.1 Overview
Object-oriented software differs sufficiently from procedural software to justify
reconsidering and adapting approaches to software test and analysis. For example,
methods in object-oriented software are typically shorter than procedures in other
software, so faults in complex intraprocedural logic and control flow occur less often and
merit less attention in testing. On the other hand, short methods together with
encapsulation of object state suggest greater attention to interactions among method
calls, while polymorphism, dynamic binding, generics, and increased use of exception
handling introduce new classes of fault that require attention.
Some traditional test and analysis techniques are easily adapted to object-oriented
software. For example, code inspection can be applied to object-oriented software much
as it is to procedural software, albeit with different checklists. In this chapter we will be
concerned mostly with techniques that require more substantial revision (like
conventional structural testing techniques) and with the introduction of new techniques
for coping with problems associated with object-oriented software.
115
116
117
118
119
120
241
242
Figure 15.3: An excerpt from the class diagram of the Chipmunk Web presence that
shows the hierarchy rooted in class LineItem.
Summary: Relevant Characteristics of Object-Oriented Software
State Dependent Behavior Testing techniques must consider the state in which
methods are invoked. Testing techniques that are oblivious to state (e.g., traditional
coverage of control structure) are not effective in revealing state-dependent faults.
Encapsulation The effects of executing object-oriented code may include outputs,
modification of object state, or both. Test oracles may require access to private
(encapsulated) information to distinguish between correct and incorrect behavior.
Inheritance Test design must consider the effects of new and overridden methods on
the behavior of inherited methods, and distinguish between methods that require new
test cases, ancestor methods that can be tested by reexecuting existing test cases,
and methods that do not need to be retested.
Polymorphism and Dynamic Binding A single method call may be dynamically
bound to different methods depending on the state of the computation. Tests must
exercise different bindings to reveal failures that depend on a particular binding or on
interactions between bindings for different calls.
Abstract Classes Abstract classes cannot be directly instantiated and tested, yet
they may be important interface elements in libraries and components. It is necessary
to test them without full knowledge of how they may be instantiated.
Exception Handling Exception handling is extensively used in modern objectoriented programming. The textual distance between the point where an exception is
thrown and the point where it is handled, and the dynamic determination of the
binding, makes it important to explicitly test exceptional as well as normal control
flow.
Concurrency Modern object-oriented languages and toolkits encourage and
sometimes even require multiple threads of control (e.g., the Java user interface
construction toolkits AWT and Swing). Concurrency introduces new kinds of possible
failures, such as deadlock and race conditions, and makes the behavior of a system
dependent on scheduler decisions that are not under the tester's control.
Inheritance brings in optimization issues. Child classes may share several methods with
their ancestors. Sometimes an inherited method must be retested in the child class,
despite not having been directly changed, because of interaction with other parts of the
class that have changed. Many times, though, one can establish conclusively that the
behavior of an inherited method is really unchanged and need not be retested. In other
cases, it may be necessary to rerun tests designed for the inherited method, but not
necessary to design new tests.
Most object-oriented languages allow variables to dynamically change their type, as long
as they remain within a hierarchy rooted at the declared type of the variable. For
example, variable subsidiary of method getYTDPurchased() in Figure 15.4 can be
dynamically bound to different classes of the Account hierarchy, and thus the invocation
of method subsidiary.getYTDPurchased() can be bound dynamically to different
methods.
1
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
332
333
int totalPurchased = 0;
for (Enumeration e = subsidiaries.elements() ; e.hasMo
{
Account subsidiary = (Account) e.nextElement()
totalPurchased += subsidiary.getYTDPurchased()
}
for (Enumeration e = customers.elements(); e.hasMoreEl
{
Customer aCust = (Customer) e.nextElement();
totalPurchased += aCust.getYearlyPurchase();
}
ytdPurchased = totalPurchased;
ytdPurchasedValid = true;
return totalPurchased;
}
...
}
Figure 15.4: Part of a Java implementation of Class Account. The abstract class is
specialized by the regional markets served by Chipmunk into USAccount, UKAccount,
JPAccount, EUAccount and OtherAccount, which differ with regard to shipping methods,
taxes, and currency. A corporate account may be associated with several individual
customers, and large companies may have different subsidiaries with accounts in
different markets. Method getYTDPurchased() sums the year-to-date purchases of all
customers using the main account and the accounts of all subsidiaries.
Dynamic binding to different methods may affect the whole computation. Testing a call
by considering only one possible binding may not be enough. Test designers need
testing techniques that select subsets of possible bindings that cover a sufficient range
of situations to reveal faults in possible combinations of bindings.
Some classes in an object-oriented program are intentionally left incomplete and cannot
be directly instantiated. These abstract classes[2] must be extended through subclasses;
only subclasses that fill in the missing details (e.g., method bodies) can be instantiated.
For example, both classes LineItem of Figure 15.3 and Account of Figure 15.4 are
abstract.
If abstract classes are part of a larger system, such as the Chipmunk Web presence,
and if they are not part of the public interface to that system, then they can be tested by
testing all their child classes: classes Model, Component, CompositeItem, and
SimpleItem for class LineItem and classes USAccount, UKAccount, JPAccount,
EUAccount and OtherAccount for class Account. However, we may need to test an
abstract class either prior to implementing all child classes, for example if not all child
classes will be implemented by the same engineers in the same time frame, or without
knowing all its implementations, for example if the class is included in a library whose
reuse cannot be fully foreseen at development time. In these cases, test designers need
techniques for selecting a representative set of instances for testing the abstract class.
Exceptions were originally introduced in programming languages independently of objectoriented features, but they play a central role in modern object-oriented programming
languages and in object-oriented design methods. Their prominent role in object-oriented
programs, and the complexity of propagation and handling of exceptions during program
execution, call for careful attention and specialized techniques in testing.
The absence of a main execution thread in object-oriented programs makes them well
suited for concurrent and distributed implementations. Although many object- oriented
programs are designed for and executed in sequential environments, the design of
object-oriented applications for concurrent and distributed environments is becoming
very frequent.
Object-oriented design and programming greatly impact analysis and testing. However,
test designers should not make the mistake of ignoring traditional technology and
methodologies. A specific design approach mainly affects detailed design and code, but
there are many aspects of software development and quality assurance that are largely
independent of the use of a specific design approach. In particular, aspects related to
planning, requirements analysis, architectural design, deployment and maintenance can
be addressed independently of the design approach. Figure 15.5 indicates the scope of
the impact of object-oriented design on analysis and testing.
Figure 15.5: The impact of object-oriented design and coding on analysis and
testing.
[1]Object-oriented
System and acceptance testing check overall system behavior against user and system
requirements. Since these requirements are (at least in principle) independent of the
design approach, system and acceptance testing can be addressed with traditional
techniques. For example, to test the business logic subsystem of the Chipmunk Web
presence, test designers may decide to derive test cases from functional specifications
using category-partition and catalog based methods (see Chapter 11).
Steps in Object-Oriented Software Testing
Object-oriented testing can be broken into three phases, progressing from individual
classes toward consideration of integration and interactions.
Intraclass Testing classes in isolation (unit testing)
1. If the class-under-test is abstract, derive a set of instantiations to cover
significant cases. Instantiations may be taken from the application (if
available) and/or created just for the purpose of testing.
2. Design test cases to check correct invocation of inherited and overridden
methods, including constructors. If the class-under-test extends classes that
have previously been tested, determine which inherited methods need to be
retested and which test cases from ancestor classes can be reused.
3. Design a set of intraclass test cases based on a state machine model of
specified class behavior.
4. Augment the state machine model with structural relations derived from
class source code and generate additional test cases to cover structural
features.
5. Design an initial set of test cases for exception handling, systematically
exercising exceptions that should be thrown by methods in the class under
test and exceptions that should be caught and handled by them.
6. Design an initial set of test cases for polymorphic calls (calls to superclass
or interface methods that can be bound to different subclass methods
depending on instance values).
Interclass Testing class integration (integration testing)
1. Identify a hierarchy of clusters of classes to be tested incrementally.
2. Design a set of functional interclass test cases for the cluster-under-test.
3. Add test cases to cover data flow between method calls.
4. Integrate the intraclass exception-handling test sets with interclass
Selected is decomposed into its two component states, with entries to modelSelected
directed to the default initial state workingConfiguration.
In covering the state machine model, we have chosen sets of transition sequences that
together exercise each individual transition at least once. This is the transition adequacy
criterion introduced in Chapter 14. The stronger history-sensitive criteria described in
that chapter are also applicable in principle, but are seldom used because of their cost.
Even transition coverage may be impractical for complex statecharts. The number of
states and transitions can explode in "flattening" a statechart that represents multiple
threads of control. Unlike flattening of ordinary superstates, which leaves the number of
elementary states unchanged while replicating some transitions, flattening of concurrent
state machines (so-called AND-states) produces new states that are combinations of
elementary states.
Figure 15.8 shows the statechart specification of class Order of the business logic of the
Chipmunk Web presence. Figure 15.9 shows the corresponding "flattened" state
machine. Flattening the AND-state results in a number of states equal to the Cartesian
product of the elementary states (3 3 = 9 states) and a corresponding number of
transitions. For instance, transition add item that exits state not scheduled of the
statechart corresponds to three transitions exiting the states not schedXcanc no fee, not
schedXcanc fee, and not schedXnot canc, respectively. Covering all transitions at least
once may result in a number of test cases that exceeds the budget for testing the class.
In this case, we may forgo flattening and use simpler criteria that take advantage of the
hierarchical structure of the statechart.
criterion, which requires the execution of all transitions that appear in the statechart. The
criterion requires that each statechart transition is exercised at least once, but does not
guarantee that transitions are exercised in all possible states. For example, transition
add item, which leaves the initial state, is exercised from at least one substate, but not
from all possible substates as required by the transition coverage adequacy criterion.
Table 15.2: A test suite that satisfies the simple transition
coverage adequacy criterion for the statechart of Figure
15.8. Transitions are indicated without parameters for
simplicity.
Open table as spreadsheet
get_shipping_cost()
get_discount()
purchase()
place_order()
24_hours()
5_days()
schedule()
ship()
deliver()
schedule()
suspend()
5_days()
schedule()
ship()
deliver()
purchase()
place_order()
schedule()
cancel()
Figure 15.10: Part of a class diagram of the Chipmunk Web presence. Classes
Account, LineItem, and CSVdb are abstract.
Figure 15.11: Use/include relation for the class diagram in Figure 15.10. Abstract
classes are not included. Two classes are related if one uses or includes the other.
Classes that are higher in the diagram include or use classes that are lower in the
diagram.
Interclass testing strategies usually proceed bottom-up, starting from classes that
depend on no others. The implementation-level use/include relation among classes
typically parallels the more abstract, logical depends relation among modules (see
sidebar on page 292), so a bottom-up strategy works well with cluster-based testing.
For example, we can start integrating class SlotDB with class Slot, and class
Component with class ComponentDB, and then proceed incrementally integrating
classes ModelDB and Model, up to class Order.
Dependence
The hierarchy of clusters for interclass testing is based on a conceptual relation of
dependence, and not directly on concrete relations among implementation classes (or
implementation-level design documentation).
Module A depends on module B if the functionality of B must be present for the
functionality of A to be provided.
If A and B are implemented as classes or clusters of closely related classes, it is
likely that the logical depends relation will be reflected in concrete relations among
the classes. Typically, the class or classes in A will either call methods in the class or
classes in B, or classes in A will have references to classes in B forming a contains
relation among their respective objects.
Concrete relations among classes do not always indicate dependence. It is common
for contained objects to have part-of relations with their ancestors in the containment
hierarchy, but the dependence is normally from container to contained object and not
vice versa. It is also common to find calls from framework libraries to methods that
use those libraries. For example, the SAX API for parsing XML is an event-driven
parsing framework, which means the parsing library makes calls (through interfaces)
on methods provided by the application. This style of event handling is most familiar to
Java programmers through the standard Java graphical user interface libraries. It is
clear that the application depends on the library and not vice versa.
The depends relation is as crucial to other software development processes as it is to
testing. It is essential to building a system as a set of incremental releases, and to
scheduling and managing the construction of each release. The depends relation may
be documented in UML package diagrams, and even if not documented explicitly it is
surely manifest in the development build order. Test designers may (and probably
should) be involved in defining the build order, but should not find themselves in the
position of discovering or re-creating it after the fact.
another interaction that should not be permitted at that point yields a test case that
checks error handling.
Figure 15.12 shows a possible pattern of interactions among objects, when a customer
assembling an order O first selects the computer model C20, then adds a hard disk
HD60 that is not compatible with the slots of the selected model, and then adds "legal"
hard disk HD20. The sequence diagram indicates the sequence of interactions among
objects and suggests possible testing scenarios. For example, it suggests adding a
component after having selected a model. In other words, it indicates interesting states
of objects of type ModelDB and Slots when testing class Model.
Figure 15.12: A (partial) sequence diagram that specifies the interactions among
objects of type Order, Model, ModelDB, Component, ComponentDB, Slots, and
SlotDB, to select a computer, add an illegal component, and then add a legal
one.
Unlike statecharts, which should describe all possible sequences of transitions that an
object can undergo, interaction diagrams illustrate selected interactions that the
designers considered significant because they were typical, or perhaps because they
were difficult to understand. Deriving test cases from interaction diagrams is useful as a
way of choosing some significant cases among the enormous variety of possible
interaction sequences, but it is insufficient as a way of ensuring thorough testing.
Integration tests should at the very least repeat coverage of individual object states and
transitions in the context of other parts of the cluster under test.
84
85
86
87
88
89
90
91
92
93
215
216
Figure 15.2: More of the Java implementation of class Model. Because of the way
method isLegalConfig is implemented (see Figure 15.1), all methods that modify slots
must reset the private variable legalConfig.
The chief difference between functional testing techniques for object-oriented software
and their counterparts for procedural software (Chapters 10, 11, and 14) is the central
role of object state and of sequences of method invocations to modify and observe
object state. Similarly, structural test design must be extended beyond consideration of
control and data flow in a single method to take into account how sequences of method
invocations interact. For example, tests of isLegalConfiguration would not be sufficient
without considering the prior state of private variable legalConfig.
Since the state of an object is comprised of the values of its instance variables, the
number of possible object states can be enormous. We might choose to consider only
the instance variables that do not appear in the specification, and add only those to the
state machine representation of the object state. In the class Model example, we will
have to add only the state of the Boolean variable legalConfig, which can at most double
the number of states (and at worst quadruple the number of transitions). While we can
model the concrete values of a single Boolean variable like legalConfig, this approach
would not work if we had a dozen such variables, or even a single integer variable
introduced in the implementation. To reduce the enormous number of states obtained by
considering the combinations of all values of the instance variables, we could select a
few representative values.
Another way to reduce the number of test cases based on interaction through instance
variable values while remaining sensitive enough to catch many common oversights is to
model not the values of the variables, but the points at which the variables receive those
values. This is the same intuition behind data flow testing described in Chapter 13,
although it requires some extension to cover sequences in which one method defines
(sets) a variable and another uses that variable. Definition-use pairs for the instance
variables of an object are computed on an intraclass control flow graph that joins all the
methods of a single class, and thus allows pairing of definitions and uses that occur in
different methods.
Figure 15.13 shows a partial intraclass control flow graph of class Model. Each method
is modeled with a standard control flow graph (CFG), just as if it were an independent
procedure, except that these are joined to allow paths that invoke different methods in
sequence. To allow sequences of method calls, the class itself is modeled with a node
class Model connected to the CFG of each method. Method Model includes two extra
statements that correspond to the declarations of variables legalConfig and modelDB
that are initialized when the constructor is invoked.[3]
Figure 15.13: A partial intraclass control flow graph for the implementation of class
Model in Figures 15.1 and 15.2.
Sometimes definitions and uses are made through invocation of methods of other
classes. For example, method addComponent calls method contains of class
componentDB. Moreover, some variables are structured; for example, the state variable
slot is a complex object. For the moment, we simply "unfold" the calls to external
methods, and we treat arrays and objects as if they were simple variables.
A test case to exercise a definition-use pair (henceforth DU pair) is a sequence of
method invocations that starts with a constructor, and includes the definition followed by
the use without any intervening definition (a definition-clear path). A suite of test cases
can be designed to satisfy a data flow coverage criterion by covering all such pairs. In
that case we say the test suite satisfies the all DU pairs adequacy criterion.
Consider again the private variable legalConfig in class Model, Figures 15.1 and 15.2.
There are two uses of legalConfig, both in method isLegalConfiguration, one in the if and
one in the return statement; and there are several definitions in methods addComponent,
removeComponent, checkConfiguration and in the constructor, which initializes
legalConfig to False. The all DU pairs adequacy criterion requires a test case to
exercise each definition followed by each use of legalConfig with no intervening
definitions.
Specifications do not refer to the variable legalConfig and thus do not directly consider
method interactions through legalConfig or contribute to defining test cases to exercise
such interactions. This is the case, for example, in the invocation of method
checkConfiguration in isLegalConfiguration: The specification suggests that a single
invocation of method isLegalConfiguration can be sufficient to test the interactions
involving this method, while calls to method checkConfiguration in isLegalConfiguration
indicate possible failures that may be exposed only after two calls of method
isLegalConfiguration. In fact, a first invocation of isLegalConfiguration with value True for
legalConfig implies a call to checkConfiguration and consequent new definitions of
legalConfig. Only a second call to isLegalConfiguration would exercise the use of the
new value in the if statement, thus revealing failures that may derive from bad updates of
legalConfig in checkConfiguration.
The all DU pairs adequacy criterion ensures that every assignment to a variable is tested
at each of the uses of that variable, but like other structural coverage criteria it is not
particularly good at detecting missing code. For example, if the programmer omitted an
assignment to legalConfig, there would be no DU pair connecting the missing assignment
to the use. However, assignments to legalConfig are correlated with updates to slots,
and all DU pairs coverage with respect to slots is likely to reveal a missing assignment
to the Boolean variable. Correlation among assignments to related fields is a common
characteristic of the structure of object-oriented software.
Method calls and complex state variables complicate data flow analysis of objectoriented software, as procedure calls and structured variables do in procedural code. As
discussed in Chapters 6 and 13, there is no universal recipe to deal with interclass calls.
Test designers must find a suitable balance between costs and benefits.
A possible approach to deal with interclass calls consists in proceeding incrementally
following the dependence relation, as we did for functional interclass testing. The
dependence relation that can be derived from code may differ from the dependence
relation derived from specifications. However, we can still safely assume that welldesigned systems present at most a small number of easily breakable cycles. The
dependencies of the implementation and specification of class Model are the same and
are shown in Figure 15.11.
Leaf classes of the dependence hierarchy can be analyzed in isolation by identifying
definitions and uses of instance variables, as just shown. The data flow information
collected on leaf classes can be summarized by marking methods that access but do not
modify the state as Inspectors; methods that modify, but do not otherwise access the
state, as Modifiers; and methods that both access and modify the state as
Inspector/Modifiers.
When identifying inspectors, modifiers and inspector/modifiers, we consider the whole
object state. Thus, we mark a method as inspector/modifier even if it uses just one
instance variable and modifies a different one. This simplification is crucial to scalability,
since distinguishing uses and definitions of each individual variable would quickly lead to
an unmanageable amount of information while climbing the dependence hierarchy.
If methods contain more than one execution path, we could summarize the whole
method as an inspector, modifier, or inspector/modifier, or we could select a subset of
paths to be considered independently. A single method might include Inspector, Modifier,
and Inspector/Modifier paths.
Once the data flow information of leaf classes has been summarized, we can proceed
with classes that only use or contain leaf classes. Invocations of modifier methods and
inspector/modifiers of leaf classes are considered as definitions. Invocations of
inspectors and inspector/modifiers are treated as uses. When approximating
inspector/modifiers as uses, we assume that the method uses the values of the instance
variables for computing the new state. This is a common way of designing methods, but
some methods may fall outside this pattern. Again, we trade precision for scalability and
reduced cost.
We can then proceed incrementally analyzing classes that depend only on classes
already analyzed, until we reach the top of the hierarchy. In this way, each class is
always considered in isolation, and the summary of information at each step prevents
exponential growth of information, thus allowing large classes to be analyzed, albeit at a
cost in precision.
Figure 15.14 shows the summary information for classes Slot, ModelDB, and Model.
The summary information for classes Slot and ModelDB can be used for computing
structural coverage of class Model without unfolding the method calls. The summary
information for class Model can be used to compute structural coverage for class Order
without knowing the structure of the classes used by class Order. Method
checkConfiguration is not included in the summary information because it is private. The
three paths in checkConfiguration are included in the summary information of the calling
method isLegalConfiguration.
Class Slot
Slot() modifier
bind()
modifier
unbind() modifier
isBound() inspector
Class ModelDB
ModelDB() modifier
getModel() inspector
findModel() inspector
Class Model
Model()
modifier
selectModel()
modifier
deselectModel()
modifier
inspector/modifier
inspector/modifier
inspector/modifier
modifier
Figure 15.14: Summary information for structural interclass testing for classes Slot,
ModelDB, and Model. Lists of CFG nodes in square brackets indicate different paths,
when methods include more than one part.
While summary information is usually derived from child classes, sometimes it is useful
to provide the same information without actually performing the analysis, as we have
done when analyzing class Model. This is useful when we cannot perform data flow
analysis on the child classes, as when child classes are delivered as a closed
component without source code, or are not available yet because the development is still
in progress.
[3]We
An (abstract) check for equivalence can be used in a test oracle if test cases exercise
two sequences of method calls that should (or should not) produce the same object
state. Comparing objects using this equivalent scenarios approach is particularly
suitable when the classes being tested are an instance of a fairly simple abstract data
type, such as a dictionary structure (which includes hash tables, search trees, etc.), or a
sequence or collection.
Table 15.3 shows two sequences of method invocations, one equivalent and one nonequivalent to test case TCE for class Model. The equivalent sequence is obtained by
removing "redundant" method invocations - invocations that brings the system to a
previous state. In the example, method deselectModel cancels the effect of previous
invocations of method selectModel and addComponent. The nonequivalent sequence is
obtained by selecting a legal subset of method invocations that bring the object to a
different state.
Table 15.3: Equivalent and nonequivalent scenarios (invocation
sequences) for test case TCE from Table 15.1 for class Model.
Test Case TCE
selectModel(M1)
Scenario TCE2
Scenario TCE1
addComponent(S1,C1)
selectModel(M2)
addComponent(S2,C2)
selectModel(M2)
addComponent(S1,C1)
isLegalCon.guration() addComponent(S1,C1)
addComponent(S2,C2)
deselectModel() isLegalCon.guration()
isLegalConfiguration()
selectModel(M2)
EQUIVALENT
NON-EQUIVALENT
addComponent(S1,C1)
isLegalCon.guration()
Producing equivalent sequences is often quite simple. While finding nonequivalent
sequences is even easier, choosing a few good ones is difficult. One approach is to
hypothesize a fault in the method that "generated" the test case, and create a sequence
that could be equivalent if the method contained that fault. For example, test case TCE
was designed to test method deselectModel. The nonequivalent sequence of Table 15.3
leads to a state that could be produced if method deselectModel did not clear all slots,
leaving component C2 bound to slot S2 in the final configuration.
One sequence of method invocations is equivalent to another if the two sequences lead
to the same object state. This does not necessarily mean that their concrete
representation is bit-for-bit equal. For example, method addComponent binds a
component to a slot by creating a new Slot object (Figure 15.2). Starting from two
identical Model objects, and calling addComponent on both with exactly the same
parameters, would result in two objects that represent the same information but that
nonetheless would contain references to distinct Slot objects. The default equals method
inherited from class Object, which makes a bit-for-bit comparison, would not consider
them equivalent. A good practice is to add a suitable observer method to a class (e.g.,
by overriding the default equals method in Java).
[4]A
"friend" class in C++ is permitted direct access to private variables in another class.
There is no direct equivalent in Java or SmallTalk, although in Java one could obtain a
somewhat similar effect by using package visibility for variables and placing oracles in
the same package.
Figure 15.15: A method call in which the method itself and two of its parameters can
be dynamically bound to different classes.
The explosion in possible combinations is essentially the same combinatorial explosion
encountered if we try to cover all combinations of attributes in functional testing, and the
same solutions are applicable. The combinatorial testing approach presented in Chapter
11 can be used to choose a set of combinations that covers each pair of possible
bindings (e.g., Business account in Japan, Education customer using Chipmunk Card),
rather than all possible combinations (Japanese business customer using Chipmunk
card). Table 15.4 shows 15 cases that cover all pairwise combinations of calls for the
example of Figure 15.15.
Table 15.4: A set of test case
specifications that cover all pairwise
combinations of the possible polymorphic
bindings of Account, Credit, and
creditCard.
Open table as spreadsheet
Account
Credit
creditCard
USAccount
EduCredit
VISACard
USAccount
BizCredit
AmExpCard
EduCredit
AmExpCard
UKAccount
BizCredit
VISACard
EduCredit ChipmunkCard
BizCredit
AmExpCard
EUAccount individualCredit
VISACard
JPAccount
VISACard
JPAccount
EduCredit
BizCredit ChipmunkCard
JPAccount individualCredit
OtherAccount
OtherAccount
AmExpCard
EduCredit ChipmunkCard
BizCredit
VISACard
OtherAccount individualCredit
AmExpCard
The combinations in Table 15.4 were of dynamic bindings in a single call. Bindings in a
sequence of calls can also interact. Consider, for example, method getYTD- Purchased
of class Account shown in Figure 15.4 on page 278, which computes the total yearly
purchase associated with one account to determine the applicable discount. Chipmunk
offers tiered discounts to customers whose total yearly purchase reaches a threshold,
considering all subsidiary accounts.
The total yearly purchase for an account is computed by method getYTDPurchased,
which sums purchases by all customers using the account and all subsidiaries. Amounts
are always recorded in the local currency of the account, but getYTDPurchased sums
the purchases of subsidiaries even when they use different currencies (e.g., when some
are bound to subclass USAccount and others to EUAccount). The intra- and interclass
testing techniques presented in the previous section may fail to reveal this type of fault.
The problem can be addressed by selecting test cases that cover combinations of
polymorphic calls and bindings. To identify sequential combinations of bindings, we must
first identify individual polymorphic calls and binding sets, and then select possible
sequences.
Let us consider for simplicity only the method getYTDPurchased. This method is called
once for each customer and once for each subsidiary of the account and in both cases
can be dynamically bound to methods belonging to any of the subclasses of Account
(UKAccount, EUAccount, and so on). At each of these calls, variable totalPurchased is
used and changed, and at the end of the method it is used twice more (to set an
instance variable and to return a value from the method).
Data flow analysis may be used to identify potential interactions between possible
bindings at a point where a variable is modified and points where the same value is
used. Any of the standard data flow testing criteria could be extended to consider each
possible method binding at the point of definition and the point of use. For instance, a
single definition-use pair becomes n m pairs if the point of definition can be bound in n
ways and the point of use can be bound in m ways. If this is impractical, a weaker but
still useful alternative is to vary both bindings independently, which results in m or n pairs
(whichever is greater) rather than their product. Note that this weaker criterion would be
very likely to reveal the fault in getYTDPurchased, provided the choices of binding at
each point are really independent rather than going through the same set of choices in
lockstep. In many cases, binding sets are not mutually independent, so the selection of
combinations is limited.
15.10 Inheritance
Inheritance does not introduce new classes of faults except insofar as it is associated
with polymorphism and dynamic binding, which we have already discussed, and
exception handling, which is discussed in Section 15.12. It does provide an opportunity
for optimization by reusing test cases and even test executions. Subclasses share
methods with ancestors. Identifying which methods do not need to be retested and
which test cases can be reused may significantly reduce testing effort.
Methods of a subclass can be categorized as
New if they are newly defined in the subclass - that is, they do not occur in the ancestor.
New methods include those with the same name but different parameters than methods
in ancestor classes.
Recursive if they are inherited from the ancestor without change - that is, they occur
only in the ancestor.
Redefined if they are overridden in the subclass, that is, both occur in the subclass.
Abstract new if they are newly defined and abstract in the subclass.
Abstract recursive if they are inherited from the ancestor, where they are abstract.
Abstract redefined if they are redefined in the subclass, and they are abstract in the
ancestor.
When testing a base class, one that does not specialize a previously tested class, we
can summarize the testing information in a simple table that indicates the sets of
generated and executed test cases. Such a table is called a testing history.
In general we will have four sets of test cases for a method: intraclass functional,
intraclass structural, interclass functional, and interclass structural. For methods that do
not call methods in other classes, we will have only intraclass test cases, since no
integration test focuses on such methods. For abstract methods, we will only have
functional test cases, since we do not have the code of the method. Each set of test
cases is marked with a flag that indicates whether the test set can be executed.
Table 15.5 shows a testing history for class LineItem, whose code is shown in Figure
15.16. Methods validItem, getWeightGm, getHeightCm, getWidthCm, and get- DepthCm
are abstract and do not interact with external classes; thus we only have intraclass
functional test cases that cannot be directly executed. Method getUnitPrice is abstract,
but from the specifications (not shown here) we can infer that it interacts with other
classes; thus we have both intra- and interclass functional test cases. Both the
constructor and method getExtendedPrice are implemented and interact with other
classes (Order and AccountType, respectively), and thus we have all four sets of test
cases.
Table 15.5: Testing history for class LineItem
Open table as spreadsheet
Method
Intra funct
LineItem TSLI1,Y
validItem TSvI1,N
getUnitPrice TSgUP1,N
Intra struct
Inter funct
Inter struct
TSgUP3,N
getHeightCm TSgHC1,N
getWidthCm TSgWC1,N
getDepthCm TSgDC1,N
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
Intra funct
Intra struct
Inter funct
Inter struct
LineItem TSLI1,N
TSLI2,N
TSLI3,N
TSLI4,N
validItem TSvI1,N
getHeightCm TSgHC1,N
getWidthCm TSgWC1,N
getDepthCm TSgDC1,N
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
package Orders;
import Accounts.AccountType;
import Prices.Pricelist;
import java.util.*;
/**
* A composite line item includes a "wrapper" item for the who
* bundle and a set of zero or more component items.
*/
public abstract class CompositeItem extends LineItem {
/**
* A composite item has some unifying name and base pric
* (which might be zero) and has zero or more additional
* which are themselves line items.
*/
private Vector parts = new Vector();
/**
* Constructor from LineItem, links to an encompassing O
*/
public CompositeItem(Order order) {
super( order);
}
public int getUnitPrice(AccountType accountType) {
27
28
29
30
31
32
33
34
35
36
37
15.11 Genericity
Generics, also known as parameterized types or (in C++) as templates, are an
important tool for building reusable components and libraries. A generic class (say,
linked lists) is designed to be instantiated with many different parameter types (e.g.,
LinkedList<String> and LinkedList<Integer>). We can test only instantiations, not the
generic class itself, and we may not know in advance all the different ways a generic
class might be instantiated.
A generic class is typically designed to behave consistently over some set of permitted
parameter types. Therefore the testing (and analysis) job can be broken into two parts:
showing that some instantiation is correct and showing that all permitted instantiations
behave identically.
Testing a single instantiation raises no particular problems, provided we have source
code for both the generic class and the parameter class. Roughly speaking, we can
design test cases as if the parameter were copied textually into the body of the generic
class.
Consider first the case of a generic class that does not make method calls on, nor
access fields of, its parameters. Ascertaining this property is best done by inspecting
the source code, not by testing it. If we can nonetheless conjecture some ways in which
the generic and its parameter might interact (e.g., if the generic makes use of some
service that a parameter type might also make use of, directly or indirectly), then we
should design test cases aimed specifically at detecting such interaction.
Gaining confidence in an unknowable set of potential instantiations becomes more
difficult when the generic class does interact with the parameter class. For example,
Java (since version 1.5) has permitted a declaration like this:
class PriorityQueue<Elem implements Comparable> { ... }
The generic PriorityQueue class will be able to make calls on the methods of interface
Comparable. Now the behavior of PriorityQueue<E> is not independent of E, but it
should be dependent only in certain very circumscribed ways, and in particular it should
behave correctly whenever E obeys the requirements of the contract implied by
Comparable.
The contract imposed on permitted parameters is a kind of specification, and
specification-based (functional) test selection techniques are an appropriate way to
select representative instantiations of the generic class. For example, if we read the
interface specification for java.lang.Comparable, we learn that most but not all classes
that implement Comparable also satisfy the rule
(x.compareTo(y) == 0) == (x.equals(y))
Explicit mention of this condition strongly suggests that test cases should include
instantiations with classes that do obey this rule (class String, for example) and others
that do not (e.g., class BigDecimal with two BigDecimal values 4.0 and 4.00).
15.12 Exceptions
Programs in modern object-oriented languages use exceptions to separate handling of
error cases from the primary program logic, thereby simplifying normal control flow.
Exceptions also greatly reduce a common class of faults in languages without exceptionhandling constructs. One of the most common faults in C programs, for example, is
neglecting to check for the error indications returned by a C function. In a language like
Java, an exception is certain to interrupt normal control flow.
The price of separating exception handling from the primary control flow logic is
introduction of implicit control flows. The point at which an exception is caught and
handled may be far from the point at which it is thrown. Moreover, the association of
exceptions with handlers is dynamic. In most object-oriented languages and procedural
languages that provide exception handling, an exception propagates up the stack of
calling methods until it reaches a matching handler.
Since exceptions introduce a kind of control flow, one might expect that it could be
treated like other control flow in constructing program models and deriving test cases.
However, treating every possible exception this way would create an unwieldy control
flow graph accounting for potential exceptions at every array subscript reference, every
memory allocation, every cast, and so on, and these would be multiplied by matching
them to every handler that could appear immediately above them on the call stack.
Worse, many of these potential exceptions are actually impossible, so the burden would
not be just in designing test cases for each of them but in deciding which can actually
occur. It is more practical to consider exceptions separately from normal control flow in
test design.
We can dismiss from consideration exceptions triggered by program errors signaled by
the underlying system (subscript errors, bad casts, etc.), since exercising these
exceptions adds nothing to other efforts to prevent or find the errors themselves. If a
method A throws an exception that indicates a programming error, we can take almost
the same approach. However, if there are exception handlers for these program error
exceptions, such as we may find in fault-tolerant programs or in libraries that attempt to
maintain data consistency despite errors in client code, then it is necessary to test the
error recovery code (usually by executing it together with a stub class with the
programming error). This is different and much less involved than testing the error
recovery code coupled with every potential point at which the error might be present in
actual code.
Exceptions that indicate abnormal cases but not necessarily program errors (e.g.,
exhaustion of memory or premature end-of-file) require special treatment. If the handler
for these is local (e.g., a Java try block with an exception handler around a group of file
operations), then the exception handler itself requires testing. Whether to test each
individual point where exceptions bound to the same handler might be raised (e.g., each
individual file operation within the same try block) is a matter of judgment.
The remaining exceptions are those that are allowed to propagate beyond the local
context in which they are thrown. For example, suppose method A makes a call to
method B, within a Java try block with an exception handler for exceptions of class E.
Suppose B has no exception handler for E and makes a call to method C, which throws
E. Now the exception will propagate up the chain of method calls until it reaches the
handler in A. There could be many such chains, which depend in part on overriding
inherited methods, and it is difficult (sometimes even impossible) to determine all and
only the possible pairings of points where an exception is thrown with handlers in other
methods.
Since testing all chains through which exceptions can propagate is impractical, it is best
to make it unnecessary. A reasonable design rule to enforce is that, if a method
propagates an exception without catching it, the method call should have no other effect.
If it is not possible to ensure that method execution interrupted by an exception has no
effect, then an exception handler should be present (even if it propagates the same
exception by throwing it again). Then, it should suffice to design test cases to exercise
each point at which an exception is explicitly thrown by application code, and each
handler in application code, but not necessarily all their combinations.
Further Reading
Many recent books on software testing and software engineering address objectoriented software to at least some degree. The most complete book-length account of
current methods is Binder's Testing Object Oriented Systems [Bin00].
Structural state-based testing is discussed in detail by Buy, Orso, and Pezz` e [BOP00].
The data flow approach to testing software with polymorphism and dynamic binding was
initially proposed by Orso [Ors98]. Harrold, McGregor, and Fitzpatrick [HMF92] provide
a detailed discussion of the use of testing histories for selecting test cases for
subclasses.
Th evenod-Fosse and Waeselynck describe statistical testing using statechart
specifications [TFW93]. An excellent paper by Doong and Frankl [DF94] introduces
equivalent scenarios. Although Doong and Frankl discuss their application with algebraic
specifications (which are not much used in practice), the value of the approach does not
hinge on that detail.
Related Topics
Basic functional and structural testing strategies are treated briefly here, and readers
who have not already read Chapters 10, 11, and 12 will find there a more thorough
presentation of the rationale and basic techniques for those approaches. Chapters 13
and 14 likewise present the basic data flow and model-based testing approaches in
more detail. As integration testing progresses beyond small clusters of classes to major
subsystems and components, the interclass testing techniques described in this chapter
will become less relevant, and component testing techniques presented in Chapter 21
more important. The system and acceptance testing techniques described in Chapter 22
are as appropriate to object-oriented software as they are to mixed and purely
procedural software systems.
Exercises
The set of test cases given in Table 15.1 is not the smallest test suite that
satisfies the transition coverage criterion for the finite state machine (FSM) of
Figure 15.7.
1. Derive a smaller set of test cases that satisfy the transition coverage
criterion for the FSM.
15.1
2. Compare the two sets of test cases. What are the advantages of each?
3. Derive a suite of test cases that satisfies the simple transition coverage
criterion but does not satisfy the transition coverage criterion.
The test cases given in Table 15.1 assume that transitions not given explicitly are
"don't care," and thus we do not exercise them. Modify the test suite, first
15.2 assuming that omitted transitions are "error" transitions. Next, modify the same
test suite, but instead assuming that the omitted transitions are "self" transitions.
Are the two modified test suites different? Why or why not?
Generate at least one equivalent and one nonequivalent scenario for at least one
15.3 of the test cases TCA,,TCE of Table 15.1.
15.4
Required Background
Chapter 9
The introduction to test case selection and adequacy sets the context for this
chapter. Though not strictly required, it is helpful in understanding how the
techniques described in this chapter should be applied.
Chapter 12
Some basic knowledge of structural testing criteria is required to understand the
comparison of fault-based with structural testing criteria.
16.1 Overview
Engineers study failures to understand how to prevent similar failures in the future. For
example, failure of the Tacoma Narrows Bridge in 1940 led to new understanding of
oscillation in high wind and to the introduction of analyses to predict and prevent such
destructive oscillation in subsequent bridge design. The causes of an airline crash are
likewise extensively studied, and when traced to a structural failure they frequently result
in a directive to apply diagnostic tests to all aircraft considered potentially vulnerable to
similar failures.
Experience with common software faults sometimes leads to improvements in design
methods and programming languages. For example, the main purpose of automatic
memory management in Java is not to spare the programmer the trouble of releasing
unused memory, but to prevent the programmer from making the kind of memory
management errors (dangling pointers, redundant deallocations, and memory leaks) that
frequently occur in C and C++ programs. Automatic array bounds checking cannot
prevent a programmer from using an index expression outside array bounds, but can
make it much less likely that the fault escapes detection in testing, as well as limiting the
damage incurred if it does lead to operational failure (eliminating, in particular, the buffer
overflow attack as a means of subverting privileged programs). Type checking reliably
detects many other faults during program translation.
Of course, not all programmer errors fall into classes that can be prevented or statically
detected using better programming languages. Some faults must be detected through
testing, and there too we can use knowledge about common faults to be more effective.
The basic concept of fault-based testing is to select test cases that would distinguish the
program under test from alternative programs that contain hypothetical faults. This is
usually approached by modifying the program under test to actually produce the
hypothetical faulty programs. Fault seeding can be used to evaluate the thoroughness of
a test suite (that is, as an element of a test adequacy criterion), or for selecting test
cases to augment a test suite, or to estimate the number of faults in a program.
alternate program R is distinct from the behavior of the original program P for a test
t,if R and P produce a different result for t, or if the output of R is not defined for t.
Distinguished set of alternate programs for a test suite T A set of alternate
programs are distinct if each alternate program in the set can be distinguished from
the original program by at least one test in T.
Fault-based testing can guarantee fault detection only if the competent programmer
hypothesis and the coupling effect hypothesis hold. But guarantees are more than we
expect from other approaches to designing or evaluating test suites, including the
structural and functional test adequacy criteria discussed in earlier chapters. Fault-based
testing techniques can be useful even if we decline to take the leap of faith required to
fully accept their underlying assumptions. What is essential is to recognize the
dependence of these techniques, and any inferences about software quality based on
fault-based testing, on the quality of the fault model. This also implies that developing
better fault models, based on hard data about real faults rather than guesses, is a good
investment of effort.
Mutants should be plausible as faulty programs. Mutant programs that are rejected by a
compiler, or that fail almost all tests, are not good models of the faults we seek to
uncover with systematic testing.
We say a mutant is valid if it is syntactically correct. A mutant obtained from the
program of Figure 16.1 by substituting while for switch in the statement at line 13 would
not be valid, since it would result in a compile-time error. We say a mutant is useful if, in
addition to being valid, its behavior differs from the behavior of the original program for
no more than a small subset of program test cases. A mutant obtained by substituting 0
for 1000 in the statement at line 4 would be valid, but not useful, since the mutant would
be distinguished from the program under test by all inputs and thus would not give any
useful information on the effectiveness of a test suite. Defining mutation operators that
produce valid and useful mutations is a nontrivial task.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
Figure 16.1: Program transduce converts line endings among Unix, DOS, and
Macintosh conventions. The main procedure, which selects the output line end
convention, and the output procedure emit are not shown.
Since mutants must be valid, mutation operators are syntactic patterns defined relative
to particular programming languages. Figure 16.2 shows some mutation operators for
the C language. Constraints are associated with mutation operators to guide selection of
test cases likely to distinguish mutants from the original program. For example, the
mutation operator svr (scalar variable replacement) can be applied only to variables of
compatible type (to be valid), and a test case that distinguishes the mutant from the
original program must execute the modified statement in a state in which the original
variable and its substitute have different values.
Open table as spreadsheet
ID Operator
Description
Constraint
Operand Modifications
crp
scr
acr
scr
svr
csr
asr
CX
C A[I]
CS
XY
vie
car
XC
XS
A[I]C
sar
cnr
A[I]C
Expression Modifications
arithmetic operator
aor replacement
lcr
logical connector
replacement
replace e by abs(e)
replace arithmetic operator with
arithmetic operator
e1e2
e1e2
e1e2
e1e2
relational operator
replacement
uoi
Statement Modifications
sdl
statement deletion
ses
e1e2e1e2
ror
e<0
delete a statement
replace the label of one case with
another
Mi
ror 28
Mj
ror 32
Mk
Ml
ssr 16
(pos > 0)
(pos >= 0) -
atCR = 0
pos = 0
x x x x
- - -
Description
1U
Test
case
Description
1D
2U
2D
Figure 16.3: A sample set of mutants for program Transduce generated with mutation
operators from Figure 16.2. x indicates the mutant is killed by the test case in the
column head.
kills Mj, which can be distinguished from the original program by test cases 1D,2U, 2D,
and 2M. Mutants Mi, Mk, and Ml are not distinguished from the original program by any
test in TS. We say that mutants not killed by a test suite are live.
A mutant can remain live for two reasons:
The mutant can be distinguished from the original program, but the test suite T
does not contain a test case that distinguishes them (i.e., the test suite is not
adequate with respect to the mutant).
The mutant cannot be distinguished from the original program by any test case
(i.e., the mutant is equivalent to the original program).
Given a set of mutants SM and a test suite T, the fraction of nonequivalent mutants killed
by T measures the adequacy of T with respect to SM. Unfortunately, the problem of
identifying equivalent mutants is undecidable in general, and we could err either by
claiming that a mutant is equivalent to the program under test when it is not or by
counting some equivalent mutants among the remaining live mutants.
The adequacy of the test suite TS evaluated with respect to the four mutants of Figure
16.3 is 25%. However, we can easily observe that mutant Mi is equivalent to the original
program (i.e., no input would distinguish it). Conversely, mutants Mk and Ml seem to be
nonequivalent to the original program: There should be at least one test case that
distinguishes each of them from the original program. Thus the adequacy of TS,
measured after eliminating the equivalent mutant Mi, is 33%.
Mutant Ml is killed by test case Mixed, which represents the unusual case of an input file
containing both DOS- and Unix-terminated lines. We would expect that Mixed would also
kill Mk, but this does not actually happen: Both Mk and the original program produce the
same result for Mixed. This happens because both the mutant and the original program
fail in the same way.[1] The use of a simple oracle for checking the correctness of the
outputs (e.g., checking each output against an expected output) would reveal the fault.
The test suite TS2 obtained by adding test case Mixed to TS would be 100% adequate
(relative to this set of mutants) after removing the fault.
Mutation Analysis vs. Structural Testing
For typical sets of syntactic mutants, a mutation-adequate test suite will also be
adequate with respect to simple structural criteria such as statement or branch
coverage. Mutation adequacy can simulate and subsume a structural coverage criterion
if the set of mutants can be killed only by satisfying the corresponding test coverage
obligations.
Statement coverage can be simulated by applying the mutation operator sdl (statement
deletion) to each statement of a program. To kill a mutant whose only difference from
the program under test is the absence of statement S requires executing the mutant and
the program under test with a test case that executes S in the original program. Thus to
kill all mutants generated by applying the operator sdl to statements of the program
under test, we need a test suite that causes the execution of each statement in the
original program.
Branch coverage can be simulated by applying the operator cpr (constant for predicate
replacement) to all predicates of the program under test with constants True and False.
To kill a mutant that differs from the program under test for a predicate P set to the
constant value False, we need to execute the mutant and the program under test with a
test case that causes the execution of the True branch of P. To kill a mutant that differs
from the program under test for a predicate P set to the constant value True,we need to
execute the mutant and the program under test with a test case that causes the
execution of the False branch of P.
A test suite that satisfies a structural test adequacy criterion may or may not kill all the
corresponding mutants. For example, a test suite that satisfies the statement coverage
adequacy criterion might not kill an sdl mutant if the value computed at the statement
does not affect the behavior of the program on some possible executions.
[1]The
program was in regular use by one of the authors and was believed to be correct.
Discovery of the fault came as a surprise while using it as an example for this chapter.
and thus there are about 6000 untagged chub remaining in the lake.
It may be tempting to also ask fishermen to report the number of trout caught and to
perform a similar calculation to estimate the ratio between chub and trout. However,
this is valid only if trout and chub are equally easy to catch, or if one can adjust the
ratio using a known model of trout and chub vulnerability to fishing.
Counting residual faults A similar procedure can be used to estimate the number of
faults in a program: Seed a given number S of faults in the program. Test the
program with some test suite and count the number of revealed faults. Measure the
number of seeded faults detected, DS, and also the number of natural faults DN
detected. Estimate the total number of faults remaining in the program, assuming the
test suite is as effective at finding natural faults as it is at finding seeded faults, using
the formula
Fault seeding can be used statistically in another way: To estimate the number of faults
remaining in a program. Usually we know only the number of faults that have been
detected, and not the number that remains. However, again to the extent that the fault
model is a valid statistical model of actual fault occurrence, we can estimate that the
ratio of actual faults found to those still remaining should be similar to the ratio of seeded
faults found to those still remaining.
Once again, the necessary assumptions are troubling, and one would be unwise to place
too much confidence in an estimate of remaining faults. Nonetheless, a prediction with
known weaknesses is better than a seat-of-the-pants guess, and a set of estimates
derived in different ways is probably the best one can hope for.
While the focus of this chapter is on fault-based testing of software, related techniques
can be applied to whole systems (hardware and software together) to evaluate fault
tolerance. Some aspects of fault-based testing of hardware are discussed in the sidebar
on page 323.
Further Reading
Software testing using fault seeding was developed by Hamlet [Ham77] and
independently by DeMillo, Lipton, and Sayward [DLS78]. Underlying theories for faultbased testing, and in particular on the conditions under which a test case can distinguish
faulty and correct versions of a program, were developed by Morell [Mor90] and
extended by Thompson, Richardson, and Clarke [TRC93]. Statistical mutation using a
Bayesian approach to grow the sample until sufficient evidence has been collected has
been described by Sahinoglu and Spafford [SS90]. Weak mutation was proposed by
Howden [How82]. The sample mutation operators used in this chapter are adapted from
the Mothra software testing environment [DGK+88].
Exercises
16.1
1
2
3
4
5
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
Motivate the need for the competent programmer and the coupling effect hypothes
16.3 mutation analysis still make sense if these hypotheses did not hold? Why?
16.4
Required Background
Chapter 7
Reasoning about program correctness is closely related to test oracles that
recognize incorrect behavior at run-time.
Chapters 9 and 10
Basic concepts introduced in these chapters are essential background for
understanding the distinction between designing a test case specification and
executing a test case.
Chapters 11 through 16
These chapters provide more context and concrete examples for understanding
the material presented here.
17.1 Overview
Designing tests is creative; executing them should be as mechanical as compiling the
latest version of the product, and indeed a product build is not complete until it has
passed a suite of test cases. In many organizations, a complete build-and-test cycle
occurs nightly, with a report of success or problems ready each morning.
The purpose of run-time support for testing is to enable frequent hands-free reexecution
of a test suite. A large suite of test data may be generated automatically from a more
compact and abstract set of test case specifications. For unit and integration testing,
and sometimes for system testing as well, the software under test may be combined
with additional "scaffolding" code to provide a suitable test environment, which might, for
example, include simulations of other software and hardware resources. Executing a
large number of test cases is of little use unless the observed behaviors are classified
as passing or failing. The human eye is a slow, expensive, and unreliable instrument for
judging test outcomes, so test scaffolding typically includes automated test oracles. The
test environment often includes additional support for selecting test cases (e.g., rotating
nightly through portions of a large test suite over the course of a week) and for
summarizing and reporting results.
17.3 Scaffolding
During much of development, only a portion of the full system is available for testing. In
modern development methodologies, the partially developed system is likely to consist
of one or more runnable programs and may even be considered a version or prototype
of the final system from very early in construction, so it is possible at least to execute
each new portion of the software as it is constructed, but the external interfaces of the
evolving system may not be ideal for testing; often additional code must be added. For
example, even if the actual subsystem for placing an order with a supplier is available
and fully operational, it is probably not desirable to place a thousand supply orders each
night as part of an automatic test run. More likely a portion of the order placement
software will be "stubbed out" for most test executions.
Code developed to facilitate testing is called scaffolding, by analogy to the temporary
structures erected around a building during construction or maintenance. Scaffolding
may include test drivers (substituting for a main or calling program), test harnesses
(substituting for parts of the deployment environment), and stubs (substituting for
functionality called or used by the software under test), in addition to program
instrumentation and support for recording and managing test execution. A common
estimate is that half of the code developed in a software project is scaffolding of some
kind, but the amount of scaffolding that must be constructed with a software project can
vary widely, and depends both on the application domain and the architectural design
and build plan, which can reduce cost by exposing appropriate interfaces and providing
necessary functionality in a rational order.
The purposes of scaffolding are to provide controllability to execute test cases and
observability to judge the outcome of test execution. Sometimes scaffolding is required
to simply make a module executable, but even in incremental development with
immediate integration of each module, scaffolding for controllability and observability
may be required because the external interfaces of the system may not provide
sufficient control to drive the module under test through test cases, or sufficient
observability of the effect. It may be desirable to substitute a separate test "driver"
program for the full system, in order to provide more direct control of an interface or to
remove dependence on other subsystems.
Consider, for example, an interactive program that is normally driven through a graphical
user interface. Assume that each night the program goes through a fully automated and
unattended cycle of integration, compilation, and test execution. It is necessary to
perform some testing through the interactive interface, but it is neither necessary nor
efficient to execute all test cases that way. Small driver programs, independent of the
graphical user interface, can drive each module through large test suites in a short time.
When testability is considered in software architectural design, it often happens that
interfaces exposed for use in scaffolding have other uses. For example, the interfaces
needed to drive an interactive program without its graphical user interface are likely to
serve also as the interface for a scripting facility. A similar phenomenon appears at a
finer grain. For example, introducing a Java interface to isolate the public functionality of
a class and hide methods introduced for testing the implementation has a cost, but also
potential side benefits such as making it easier to support multiple implementations of
the interface.
37
38
39
40
99
100
set.add(new Interval('A','Z'));
set.add(new Interval('i','n'));
assertEquals("{ ['A'-'Z']['a'-'z'] }", set.toString());
}
...
}
Figure 17.1: Excerpt of JFlex 1.4.1 source code (a widely used open-source scanner
generator) and accompanying JUnit test cases. JUnit is typical of basic test
scaffolding libraries, providing support for test execution, logging, and simple result
checking (assertEquals in the example). The illustrated version of JUnit uses Java
reflection to find and execute test case methods; later versions of JUnit use Java
annotation (metadata) facilities, and other tools use source code preprocessors or
generators.
Fully generic scaffolding may suffice for small numbers of hand-written test cases. For
larger test suites, and particularly for those that are generated systematically (e.g.,
using the combinatorial techniques described in Chapter 11 or deriving test case
specifications from a model as described in Chapter 14), writing each test case by hand
is impractical. Note, however, that the Java code expressing each test case in Figure
17.1 follows a simple pattern, and it would not be difficult to write a small program to
convert a large collection of input, output pairs into procedures following the same
pattern. A large suite of automatically generated test cases and a smaller set of handwritten test cases can share the same underlying generic test scaffolding.
Scaffolding to replace portions of the system is somewhat more demanding, and again
both generic and application-specific approaches are possible. The simplest kind of
stub, sometimes called a mock, can be generated automatically by analysis of the
source code. A mock is limited to checking expected invocations and producing
precomputed results that are part of the test case specification or were recorded in a
prior execution. Depending on system build order and the relation of unit testing to
integration in a particular process, isolating the module under test is sometimes
considered an advantage of creating mocks, as compared to depending on other parts
of the system that have already been constructed.
The balance of quality, scope, and cost for a substantial piece of scaffolding software say, a network traffic generator for a distributed system or a test harness for a compiler
- is essentially similar to the development of any other substantial piece of software,
including similar considerations regarding specialization to a single project or investing
more effort to construct a component that can be used in several projects.
The balance is altered in favor of simplicity and quick construction for the many small
pieces of scaffolding that are typically produced during development to support unit and
small-scale integration testing. For example, a database query may be replaced by a
stub that provides only a fixed set of responses to particular query strings.
Comparison-based oracles are useful mainly for small, simple test cases, but sometimes
expected outputs can also be produced for complex test cases and large test suites.
Capture-replay testing, a special case of this in which the predicted output or behavior is
preserved from an earlier execution, is discussed in this chapter. A related approach is
to capture the output of a trusted alternate version of the program under test. For
example, one may produce output from a trusted implementation that is for some reason
unsuited for production use; it may too slow or may depend on a component that is not
available in the production environment. It is not even necessary that the alternative
implementation be more reliable than the program under test, as long as it is sufficiently
different that the failures of the real and alternate version are likely to be independent,
and both are sufficiently reliable that not too much time is wasted determining which one
has failed a particular test case on which they disagree.
Figure 17.2: A test harness with a comparison-based test oracle processes test cases
consisting of (program input, predicted output) pairs.
A third approach to producing complex (input, output) pairs is sometimes possible: It
may be easier to produce program input corresponding to a given output than vice
versa. For example, it is simpler to scramble a sorted array than to sort a scrambled
array.
A common misperception is that a test oracle always requires predicted program output
to compare to the output produced in a test execution. In fact, it is often possible to
judge output or behavior without predicting it. For example, if a program is required to
find a bus route from station A to station B, a test oracle need not independently
compute the route to ascertain that it is in fact a valid route that starts at A and ends at
B.
Oracles that check results without reference to a predicted output are often partial, in
the sense that they can detect some violations of the actual specification but not others.
They check necessary but not sufficient conditions for correctness. For example, if the
specification calls for finding the optimum bus route according to some metric, partial
oracle a validity check is only a partial oracle because it does not check optimality.
Similarly, checking that a sort routine produces sorted output is simple and cheap, but it
is only a partial oracle because the output is also required to be a permutation of the
input. A cheap partial oracle that can be used for a large number of test cases is often
combined with a more expensive comparison-based oracle that can be used with a
smaller set of test cases for which predicted output has been obtained.
Ideally, a single expression of a specification would serve both as a work assignment
and as a source from which useful test oracles were automatically derived.
Specifications are often incomplete, and their informality typically makes automatic
derivation of test oracles impossible. The idea is nonetheless a powerful one, and
wherever formal or semiformal specifications (including design models) are available, it
is worth- while to consider whether test oracles can be derived from them. Some of the
effort of formalization will be incurred either early, in writing specifications, or later when
oracles are derived from them, and earlier is usually preferable. Model-based testing, in
which test cases and test oracles are both derived from design models are discussed in
Chapter 14.
Figure 17.3: When self-checks are embedded in the program, test cases need not
include predicted outputs.
Self-check assertions may be left in the production version of a system, where they
provide much better diagnostic information than the uncontrolled application crash the
customer may otherwise report. If this is not acceptable - for instance, if the cost of a
runtime assertion check is too high - most tools for assertion processing also provide
controls for activating and deactivating assertions. It is generally considered good design
practice to make assertions and self-checks be free of side-effects on program state.
Side-effect free assertions are essential when assertions may be deactivated, because
otherwise suppressing assertion checking can introduce program failures that appear
only when one is not testing.
Self-checks in the form of assertions embedded in program code are useful primarily for
checking module and subsystem-level specifications, rather than overall program
behavior. Devising program assertions that correspond in a natural way to specifications
(formal or informal) poses two main challenges: bridging the gap between concrete
execution values and abstractions used in specification, and dealing in a reasonable way
with quantification over collections of values.
Test execution necessarily deals with concrete values, while abstract models are
indispensable in both formal and informal specifications. Chapter 7 (page 110) describes
the role of abstraction functions and structural invariants in specifying concrete
operational behavior based on an abstract model of the internal state of a module. The
intended effect of an operation is described in terms of a precondition (state before the
operation) and postcondition (state after the operation), relating the concrete state to
the abstract model. Consider again a specification of the get method of java.util.Map
from Chapter 7, with pre- and postconditions expressed as the Hoare triple
is an abstraction function that constructs the abstract model type (sets of key, value
pairs) from the concrete data structure. is a logical association that need not be
implemented when reasoning about program correctness. To create a test oracle, it is
useful to have an actual implementation of . For this example, we might implement a
special observer method that creates a simple textual representation of the set of (key,
value) pairs. Assertions used as test oracles can then correspond directly to the
specification. Besides simplifying implementation of oracles by implementing this
mapping once and using it in several assertions, structuring test oracles to mirror a
correctness argument is rewarded when a later change to the program invalidates some
part of that argument (e.g., by changing the treatment of duplicates or using a different
data structure in the implementation).
In addition to an abstraction function, reasoning about the correctness of internal
structures usually involves structural invariants, that is, properties of the data structure
that are preserved by all operations. Structural invariants are good candidates for self
checks implemented as assertions. They pertain directly to the concrete data structure
implementation, and can be implemented within the module that encapsulates that data
structure. For example, if a dictionary structure is implemented as a red-black tree or an
AVL tree, the balance property is an invariant of the structure that can be checked by an
assertion within the module. Figure 17.4 illustrates an invariant check found in the source
code of the Eclipse programming invariant.
1
2
3
4
5
6
7
8
9
13
14
15
16
17
49
50
package org.eclipse.jdt.internal.ui.text;
import java.text.CharacterIterator;
import org.eclipse.jface.text.Assert;
/**
*A <code>CharSequence</code> based implementation of
* <code>CharacterIterator</code>.
* @since 3.0
*/
public class SequenceCharacterIterator implements CharacterI
...
private void invariant() {
Assert.isTrue(fIndex >= fFirst);
Assert.isTrue(fIndex <= fLast);
}
...
public SequenceCharacterIterator(CharSequence sequenc
51
52
53
54
55
56
57
58
59
60
61
62
63
143
144
145
146
147
148
149
150
151
152
263
264
throws IllegalArgumentException {
if (sequence == null)
throw new NullPointerException();
if (first < 0 || first > last)
throw new IllegalArgumentException();
if (last > sequence.length())
throw new IllegalArgumentException();
fSequence= sequence;
fFirst= first;
fLast= last;
fIndex= first;
invariant();
}
...
into iteration in a program assertion. In fact, some run-time assertion checking systems
provide quantifiers that are simply interpreted as loops. This approach can work when
collections are small and quantifiers are not too deeply nested, particularly in
combination with facilities for selectively disabling assertion checking so that the
performance cost is incurred only when testing. Treating quantifiers as loops does not
scale well to large collections and cannot be applied at all when a specification quantifies
over an infinite collection.[1] For example, it is perfectly reasonable for a specification to
state that the route found by a trip-planning application is the shortest among all possible
routes between two points, but it is not reasonable for the route planning program to
check its work by iterating through all possible routes.
The problem of quantification over large sets of values is a variation on the basic
problem of program testing, which is that we cannot exhaustively check all program
behaviors. Instead, we select a tiny fraction of possible program behaviors or inputs as
representatives. The same tactic is applicable to quantification in specifications. If we
cannot fully evaluate the specified property, we can at least select some elements to
check (though at present we know of no program assertion packages that support
sampling of quantifiers). For example, although we cannot afford to enumerate all
possible paths between two points in a large map, we may be able to compare to a
sample of other paths found by the same procedure. As with test design, good samples
require some insight into the problem, such as recognizing that if the shortest path from
A to C passes through B, it should be the concatenation of the shortest path from A to B
and the shortest path from B to C.
A final implementation problem for self-checks is that asserted properties sometimes
involve values that are either not kept in the program at all (so-called ghost variables) or
values that have been replaced ("before" values). A specification of noninterference
between threads in a concurrent program may use ghost variables to track entry and
exit of threads from a critical section. The postcondition of an in-place sort operation will
state that the new value is sorted and a permutation of the input value. This permutation
relation refers to both the "before" and "after" values of the object to be sorted. A runtime assertion system must manage ghost variables and retained "before" values and
must ensure that they have no side-effects outside assertion checking.
[1]It
as have tools to generate some kinds of test oracles from design and specification
documents. Fuller support for creating test scaffolding might bring these together,
combining information derivable from program code itself with information from design
and specification to create at least test harnesses and oracles. Program transformation
and program analysis techniques have advanced quickly in the last decade, suggesting
that a higher level of automation than in the past should now be attainable.
Further Reading
Techniques for automatically deriving test oracles from formal specifications have been
described for a wide variety of specification notations. Good starting points in this
literature include Peters and Parnas [PP98] on automatic extraction of test oracles from
a specification structured as tables; Gannon et al. [GMH81] and Bernot et al. [BGM91]
on derivation of test oracles from algebraic specifications; Doong and Frankl [DF94] on
an approach related to algebraic specifications but adapted to object-oriented
programs; Bochmann and Petrenko [vBP94] on derivation of test oracles from finite
state models, particularly (but not only) for communication protocols; and Richardson et
al. [RAO92] on a general approach to deriving test oracles from multiple specification
languages, including a form of temporal logic and the Z modeling language.
Rosenblum [Ros95] describes a system for writing test oracles in the form of program
assertions and assesses their value. Memon and Soffa [MS03] assesses the impact of
test oracles and automation for interactive graphical user interface (GUI) programs.
Ostrand et al. [OAFG98] describe capture/replay testing for GUI programs.
Mocks for simulating the environment of a module are described by Saff and Ernst
[SE04]. Husted and Massol [HM03] is a guide to the popular JUnit testing framework.
Documentation for JUnit and several similar frameworks for various languages and
systems are also widely available on the Web.
Related Topics
Readers interested primarily in test automation or in automation of other aspects of
analysis and test may wish to continue reading with Chapter 23.
Exercises
Voluminous output can be a barrier to naive implementations of comparisonbased oracles. For example, sometimes we wish to show that some abstraction
of program behavior is preserved by a software change. The naive approach is to
store a detailed execution log of the original version as predicted output, and
17.1 compare that to a detailed execution log of the modified version. Unfortunately, a
detailed log of a single execution is quite lengthy, and maintaining detailed logs of
many test case executions may be impractical. Suggest more efficient
Required Background
Chapter 2
This chapter discusses complementarities and trade-offs between test and
analysis, and motivates the need for alternatives to testing.
18.1 Overview
Inspection is a low-tech but effective analysis technique that has been extensively used
in industry since the early 1970s. It is incorporated in many standards, including the
Capability Maturity Model (CMM and CMMI) and the ISO 9000 standards, and is a key
element of verificationand test-oriented processes such as the Cleanroom, SRET and
XP processes.[1]
Inspection is a systematic, detailed review of artifacts to find defects and assess quality.
It can benefit from tool support, but it can also be executed manually. Inspection is most
commonly applied to source code, but can be applied to all kinds of artifacts during the
whole development cycle. It is effective in revealing many defects that testing cannot
reveal or can reveal only later and at higher cost.
Inspection also brings important education and social benefits. Junior developers quickly
learn standards for specification and code while working as inspectors, and expert
developers under pressure are less tempted to ignore standards. The sidebar on page
342 summarizes the chief social and educational benefits of inspection.
Social and Educational Benefits of Inspection
While the direct goal of inspection is to find and remove defects, social and
educational effects may be equally important.
Inspection creates a powerful social incentive to present acceptable work products,
even when there is no direct tie to compensation or performance evaluation. The
classic group inspection process, in which the author of the work under review is
required to be a passive participant, answering questions but not volunteering
explanation or justification for the work until asked, especially magnifies the effect; it
is not easy to listen quietly while one's work is publicly picked apart by peers.
Inspection is also an effective way to form and communicate shared norms in an
organization, not limited to rules that are explicit in checklists. The classic inspection
process prohibits problem solving in the inspection meeting itself, but the necessity of
such a rule to maintain momentum in the inspection meeting is evidence for the
general rule that, given opportunity, developers and other technical professionals are
quick to share experience and knowledge relevant to problems found in a colleague's
work. When a new practice or standard is introduced in an organization, inspection
propagates awareness and shared understanding.
New staff can be almost immediately productive, individually reviewing work products
against checklists, accelerating their familiarization with organization standards and
practices. Group inspection roles require some experience, but can likewise be more
effective than traditional training in integrating new staff.
The social and educational facets of inspection processes should be taken into
account when designing an inspection process or weighing alternatives or variations
to an existing process. If the alternatives are weighed by fault-finding effectiveness
alone, the organization could make choices that appear to be an improvement on that
dimension, but are worse overall.
[1]See
XP.
when the inspection requires detailed knowledge that cannot be easily acquired without
being involved in the development. This happens for example, when inspecting complex
modules looking for semantics or integration problems.
Developers must be motivated to collaborate constructively in inspection, rather than
hiding problems and sabotaging the process. Reward mechanisms can influence the
developers' attitude and must be carefully designed to avoid perverse effects. For
example, fault density is sometimes used as a metric of developer performance. An
assessment of fault density that includes faults revealed by inspection may discourage
developers from constructive engagement in the inspection process and encourage them
to hide faults during inspection instead of highlighting them. At the very least, faults that
escape inspection must carry a higher weight than those found during inspection. Naive
incentives that reward developers for finding faults during inspection are apt to be
counterproductive because they punish the careful developer for bringing a highquality
code to the inspection.
18.4 Checklists
Checklists are a core element of classic inspection. They summarize the experience
accumulated in previous projects, and drive the review sessions. A checklist contains a
set of questions that help identify defects in the inspected artifact, and verify that the
artifact complies with company standards. A good checklist should be updated regularly
to remove obsolete elements and to add new checks suggested by the experience
accumulated in new projects. We can, for example, remove some simple checks about
coding standards after introducing automatic analyzers that enforce the standards, or
we can add specific semantic checks to avoid faults that caused problems in recent
projects.
Checklists may be used to inspect a large variety of artifacts, including requirements and
design specifications, source code, test suites, reports, and manuals. The contents of
checklists may vary greatly to reflect the different properties of the various artifacts, but
all checklists share a common structure that facilitates their use in review sessions.
Review sessions must be completed within a relatively short time (no longer than two
hours) and may require teams of different size and expertise (from a single junior
programmer to teams of senior analysts). Length and complexity of checklists must
reflect their expected use. We may have fairly long checklists with simple questions for
simple syntactic reviews, and short checklists with complex questions for semantic
reviews.
Modern checklists are structured hierarchically and are used incrementally. Checklists
with simple checks are used by individual inspectors in the early stages of inspection,
while checklists with complex checks are used in group reviews in later inspection
phases. The preface of a checklist should indicate the type of artifact and inspection that
can be done with that checklist and the level of expertise required for the inspection.
The sidebar on page 346 shows an excerpt of a checklist for a simple Java code
inspection and the sidebar on page 347 shows an excerpt of a checklist for a more
complex review of Java programs.
A common checklist organization, used in the examples in this chapter, consists of a set
of features to be inspected and a set of items to be checked for each feature.
Organizing the list by features helps direct the reviewers' attention to the appropriate set
of checks during review. For example, the simple checklist on page 346 contains checks
for file headers, file footers, import sections, class declarations, classes, and idiomatic
methods. Inspectors will scan the Java file and select the appropriate checks for each
feature.
The items to be checked ask whether certain properties hold. For example, the file
header should indicate the identity of the author and the current maintainer, a cross
reference to the design entity corresponding to the code in the file, and an overview of
the structure of the package. All checks are expressed so that a positive answer
indicates compliance. This helps the quality manager spot possible problems, which will
correspond to "no" answers in the inspection reports.
Java Checklist: Level 1 inspection (single-pass read-through, context
independent)
FEATURES (where to look and how to check):
Item (what to check)
FILE HEADER: Are the following items included and consistent? yes no comments
Author and current maintainer identity
Cross-reference to design entity
Overview of package structure, if the class is the principal
entry point of a package
FILE FOOTER: Does it include the following items? yes no comments
Revision log to minimum of 1 year or at least to
most recent point release, whichever is longer
IMPORT SECTION: Are the following requirements satisfied? yes no comments
Brief comment on each import with the exception of standard
set: java.io.*, java.util.*
Each imported package corresponds to a dependence in the
design documentation
CLASS DECLARATION: Are the following requirements
yes no comments
satisfied?
The visibility marker matches the design document
The constructor is explicit (if the class is not static)
The visibility of the class is consistent with the design
document
CLASS DECLARATION JAVADOC: Does the Javadoc header
yes no comments
include:
One sentence summary of class functionality
Guaranteed invariants (for data structure classes)
Usage instructions
CLASS: Are names compliant with the following rules? yes no comments
Class or interface: CapitalizedWithEachInternalWordCapitalized
Special case: If class and interface have same base name,
distinguish as ClassNameIfc and Class- NameImpl
Exception: ClassNameEndsWithException
Constants
(final):
Inspectors check the items, answer "yes" or "no" depending on the status of the
inspected feature, and add comments with detailed information. Comments are common
when the inspectors identify violations, and they help identify and localize the violations.
For example, the inspectors may indicate which file headers do not contain all the
required information and which information is missing. Comments can also be added
when the inspectors do not identify violations, to clarify the performed checks. For
example, the inspectors may indicate that they have not been able to check if the
maintainer indicated in the header is still a member of the staff of that project.
Checklists should not include items that can be more cost-effectively checked with
automated test or analysis techniques. For example, the checklist at page 346 does not
include checks for presence in the file header of file title, control identifier, copyright
statement and list of classes, since such information is added automatically and thus
does not require manual checks. On the other hand, it asks the inspector to verify the
presence of references to the author and maintainer and of cross reference to the
corresponding design entities, since this checklist is used in a context where such
information is not inserted automatically. When adopting an environment that
automatically updates author and maintainer information and checks cross references to
design entities, we may remove the corresponding checks from the checklist, and
increase the amount of code that can be inspected in a session, or add new checks for
reducing different problems experienced in new projects.
Properties should be as objective and unambiguous as possible. Complete
independence from subjective judgment may not be possible, but must be pursued. For
example broad properties like "Comments are complete?" or "Comments are well
written?" ask for a subjective judgment, and raise useless and contentious discussions
among inspectors and the authors of an artifact undergoing inspection. Checklist items
like "Brief comment on each import with the exception of standard set: java.io.*,
java.util.*" or "One sentence summary of class functionality" address the same purpose
more effectively.
Items should also be easy to understand. The excerpts in the sidebars on pages 346
and 347 list items to be checked, but for each item, the checklist should provide a
description, motivations, and examples. Figure 18.1 shows a complete description of
one of the items of the sidebars.
ITEM The visibility of the Class is consistent with the design document
Detailed checklist item reference:
Description The fields and methods exported by a class must correspond to those in
the specification, which may be in the form of a UML diagram. If the class specializes
another class, method header comments must specify where superclass methods are
overridden or overloaded. Overloading or overriding methods must be semantically
consistent with ancestor methods. Additional public utility or convenience methods may
be provided if well documented in the implementation.
The class name should be identical to the name of the class in the specifying document,
for ease of reference. Names of methods and fields may differ from those in the
specifying document, provided header comments (class header comments for public
fields, method header comments for public methods) provide an explicit mapping of
implementation names to specification names. Order and grouping of fields and methods
need not follow the order and grouping in the specifying document.
Motivations Clear correspondence of elements of the implementation to elements of the
specification facilitates maintenance and reduces integration faults. If significant
deviations are needed (e.g., renaming a class or omitting or changing a public method
signature), these are design revisions that should be discussed and reflected in the
specifying document.
Examples The code implementing the following UML specification of class
CompositeItem should export fields and methods corresponding to the fields of the
specification of class CompositeItem and its ancestor class LineItem. Implementations
that use different names for some fields or methods or that do not redefine method
getUnitPrice in class CompositeItem are acceptable if properly documented. Similarly,
implementations that export an additional method compare that specializes the default
method equal to aid test oracle generation is acceptable.
TEST PLAN CHECKLIST: Comprehensive review in context
FEATURES (where to look and how to check):
Item (what to check)
ITEMS TO BE TESTED OR ANALYZED: For each item, does
yes no comments
the plan include:
A reference to the specification for the item
A reference to installation procedures for the item, if any
TEST AND ANALYSIS APPROACH: Are the following
A superficial analysis of pair programming would suggest that using two programmers
instead of one should halve productivity. The empirical evidence available so far
suggests compensating effects of better use of time, better design choices and earlier
detection of defects leading to less time lost to rework and overall better quality.
Further Reading
The classic group inspection process is known as Fagan inspections and is described by
Fagan [Fag86]. Industrial experience with software inspections in a large software
development project is described by Russell [Rus91] and by Grady and van Slack
[GS94]. Gilb and Graham [GG93] is a widely used guidebook for applying software
inspection.
Parnas and Weiss [PW85, HW01] describe a variant process designed to ensure that
every participant in the review process is actively engaged; it is a good example of
interplay between the technical and social aspects of a process. Knight and Myers'
phased inspections [KM93] are an attempt to make a more cost-effective deployment of
personnel in inspections, and they also suggest ways in which automation can be
harnessed to improve efficiency. Perpich et al. [PPP+97] describe automation to
facilitate asynchronous inspection, as an approach to reducing the impact of inspection
on time-to-market.
There is a large research literature on empirical evaluation of the classic group
inspection and variations; Dunsmore, Roper, and Wood [DRW03] and Porter and
Johnson [PJ97] are notable examples. While there is a rapidly growing literature on
empirical evaluation of pair programming as a pedagogical method, empirical evaluations
of pair programming in industry are (so far) fewer and decidedly mixed. Hulkko and
Abrahamsson [HA05] found no empirical support for common claims of effectiveness
and efficiency. Sauer et al. [SJLY00] lay out a behavioral science research program for
determining what makes inspection more or less effective and provide an excellent
survey of relevant research results up to 2000, with suggestions for practical
improvements based on those results.
Related Topics
Simple and repetitive checks can sometimes be replaced by automated analyses.
Chapter 19 presents automated analysis techniques, while Chapter 23 discusses
automatization problems.
Exercises
Your organization, which develops personal training monitors for runners and
cyclists, has a software development team split between offices in Milan, Italy,
and Eugene, Oregon. Team member roles (developers, test designers, technical
writers, etc.) are fairly evenly distributed between locations, but some technical
expertise is concentrated in one location or another. Expertise in mapping and
geographic information systems, for example, is concentrated mainly in Eugene,
18.1 and expertise in device communication and GPS hardware mainly in Milan. You
are considering whether to organize inspection of requirements specifications,
design documents, and program code primarily on a local, face-to-face basis, or
Automated analysis should substitute for inspection where it is more cost18.3 effective. How would you evaluate the cost of inspection and analysis to decide
whether to substitute an analysis tool for a particular set of checklist items?
Inspection does not require tools but may benefit from tool support. Indicate three
tools that you think can reduce human effort and increase inspectors' productivity.
18.4 List tools in order of importance with respect to effort saving, explain why you
ranked those tools highest, and indicate the conditions under which each tool may
be particularly effective.
In classic inspection, some inspectors may remain silent and may not actively
18.5 participate in the inspection meeting. How would you modify inspection meetings
to foster active participation of all inspectors?
[2]For
Required Background
Chapter 6
This chapter describes data flow analysis, a basic technique used in many static
program analyses.
Chapter 7
This chapter introduces symbolic execution and describes how it is used for
checking program properties.
Chapter 8
This chapter discusses finite state verification techniques applicable to models of
programs. Static analysis of programs is often concerned with extracting models
to which these techniques can be applied.
19.1 Overview
Automated program analysis techniques complement test and inspection in two ways.
First, automated program analyses can exhaustively check some important properties of
programs, including those for which conventional testing is particularly ill-suited. Second,
program analysis can extract and summarize information for inspection and test design,
replacing or augmenting human effort.
Conventional program testing is weak at detecting program faults that cause failures
only rarely or only under conditions that are difficult to control. For example, conventional
program testing is not an effective way to find race conditions between concurrent
threads that interfere only in small critical sections, or to detect memory access faults
that only occasionally corrupt critical structures.[1] These faults lead to failures that are
sparsely scattered in a large space of possible program behaviors, and are difficult to
detect by sampling, but can be detected by program analyses that fold the enormous
program state space down to a more manageable representation.
Concurrency Faults
Concurrent threads are vulnerable to subtle faults, including potential deadlocks and
data races. Deadlocks occur when each of a set of threads is blocked, waiting for
another thread in the set to release a lock. Data races occur when threads
concurrently access shared resources while at least one is modifying that resource.
Concurrency faults are difficult to reveal and reproduce. The nondeterministic nature
of concurrent programs does not guarantee the same execution sequence between
different program runs. Thus programs that fail during one execution may not fail
during other executions with the same input data, due to the different execution
orders.
Concurrency faults may be prevented in several ways. Some programming styles
eliminate concurrency faults by restricting program constructs. For example, some
safety critical applications do not allow more than one thread to write to any particular
shared memory item, eliminating the possibility of concurrent writes (write-write
races). Other languages provide concurrent programming constructs that enable
simple static checks. For example, protection of a shared variable in Java
synchronized blocks is easy to check statically. Other constructs are more difficult to
check statically. For example, C and C++ libraries that require individual calls to
obtain and release a lock can be used in ways that resist static verification.
Manual program inspection is also effective in finding some classes of faults that are
difficult to detect with testing. However, humans are not good at repetitive and tedious
and 357.
and memory faults are further discussed in the sidebars on pages 356
} else if (c == '%') {
/* Case 2: '%xx' is hex for character xx */
int digit high = Hex Values[*(++eptr)];
int digit low = Hex Values[*(++eptr)];
If executed with an input string terminated by %x, where x is an hexadecimal digit, the
program incorrectly scans beyond the end of the input string and can corrupt memory.
However, the failure may occur much after the execution of the faulty statement,
when the corrupted memory is used. Because memory corruption may occur rarely
and lead to failure more rarely still, the fault is hard to detect with traditional testing
techniques.
In languages that require (or allow) a programmer to explicitly control deallocation of
memory, potential faults include deallocating memory that is still accessible through
pointers (making them dangerous dangling pointers to memory that may be recycled
for other uses, with different data types) or failing to deallocate memory that has
become inaccessible. The latter problem is known as a memory leak. Memory leaks
are pernicious because they do not cause immediate failure and may in fact lead to
memory exhaustion only after long periods of execution; for this reason they often
escape unit testing and show up only in integration or system test, or in actual use, as
discussed in the sidebar on page 409. Even when failure is observed, it can be
difficult to trace the failure back to the fault.
Chapter 7 describes how symbolic execution can prove that a program satisfies
specifications expressed in terms of invariants and pre and postconditions.
Unfortunately, producing complete formal specifications with all the required pre and
postconditions is rarely cost-effective. Moreover, even when provided with a complete
formal specification, verification through symbolic execution may require solving
predicates that exceed the capacity of modern constraint solvers.
Symbolic execution techniques find wider application in program analysis tools that aim
at finding particular, limited classes of program faults rather than proving program
correctness. Typical applications include checking for use of uninitialized memory,
memory leaks, null pointer dereference, and vulnerability to certain classes of attack
such as SQL injection or buffer overflow. Tools for statically detecting these faults make
few demands on programmers. In particular, they do not require complete program
specifications or pre- and postcondition assertions, and they range from moderately
expensive (suitable for daily or occasional use) to quite cheap (suitable for instant
feedback in a program editor).
In addition to focusing on particular classes of faults, making a static program analysis
efficient has a cost in accuracy. As discussed in Chapter 2, the two basic ways in which
we can trade efficiency for accuracy are abstracting details of execution to fold the state
space or exploring only a sample of the potential program state space. All symbolic
execution techniques fold the program state space to some extent. Some fold it far
enough that it can be exhaustively explored, incurring some pessimistic inaccuracy but no
optimistic inaccuracy. Others maintain a more detailed representation of program states,
but explore only a portion of the state space. In that way, they resemble conventional
testing.
A symbolic testing tool can simply prune execution paths whose execution conditions
involve many constraints, suggesting a high likelihood of infeasibility, or it may suppress
reports depending on a combination of likelihood and severity. A particularly useful
technique is to order warnings, with those that are almost certainly real program faults
given first. It is then up to the user to decide how far to dig into the warning list.
18
19
20
21
22
23
24
25
26
27
28
29
30 }
Figure 19.1: A C program that invokes the C function cgi_decode of Figure 12.1 with
memory for outbuf allocated from the heap.
Memory analysis dynamically traces memory accesses to detect misuse as soon as it
occurs, thus making potentially hidden failures visible and facilitating diagnosis. For
example, Figure 19.2 shows an excerpt of the results of dynamic analysis of program
cgi_decode with the Purify dynamic memory analysis tool. The result is obtained by
executing the program with a test case that produces an output longer than 10 ASCII
characters. Even if the test case execution would not otherwise cause a visible failure,
the dynamic analysis detects an array bounds violation and indicates program locations
related to the fault.
Figure 19.3: States of a memory location for dynamic memory analysis (adapted
from Hastings and Joyce [HJ92]).
Memory leaks can be detected by running a garbage detector, which is the analysis
portion of a garbage collector. Garbage collectors automatically identify unused memory
locations and free them. Garbage detection algorithms implement the identification step
by recursively following potential pointers from the data and stack segments into the
heap, marking all referenced blocks, and thereby identifying allocated blocks that are no
longer referenced by the program. Blocks allocated but no longer directly or transitively
referenced are reported as possible memory leaks.
lock(lck1)
x=x+1;
unlock(lck1)
{ } {lck1, lck2}
{lck1}
{lck1}
thread B
lock(lck2)
x=x+1;
unlock(lck2)
{}
{lck2}
{}
{}
Figure 19.4: Threads accessing the same shared variable with different locks.
(Adapted from Savage et al. [SBN+97])
This simple locking discipline is violated by some common programming practices:
Shared variables are frequently initialized without holding a lock; shared variables written
only during initialization can be safely accessed without locks; and multiple readers can
be allowed in mutual exclusion with single writers. Lockset analysis can be extended to
accommodate these idioms.
Initialization can be handled by delaying analysis till after initialization. There is no easy
way of knowing when initialization is complete, but we can consider the initialization
completed when the variable is accessed by a second thread.
Safe simultaneous reads of unprotected shared variables can also be handled very
simply by enabling lockset violations only when the variable is written by more than one
thread. Figure 19.5 shows the state transition diagram that enables lockset analysis and
determines race reports. The initial virgin state indicates that the variable has not been
referenced yet. The first access moves the variable to the exclusive state. Additional
accesses by the same thread do not modify the variable state, since they are
considered part of the initialization procedure. Accesses by other threads move to states
shared and shared-modified that record the type of access. The variable lockset is
updated in both shared and shared-modified states, but violations of the policy are
reported only if they occur in state shared-modified. In this way, read-only concurrent
accesses do not produce warnings.
Figure 19.5: The state transition diagram for lockset analysis with multiple read
accesses.
To allow multiple readers to access a shared variable and still report writers' data races,
we can simply distinguish between the set of locks held in all accesses from the set of
locks held in write accesses.
/**
* Internal method to insert into a subtree.
*
* @param x
*
the item to insert.
* @param t
*
the node that roots the tree.
* @return the new root.
*/
private AvlNode insert(Comparable x, AvlNode t) {
if (t == null)
t= new AvlNode(x, null, null);
else if (x.compareTo(t.element) < 0) {
t.left = insert(x, t.left);
if (height(t.left) - height(t.right) == 2)
if (x.compareTo(t.left.element) < 0)
t = rotateWithLeftChild(t);
else
t = doubleWithLeftChild(t);
} else if (x.compareTo(t.element) > 0) {
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
Figure 19.6: A Java method for inserting a node into an AVL tree
[Wei07].[*]
These predicates indicate that, in all observed executions of the insert method, the AVL
tree properties of node ordering and tree balance were maintained.
A model like this helps test designers understand the behavior of the program and the
completeness of the test suite. We can easily see that the test suite produces AVL trees
unbalanced both to the right and to the left, albeit within the AVL allowance. A predicate
like
diffHeight == 0
would indicate the absence of test cases producing unbalanced trees, and thus possibly
incomplete test suites.
Behavior analysis produces a model by refining an initial set of predicates generated
from templates. Figure 19.7 illustrates a sample set of predicate templates. Instantiating
all templates for all variables at all program points would generate an enormous number
of initial predicates, many of which are useless. Behavior analysis can be optimized by
indicating the points in the program at which we would like to extract behavior models
and the variables of interest at those points. The instruction recordData(t, t.left, t.right) in
Figure 19.6 indicates both the point at which variables are monitored (immediately
before returning from the method) and the monitored variables (the fields of the current
node and of its left and right children).
Over any variable x:
constant
x=a
uninitialized
x = uninit
x a, x b, a x b
nonzero
x0
modulus
x a (mod b)
nonmodulus
x a (mod b)
y = ax + b
ordering relationship
x y, x < y, x = y, x y
functions
x = fn(y)
x + y:
in a range
x + y a, x + y b, a x + y b
nonzero
x+y0
modulus
x + y a (mod b)
nonmodulus
x + y a (mod b)
z = ax + by + c
functions
z = fn(x,y)
element ordering
y = ax + b elementwise
subsequence relationship
reversal
x is a subsequence of y
x is the reverse of y
xs
where a, b, and c denote constants, fn denotes a built-in function, and uninit denotes
an uninitialized value. The name of the variable denotes its value at the considered
point of execution; origx indicates the original value of variable x, that is, the value at
the beginning of the considered execution.
Figure 19.7: A sample set of predicate patterns implemented by the Daikon behavior
analysis tool.
The initial set of predicates is refined by eliminating those violated during execution.
Figure 19.9 shows two behavior models for the method insert shown in Figure 19.6. The
models were derived by executing the two test cases shown in Figure 19.8. The model
for test testCaseSingleValues shows the limitations of a test case that assigns only
three values, producing a perfectly balanced tree. The predicates correctly characterize
that execution, but represent properties of a small subset of AVL trees. The behavioral
model obtained with test testCaseRandom provides more information about the method.
This test case results in 300 invocations of the method with randomly generated
numbers. The model indicates that the elements are inserted correctly in the AVL tree
(for each node father, left < father < right) and the tree is balanced as expected
(diffHeight one of {1,0,1}). The models provide additional information about the test
cases. All inserted elements are nonnegative (left >= 0). The model also includes
predicates that are not important or can be deduced from others. For example,
fatherHeight >= 0 can easily be deduced from the code, while father >= 0 is a
consequence of left >= 0 and left < father.
1
2
3
4
5
6
7
25 ...
26
27
28
29
30
31
32
33
Figure 19.8: Two test cases for method insert of Figure 19.6. testCaseSingleValues
inserts 5, 2, and 7 in this order; testCaseRandom inserts 300 randomly generated
integer values.
Behavior model for testCaseSingleValues
_________________________________________
father one of {2, 5, 7}
left == 2
right == 7
leftHeight == rightHeight
rightHeight == diffHeight
leftHeight == 0
rightHeight == 0
fatherHeight one of {0, 1}
Behavior model for testCaseRandom
______________________________________
father >= 0
left >= 0
father > left
father < right
left < right
fatherHeight >= 0
leftHeight >= 0
rightHeight >= 0
fatherHeight > leftHeight
fatherHeight > rightHeight
fatherHeight > diffHeight
rightHeight >= diffHeight
diffHeight one of {-1,0,1}
leftHeight - rightHeight + diffHeight == 0
Figure 19.9: The behavioral models for method insert of Figure 19.6. The model was
obtained using Daikon with test cases testCaseSingleValues and testCaseRandom
shown in Figure 19.8.
As illustrated in the example, the behavior model is neither a specification of the program
nor a complete description of the program behavior, but rather a representation of the
behavior experienced so far. Additional executions can further refine the behavior model
by refining or eliminating predicates.
Some conditions may be coincidental; that is they may happen to be true only of the
small portion of the program state space that has been explored by particular set of test
cases. We can reduce the effect of coincidental conditions by computing a probability of
coincidence, which can be estimated by counting the number of times the predicate is
tested. Conditions are considered valid if their coincidental probability falls below a
threshold. For example, father >= 0 may occur coincidentally with a probability of 0.5, if
it is verified by a single execution, but the probability decreases to 0.5n,ifitis verified by n
executions. With a threshold of 0.05%, after two executions with father = 7, the analysis
will consider valid the predicate father = 7, but not father >= 0 yet, since the latter still
has a high probability of being coincidental. Two additional executions with different
positive outcomes will invalidate predicate father = 7 and will propose father >= 0, since
its probability will be below the current threshold. The predicate father >= 0 appears in
the model obtained from testCaseRandom, but not in the model obtained from
testCaseSingleValues because it occurred 300 times in the execution of
testCaseRandom but only 3 times in the execution of testCaseSingleValues.
Behavior models may help in many ways: during testing to help validate the
thoroughness of tests, during program analysis to help understand program behavior,
during regression test to compare the behavior of different versions or configurations,
during test of component-based software systems to compare the behavior of
components in different contexts, and during debugging to identify anomalous behavior
and understand its causes.
Dynamic analysis (aside from conventional testing) was once relegated to debugging
and performance analysis, but has recently become an important approach for
constructing and refining models of program behavior. Synergistic combinations of static
program analysis, dynamic analysis, and testing are a promising avenue of further
research.
Further Reading
Readings on some of the underlying techniques in program analysis are suggested in
Chapters 5, 6, 7, and 8. In addition, any good textbook on compiler construction will
provide useful basic background on extracting models from program source code.
A recent application of symbolic testing described by Bush, Pincus, and Sielaff [BPS00]
is a good example of the revival of an approach that found little practical application
when first introduced in the 1970s. Aside from exploiting vastly greater computing
capacity, the modern version of the technique improves on the original in several ways,
most notably better managing communication of analysis results to the user. CoenPorisini et al. [CPDGP01] describe a modern application of symbolic execution in
constructing a rigorous demonstration of program properties by exploiting limitations of
an application domain.
Savage et al. [SBN+97] introduced the lockset analysis technique, which has influenced a
great deal of subsequent research in both static and dynamic analyses of multi-threaded
software. The Daikon tool and its approach to behavioral model extraction were
introduced by Ernst et al. [ECGN01].
Exercises
We claimed that Java synchronized(l) { block } is easier to check statically than
separate lock(l) and unlock(l) operations.
19.1 Give an example of how it could be harder to verify that lock(l) and unlock(l)
operations protect a particular variable access than to verify that the access is
protected by a synchronized(l) { }.
Although Java synchronized blocks make analysis of locking easy relative to
individual lock(l) and unlock(l) operations, it is still possible to construct Java
programs for which a static program analysis will not be able to determine
19.2 whether access at a particular program location is always protected by the same
lock. Give an example of this, with an explanation. (Hint: Each lock in Java is
identified by a corresponding object.)
A fundamental facility for symbolic testing and many other static analysis
techniques is to allow the user to note that a particular warning or error report is a
false alarm, and to suppress it in future runs of the analysis tool. However, it is
19.3 possible that a report that is a false alarm today might describe a real fault
sometime later, due to program changes. How might you support the "revival" of
suppressed error reports at appropriate times and points? Discuss the
advantages and disadvantages of your approach.
Suppose we choose to model a program execution state with four pieces of
information - the program location (control point) and the states of four Boolean
variables w,x,y,and z - and suppose each of those variables is modeled by a finite
state machine (FSM) with three states representing possible values (uninitialized,
true, and false).
If we were modeling just the possible values of w, a natural choice would be to
19.4 label each program location with an element from a powerset lattice in which each
lattice element represents a subset of automaton states. If we model w,x,y,and z,
there are at least two different ways we could represent values at each program
location: As a set of tuples of FSM states or as a tuple of sets of FSM states.
What are the advantages and disadvantages of each of these representation
choices? How might your choice depend on the property you were attempting to
verify?
[*]Adapted
from source code from the text DATA STRUCTURES & ALGORITHM
ANALYSIS IN JAVA by Weiss, 2007, 1999 Pearson Education, Inc. Reproduced by
permission of Pearson Education, Inc. All rights reserved.
Chapter List
Chapter 20: Planning and Monitoring the Process
Chapter 21: Integration and Component-based Software Testing
Chapter 22: System, Acceptance, and Regression Testing
Chapter 23: Automating Analysis and Test
Chapter 24: Documenting Analysis and Test
Required Background
Chapter 4
Introduction of basic concepts of quality process, goals, and activities provides
useful background for understanding this chapter.
20.1 Overview
Planning involves scheduling activities, allocating resources, and devising observable,
unambiguous milestones against which progress and performance can be monitored.
Monitoring means answering the question, "How are we doing?"
Quality planning is one aspect of project planning, and quality processes must be closely
coordinated with other development processes. Coordination among quality and
development tasks may constrain ordering (e.g., unit tests are executed after creation of
program units). It may shape tasks to facilitate coordination; for example, delivery may
be broken into smaller increments to allow early testing. Some aspects of the project
plan, such as feedback and design for testability, may belong equally to the quality plan
and other aspects of the project plan.
Quality planning begins at the inception of a project and is developed with the overall
project plan, instantiating and building on a quality strategy that spans several projects.
Like the overall project plan, the quality plan is developed incrementally, beginning with
the feasibility study and continuing through development and delivery.
Formulation of the plan involves risk analysis and contingency planning. Execution of the
plan involves monitoring, corrective action, and planning for subsequent releases and
projects.
Allocating responsibility among team members is a crucial and difficult part of planning.
When one person plays multiple roles, explicitly identifying each responsibility is still
essential for ensuring that none are neglected.
might address some of these, and automated analyses might help with completeness
and consistency checking.
The evolving collection of work products can be viewed as a set of descriptions of
different parts and aspects of the software system, at different levels of detail. Portions
of the implementation have the useful property of being executable in a conventional
sense, and are the traditional subject of testing, but every level of specification and
design can be both the subject of verification activities and a source of information for
verifying other artifacts. A typical intermediate artifact - say, a subsystem interface
definition or a database schema - will be subject to the following steps:
Internal consistency check Check the artifact for compliance with structuring rules that
define "well-formed" artifacts of that type. An important point of leverage is defining the
syntactic and semantic rules thoroughly and precisely enough that many common errors
result in detectable violations. This is analogous to syntax and strong-typing rules in
programming languages, which are not enough to guarantee program correctness but
effectively guard against many simple errors.
External consistency check Check the artifact for consistency with related artifacts.
Often this means checking for conformance to a "prior" or "higher-level" specification,
but consistency checking does not depend on sequential, top-down development - all
that is required is that the related information from two or more artifacts be defined
precisely enough to support detection of discrepancies. Consistency usually proceeds
from broad, syntactic checks to more detailed and expensive semantic checks, and a
variety of automated and manual verification techniques may be applied.
Generation of correctness conjectures Correctness conjectures, which can be test
outcomes or other objective criteria, lay the groundwork for external consistency checks
of other work products, particularly those that are yet to be developed or revised.
Generating correctness conjectures for other work products will frequently motivate
refinement of the current product. For example, an interface definition may be
elaborated and made more precise so that implementations can be effectively tested.
In the specification activity, the development team defines the required behavior of
the system, while the quality team defines usage scenarios that are later used for
deriving system test suites. The planning activity identifies incremental development
and certification phases.
After planning, all activities are iterated to produce incremental releases of the
system. Each system increment is fully deployed and certified before the following
step. Design and code undergo formal inspection ("Correctness verification") before
release. One of the key premises underpinning the Cleanroom process model is that
rigorous design and formal inspection produce "nearly fault-free software."
Usage profiles generated during specification are applied in the statistical testing
activity to gauge quality of each release. Another key assumption of the Cleanroom
process model is that usage profiles are sufficiently accurate that statistical testing
will provide an accurate measure of quality as perceived by users.[a] Reliability is
measured in terms of mean time between failures (MTBF) and is constantly controlled
after each release. Failures are reported to the development team for correction, and
if reliability falls below an acceptable range, failure data is used for process
improvement before the next incremental release.
Test cases suitable for batch execution are part of the system code base and are
implemented prior to the implementation of features they check ("test-first").
Developers work in pairs, incrementally developing and testing a module. Pair
programming effectively conflates a review activity with coding. Each release is
checked by running all the tests devised up to that point of development, thus
essentially merging unit testing with integration and system testing. A failed
acceptance test is viewed as an indication that additional unit tests are needed.
Although there are no standard templates for analysis and test strategies, we can
identify a few elements that should be part of almost any good strategy. A strategy
should specify common quality requirements that apply to all or most products,
promoting conventions for unambiguously stating and measuring them, and reducing the
likelihood that they will be overlooked in the quality plan for a particular project. A
strategy should indicate a set of documents that is normally produced during the quality
process, and their contents and relationships. It should indicate the activities that are
prescribed by the overall process organization. Often a set of standard tools and
practices will be prescribed, such as the interplay of a version and configuration control
tool with review and testing procedures. In addition, a strategy includes guidelines for
project staffing and assignment of roles and responsibilities. An excerpt of a sample
strategy document is presented in Chapter 24.
[a]See
Figure 20.1: Three possible simple schedules with different risks and resource
allocation. The bars indicate the duration of the tasks. Diamonds indicate milestones,
and arrows between bars indicate precedence between tasks.
In the middle schedule, marked as UNLIMITED RESOURCES, the test design and
execution activities are separated into distinct tasks. Test design tasks are scheduled
early, right after analysis and design, and only test execution is scheduled after Code
and integration. In this way the tasks Design subsystem tests and Design system tests
are removed from the critical path, which now spans 16 weeks with a tolerance of 5
weeks with respect to the expected termination of the project. This schedule assumes
enough resources for running Code and integration, Production of user documentation,
Design of subsystem tests, and Design of system tests.
The LIMITED RESOURCES schedule at the bottom of Figure 20.1 rearranges tasks to
meet resource constraints. In this case we assume that test design and execution, and
production of user documentation share the same resources and thus cannot be
executed in parallel. We can see that, despite the limited parallelism, decomposing
testing activities and scheduling test design earlier results in a critical path of 17 weeks,
4 weeks earlier than the expected termination of the project. Notice that in the example,
the critical path is formed by the tasks Analysis and design, Design subsystem tests,
Design system tests, Produce user documentation, Execute subsystem tests, and
Execute system tests. In fact, the limited availability of resources results in
dependencies among Design subsystem tests, Design system tests and Produce user
documentation that last longer than the parallel task Code and integration.
The completed plan must include frequent milestones for assessing progress. A rule of
thumb is that, for projects of a year or more, milestones for assessing progress should
occur at least every three months. For shorter projects, a reasonable maximum interval
Figure 20.2: Initial schedule for quality activities in development of the business logic
subsystem of the Chipmunk Web presence, presented as a GANTT
diagram.
The GANTT diagram shows four main groups of analysis and test activities: design
inspection, code inspection, test design, and test execution. The distribution of activities
over time is constrained by resources and dependence among activities. For example,
system test execution starts after completion of system test design and cannot finish
before system integration (the sync and stablize elements of development framework)
is complete. Inspection activities are constrained by specification and design activities.
Test design activities are constrained by limited resources. Late scheduling of the design
of integration tests for the administrative business logic subsystem is necessary to avoid
overlap with design of tests for the shopping functionality subsystem.
The GANTT diagram does not highlight intermediate milestones, but we can easily
identify two in April and July, thus dividing the development into three main phases. The
first phase (January to April) corresponds to requirements analysis and architectural
design activities and terminates with the architectural design baseline. In this phase, the
quality team focuses on design inspection and on the design of acceptance and system
tests. The second phase (May to July) corresponds to subsystem design and to the
implementation of the first complete version of the system. It terminates with the first
stabilization of the administrative business logic subsystem. In this phase, the quality
team completes the design inspection and the design of test cases. In the final stage,
the development team produces the final version, while the quality team focuses on code
inspection and test execution.
Absence of test design activities in the last phase results from careful identification of
activities that allowed early planning of critical tasks.
Technology Risks
Schedule Risks
Executions Risks
Requirements Risks
One key aggregate measure is the number of faults that have been revealed and
removed, which can be compared to data obtained from similar past projects. Fault
detection and removal can be tracked against time and will typically follow a
characteristic distribution similar to that shown in Figure 20.3. The number of faults
detected per time unit tends to grow across several system builds, then to decrease at
a much lower rate (usually half the growth rate) until it stabilizes.
or extraneous, that is, due to something not relevant or pertinent to the document or
code, as in a section of the design document that is not pertinent to the current product
and should be removed. The source of the fault indicates the origin of the faulty
modules: in-house, library, ported from other platforms,or outsourced code.
The age indicates the age of the faulty element - whether the fault was found in new, old
(base), rewritten,or re-fixed code.
The detailed information on faults allows for many analyses that can provide information
on the development and the quality process. As in the case of analysis of simple
faultiness data, the interpretation depends on the process and the product, and should
be based on past experience. The taxonomy of faults, as well as the analysis of
faultiness data, should be refined while applying the method.
When we first apply the ODC method, we can perform some preliminary analysis using
only part of the collected information:
Distribution of fault types versus activities Different quality activities target different
classes of faults. For example, algorithmic (that is, local) faults are targeted primarily by
unit testing, and we expect a high proportion of faults detected by unit testing to be in
this class. If the proportion of algorithmic faults found during unit testing is unusually
small, or a larger than normal proportion of algorithmic faults are found during integration
testing, then one may reasonably suspect that unit tests have not been well designed. If
the mix of faults found during integration testing contains an unusually high proportion of
algorithmic faults, it is also possible that integration testing has not focused strongly
enough on interface faults.
Distribution of triggers over time during field test Faults corresponding to simple
usage should arise early during field test, while faults corresponding to complex usage
should arise late. In both cases, the rate of disclosure of new faults should
asymptotically decrease. Unexpected distributions of triggers over time may indicate
poor system or acceptance test. If triggers that correspond to simple usage reveal
many faults late in acceptance testing, we may have chosen a sample that is not
representative of the user population. If faults continue growing during acceptance test,
system testing may have failed, and we may decide to resume it before continuing with
acceptance testing.
Age distribution over target code Most faults should be located in new and rewritten
code, while few faults should be found in base or re-fixed code, since base and re-fixed
code has already been tested and corrected. Moreover, the proportion of faults in new
and rewritten code with respect to base and re-fixed code should gradually increase.
Different patterns may indicate holes in the fault tracking and removal process or may
be a symptom of inadequate test and analysis that failed in revealing faults early (in
previous tests of base or re-fixed code). For example, an increase of faults located in
base code after porting to a new platform may indicate inadequate tests for portability.
Distribution of fault classes over time The proportion of missing code faults should
gradually decrease, while the percentage of extraneous faults may slowly increase,
because missing functionality should be revealed with use and repaired, while
extraneous code or documentation may be produced by updates. An increasing number
of missing faults may be a symptom of instability of the product, while a sudden sharp
increase in extraneous faults may indicate maintenance problems.
Description
Example
Critical
Cosmetic
Minor inconvenience.
The RCA approach to categorizing faults, in contrast to ODC, does not use a predefined
set of categories. The objective of RCA is not to compare different classes of faults over
time, or to analyze and eliminate all possible faults, but rather to identify the few most
important classes of faults and remove their causes. Successful application of RCA
progressively eliminates the causes of the currently most important faults, which lose
importance over time, so applying a static predefined classification would be useless.
Moreover, the precision with which we identify faults depends on the specific project and
process and varies over time.
ODC Classification of Triggers Listed by Activity
Design Review and Code Inspection
Design Conformance A discrepancy between the reviewed artifact and a prior-stage
artifact that serves as its specification.
Logic/Flow An algorithmic or logic flaw.
Backward Compatibility A difference between the current and earlier versions of an
artifact that could be perceived by the customer as a failure.
Internal Document An internal inconsistency in the artifact (e.g., inconsistency
between code and comments).
Lateral Compatibility An incompatibility between the artifact and some other system
or module with which it should interoperate.
Concurrency A fault in interaction of concurrent processes or threads.
configurations.
Blocked Test Failure occurred in setting up the test scenario.
A good RCA classification should follow the uneven distribution of faults across
categories. If, for example, the current process and the programming style and
environment result in many interface faults, we may adopt a finer classification for
interface faults and a coarse-grain classification of other kinds of faults. We may alter
the classification scheme in future projects as a result of having identified and removed
the causes of many interface faults.
Classification of faults should be sufficiently precise to allow identifying one or two most
significant classes of faults considering severity, frequency, and cost of repair. It is
important to keep in mind that severity and repair cost are not directly related. We may
have cosmetic faults that are very expensive to repair, and critical faults that can be
easily repaired. When selecting the target class of faults, we need to consider all the
factors. We might, for example, decide to focus on a class of moderately severe faults
that occur very frequently and are very expensive to remove, investing fewer resources
in preventing a more severe class of faults that occur rarely and are easily repaired.
When did faults occur, and when were they found? It is typical of mature software
processes to collect fault data sufficient to determine when each fault was detected
(e.g., in integration test or in a design inspection). In addition, for the class of faults
identified in the first step, we attempt to determine when those faults were introduced
(e.g., was a particular fault introduced in coding, or did it result from an error in
architectural design?).
Why did faults occur? In this core RCA step, we attempt to trace representative faults
back to causes, with the objective of identifying a "root" cause associated with many
faults in the class. Analysis proceeds iteratively by attempting to explain the error that
led to the fault, then the cause of that error, the cause of that cause, and so on. The rule
of thumb "ask why six times" does not provide a precise stopping rule for the analysis,
but suggests that several steps may be needed to find a cause in common among a
large fraction of the fault class under consideration.
The 80/20 or Pareto Rule
Fault classification in root cause analysis is justified by the so-called 80/20 or Pareto
rule. The Pareto rule is named for the Italian economist Vilfredo Pareto, who in the
early nineteenth century proposed a mathematical power law formula to describe the
unequal distribution of wealth in his country, observing that 20% of the people owned
80% of the wealth.
Pareto observed that in many populations, a few (20%) are vital and many (80%) are
trivial. In fault analysis, the Pareto rule postulates that 20% of the code is responsible
for 80% of the faults. Although proportions may vary, the rule captures two important
facts:
1. Faults tend to accumulate in a few modules, so identifying potentially faulty
modules can improve the cost effectiveness of fault detection.
2. Some classes of faults predominate, so removing the causes of a
predominant class of faults can have a major impact on the quality of the
process and of the resulting product.
The predominance of a few classes of faults justifies focusing on one class at a time.
Tracing the causes of faults requires experience, judgment, and knowledge of the
development process. We illustrate with a simple example. Imagine that the first RCA
step identified memory leaks as the most significant class of faults, combining a
moderate frequency of occurrence with severe impact and high cost to diagnose and
repair. The group carrying out RCA will try to identify the cause of memory leaks and
may conclude that many of them result from forgetting to release memory in exception
handlers. The RCA group may trace this problem in exception handling to lack of
information: Programmers can't easily determine what needs to be cleaned up in
exception handlers. The RCA group will ask why once more and may go back to a
design error: The resource management scheme assumes normal flow of control and
thus does not provide enough information to guide implementation of exception handlers.
Finally, the RCA group may identify the root problem in an early design problem:
Exceptional conditions were an afterthought dealt with late in design.
Each step requires information about the class of faults and about the development
process that can be acquired through inspection of the documentation and interviews
with developers and testers, but the key to success is curious probing through several
levels of cause and effect.
How could faults be prevented? The final step of RCA is improving the process by
removing root causes or making early detection likely. The measures taken may have a
minor impact on the development process (e.g., adding consideration of exceptional
conditions to a design inspection checklist), or may involve a substantial modification of
the process (e.g., making explicit consideration of exceptional conditions a part of all
requirements analysis and design steps). As in tracing causes, prescribing preventative
or detection measures requires judgment, keeping in mind that the goal is not perfection
but cost-effective improvement.
ODC and RCA are two examples of feedback and improvement, which are an important
dimension of most good software processes. Explicit process improvement steps are,
for example, featured in both SRET (sidebar on page 380) and Cleanroom (sidebar on
page 378).
roles minimizes the risk of conflict between roles played by an individual, and thus makes
most sense for roles in which independence is paramount, such as final system and
acceptance testing. An independent team devoted to quality activities also has an
advantage in building specific expertise, such as test design. The primary risk arising
from separation is in conflict between goals of the independent quality team and the
developers.
When quality tasks are distributed among groups or organizations, the plan should
include specific checks to ensure successful completion of quality activities. For
example, when module testing is performed by developers and integration and system
testing is performed by an independent quality team, the quality team should check the
completeness of module tests performed by developers, for example, by requiring
satisfaction of coverage criteria or inspecting module test suites. If testing is performed
by an independent organization under contract, the contract should carefully describe the
testing process and its results and documentation, and the client organization should
verify satisfactory completion of the contracted tasks.
Existence of a testing team must not be perceived as relieving developers from
responsibility for quality, nor is it healthy for the testing team to be completely oblivious
to other pressures, including schedule pressure. The testing team and development
team, if separate, must at least share the goal of shipping a high-quality product on
schedule.
Independent quality teams require a mature development process to minimize
communication and coordination overhead. Test designers must be able to work on
sufficiently precise specifications and must be able to execute tests in a controllable test
environment. Versions and configurations must be well defined, and failures and faults
must be suitably tracked and monitored across versions.
It may be logistically impossible to maintain an independent quality group, especially in
small projects and organizations, where flexibility in assignments is essential for
resource management. Aside from the logistical issues, division of responsibility creates
additional work in communication and coordination. Finally, quality activities often
demand deep knowledge of the project, particularly at detailed levels (e.g., unit and
early integration test). An outsider will have less insight into how and what to test, and
may be unable to effectively carry out the crucial earlier activities, such as establishing
acceptance criteria and reviewing architectural design for testability. For all these
reasons, even organizations that rely on an independent verification and validation
(IV&V) group for final product qualification allocate other responsibilities to developers
and to quality professionals working more closely with the development team.
At the polar opposite from a completely independent quality team is full integration of
quality activities with development, as in some "agile" processes including XP.
Communication and coordination overhead is minimized this way, and developers take
full responsibility for the quality of their work product. Moreover, technology and
application expertise for quality tasks will match the expertise available for development
tasks, although the developer may have less specific expertise in skills such as test
design.
The more development and quality roles are combined and intermixed, the more
important it is to build into the plan checks and balances to be certain that quality
activities and objective assessment are not easily tossed aside as deadlines loom. For
example, XP practices like "test first" together with pair programming (sidebar on page
381) guard against some of the inherent risks of mixing roles. Separate roles do not
necessarily imply segregation of quality activities to distinct individuals. It is possible to
assign both development and quality responsibility to developers, but assign two
individuals distinct responsibilities for each development work product. Peer review is an
example of mixing roles while maintaining independence on an item-by-item basis. It is
also possible for developers and testers to participate together in some activities.
Many variations and hybrid models of organization can be designed. Some organizations
have obtained a good balance of benefits by rotating responsibilities. For example, a
developer may move into a role primarily responsible for quality in one project and move
back into a regular development role in the next. In organizations large enough to have a
distinct quality or testing group, an appropriate balance between independence and
integration typically varies across levels of project organization. At some levels, an
appropriate balance can be struck by giving responsibility for an activity (e.g., unit
testing) to developers who know the code best, but with a separate oversight
responsibility shared by members of the quality team. For example, unit tests may be
designed and implemented by developers, but reviewed by a member of the quality
team for effective automation (particularly, suitability for automated regression test
execution as the product evolves) as well as thoroughness. The balance tips further
toward independence at higher levels of granularity, such as in system and acceptance
testing, where at least some tests should be designed independently by members of the
quality team.
Outsourcing test and analysis activities is sometimes motivated by the perception that
testing is less technically demanding than development and can be carried out by lowerpaid and lower-skilled individuals. This confuses test execution, which should in fact be
straightforward, with analysis and test design, which are as demanding as design and
programming tasks in development. Of course, less skilled individuals can design and
carry out tests, just as less skilled individuals can design and write programs, but in both
cases the results are unlikely to be satisfactory.
Outsourcing can be a reasonable approach when its objectives are not merely
minimizing cost, but maximizing independence. For example, an independent judgment of
quality may be particularly valuable for final system and acceptance testing, and may be
essential for measuring a product against an independent quality standard (e.g.,
qualifying a product for medical or avionic use). Just as an organization with mixed roles
requires special attention to avoid the conflicts between roles played by an individual,
radical separation of responsibility requires special attention to control conflicts between
the quality assessment team and the development team.
The plan must clearly define milestones and delivery for outsourced activities, as well as
checks on the quality of delivery in both directions: Test organizations usually perform
quick checks to verify the consistency of the software to be tested with respect to some
minimal "testability" requirements; clients usually check the completeness and
consistency of test results. For example, test organizations may ask for the results of
inspections on the delivered artifact before they start testing, and may include some
quick tests to verify the installability and testability of the artifact. Clients may check that
tests satisfy specified functional and structural coverage criteria, and may inspect the
test documentation to check its quality. Although the contract should detail the relation
between the development and the testing groups, ultimately, outsourcing relies on mutual
trust between organizations.
Further Reading
IEEE publishes a standard for software quality assurance plans [Ins02], which serves as
a good starting point. The plan outline in this chapter is based loosely on the IEEE
standard. Jaaksi [Jaa03] provides a useful discussion of decision making based on
distribution of fault discovery and resolution over the course of a project, drawn from
experience at Nokia. Chaar et al. [CHBC93] describe the orthogonal defect classification
technique, and Bhandari et al. [BHC+94] provide practical details useful in implementing
it. Leszak et al. [LPS02] describe a retrospective process with root cause analysis,
process compliance analysis, and software complexity analysis. Denaro and Pezz
[DP02] describe fault-proneness models for allocating effort in a test plan. De- Marco
and Lister [DL99] is a popular guide to the human dimensions of managing software
teams.
Exercises
Testing compatibility with a variety of device drivers is a significant cost and
schedule factor in some projects. For example, a well-known developer of
desktop publishing software maintains a test laboratory containing dozens of
current and outdated models of Macintosh computer, running several operating
system versions.
Put yourself in the place of the quality manager for a new version of this desktop
publishing software, and consider in particular the printing subsystem of the
software package. Your goal is to minimize the schedule impact of testing the
software against a large number of printers, and in particular to reduce the risk
20.1 that serious problems in the printing subsystem surface late in the project, or that
testing on the actual hardware delays product release.
How can the software architectural design be organized to serve your goals of
reducing cost and risk? Do you expect your needs in this regard will be aligned
with those of the development manager, or in conflict? What other measures
might you take in project planning, and in particular in the project schedule, to
minimize risks of problems arising when the software is tested in an operational
environment? Be as specific as possible, and avoid simply restating the general
strategies presented in this chapter.
Chipmunk Computers has signed an agreement with a software house for
software development under contract. Project leaders are encouraged to take
advantage of this agreement to outsource development of some modules and
20.2 thereby reduce project cost. Your project manager asks you to analyze the risks
that may result from this choice and propose approaches to reduce the impact of
the identified risks. What would you suggest?
Suppose a project applied orthogonal defect classification and analyzed
correlation between fault types and fault triggers, as well as between fault types
20.3 and impact. What useful information could be derived from cross-correlating those
classifications, beyond the information available from each classification alone?
ODC attributes have been adapted and extended in several ways, one of which is
including fault qualifier, which distinguishes whether the fault is due to missing,
20.4 incorrect, or extraneous code. What attributes might fault qualifier be correlated
Required Background
Chapter 4
Basic concepts of quality process, goals, and activities are important for
understanding this chapter.
Chapter 17
Scaffolding is a key cost element of integration testing. Some knowledge about
scaffolding design and implementation is important to fully understand an
essential dimension of integration testing.
21.1 Overview
The traditional V model introduced in Chapter 2 divides testing into four main levels of
granularity: module, integration, system, and acceptance test. Module or unit test
checks module behavior against specifications or expectations; integration test checks
module compatibility; system and acceptance tests check behavior of the whole system
with respect to specifications and user needs, respectively.
An effective integration test is built on a foundation of thorough module testing and
inspection. Module test maximizes controllability and observability of an individual unit,
and is more effective in exercising the full range of module behaviors, rather than just
those that are easy to trigger and observe in a particular context of other modules.
While integration testing may to some extent act as a process check on module testing
(i.e., faults revealed during integration test can be taken as a signal of unsatisfactory unit
testing), thorough integration testing cannot fully compensate for sloppiness at the
module level. In fact, the quality of a system is limited by the quality of the modules and
components from which it is built, and even apparently noncritical modules can have
widespread effects. For example, in 2004 a buffer overflow vulnerability in a single,
widely used library for reading Portable Network Graphics (PNG) files caused security
vulnerabilities in Windows, Linux, and Mac OS X Web browsers and email clients.
On the other hand, some unintended side-effects of module faults may become apparent
only in integration test (see sidebar on page 409), and even a module that satisfies its
interface specification may be incompatible because of errors introduced in design
decomposition. Integration tests therefore focus on checking compatibility between
module interfaces.
Integration faults are ultimately caused by incomplete specifications or faulty
implementations of interfaces, resource usage, or required properties. Unfortunately, it
may be difficult or not cost-effective to anticipate and completely specify all module
interactions. For example, it may be very difficult to anticipate interactions between
remote and apparently unrelated modules through sharing a temporary hidden file that
just happens to be given the same name by two modules, particularly if the name clash
appears rarely and only in some installation configurations. Some of the possible
manifestations of incomplete specifications and faulty implementations are summarized
in Table 21.1.
Table 21.1: Integration faults.
Open table as spreadsheet
Integration fault
Inconsistent
interpretation of
Example
parameters or values
Each module's
interpretation may be
reasonable, but they are
incompatible.
The official investigation of the Ariane 5 accident that led to the loss of the rocket on July
4, 1996 concluded that the accident was caused by incompatibility of a software module
with the Ariane 5 requirements. The software module was in charge of computing the
horizontal bias, a value related to the horizontal velocity sensed by the platform that is
calculated as an indicator of alignment precision. The module had functioned correctly
for Ariane 4 rockets, which were smaller than the Ariane 5, and thus had a substantially
lower horizontal velocity. It produced an overflow when integrated into the Ariane 5
software. The overflow started a series of events that terminated with self-destruction of
the launcher. The problem was not revealed during testing because of incomplete
specifications:
The specification of the inertial reference system and the tests performed at
equipment level did not specifically include the Ariane 5 trajectory data.
Consequently the realignment function was not tested under simulated Ariane 5
flight conditions, and the design error was not discovered. [From the official
investigation report]
As with most software problems, integration problems may be attacked at many levels.
Good design and programming practice and suitable choice of design and programming
environment can reduce or even eliminate some classes of integration problems. For
example, in applications demanding management of complex, shared structures,
choosing a language with automatic storage management and garbage collection greatly
reduces memory disposal errors such as dangling pointers and redundant deallocations
("double frees").
Even if the programming language choice is determined by other factors, many errors
can be avoided by choosing patterns and enforcing coding standards across the entire
code base; the standards can be designed in such a way that violations are easy to
detect manually or with tools. For example, many projects using C or C++ require use of
"safe" alternatives to unchecked procedures, such as requiring strncpy or strlcpy (string
copy procedures less vulnerable to buffer overflow) in place of strcpy. Checking for the
mere presence of strcpy is much easier (and more easily automated) than checking for
its safe use. These measures do not eliminate the possibility of error, but integration
testing is more effective when focused on finding faults that slip through these design
measures.
normal Web page requests that arrived on the secure (https) server port:
1 static void ssl_io_filter_disable(ap_filter_t *f)
2 {
3
bio_filter_in_ctx_t *inctx = f->ctx;
4
inctx->ssl = NULL;
5
inctx->filter_ctx->pssl = NULL;
6 }
This code fails to reclaim some dynamically allocated memory, causing the Web
server to "leak" memory at run-time. Over a long period of use, or over a shorter
period if the fault is exploited in a denial-of-service attack, this version of the Apache
Web server will allocate and fail to reclaim more and more memory, eventually
slowing to the point of unusability or simply crashing.
The fault is nearly impossible to see in this code. The memory that should be
deallocated here is part of a structure defined and created elsewhere, in the SSL
(secure sockets layer) subsystem, written and maintained by a different set of
developers. Even reading the definition of the ap filter t structure, which occurs in a
different part of the Apache Web server source code, doesn't help, since the ctx field
is an opaque pointer (type void * in C) . The repair, applied in version 2.0.49 of the
server, is:
1
2
3
4
5
6
7
8
distinguish two main classes: structural and feature oriented. In a structural approach,
modules are constructed, assembled, and tested together in an order based on
hierarchical structure in the design. Structural approaches include bottom-up, top-down,
and a combination sometimes referred to as sandwich or backbone strategy. Featureoriented strategies derive the order of integration from characteristics of the application,
and include threads and critical modules strategies.
Top-down and bottom-up strategies are classic alternatives in system construction and
incremental integration testing as modules accumulate. They consist in sorting modules
according to the use/include relation (see Chapter 15, page 286), and in starting testing
from the top or from the bottom of the hierarchy, respectively.
A top-down integration strategy begins at the top of the uses hierarchy, including the
interfaces exposed through a user interface or top-level application program interface
(API). The need for drivers is reduced or eliminated while descending the hierarchy,
since at each stage the already tested modules can be used as drivers while testing the
next layer. For example, referring to the excerpt of the Chipmunk Web presence shown
in Figure 21.1, we can start by integrating CustomerCare with Customer, while stubbing
Account and Order. We could then add either Account or Order and Package, stubbing
Model and Component in the last case. We would finally add Model, Slot, and
Component in this order, without needing any driver.
Figure 21.1: An excerpt of the class diagram of the Chipmunk Web presence.
Modules are sorted from the top to the bottom according to the use/include relation.
The topmost modules are not used or included in any other module, while the bottommost modules do not include or use other modules.
Bottom-up integration similarly reduces the need to develop stubs, except for breaking
circular relations. Referring again to the example in Figure 21.1, we can start bottom-up
by integrating Slot with Component, using drivers for Model and Order.We can then
incrementally add Model and Order. We can finally add either Package or Account and
Customer, before integrating CustomerCare, without constructing stubs.
Top-down and bottom-up approaches to integration testing can be applied early in the
development if paired with similar design strategies: If modules are delivered following
the hierarchy, either top-down or bottom-up, they can be integrated and tested as soon
as they are delivered, thus providing early feedback to the developers. Both approaches
increase controllability and diagnosability, since failures are likely caused by interactions
with the newly integrated modules.
In practice, software systems are rarely developed strictly top-down or bottom-up.
Design and integration strategies are driven by other factors, like reuse of existing
modules or commercial off-the-shelf (COTS) components, or the need to develop early
prototypes for user feedback. Integration may combine elements of the two
approaches, starting from both ends of the hierarchy and proceeding toward the middle.
An early top-down approach may result from developing prototypes for early user
feedback, while existing modules may be integrated bottom-up. This is known as the
sandwich or backbone strategy. For example, referring once more to the small system
of Figure 21.1, let us imagine reusing existing modules for Model, Slot, and Component,
and developing CustomerCare and Customer as part of an early prototype. We can
start integrating CustomerCare and Customer top down, while stubbing Account and
Order. Meanwhile, we can integrate bottom-up Model, Slot, and Component with Order,
using drivers for Customer and Package. We can then integrate Account with Customer,
and Package with Order, before finally integrating the whole prototype system.
The price of flexibility and adaptability in the sandwich strategy is complex planning and
monitoring. While top-down and bottom-up are straightforward to plan and monitor, a
sandwich approach requires extra coordination between development and test.
In contrast to structural integration testing strategies, feature-driven strategies select an
order of integration that depends on the dynamic collaboration patterns among modules
regardless of the static structure of the system. The thread integration testing strategy
integrates modules according to system features. Test designers identify threads of
execution that correspond to system features, and they incrementally test each thread.
The thread integration strategy emphasizes module interplay for specific functionality.
Referring to the Chipmunk Web presence, we can identify feature threads for
assembling models, finalizing orders, completing payments, packaging and shipping, and
so on. Feature thread integration fits well with software processes that emphasize
incremental delivery of user-visible functionality. Even when threads do not correspond
to usable end-user features, ordering integration by functional threads is a useful tactic
to make flaws in integration externally visible.
Incremental delivery of usable features is not the only possible consideration in choosing
the order in which functional threads are integrated and tested. Risk reduction is also a
driving force in many software processes. Critical module integration testing focuses on
modules that pose the greatest risk to the project. Modules are sorted and incrementally
integrated according to the associated risk factor that characterizes the criticality of
each module. Both external risks (such as safety) and project risks (such as schedule)
can be considered.
A risk-based approach is particularly appropriate when the development team does not
have extensive experience with some aspect of the system under development.
Consider once more the Chipmunk Web presence. If Chipmunk has not previously
constructed software that interacts directly with shipping services, those interface
modules will be critical because of the inherent risks of interacting with externally
provided subsystems, which may be inadequately documented or misunderstood and
which may also change.
Feature-driven test strategies usually require more complex planning and management
than structural strategies. Thus, we adopt them only when their advantages exceed the
extra management costs. For small systems a structural strategy is usually sufficient,
but for large systems feature-driven strategies are usually preferred. Often large
projects require combinations of development strategies that do not fit any single test
integration strategies. In these cases, quality managers would combine different
strategies: top-down, bottom-up, and sandwich strategies for small subsystems, and a
blend of threads and critical module strategies at a higher level.
The interface specification of a component should provide all the information required for
reusing the component, including so-called nonfunctional properties such as performance
or capacity limits, in addition to functional behavior. All dependence of the component on
the environment in which it executes should also be specified. In practice, few
component specifications are complete in every detail, and even details that are
specified precisely can easily be overlooked or misunderstood when embedded in a
complex specification document.
The main problem facing test designers in the organization that produces a component is
lack of information about the ways in which the component will be used. A component
may be reused in many different contexts, including applications for which its functionality
is an imperfect fit. A general component will typically provide many more features and
options than are used by any particular application.
A good deal of functional and structural testing of a component, focused on finding and
removing as many program faults as possible, can be oblivious to the context of actual
use. As with system and acceptance testing of complete applications, it is then
necessary to move to test suites that are more reflective of actual use. Testing with
usage scenarios places a higher priority on finding faults most likely to be encountered in
use and is needed to gain confidence that the component will be perceived by its users
(that is, by developers who employ it as part of larger systems) as sufficiently
dependable.
Test designers cannot anticipate all possible uses of a component under test, but they
can design test suites for classes of use in the form of scenarios. Test scenarios are
closely related to scenarios or use cases in requirements analysis and design.
Sometimes different classes of use are clearly evident in the component specification.
For example, the W3 Document Object Model (DOM) specification has parts that deal
exclusively with HTML markup and parts that deal with XML; these correspond to
different uses to which a component implementing the DOM may be put. The DOM
specification further provides two "views" of the component interface. In the flat view, all
traversal and inspection operations are provided on node objects, without regard to
subclass. In the structured view, each subclass of node offers traversal and inspection
operations specific to that variety of node. For example, an Element node has methods
to get and set attributes, but a Text node (which represents simple textual data within
XML or HTML) does not.
future.
Software design for testability is an important factor in the cost and effectiveness of test
and analysis, particularly for module and component integration. To some extent modelbased testing (Chapter 14) is progress toward producing modules and components with
well-specified and testable interfaces, but much remains to be done in characterizing
and supporting testability. Design for testability should be an important factor in the
evolution of architectural design approaches and notations, including architecture design
languages.
Further Reading
The buffer overflow problem in libpng, which caused security vulnerabilities in major
Windows, Linux, and Mac OS X Web browsers and e-mail clients, was discovered in
2004 and documented by the United States Computer Emergency Readiness Team
(CERT) in Vulnerability Note VU#388984 [Uni04]. The full report on the famous Ariane 5
failure [Lio96] is available from several sources on the Web. The NASA report on loss of
the Mars Climate Orbiter [Ste99] is also available on the Web. Leveson [Lev04]
describes the role of software in the Ariane failure, loss of the Mars Climate Orbiter, and
other spacecraft losses. Weyuker [Wey98] describes challenges of testing componentbased systems.
Exercises
21.1
When developing a graphical editor, we used a COTS component for saving and
reading files in XML format. During integration testing, the program failed when
reading an empty file and when reading a file containing a syntax error.
Try to classify the corresponding faults according to the taxonomy described in
Table 21.1.
The Chipmunk quality team decided to use both thread and critical module
integration testing strategies for the Chipmunk Web presence. Envisage at least
21.2 one situation in which thread integration should be preferred over critical module
and one in which critical module testing should be preferred over thread, and
motivate the choice.
Can a backbone testing strategy yield savings in the cost of producing test scaf21.3 folding, relative to other structural integration testing strategies? If so, how and
under what conditions? If not, why not?
[1]The
term component is used loosely and often inconsistently in different contexts. Our
working definition and related terms are explained in the sidebar on page 414.
Required Background
Chapter 4
The concepts of dependability, reliability, availability and mean time to failure are
important for understanding the difference between system and acceptance
testing.
Chapter 17
Generating reusable scaffolding and test cases is a foundation for regression
testing. Some knowledge about the scaffolding and test case generation
problem, though not strictly required, may be useful for understanding regression
testing problems.
22.1 Overview
System, acceptance, and regression testing are all concerned with the behavior of a
software system as a whole, but they differ in purpose.
System testing is a check of consistency between the software system and its
specification (it is a verification activity). Like unit and integration testing, system testing
is primarily aimed at uncovering faults, but unlike testing activities at finer granularity
levels, system testing focuses on system-level properties. System testing together with
acceptance testing also serves an important role in assessing whether a product can be
released to customers, which is distinct from its role in exposing faults to be removed to
improve the product.
System, Acceptance, and Regression Testing
Open table as spreadsheet
System test
Checks against
requirements
specifications
Acceptance test
Regression test
Performed by development
test group
critical property with many opportunities for failure, the system architecture and buildplan for the Chipmunk Web presence was structured with interfaces that could be
artificially driven through various scenarios early in development, and with several of the
system test scenarios simulated in earlier integration tests.
The appropriate notions of thoroughness in system testing are with respect to the
system specification and potential usage scenarios, rather than code or design. Each
feature or specified behavior of the system should be accounted for in one or several
test cases. In addition to facilitating design for test, designing system test cases
together with the system requirements specification document helps expose ambiguity
and refine specifications.
The set of feature tests passed by the current partial implementation is often used as a
gauge of progress. Interpreting a count of failing feature-based system tests is
discussed in Chapter 20, Section 20.6.
Additional test cases can be devised during development to check for observable
symptoms of failures that were not anticipated in the initial system specification. They
may also be based on failures observed and reported by actual users, either in
acceptance testing or from previous versions of a system. These are in addition to a
thorough specification-based test suite, so they do not compromise independence of the
quality assessment.
Some system properties, including performance properties like latency between an
event and system response and reliability properties like mean time between failures,
are inherently global. While one certainly should aim to provide estimates of these
properties as early as practical, they are vulnerable to unplanned interactions among
parts of a complex system and its environment. The importance of such global
properties is therefore magnified in system testing.
Global properties like performance, security, and safety are difficult to specify precisely
and operationally, and they depend not only on many parts of the system under test, but
also on its environment and use. For example, U.S. HIPAA regulations governing privacy
of medical records require appropriate administrative, technical, and physical safeguards
to protect the privacy of health information, further specified as follows:
Implementation specification: safeguards. A covered entity must reasonably
safeguard protected health information from any intentional or unintentional use or
disclosure that is in violation of the standards, implementation specifications or other
requirements of this subpart. [Uni00, sec. 164.530(c)(2)]
It is unlikely that any precise operational specification can fully capture the HIPAA
requirement as it applies to an automated medical records system. One must consider
the whole context of use, including, for example, which personnel have access to the
system and how unauthorized personnel are prevented from gaining access.
Some global properties may be defined operationally, but parameterized by use. For
example, a hard-real-time system must meet deadlines, but cannot do so in a
completely arbitrary environment; its performance specification is parameterized by
event frequency and minimum inter-arrival times. An e-commerce system may be
expected to provide a certain level of responsiveness up to a certain number of
transactions per second and to degrade gracefully up to a second rate. A key step is
identifying the "operational envelope" of the system, and testing both near the edges of
that envelope (to assess compliance with specified goals) and well beyond it (to ensure
the system degrades or fails gracefully). Defining borderline and extreme cases is
logically part of requirements engineering, but as with precise specification of features,
test design often reveals gaps and ambiguities.
Not all global properties will be amenable to dynamic testing at all, at least in the
conventional sense. One may specify a number of properties that a secure computer
system should have, and some of these may be amenable to testing. Others can be
addressed only through inspection and analysis techniques, and ultimately one does not
trust the security of a system at least until an adversarial team has tried and failed to
subvert it. Similarly, there is no set of test cases that can establish software safety, in
part because safety is a property of a larger system and environment of which the
software is only part. Rather, one must consider the safety of the overall system, and
assess aspects of the software that are critical to that overall assessment. Some but
not all of those claims may be amenable to testing.
Testing global system properties may require extensive simulation of the execution
environment. Creating accurate models of the operational environment requires
substantial human resources, and executing them can require substantial time and
machine resources. Usually this implies that "stress" testing is a separate activity from
frequent repetition of feature tests. For example, a large suite of system test cases
might well run each night or several times a week, but a substantial stress test to
measure robust performance under heavy load might take hours to set up and days or
weeks to run.
A test case that can be run automatically with few human or machine resources should
generally focus on one purpose: to make diagnosis of failed test executions as clear and
simple as possible. Stress testing alters this: If a test case takes an hour to set up and
a day to run, then one had best glean as much information as possible from its results.
This includes monitoring for faults that should, in principle, have been found and
eliminated in unit and integration testing, but which become easier to recognize in a
stress test (and which, for the same reason, are likely to become visible to users). For
example, several embedded system products ranging from laser printers to tablet
computers have been shipped with slow memory leaks that became noticeable only
after hours or days of continuous use. In the case of the tablet PC whose character
recognition module gradually consumed all system memory, one must wonder about the
extent of stress testing the software was subjected to.
Integration Test
System Test
module
specifications
architecture and
design
specifications
requirements specification
Scaffolding
required
Focus on
Potentially
Depends on
complex, to
Mostly limited to test
architecture and
simulate the
oracles, since the whole
integration order.
activation
system should not require
Modules and
environment
additional drivers or stubs to
subsystems can be
(drivers), the
be executed. Sometimes
incrementally
modules called by
includes a simulated
integrated to reduce
the module under
execution environment (e.g.,
need for drivers and
test (stubs) and
for embedded systems).
stubs.
test oracles
behavior of
individual modules
module integration
and interaction
system functionality
22.4 Usability
A usable product is quickly learned, allows users to work efficiently, and is pleasant to
use. Usability involves objective criteria such as the time and number of operations
required to perform tasks and the frequency of user error, in addition to the overall,
subjective satisfaction of users.
For test and analysis, it is useful to distinguish attributes that are uniquely associated
with usability from other aspects of software quality (dependability, performance,
security, etc.). Other software qualities may be necessary for usability; for example, a
program that often fails to satisfy its functional requirements or that presents security
holes is likely to suffer poor usability as a consequence. Distinguishing primary usability
properties from other software qualities allows responsibility for each class of properties
to be allocated to the most appropriate personnel, at the most cost-effective points in
the project schedule.
Even if usability is largely based on user perception and thus is validated based on user
feedback, it can be verified early in the design and through the whole software life cycle.
The process of verifying and validating usability includes the following main steps:
Inspecting specifications with usability checklists. Inspection provides early feedback
on usability.
Testing early prototypes with end users to explore their mental model (exploratory
test), evaluate alternatives (comparison test), and validate software usability. A
prototype for early assessment of usability may not include any functioning software; a
cardboard prototype may be as simple as a sequence of static images presented to
users by the usability tester.
Testing incremental releases with both usability experts and end users to monitor
progress and anticipate usability problems.
System and acceptance testing that includes expert-based inspection and testing,
userbased testing, comparison testing against competitors, and analysis and checks
often done automatically, such as a check of link connectivity and verification of browser
compatibility.
User-based testing (i.e., testing with representatives of the actual end-user population)
is particularly important for validating software usability. It can be applied at different
stages, from early prototyping through incremental releases of the final system, and can
be used with different goals: exploring the mental model of the user, evaluating design
alternatives, and validating against established usability requirements and standards.
The purpose of exploratory testing is to investigate the mental model of end users. It
consists of asking users about their approach to interactions with the system. For
example, during an exploratory test for the Chipmunk Web presence, we may provide
users with a generic interface for choosing the model they would like to buy, in order to
understand how users will interact with the system. A generic interface could present
information about all laptop computer characteristics uniformly to see which are
examined first by the sample users, and thereby to determine the set of characteristics
that should belong to the summary in the menu list of laptops. Exploratory test is usually
performed early in design, especially when designing a system for a new target
population.
The purpose of comparison testing is evaluating options. It consists of observing user
reactions to alternative interaction patterns. During comparison test we can, for
example, provide users with different facilities to assemble the desired Chipmunk laptop
configuration, and to identify patterns that facilitate users' interactions. Comparison test
is usually applied when the general interaction patterns are clear and need to be refined.
It can substitute for exploratory testing if initial knowledge about target users is sufficient
to construct a range of alternatives, or otherwise follows exploratory testing.
The purpose of validation testing is assessing overall usability. It includes identifying
difficulties and obstacles that users encounter while interacting with the system, as well
as measuring characteristics such as error rate and time to perform a task.
A well-executed design and organization of usability testing can produce results that are
objective and accurately predict usability in the target user population. The usability test
design includes selecting suitable representatives of the target users and organizing
sessions that guide the test toward interpretable results. A common approach is divided
into preparation, execution, and analysis phases. During the preparation phase, test
designers define the objectives of the session, identify the items to be tested, select a
representative population of end users, and plan the required actions. During execution,
users are monitored as they execute the planned actions in a controlled environment.
During analysis, results are evaluated, and changes to the software interfaces or new
testing sessions are planned, if required.
Each phase must be carefully executed to ensure success of the testing session. User
time is a valuable and limited resource. Well-focused test objectives should not be too
narrow, to avoid useless waste of resources, nor too wide, to avoid scattering
resources without obtaining useful data. Focusing on specific interactions is usually more
effective than attempting to assess the usability of a whole program at once. For
example, the Chipmunk usability test team independently assesses interactions for
catalog browsing, order definition and purchase, and repair service.
The larger the population sample, the more precise the results, but the cost of very large
samples is prohibitive; selecting a small but representative sample is therefore critical. A
good practice is to identify homogeneous classes of users and select a set of
representatives from each class. Classes of users depend on the kind of application to
be tested and may be categorized by role, social characteristics, age, and so on. A
typical compromise between cost and accuracy for a well-designed test session is five
users from a unique class of homogeneous users, four users from each of two classes,
or three users for each of three or more classes. Questionnaires should be prepared for
the selected users to verify their membership in their respective classes. Some
approaches also assign a weight to each class, according to their importance to the
business. For example, Chipmunk can identify three main classes of users: individual,
business, and education customers. Each of the main classes is further divided.
Individual customers are distinguished by education level; business customers by role;
and academic customers by size of the institution. Altogether, six putatively
homogeneous classes are obtained: Individual customers with and without at least a
bachelor degree, managers and staff of commercial customers, and customers at small
and large education institutions.
Users are asked to execute a planned set of actions that are identified as typical uses of
the tested feature. For example, the Chipmunk usability assessment team may ask
users to configure a product, modify the configuration to take advantage of some special
offers, and place an order with overnight delivery.
Users should perform tasks independently, without help or influence from the testing
staff. User actions are recorded, and comments and impressions are collected with a
post-activity questionnaire. Activity monitoring can be very simple, such as recording
sequences of mouse clicks to perform each action. More sophisticated monitoring can
include recording mouse or eye movements. Timing should also be recorded and may
sometimes be used for driving the sessions (e.g., fixing a maximum time for the session
or for each set of actions).
An important aspect of usability is accessibility to all users, including those with
disabilities. Accessibility testing is legally required in some application domains. For
example, some governments impose specific accessibility rules for Web applications of
public institutions. The set of Web Content Accessibility Guidelines (WCAG) defined by
the World Wide Web Consortium are becoming an important standard reference. The
WCAG guidelines are summarized in the sidebar on page 426.
Web Content Accessibility Guidelines (WCAG)[a]
1. Provide equivalent alternatives to auditory and visual content that convey
essentially the same function or purpose.
2. Ensure that text and graphics are understandable when viewed without
color.
3. Mark up documents with the proper structural elements, controlling
presentation with style sheets rather than presentation elements and
attributes.
[a]Excerpted
same partition are mutually redundant with respect to functional criteria. Redundant test
cases may be introduced in the test suites due to concurrent work of different test
designers or to changes in the code. Redundant test cases do not reduce the overall
effectiveness of tests, but impact on the cost-benefits trade-off: They are unlikely to
reveal faults, but they augment the costs of test execution and maintenance. Obsolete
test cases are removed because they are no longer useful, while redundant test cases
are kept because they may become helpful in successive versions of the software.
Good test documentation is particularly important. As we will see in Chapter 24, test
specifications define the features to be tested, the corresponding test cases, the inputs
and expected outputs, as well as the execution conditions for all cases, while reporting
documents indicate the results of the test executions, the open faults, and their relation
to the test cases. This information is essential for tracking faults and for identifying test
cases to be reexecuted after fault removal.
program model on which the version comparison is based (e.g., control flow or data flow
graph models).
Control flow graph (CFG) regression techniques are based on the differences between
the CFGs of the new and old versions of the software. Let us consider, for example, the
C function cgi_decode from Chapter 12. Figure 22.1 shows the original function as
presented in Chapter 12, while Figure 22.2 shows a revison of the program. We refer to
these two versions as 1.0 and 2.0, respectively. Version 2.0 adds code to fix a fault in
interpreting hexadecimal sequences %xy. The fault was revealed by testing version 1.0
with input terminated by an erroneous subsequence %x, causing version 1.0 to read
past the end of the input buffer and possibly overflow the output buffer. Version 2.0
contains a new branch to map the unterminated sequence to a question mark.
1 #include "hex_values.h"
2 /** Translate a string from the CGI encoding to plain ascii tex
3 *
'+' becomes space, %xx becomes byte with hex value xx,
4 *
other alphanumeric characters map to themselves.
5 *
Returns 0 for success, positive for erroneous input
6 *
1 = bad hexadecimal digit
7 */
8 int cgi_decode(char *encoded, char *decoded) {
9
char *eptr = encoded;
10 char *dptr = decoded;
11 int ok=0;
12 while (*eptr) {
13
char c;
14
c = *eptr;
15
if (c == '+') { /* Case 1: '+' maps to blank */
16
*dptr = '';
17
} else if (c == '%') { /* Case 2: '%xx' is hex for characte
18
int digit_high = Hex_Values[*(++eptr)]; /* note illegal =
19
int digit_low = Hex_Values[*(++eptr)];
20
if ( digit_high == -1 || digit low==-1) {
21
/* *dptr='?'; */
22
ok=1; /* Bad return code */
23
} else {
24
*dptr = 16* digit_high + digit_low;
25
}
26
} else { /* Case 3: Other characters map to themselves */
27
*dptr = *eptr;
28
}
29
++dptr;
30
++eptr;
31 }
32 *dptr = '\0';
33 return ok;
34 }
31
*dptr = *eptr;
32
}
33
if (! isascii(*dptr)) { /* Produce only legal ascii */
34
*dptr = '?';
35
ok=1;
36
}
37
++dptr;
38
++eptr;
39
}
40
*dptr = '\0'; /* Null terminator for string */
41
return ok;
42 }
Figure 22.2: Version 2.0 of the C function cgi_decode adds a control on hexadecimal
escape sequences to reveal incorrect escape sequences at the end of the input string
and a new branch to deal with non-ASCII characters.
Let us consider all structural test cases derived for cgi_decode in Chapter 12, and
assume we have recorded the paths exercised by the different test cases as shown in
Figure 22.3. Recording paths executed by test cases can be done automatically with
modest space and time overhead, since what must be captured is only the set of
program elements exercised rather than the full history.
Open table as spreadsheet
Id
Test case
Path
TC1
""
ABM
TC2
"test+case%1Dadequacy"
A B C D F LB M
TC3 "adequate+test%0Dexecution%7U"
ABCDFLBM
TC4
"%3D"
ABCDG HLBM
TC5
"%A"
ABCDGILBM
TC6
"a+b"
ABCDFLBCELBCDFLBM
TC7
"test"
ABCDFLBCDFLBCDFLBCDF
LBM
TC8
"+%0D+%4J"
A B C E L B C D G I LB M
TC9
"first+test%9Ktest%K9"
A B C D F LB M
Figure 22.3: Paths covered by the structural test cases derived for version 1.0 of
function cgi_decode. Paths are given referring to the nodes of the control flow graph
of Figure 22.4.
CFG regression testing techniques compare the annotated control flow graphs of the
two program versions to identify a subset of test cases that traverse modified parts of
the graphs. The graph nodes are annotated with corresponding program statements, so
that comparison of the annotated CFGs detects not only new or missing nodes and
arcs, but also nodes whose changed annotations correspond to small, but possibly
relevant, changes in statements.
The CFG for version 2.0 of cgi_decode is given in Figure 22.4. Differences between
version 2.0 and 1.0 are indicated in gray. In the example, we have new nodes, arcs and
paths. In general, some nodes or arcs may be missing (e.g., when part of the program
is removed in the new version), and some other nodes may differ only in the annotations
(e.g., when we modify a condition in the new version). CFG criteria select all test cases
that exercise paths through changed portions of the CFG, including CFG structure
changes and node annotations. In the example, we would select all test cases that pass
through node D and proceed toward node G and all test cases that reach node L, that
is, all test cases except TC1. In this example, the criterion is not very effective in
reducing the size of the test suite because modified statements affect almost all paths.
Figure 22.4: The control flow graph of function cgi_decode version 2.0. Gray
background indicates the changes from the former version.
If we consider only the corrective modification (nodes X and Y ), the criterion is more
effective. The modification affects only the paths that traverse the edge between D and
G, so the CFG regression testing criterion would select only test cases traversing those
nodes (i.e., TC2, TC3, TC4, TC5, TC8 and TC9). In this case the size of the test suite
to be reexecuted includes two-thirds of the test cases of the original test suite.
In general, the CFG regression testing criterion is effective only when the changes affect
a relatively small subset of the paths of the original program, as in the latter case. It
becomes almost useless when the changes affect most paths, as in version 2.0.
Data flow (DF) regression testing techniques select test cases for new and modified
pairs of definitions with uses (DU pairs, cf. Sections 6.1, page 77 and 13.2, page 236).
DF regression selection techniques reexecute test cases that, when executed on the
original program, exercise DU pairs that were deleted or modified in the revised
program. Test cases that executed a conditional statement whose predicate was altered
are also selected, since the changed predicate could alter some old definition-use
associations. Figure 22.5 shows the new definitions and uses introduced by
modifications to cgi_decode.[1] These new definitions and uses introduce new DU pairs
and remove others.
Open table as spreadsheet
Variable Definitions Uses
*eptr
eptr
dptr
dptr
ok
W
ZW
YZ
Figure 22.5: Definitions and uses introduced by changes in cgi_decode. Labels refer
to the nodes in the control flow graph of Figure 22.4.
In contrast to code-based techniques, specification-based test selection techniques do
not require recording the control flow paths executed by tests. Regression test cases
can be identified from correspondence between test cases and specification items. For
example, when using category partition, test cases correspond to sets of choices, while
in finite state machine model-based approaches, test cases cover states and transitions.
Where test case specifications and test data are generated automatically from a
specification or model, generation can simply be repeated each time the specification or
model changes.
Code-based regression test selection criteria can be adapted for model-based
regression test selection. Consider, for example, the control flow graph derived from the
process shipping order specification in Chapter 14. We add the following item to that
specification:
Restricted countries A set of restricted destination countries is maintained, based on
current trade restrictions. If the shipping address contains a restricted destination
country, only credit card payments are accepted for that order, and shipping proceeds
only after approval by a designated company officer responsible for checking that the
goods ordered may be legally exported to that country.
The new requirement can be added to the flow graph model of the specification as
illustrated in Figure 22.6.
Figure 22.6: A flow graph model of the specification of the shipping order
functionality presented in Chapter 14, augmented with the "restricted country"
requirement. The changes in the flow graph are indicated in black.
We can identify regression test cases with the CFG criterion that selects all cases that
correspond to international shipping addresses (i.e., test cases TC-1 and TC-5 from the
following table). The table corresponds to the functional test cases derived using to the
method described in Chapter 14 on page 259.
Open table as spreadsheet
Case
Too
small
Ship
where
Ship
method
Cust
type
Pay
method
Same
addr
CC valid
TC-1
No
Int
Air
Bus
CC
No
Yes
TC-2
No
Dom
Land
TC-3
Yes
TC-4
No
Dom
Air
TC-5
No
Int
Land
TC-6
No
Edu
Inv
TC-7
No
CC
Yes
TC-8
No
CC
No (abort)
TC-9
No
CC
No (no
abort)
Models derived for testing can be used not only for selecting regression test cases, but
also for generating test cases for the new code. In the preceding example, we can use
the model not only to identify the test cases that should be reused, but also to generate
new test cases for the new functionality, following the combinatorial approaches
described in Chapter 11.
[1]When
dealing with arrays, we follow the criteria discussed in Chapter 13: A change of
an array value is a definition of the array and a use of the index. A use of an array value
is a use of both the array and the index.
Selective regression test selection based on analysis of source code is now wellstudied. There remains need and opportunity for improvement in techniques that give up
the safety guarantee (selecting all test cases that might be affected by a software
change) to obtain more significant test suite reductions. Specification-based regression
test selection is a promising avenue of research, particularly as more systems
incorporate components without full source code.
Increasingly ubiquitous network access is blurring the once-clear lines between alpha
and beta testing and opening possibilities for gathering much more information from
execution of deployed software. We expect to see advances in approaches to gathering
information (both from failures and from normal execution) as well as exploiting
potentially large amounts of gathered information. Privacy and confidentiality are an
important research challenge in postdeployment monitoring.
Further Reading
Musa [Mus04] is a guide to reliability engineering from a pioneer in the field; ongoing
research appears in the International Symposium on Software Reliability Engineering
(ISSRE) conference series. Graves et al. [GHK+98] and Rothermel and Harrold [RH97]
provide useful overviews of selective regression testing. Kim and Porter [KP02] describe
history-based test prioritization. Barnum [Bar01] is a well-regarded text on usability
testing; Nielsen [Nie00] is a broader popular introduction to usability engineering, with a
chapter on usability testing.
Exercises
Consider the Chipmunk Computer Web presence. Define at least one test case
that may serve both during final integration and early system testing, at least one
22.1 that serves only as an integration test case, and at least one that is more suitable
as a system test case than as a final integration test case. Explain your choices.
When and why should testing responsibilities shift from the development team to
22.2 an independent quality team? In what circumstances might using an independent
quality team be impractical?
Identify some kinds of properties that cannot be efficiently verified with system
22.3 testing, and indicate how you would verify them.
Provide two or more examples of resource limitations that may impact system
22.4 test more than module and integration test. Explain the difference in impact.
Consider the following required property of the Chipmunk Computer Web
22.5 presence:
Customers should perceive that purchasing a computer using the Chipmunk Web
presence is at least as convenient, fast, and intuitive as purchasing a computer in an
off-line retail store.
Would you check it as part of system or acceptance testing? Reformulate the property
to allow test designers to check it in a different testing phase (system testing, if you
consider the property checkable as part of acceptance testing, or vice versa).
Required Background
Chapter 20
Some knowledge about planning and monitoring, though not strictly required, can
be useful to understand the need for automated management support.
Chapter 17
Some knowledge about execution and scaffolding is useful to appreciate the
impact of tools for scaffolding generation and test execution.
Chapter 19
Some knowledge about program analysis is useful to understand the need to
automate analysis techniques.
23.1 Overview
A rational approach to automating test and analysis proceeds incrementally, prioritizing
the next steps based on variations in potential impact, variations in the maturity, cost,
and scope of the applicable technology, and fit and impact on the organization and
process. The potential role of automation in test and analysis activities can be
considered along three nonorthogonal dimensions: the value of the activity and its current
cost, the extent to which the activity requires or is made less expensive by automation,
and the cost of obtaining or constructing tool support.
Some test and analysis tasks depend so heavily on automation that a decision to employ
a technique is tantamount to a decision to use tools. For example, employing structural
coverage criteria in program testing necessarily means using coverage measurement
tools. In other cases, an activity may be carried out manually, but automation reduces
cost or improves effectiveness. For example, tools for capturing and replaying
executions reduce the costs of reexecuting test suites and enable testing strategies that
would be otherwise impractical. Even tasks that appear to be inherently manual may be
enhanced with automation. For example, although software inspection is a manual
activity at its core, a variety of tools have been developed to organize and present
information and manage communication for software inspection, improving the efficiency
of inspectors.
The difficulty and cost of automating test and analysis vary enormously, ranging from
tools that are so simple to develop that they are justifiable even if their benefits are
modest to tools that would be enormously valuable but are simply impossible. For
example, if we have specification models structured as finite state machines, automatic
generation of test case specifications from the finite state model is a sufficiently simple
and well-understood technique that obtaining or building suitable tools should not be an
obstacle. At the other extreme, as we have seen in Chapter 2, many important problems
regarding programs are undecidable. For example, no matter how much value we might
derive from a tool that infallibly distinguishes executable from nonexecutable program
paths, no such tool can exist. We must therefore weigh the difficulty or expense of
automation together with potential benefits, including costs of training and integration.
Difficulty and cost are typically entangled with scope and accuracy. Sometimes a
general-purpose tool (e.g., capture and replay for Windows applications) is only
marginally more difficult to produce than a tool specialized for one project (e.g., capture
and replay for a specific Windows application). Investment in the general-purpose tool,
whether to build it or to buy it, can be amortized across projects. In other cases, it may
be much more cost-effective to create simple, project-specific tools that sidestep the
complexity of more generic tools.
However industrious and well-intentioned, humans are slow and error-prone when
dealing with repetitive tasks. Conversely, simple repetitive tasks are often
straightforward to automate, while judgment and creative problem solving remain outside
the domain of automation. Human beings are very good at identifying the relevant
execution scenarios that correspond to test case specifications (for example, by
specifying the execution space of the program under test with a finite state machine),
but are very inefficient in generating large volumes of test cases (for example, by
clicking combinations of menus in graphic interfaces), or identifying erroneous results
within a large set of outputs produced when executing regression tests. Automating the
repetitive portions of the task not only reduces costs, but improves accuracy as well.
Lines
LOC
eLOC
lLOC
Every programmer knows that there are variations in complexity between different
pieces of code and that this complexity may be as important as sheer size. A number of
attempts have been made to quantify aspects of complexity and readability:
CDENS
Blocks
AveBlockL
NEST
Loops
Number of loops
LCSAJ
BRANCH
Open table as spreadsheet
Size and complexity may also be estimated on a coarser scale, considering only
interfaces between units:
Cyclomatic Complexity
FRet
IComplex
All these metrics are proxies for size and complexity. Despite several attempts beginning
in the 1970s, no proposed metric has succeeded in capturing intrinsic complexity in a
manner that robustly correlates with effort or quality. Lines of code, despite its obvious
shortcomings, is not much worse than other measures of size. Among attempts to
measure complexity, only cyclomatic complexity (V (g)) is still commonly collected by
many tools (see sidebar). Cyclomatic complexity is defined as the number of
independent paths through the control flow graph.
WMC
Weighted methods per class, the sum of the complexities of methods in all
classes, divided by the number of classes. This metric is parametric with
respect to a measure of complexity in methods
DIT
NOC
RFC
CBO
Coupling between object classes, the number of classes with which the class
is coupled through any relation (e.g., containment, method calls, subclassing)
LCOM
All metrics discussed so far focus on code structure and can be measured only when the
code is available, often late in the development process. A subset of the object-oriented
metrics can be derived from detailed design, which still may be too late for many
purposes.
Many standards define metrics. The well-known ISO/IEC 9126 standard (sidebar on
page 446) suggests a hierarchy of properties to measure the quality of software. The
six main high-level quality dimensions identified by the ISO/IEC 9126 standard describe
quality properties as perceived by users.
Suitability
Accuracy
Interoperability
Security
Reliability
Maturity
Fault Tolerance
Recoverability
Usability
Understandability
Learnability
Operability
Attractiveness
Efficiency
Time Behavior
Resource
Utilization
Maintainability
Analyzability
Changeability
Stability
Testability
Portability
Adaptability
Installability
Co-existence
Replaceability
Open table as spreadsheet
Automated analysis is effective both for quickly and cheaply checking simple properties,
and for more expensive checks that are necessary for critical properties that resist
cheaper forms of verification. For example, simple data flow analyses can almost
instantaneously identify anomalous patterns (e.g., computing a value that is never used)
that are often symptoms of other problems (perhaps using the wrong value at a different
point in a program). At the other extreme, using a finite state verification tool to find
subtle synchronization faults in interface protocols requires a considerable investment in
constructing a model and formalizing the properties to be verified, but this effort is
justified by the cost of failure and the inadequacy of conventional testing to find timingdependent faults.
It may be practical to verify some critical properties only if the program to be checked
conforms to certain design rules. The problem of verifying critical properties is then
decomposed into a design step and a proof step. In the design step, software engineers
select and enforce design rules to accommodate analysis, encapsulating critical parts of
the code and selecting a well-understood design idiom for which suitable analysis
techniques are known. Test designers can then focus on the encapsulated or simplified
property. For example, it is common practice to encapsulate safety-critical properties
into a safety kernel. In this way, the hard problem of proving the safety-critical
properties of a complex system is decomposed into two simpler problems: Prove safety
properties of the (small) kernel, and check that all safety-related actions are mediated
by the kernel.
Tools for verifying a wide class of properties, like program verifiers based on theorem
proving, require extensive human interaction and guidance. Other tools with a more
restricted focus, including finite state verification tools, typically execute completely
automatically but almost always require several rounds of revision to properly formalize
a model and property to be checked. The least burdensome of tools are restricted to
checking a fixed set of simple properties, which (being fixed) do not require any
additional effort for specification. These featherweight analysis tools include type
checkers, data flow analyzers, and checkers of domain specific properties, such as Web
site link checkers.
Type-checking techniques are typically applied to properties that are syntactic in the
sense that they enforce a simple well-formedness rule. Violations are easy to diagnose
and repair even if the rules are stricter than one would like. Data flow analyzers, which
are more sensitive to program control and data flow, are often used to identify
anomalies rather than simple, unambiguous faults. For example, assigning a value to a
variable that is not subsequently used suggests that either the wrong variable was set or
an intended subsequent use is missing, but the program must be inspected to determine
whether the anomaly corresponds to a real fault. Approximation in data flow analysis,
resulting from summarization of execution on different control flow paths, can also
necessitate interpretation of results.
Tools for more sophisticated analysis of programs are, like data flow analyses,
ultimately limited by the undecidability of program properties. Some report false alarms
in addition to real violations of the properties they check; others avoid false alarms but
may also fail to detect all violations. Such "bug finders," though imperfect, may
nonetheless be very cost-effective compared to alternatives that require more
interaction.
Tools that provide strong assurance of important general properties, including model
checkers and theorem provers, are much more "heavyweight" with respect to
requirement for skilled human interaction and guidance. Finite state verification systems
(often called model checkers) can verify conformance between a model of a system and
a specified property, but require construction of the model and careful statement of the
property. Although the verification tool may execute completely automatically, in practice
it is run over and over again between manual revisions of the model and property
specification or, in the case of model checkers for programs, revision of the property
specification and guidance on program abstraction. Direct verification of software has
proved effective, despite this cost, for some critical properties of relatively small
programs such as device drivers. Otherwise, finite state verification technology is best
applied to specification and design models.
The most general (but also the most expensive) static analysis tools execute with
interactive guidance. The symbolic execution techniques described in Chapter 7,
together with sophisticated constraint solvers, can be used to construct formal proofs
that a program satisfies a wide class of formally specified properties. Interactive
theorem proving requires specialists with a strong mathematical background to formulate
the problem and the property and interactively select proof strategies. The cost of
semiautomated formal verification can be justified for a high level algorithm that will be
used in many applications, or at a more detailed level to prove a few crucial properties
of safety-critical applications.
methods, and color indicates lines of code, where white represents the smallest classes,
black represents the largest, and intermediate sizes are represented by shades of gray.
The graphic provides no more information than a table of values, but it facilitates a
quicker and fuller grasp of how those values are distributed.
23.9 Debugging
Detecting the presence of software faults is logically distinct from the subsequent tasks
of locating, diagnosing, and repairing faults. Testing is concerned with fault detection,
while locating and diagnosing faults fall under the rubric of debugging. Responsibility for
testing and debugging typically fall to different individuals. Nonetheless, since the
beginning point for debugging is often a set of test cases, their relation is important, and
good test automation derives as much value as possible for debugging.
A small, simple test case that invariably fails is far more valuable in debugging than a
complex scenario, particularly one that may fail or succeed depending on unspecified
conditions. This is one reason test case generators usually produce larger suites of
single-purpose test cases rather than a smaller number of more comprehensive test
cases.
Typical run-time debugging tools allow inspection of program state and controls to pause
execution at selected points (breakpoints), or when certain conditions occur
(watchpoints), or after a fixed number of execution steps. Modern debugging tools
almost always provide display and control at the level of program source code, although
compiler transformations of object code cannot always be hidden (e.g., order of
execution may differ from the order of source code). Specialized debugging support may
include visualization (e.g., of thread and process interactions) and animation of data
structures; some environments permit a running program to be paused, modified, and
continued.
When failures are encountered in stress testing or operational use, the "test case" is
likely to be an unwieldy scenario with many irrelevant details, and possibly without
enough information to reliably trigger failure. Sometimes the scenario can be
automatically reduced to a smaller test case. A test data reduction tool executes many
variations on a test case, omitting portions of the input for each trial, in order to discover
which parts contain the core information that triggers failure. The technique is not
universally applicable, and meaningful subdivisions of input data may be applicationspecific, but it is an invaluable aid to dealing with large data sets. While the purpose of
test data reduction is to aid debugging, it may also produce a useful regression test
case to guard against reintroduction of the same program fault.
Not only the test case or cases that trigger failure but also those that execute correctly
are valuable. Differential debugging compares a set of failing executions to other
executions that do not fail, focusing attention on parts of the program that are always or
often executed on failures and less often executed when the program completes
successfully. Variations on this approach include varying portions of a program (to
determine which of several recent changes is at fault), varying thread schedules (to
isolate which context switch triggers a fatal race condition), and even modifying program
data state during execution.
reported failures from the field and from system testing is easy to justify in most
organizations, as it has immediate visible benefits for everyone who must deal with
failure reports. Collecting additional information to enable fault classification and process
improvement has at least equal benefits in the long term, but is more challenging
because the payoff is not immediate.
Further Reading
Surveys of currently available tools are available commercially, and reviews of many
tools can be found in trade magazines and books. Since tools are constantly evolving,
the research literature and other archival publications are less useful for determining
what is immediately available. The research literature is more useful for understanding
basic problems and approaches in automation to guide the development and use of
tools. Zeller [Zel05] is a good modern reference on program debugging, with an
emphasis on recent advances in automated debugging. A series of books by Tufte
[Tuf01, Tuf97, Tuf90, Tuf06] are useful reading for anyone designing informationdense
displays, and Nielsen [Nie00] is an introduction to usability that, though specialized to
Web applications, describes more generally useful principles. Norman [Nor90] is an
excellent and entertaining introduction to fundamental principles of usability that apply to
software tools as well as many other designed artifacts. The example in Figure 23.1 is
taken from Lanza and Ducasse [LD03], who describe a simple and adaptable approach
to depicting program attributes using multiple graphical dimensions.
Related Topics
Chapter 19 describes program analysis tools in more detail.
Exercises
Appropriate choice of tools may vary between projects depending, among other
factors, on application domain, development language(s), and project size.
Describe possible differences in A&T tool choices for the following:
Program analysis tools for a project with Java as the only development
language, and for another project with major components in Java, SQL,
and Python, and a variety of other scripting and special-purpose
languages in other roles.
23.1
Consider the following design rule: All user text (prompts, error messages, et al.)
are made indirectly through tables, so that a table of messages in another
23.2 language can be substituted at run-time. How would you go about partly or wholly
automating a check of this property?
Suppose two kinds of fault are equally common and equally costly, but one is
local (entirely within a module) and the other is inherently nonlocal (e.g., it could
23.3 involve incompatibility between modules). If your project budget is enough to
automate detection of either the local or the nonlocal property, but not both, which
will you automate? Why?
Required Background
Chapter 20
This chapter describes test and analysis strategy and plans, which are
intertwined with documentation. Plans and strategy documents are part of
quality documentation, and quality documents are used in process monitoring.
24.1 Overview
Documentation is an important element of the software development process, including
the quality process. Complete and well-structured documents increase the reusability of
test suites within and across projects. Documents are essential for maintaining a body of
knowledge that can be reused across projects. Consistent documents provide a basis
for monitoring and assessing the process, both internally and for external authorities
where certification is desired. Finally, documentation includes summarizing and
presenting data that forms the basis for process improvement. Test and analysis
documentation includes summary documents designed primarily for human
comprehension and details accessible to the human reviewer but designed primarily for
automated analysis.
Documents are divided into three main categories: planning, specification, and reporting.
Planning documents describe the organization of the quality process and include
strategies and plans for the division or the company, and plans for individual projects.
Specification documents describe test suites and test cases. A complete set of analysis
and test specification documents include test design specifications, test case
specification, checklists, and analysis procedure specifications. Reporting documents
include details and summary of analysis and test results.
name
signature
date
approved by
name
signature
date
distribution status
Section N
Accessibility: W3C-WAI
Documentation Standards
Project documents must be archived according to the standard Chipmunk archive
[WB06-01.03]
Test logs
[WB10-02.13]
[WB11-01.11]
Inspection reports
Open table as spreadsheet
[WB12-09.01]
Tools
The following tools are approved and should be used in all development projects.
Exceptions require configuration committee approval and must be documented in the
project plan.
Fault logging Chipmunk BgT [WB10-23.01]
References
[WB12-03.12] Standard
Inspection Procedures
[WB12-23.01] BgT Installation
Manual and User Guide
A test and analysis plan may not address all aspects of software quality and testing
activities. It should indicate the features to be verified and those that are excluded from
consideration (usually because responsibility for them is placed elsewhere). For
example, if the item to be verified includes a graphical user interface, the test and
analysis plan might state that it deals only with functional properties and not with
usability, which is to be verified separately by a usability and human interface design
team.
Explicit indication of features not to be tested, as well as those included in an analysis
and test plan, is important for assessing completeness of the overall set of analysis and
test activities. Assumption that a feature not considered in the current plan is covered at
another point is a major cause of missing verification in large projects.
The quality plan must clearly indicate criteria for deciding the success or failure of each
planned activity, as well as the conditions for suspending and resuming analysis and
test.
Plans define items and documents that must be produced during verification. Test
deliverables are particularly important for regression testing, certification, and process
improvement. We will see the details of analysis and test documentation in the next
section.
The core of an analysis and test plan is a detailed schedule of tasks. The schedule is
usually illustrated with GANTT and PERT diagrams showing the relation among tasks as
well as their relation to other project milestones.[1] The schedule includes the allocation
of limited resources (particularly staff) and indicates responsibility for reresources and
responsibilities sults.
A quality plan document should also include an explicit risk plan with contingencies. As
far as possible, contingencies should include unambiguous triggers (e.g., a date on
which a contingency is activated if a particular task has not be completed) as well as
recovery procedures.
Finally, the test and analysis plan should indicate scaffolding, oracles, and any other
software or hardware support required for test and analysis activities.
[1]Project
relations between A&T and development tasks, with resource allocation and
constraints. A task schedule usually includes GANTT and PERT diagrams.
Staff and responsibilities:
Staff required for performing analysis and test activities, the required skills,
and the allocation of responsibilities among groups and individuals. Allocation
of resources to tasks is described in the schedule.
Environmental needs:
Hardware and software required to perform analysis or testing activities.
Test design specification documents describe complete test suites (i.e., sets of test
cases that focus on particular aspects, elements, or phases of a software project). They
may be divided into unit, integration, system, and acceptance test suites, if we organize
them by the granularity of the tests, or functional, structural, and performance test
suites, if the primary organization is based on test objectives. A large project may
include many test design specifications for test suites of different kinds and granularity,
and for different versions or configurations of the system and its components. Each
specification should be uniquely identified and related to corresponding project
documents, as illustrated in the sidebar on page 463.
Test design specifications identify the features they are intended to verify and the
approach used to select test cases. Features to be tested should be cross-referenced
to relevant parts of a software specification or design document. The test case selection
approach will typically be one of the test selection techniques described in Chapters 10
through 16 with documentation on how the technique has been applied.
A test design specification also includes description of the testing procedure and
pass/fail criteria. The procedure indicates steps required to set up the testing
environment and perform the tests, and includes references to scaffolding and oracles.
Pass/fail criteria distinguish success from failure of a test suite as a whole. In the
simplest case a test suite execution may be determined to have failed if any individual
test case execution fails, but in system and acceptance testing it is common to set a
tolerance level that may depend on the number and severity of failures.
A test design specification logically includes a list of test cases. Test case specifications
may be physically included in the test design specification document, or the logical
inclusion may be implemented by some form of automated navigation. For example, a
navigational index can be constructed from references in test case specifications.
Individual test case specifications elaborate the test design for each individual test case,
defining test inputs, required environmental conditions and procedures for test execution,
WB0715.01.C02
WB0715.01.C09[d]
valid model number with all legal required slots and some legal
optional slots
WB0715.01.C19
empty model DB
[d]See
WB0715.01.C23
WB0715.01.C24
empty component DB
WB0715.01.C29
valid
many
many
complete
all valid
all valid
No. of models in DB
many
No. of components in
DB many
Chipmunk C20
#SMRS
Screen
13"
Processor
Chipmunk II plus
Hard disk
30 GB
RAM
512 MB
4
DVD player
Output Specification
return value valid
Environment Needs
Execute with ChipmunkDBM v3.4 database initialized from table MDB 15 32 03.
Special Procedural Requirements
none
Intercase Dependencies
none
A prioritized list of open faults is the core of an effective fault handling and repair
procedure. Failure reports must be consolidated and categorized so that repair effort
can be managed systematically, rather than jumping erratically from problem to problem
and wasting time on duplicate reports. They must be prioritized so that effort is not
squandered on faults of relatively minor importance while critical faults are neglected or
even forgotten.
Other reports should be crafted to suit the particular needs of an organization and
project, including process improvement as described in Chapter 23. Summary reports
serve primarily to track progress and status. They may be as simple as confirmation that
the nightly build-and-test cycle ran successfully with no new failures, or they may provide
somewhat more information to guide attention to potential trouble spots. Detailed test
logs are designed for selective reading, and include summary tables that typically include
the test suites executed, the number of failures, and a breakdown of failures into those
repeated from prior test execution, new failures, and test cases that previously failed but
now execute correctly.
In some domains, such as medicine or avionics, the content and form of test logs may
be prescribed by a certifying authority. For example, some certifications require test
execution logs signed by both the person who performed the test and a quality
inspector, who ascertains conformance of the test execution with test specifications.
Further Reading
The guidelines in this chapter are based partly on IEEE Standard 829-1998 [Ins98].
Summary reports must convey information efficiently, managing both overview and
access to details. Tufte's books on information design are useful sources of principles
and examples. The second [Tuf90] and fourth [Tuf06] volumes in the series are
particularly relevant. Experimental hypermedia software documentation systems
[ATWJ00] hint at possible future systems that incorporate test documentation with other
views of an evolving software product.
Exercises
Agile software development methods (XP, Scrum, etc.) typically minimize
documentation written during software development. Referring to the sidebar on
page 381, identify standard analysis and test documents that could be generated
[b]Reproduced
[c]The
detailed list of test cases is produced automatically from the test case file, which
in turn is generated from the specification of categories and partitions. The test suite is
implicitly referenced by individual test case numbers (e.g., WB07-15.01.C09 is a test
case in test suite WB07-15.01).
[a]The
prefix WB07-15.01 implicitly references a test suite to which this test case directly
belongs. That test suite may itself be a component of higher level test suites, so logically
the test case also belongs to any of those test suites. Furthermore, some additional test
suites may be composed of selections from other test suites.
[2]If
you are more familiar with another version control system, such as Subversion or
Perforce, you may substitute it for CVS.
Bibliography
[ABC82] Richards W. Adrion, Martha A. Branstad, and John C. Cherniavsky.
Validation, verification, and testing of computer software. ACM Computing
Surveys, 14(2):159192, June 1982.
[ASU86] Alfred V. Aho, Ravi Sethi, and Jeffrey D. Ullman. Compilers: Principles,
Techniques, and Tools. Addison-Wesley Longman, Boston, 1986.
[ATWJ00] Kenneth M. Anderson, Richard N. Taylor, and E. James Whitehead Jr.
Chimera: Hypermedia for heterogeneous software development environments.
ACM Transactions on Information Systems, 18(3):211245, July 2000.
[Bar01] Carol M. Barnum. Usability Testing and Research. Allyn & Bacon,
Needham Heights, MA, 2001.
[Bei95] Boris Beizer. Black-Box Testing: Techniques for Functional Testing of
Software and Systems. John Wiley and Sons, New York, 1995.
[BGM91] Gilles Bernot, Marie Claude Gaudel, and Bruno Marre. Software testing
based on formal specifications: A theory and a tool. Software Engineering
Journal, 6(6):387405, November 1991.
[BHC+94] Inderpal Bhandari, Michael J. Halliday, Jarir Chaar, Kevin Jones,
Janette S. Atkinson, Clotilde Lepori-Costello, Pamela Y. Jasper, Eric D. Tarver,
Cecilia Carranza Lewis, and Masato Yonezawa. In-process improvement through
defect data interpretation. IBM Systems Journal, 33(1):182214, 1994.
[BHG87] Philip A. Bernstein, Vassos Hadzilacos, and Nathan Goodman.
Concurrency Control and Recovery in Database Systems. Addison-Wesley,
Boston, 1987.
[Bin00] Robert V. Binder. Testing Object-Oriented Systems, Models, Patterns,
and Tools. Addison-Wesley, Boston, 2000.
[Bis02] Matt Bishop. Computer Security: Art and Science. Addison-Wesley
Professional, Boston, 2002.
[Boe81] Barry W. Boehm. Software Engineering Economics. Prentice Hall,
Englewood Cliffs, NJ, 1981.
[BOP00] Ugo Buy, Alessandro Orso, and Mauro Pezz. Automated testing of
classes. In Proceedings of the International Symposium on Software Testing and
Analysis (ISSTA), pages 3948, Portland, OR, 2000.
[DP02] Giovanni Denaro, and Mauro Pezz;. An empirical evaluation of faultproneness models. In Proceedings of the 24th International Conference on
Software Engineering (ICSE), pages 241251, Orlando, Florida, 2002.
[DRW03] Alastair Dunsmore, Marc Roper, and Murray Wood. The development
and evaluation of three diverse techniques for object-oriented code inspection.
IEEE Transactions on Software Engineering, 29(8):677686, 2003.
[ECGN01] Michael D. Ernst, Jake Cockrell, William G. Griswold, and David Notkin.
Dynamically discovering likely program invariants to support program evolution.
IEEE Transactions on Software Engineering, 27(2):99123, February 2001.
[Fag86] Michal E. Fagan. Advances in software inspections. IEEE Transactions
on Software Engineering, 12(7):744751, 1986.
[FHLS98] Phyllis Frankl, Richard Hamlet, Bev Littlewood, and Lorenzo Strigini.
Evaluating Testing methods by Delivered Reliability. IEEE Transactions on
Software Engineering, 24(8):586601, 1998.
[FI98] Phyllis G. Frankl and Oleg Iakounenko. Further empirical studies of test
effectiveness. In Proceedings of the ACM SIGSOFT 6th International Symposium
on the Foundations of Software Engineering (FSE), volume 23, 6 of Software
Engineering Notes, pages 153162, New York, November 35 1998. ACM Press.
[Flo67] Robert W. Floyd. Assigning meanings to programs. In Proceedings of the
Symposium on Applied Mathematics, volume 19, pages 1932, Providence, RI,
1967. American Mathematical Society.
[FO76] Lloyd D. Fosdick, and Leon J. Osterweil. Data flow analysis in software
reliability. ACM Computing Surveys, 8(3):305330, 1976.
[FvBK+91] Susumu Fujiwara, Gregor von Bochmann, Ferhat Khendek, Mokhtar
Amalou, and Abderrazak Ghedamsi. Test selection based on finite state models.
IEEE Transactions on Software Engineering, 17(6):591603, June 1991.
[FW93] Phyllis G. Frankl, and Elaine G. Weyuker. Provable improvements on
branch testing. IEEE Transactions on Software Engineering, 19(10):962975,
October 1993.
[GG75] John B. Goodenough, and Susan L. Gerhart. Toward a theory of test data
selection. IEEE Transactions on Software Engineering, 1(2):156173, 1975.
[GG93] Tom Gilb, and Dorothy Graham. Software Inspection. Addison-Wesley
Longman, Boston, 1993.
1956.
[Mor90] Larry J. Morell. A theory of fault-based testing. IEEE Transactions on
Software Engineering, 16(8):844857, August 1990.
[MP43] Warren Sturgis McCulloch and Walter Harry Pitts. A logical calculus of the
ideas immanent in nervous activity. Bulletin of Mathematical Biophysics, 5(115),
1943. Reprinted in Neurocomputing: Foundations of Research, 1988, MIT Press,
Cambridge MA.
[MS03] Atif M. Memon and Mary Lou Soffa. Regression testing of GUIs. In
Proceedings of the 9th European Software Engineering Conference held jointly
with 11th ACM SIGSOFT International Symposium on Foundations of Software
Engineering (ESEC/FSE), pages 118127, Helsinki, Finland, 2003.
[Mus04] John D. Musa. Software Reliability Engineering: More Reliable Software
Faster And Cheaper. Authorhouse, second edition, 2004.
[Mye79] Glenford Myers. The Art of Software Testing. John Wiley and Sons, New
York, 1979.
[Nie00] Jakob Nielsen. Designing Web Usability: The Practice of Simplicity. New
Riders Publishing, Indianapolis, IN, 2000.
[Nor90] Donald A. Norman. The Design of Everyday Things. Doubleday/Currency
ed., 1990.
[OAFG98] Thomas Ostrand, Aaron Anodide, Herbert Foster, and Tarak Goradia.
A visual test development environment for GUI systems. In Proceedings of the
ACM SIGSOFT International Symposium on Software Testing and Analysis
(ISSTA), volume 23,2 of ACM Software Engineering Notes, pages 8292, New
York, March 25 1998. ACM Press.
[OB88] Thomas J. Ostrand and Marc J. Balcer. The category-partition method for
specifying and generating functional tests. Communications of the ACM,
31(6):676686, June 1988.
[OO90] Kurt M. Olender and Leon J. Osterweil. Cecil: A sequencing constraint
language for automatic static analysis generation. IEEE Transactions on Software
Engineering, 16(3):268280, 1990.
[OO92] Kurt M. Olender and Leon J. Osterweil. Interprocedural static analysis of
sequencing constraints. ACM Transactions on Software Engineering and
Methodologies, 1(1):2152, 1992.
[Wei07] Mark Allen Weiss. Data Structures and Algorithm Analysis in Java.
Addison-Wesley, Boston, 2nd edition, 2007.
[Wey98] Elaine J. Weyuker. Testing component-based software: A cautionary
tale. IEEE Software, 15(5):5459, September/October 1998.
[WHH80] Martin R. Woodward, David Hedley, and Michael A. Hennell. Experience
with path analysis and testing of programs. IEEE Transactions on Software
Engineering, 6(3):278286, May 1980.
[WO80] Elaine J. Weyuker and Thomas J. Ostrand. Theories of program testing
and the the application of revealing subdomains. IEEE Transactions on Software
Engineering, 6(3):236246, May 1980.
[YT89] Michal Young and Richard N. Taylor. Rethinking the taxonomy of fault
detection techniques. In Proceedings of the International Conference on Software
Engineering (ICSE), pages 5362, Pittsburgh, May 1989.
[Zel05] Andreas Zeller. Why Programs Fail: A Guide to Systematic Debugging.
Morgan Kaufmann, San Francisco, 2005.
Index
A
A&T plan, 458
abstract classes, 277
testing, 281
abstraction function, 58, 110
abstraction in finite state models, 138
acceptance testing, 421-423
accessibility
W3C Web content accessibility guidelines, 426
adaptive systems, 452
adequacy
fault-based, 319
of test suites, 151
algebraic laws
for data model verification, 144
alias, 94-96
in data flow testing, 241
interprocedural analysis, 97
all definitions adequacy criterion, 240
all definitions coverage, 241
all DU pairs adequacy criterion, 239, 295
all DU pairs coverage, 239
all DU paths adequacy criterion, 240
all DU paths coverage, 240
all-paths analysis, 85
Alloy finite state verification tool, 144
alpha and beta test, 10, 423
alpha test, 10
alternate expression, alternate program in fault-based testing, 315
analysis of dynamic memory use, 360-363
analysis plan, 382-386
AND-states
in statecharts, 286
any-path analysis, 85
API (application program interface), 413
architectural design, 6
impact on static analysis and test, 40
argument
in data flow analysis, 94
Ariane 5 incident, 406
array
in data flow analysis, 94, 236
in data flow testing, 241
assembly testing, 413-415
assertion
in symbolic execution, 105
assumed precondition, 194, 199
atomic blocks
in Promela, 122
atomic operation, 132
atomic transaction
serializability, 34
automating analysis and test, 439
availability, 10, 44
available expressions
data flow analysis, 85
Index
B
Buchi automaton, 125
backbone strategy, 410
Backus Naur Form (BNF), 257
backward data flow analysis, 87
basic block, 60
coverage, 216
basic condition, 219, 251
coverage, 253
basic condition adequacy criterion, 219
basic condition coverage, 219
basis set of a graph, 228
BDD, see ordered binary decision diagram (OBDD)
behavior models
extracting from execution, 365-369
beta test, 10, 423
bias
in test case selection, 164
big bang testing, 408
binary decision diagram, see ordered binary decision diagram (OBDD)
black-box testing, 154, 161, 162
BNF, see Backus Naur Form
Boolean connective, 251
Boolean expression, 251
bottom-up integration testing, 410
boundary condition grammar-based criterion, 262
boundary interior criterion, 223
boundary interior loop coverage, 250
boundary value testing, 185, 194
boundary values, 172
branch adequacy criterion, 217, 257
Index
C
call coverage, 229
call graph, 63-65
analysis of Java throws clause, 65
Capability Maturity Model (CMM), 341
capture and replay, 337
catalog-based testing, 194-204
category-partition testing, 180-188
category, 180, 181
error constraints, 186
property constraints, 186
regression test selection, 432
cause-effect graphs, 253
certification testing
in SRET, 380
CFG, see control flow graph
characteristic function, 135
checklist, 37, 344-348
for Java source code, 346, 347
choice
in category partition, 180
class
reasoning about, 109
classes of values, 180, 185, 194
Cleanroom process, 378, 399
CMM, see Capability Maturity Model
code generation
from finite state models, 130
cognitive burden and aids, 448-449
collaboration diagram, 293
collective code ownership, 351
combinatorial testing, 179
for polymorphism, 302
combining techniques, 7
commercial off-the-shelf components (COTS), 414
in integration testing, 410
communication protocol, 246, 249
comparing testing criteria, 230
comparison-based oracle, 332
competent programmer hypothesis, 314
compiler
grammar-based testing, 257
complete
state transition relation, 70
complete analysis, 21
completeness
structuring quality process for, 40
complex condition, 251
complex document structure
grammar-based testing, 257
component, 414
component and assembly testing, 413-415
component-based system, 414
component-based testing, 405
compositional reasoning, 108
compound condition adequacy, 220
compound condition coverage, 253
conclusion of inference rule, 109
concurrency, 277
concurrency fault, 356
specifying properties, 24
concurrency control protocol, 35
condition testing, 219-222
conformance testing, 116, 130
conservative analysis, 20, 21, 91
consistency checks
internal and external, 377
consistent
self-consistency, 17
constraint
in decision tables, 253
context independence
in interprocedural analysis, 97
context-sensitive analysis, 65, 96
contract, 413, 415
as precondition, postcondition pair, 105
interface of component, 414
monitoring satisfactory completion, 400
of procedure as Hoare triple, 109
control dependence graph, 80
control flow graph, 59-63
control flow testing, 212
model-based, 257
regression testing, 429
controllability, 329
correctness, 43
and operating conditions, 45
correctness conjectures, 377
cost
depends on time between error and fault detection, 29, 376
estimating and controlling, 382
of faults, 49
verification vs. development cost, 4
cost-effectiveness
structuring quality process for, 40
COTS, see commercial off-the-shelf components
counting residual faults, 322
coupling effect hypothesis, 314
critical dependence, 384
critical module, 412
critical paths, 383
critical systems, 44
cross-quality requirements, 6
CSP
Index
D
dangling pointer, 357
data collection, 49
data dependence graph, 80
data flow (DF) regression test, 432
data flow adequacy criterion, 239
data flow analysis, 447
with arrays and pointers, 94, 236
data flow analysis algorithm, 82-84
data flow equation, 83
data flow graph
deriving test cases from, 257
data flow model, 77
data flow testing
for object-oriented programs, 295
data model verification, 140-146
data race, 356
data structure
in data flow testing, 241
reasoning about, 109
database
problem tracking, 11
deadlock, 32, 356
debugging, 449-451
decision structure, 251-255
decision table, 253
defect
repair cost predictor, 40
defensive programming, 33
leads to unreachable code, 230
definition
in a data flow model, 77
in catalog-based testing, 194, 199
definition-clear path, 78
definition-use association, see definition-use pair
definition-use pair (DU pair), 77, 82, 236, 238
definition-use path (DU path), 238
delocalization
inspection techniques to deal with, 344
dependability, 10, 15, 43, 421
measures of, 10
vs. time-to-market, 40
dependable, 16
dependence, 77
design
activities paired with verification, 3
architectural, 6
feasibility study, 5
test design vs. test execution, 5
design for test, 35, 330, 375, 376, 389
risk-driven strategy, 419
system architecture and build plan, 408
design pattern
vs. framework, 414
design rule
to simplify verification, 447
design secret
and modular reasoning, 109
desperate tester strategy, 410
deterministic, 24
state transition relation, 70
development risk, 391
development testing in SRET, 380
diagnosability, 408
digraph, see directed graph
direct data dependence, 78
directed graph, 57
distraction
cognitive burden, 448
distributed
specifying properties, 24
distributed systems
finite state verification applicable to, 121
divide and conquer, 35
document object model (DOM), 415
as a component interface contract, 415
documentation, 455
documentation quality, 458
DOM, see document object model
domain-specific language
grammar-based testing, 257
dominator, 80
double-checked locking idiom, 117, 130
driver, 408
DU pair, see definition-use pair
DU path, see definition-use path
dual
of a directed graph, 68
dynamic analysis, 355
dynamic binding, 277, 301-303
dynamic memory allocation faults, 357
dynamic memory analysis, 360-363
dynamic method dispatch
representation in call graph, 63
Index
E
egoless programming, 351
elementary condition, 219
elementary items
identifying, 194
embedded control system, 246
encapsulation, 272
and test oracles, 300
modular reasoning, 109
entry and exit testing (procedure call testing), 229
environment, 328
environmental needs, 460
equivalent mutant, 319
equivalent scenarios, 300
erroneous condition testing, 185, 194
error
off-by-one, 185
error propagation, 236
error values, 172
estimating population sizes, 322
exception, 277, 308-309
analysis in Java, 97
as implicit control flow, 60, 309
test cases for, 281
executing test cases, 327
execution history priority schema, 436
execution risk, 391
exhaustive testing, 20
explicit model checking
vs. symbolic model checking, 138
explicit requirement
vs. implicit requirement, 43
Index
F
fail-fast, 30
failure
critical vs. noncritical, 44
fairness properties, 125
false alarms, 129
fault
analysis, 12
categorization, 49
distribution, 391
injection, 323
localization, 408
model, 314-316
propagation to failure, 212
revealing priority schema, 436
seeding, 154, 314, 315
fault-based testing, 154, 313
adequacy criterion, 156, 319
of hardware, 323
vs. functional testing, 163
feasibility study, 5, 46, 382
feature-oriented integration strategy, 412
features to be analyzed or tested, 460
feedback, 36-37, 399, 408
in the Cleanroom process, 378
feedback loop, 49
finite models of software, 55
finite state machine (FSM), 65-73, 359
conformance testing, 130
correctness relations, 70
deriving test cases from, 246-250
don't care transition, 250
error transition, 250
Index
G
garbage detector, 363
gen set
in data flow analysis, 85
varying, 92
generating test cases, 327
generating test case specifications, 172
generating test data, 328-329
generics, 306-308
glass-box testing, see structural testing
global property, 420
graceful degradation, 45
grammars
deriving tests from, 257-265
graphical user interface
specifying properties, 24
guarded commands
in Promela, 122
Index
H
halting problem, 18
hazards, 44
hierarchical (compositional) reasoning, 108
hierarchical interprocedural analysis, 97
HIPAA safeguards, 420
Hoare triple, 108
in test oracle, 335
HTML
DOM model, 415
HTTP, see Hypertext Transport Protocol
Hypertext Transport Protocol (HTTP), 35, 36
Index
I
immediate dominator, 80
implementation, 17
implicit control flow, 60, 309
implicit requirement, 43
incident
Apache 2 buffer overflow, 407
Apache 2.0.48 memory leak, 409
Ariane 5 failure, 406
libPNG buffer overflow, 406
loss of Mars Climate Orbiter, 407
incremental development
and scaffolding, 329
independent verification and validation (IV&V), 400, 419
independently testable feature (ITF), 170
indivisible action, 132
inevitability
flow analysis, 90
infeasibility
identifying infeasible paths with symbolic execution, 102
infeasible path, 105
problem in test coverage, 230-232, 243
unsatisfiable test obligations, 156
inference rule, 109
information clutter
cognitive burden, 448
information hiding
and modular reasoning, 109
inheritance, 272
in object-oriented testing, 303-306
representation in call graph, 63
testing inherited and overridden methods, 281
inspection, 37, 341
benefits and bottlenecks, 46
Index
J
Java inspection checklist, 346, 347
JUnit, 330, 331
Index
K
kill, 78
in data flow analysis, 82
mutation analysis, 319
kill set
in data flow analysis, 85
varying, 92
Index
L
lattice, 93
LCSAJ, see linear code sequence and jump
libPNG buffer overflow incident, 406
linear code sequence and jump (LCSAJ), 60, 227
lines of code
static metric, 443
live mutants, 319
live variables
data flow analysis, 85
liveness properties, 125
LOC
source lines of code, 443
lock, 356
lockset analysis, 363-365
loop boundary adequacy criterion, 227
loop invariant, 105
lost update problem, 132
Index
M
may immediately precede (MIP) relation, 139
MC/DC, see modified condition/decision coverage
Mealy machine, 65
mean time between failures (MTBF), 10, 44, 378
memory
analysis, 360-363
fault, 357, 360
leak, 357, 360, 409
metrics, 389
MIP, see may immediately precede relation
missing code fault, 163
missing path faults, 215
misspelled variable, 90
mock, 330
model, 55
correspondence, 129-134
extraction, 129
granularity, 131-134
important attributes of, 55
intensional, 134
refinement, 138-140
model checking, see finite state verification, 447
model-based testing, 154, 171, 245
regression test selection, 432
modified condition/decision coverage (MC/DC), 221, 255
required by RTCA/DO-178B standard, 222, 379
modifier, 297
modular specifications and reasoning, 109
module and unit testing, 405
monitoring and planning, 41
monitoring the quality process, 389-394
MTBF, see mean time between failures
Index
N
necessary condition, 22
nightly build-and-test cycle, 327, 420
node adequacy criterion, 257
nondeterministic, 24
nonfunctional properties
in component interface contract, 415
nonlinearity, 4
nonlocality
cognitive burden, 448
normal conditions
selected in catalog-based testing, 194
Index
O
OBDD, see ordered binary decision diagram
object reference
in data flow analysis, 94
object-oriented method dispatch
representation in call graph, 63
object-oriented software
issues in testing, 272
orthogonal approach to testing, 280
testing, 271
observability, 36, 329, 408
OCL
assertions about data models, 140
ODC, see orthogonal defect classification
operation
in catalog-based testing, 194
operational profile, 422
in SRET, 380
optimistic inaccuracy, 20, 21
OR-state
in statechart, 284
oracle, 8, 328, 332-338
for object-oriented programs, 298-301
from finite state machine, 249
ordered binary decision diagram (OBDD), 135
orthogonal defect classification (ODC), 392
outsourcing, 401
overall quality, 458
Index
P
pair programming, 351, 381, 401
pairwise combination testing, 188-194
parameter characteristic, 180, 181
parameterized type, 306-308
partial functionality, 45
partial oracle, 333
partial order reduction, 134, 138
partition, 35-36
categories into choices, 180, 185
partition testing, 162-167
patch level, 11
patch level release, 11
path adequacy criterion, 222
path coverage, 222
path testing, 222-228
and data interactions, 236
peer review, 401
performance, 419
Perl
taint mode, 91
personnel risk, 386, 390
pessimistic inaccuracy, 20, 21
plan, 41
analysis and test plan, 8
analysis and test plan document, 458-460
monitoring, 8
relation to strategy, 379
selecting analysis and test tools, 441
test and analysis, 382
planning and monitoring, 41
sandwich integration strategy, 412
planning tools, 441-443
point release, 11
pointer
in data flow analysis, 94, 236
in data flow testing, 241
pointer arithmetic
in data flow analysis, 94
polymorphism, 277, 301-303
post-dominator, 81
postcondition, 105, 358
in catalog-based testing, 194, 199
in test oracle, 335
of state transition, 249
powerset lattice, 93
pre-dominator, 81
precondition, 105, 358
in catalog-based testing, 194, 197
in test oracle, 335
of state transition, 249
predicate, 251
premise of inference rule, 109
preserving an invariant, 106
principles of test and analysis, 29
prioritization
of regression test cases, 434-436
probabilistic grammar-based criteria, 265
problem tracking database, 11
procedure call testing, 229-230
procedure entry and exit testing, 229
process
improvement, 12, 49, 394-399
management, 441
monitoring, 389-394
visibility, 36, 41, 383, 389
process qualities
vs. product qualities, 42
production coverage criterion, 262
program
generation, 130
generic term for artifact under test, 161
verifier, 447
program analysis, 355
program dependence
graph representation of, 80
program location
in fault-based testing, 315
Promela (Spin input language), 121, 122, 129
test case generation, 329
propagation from fault to failure, 212
protocol, 246, 249
proxy measure, 41
test coverage, 156
Index
Q
quality
cross-quality requirements, 6
goal, 42
manager, 382
plan, 8, 376, 458
process, 39, 376-377
team, 399-402
quantifier
in assertions, 337
Index
R
race condition, 32, 117, 132
random testing, 162
RCA, see root cause analysis
reaching definition, 82
data flow equation, 83
reading techniques in inspection, 344
redundancy, 32-33
reference
in data flow analysis, 94
refining finite state models, 138-140
region
in control flow analysis, 59
regression test, 11, 418, 427-436
prioritization, 434-436
selection, 428-434
regular expressions
deriving tests from, 257-265
relational algebra, 144
for data model verification, 140
release
point release vs. patch, 11
reliability, 10, 44, 45, 419
report, 462-465
representative value classes, 171
representative values, 180, 185
requirement
engineering, 420
implicit vs. explicit, 43
risk, 391
specification, 16, 162
residual faults
Index
S
safe
analysis, 21
safety, 44, 45, 420
properties, 125
property of system and environment, 420
specification, 45
sandwich or backbone, 410
scaffolding, 8, 328-332, 408
generic vs. specific, 330
scalability
of finite state verification techniques, 114
scenarios, 415
schedule risk, 386, 390
schedule visibility, 36
scripting rule
grammar-based testing, 257
SDL, 246
security, 420
finite state verification applicable to, 121
security hazard
preventing with Perl taint mode, 91
seeded faults, 154, 314, 315
selection
of test cases, 151
selective regression test execution, 434-436
self-monitoring and adaptive systems, 452
semantic constraints
in category-partition method, 180, 186
sensitivity, 29-32
sensitivity testing, 422
sequence diagram, 293
sequencing properties, 125
serializability, 34
severity levels
in fault classification, 392, 397
short-circuit evaluation, 221
Simple Mail Transport Protocol (SMTP), 36
simple transition coverage, 286
single state path coverage, 250
single transition path coverage, 250
singularity in input space, 164
SMTP, see Simple Mail Transport Protocol
software reliability engineered testing (SRET), 380, 399
sound
analysis, 21
special value testing, 164
specification, 17
as precondition and postcondition assertions, 105
correctness relative to, 44
decomposing for category-partition testing, 180, 181
requirement, 16
specification-based testing, 154, 161, 166
regression test selection, 432
Spin finite state verification tool, 121
spiral process
in SRET approach, 380
spiral process model, 376
spurious reports
in finite state verification, 138
SQL
as a component interface contract, 415
SRET, see software reliability engineered testing
state transition table
representation of finite state machine, 70
state diagram, see statechart
state space, 58
state space exploration, 116-134
Index
T
taint mode in Perl, 91
tasks and schedule, 460
technology risk, 386, 390
template, 306-308
temporal logic, 125
TERk coverage, 227
test
adequacy criterion, 153
deliverable, 460
driver, 329
execution, 48, 153
harness, 329
input, 152
obligation, 153, 154
oracle, 332-338
pass/fail criterion, 152
plan, 382-386
scenarios, 415
specification, 153
strategy document, 458
test case, 153
maintenance, 427
test case specification, 172
document, 462
generating, 180, 186
test coverage
as proxy measure for thoroughness, 156
test design
early, 48
specification document, 460-462
test first
in XP, 381, 401
testability
design for, 330, 375, 376, 389
Index
U
UML
data models, 140
sequence and collaboration diagrams, 293
statechart, 284
undecidability, 18, 113
undecidability and unsatisfiable test obligations, 156
unit
work assignment, 170
unit and integration test suites
unsuitable for system testing, 418
unit testing
for object-oriented programs, 282-286
unreachable code, 230
usability, 423-425
specifying and verifying properties, 24
usability testing, 6
assigned to separate team, 460
usage profile, 44, 378
use
in a data flow model, 77
use/include relation, 286
useful
distinct from dependable, 16
useful mutant, 316
usefulness, 43, 418
useless definition, 90
user stories, 381
Index
V
V model, 17, 376, 405
V&V, see verification and validation
valid mutant, 316
validated precondition, 194, 197, 199
validation, 15, 17
acceptance testing as, 418
vs. verification, 7
variable
in catalog-based testing, 194
initialization analysis, 87
verification, 16
of self-consistency and well-formedness, 17
purpose of functional testing, 162
system testing as, 418
vs. validation, 7
verification and validation (V&V), 6
version control, 449
visibility, 36, 41, 383, 389
Index
W
W3C Web content accessibility guidelines, 426
waterfall process model, 376
WCAG
W3C Web content accessibility guidelines, 426
weak mutation analysis, 321
weakening a predicate, 104
well-formedness and self-consistency, 17
white-box testing, see structural testing
Index
X-Z
XML
as a component interface contract, 415
DOM model, 415
XML schema
grammar-based testing, 262
XP, see extreme programming
List of Figures
Preface
Figure 1: Selecting core material by need
procedure, depicted as a state transition diagram (top) and as a state transition table
(bottom). An omission is obvious in the tabular representation, but easy to overlook in
the state transition diagram.
Figure 5.10: Correctness relations for a finite state machine model. Consistency and
completeness are internal properties, independent of the program or a higher-level
specification. If, in addition to these internal properties, a model accurately represents
a program and satisfies a higher-level specification, then by definition the program
itself satisfies the higher-level specification.
Figure 5.11: Procedure to convert among Dos, Unix, and Macintosh line ends.
Figure 5.12: Completed finite state machine (Mealy machine) description of line-end
conversion procedure, depicted as a state-transition table (bottom). The omitted
transition in Figure 5.9 has been added.
variable name in the data validation method will be implicitly declared and will not be
rejected by the Python compiler or interpreter, which could allow invalid data to be
treated as valid. The classic live variables data flow analysis can show that the
assignment to valid is a useless definition, suggesting that the programmer probably
intended to assign the value to a different variable.
Figure 6.12: The powerset lattice of set {a,b,c}. The powerset contains all subsets of
the set and is ordered by set inclusion.
Figure 6.13: Spurious execution paths result when procedure calls and returns are
treated as normal edges in the control flow graph. The path (A,X,Y,D) appears in the
combined graph, but it does not correspond to an actual execution order.
Figure 8.12: Ordered binary decision diagram (OBDD) encoding of the Boolean
proposition a b c, which is equivalent to a (b c). The formula and OBDD
structure can be thought of as a function from the Boolean values of a, b, and c to a
single Boolean value True or False.
Figure 8.13: Ordered binary decision diagram (OBDD) representation of a transition
relation, in three steps. In part (A), each state and symbol in the state machine is
assigned a Boolean label. For example, state s0 is labeled 00. In part (B), transitions
are encoded as tuples sym,from,to indicating a transition from state from to state
to on input symbol sym. In part (C), the transition tuples correspond to paths leading
to the True leaf of the OBDD, while all other paths lead to False. The OBDD
represents a characteristic function that takes valuations of x0 x4 and returns True
only if it corresponds to a state transition.
Figure 8.14: The data model of a simple Web site.
Figure 8.15: Alloy model of a Web site with different kinds of pages, users, and
access rights (data model part). Continued in Figure 8.16.
Figure 8.16: Alloy model of a Web site with different kinds of pages, users, and
access rights, continued from Figure 8.15.
Figure 8.17: A Web site that violates the "browsability" property, because public page
Page_2 is not reachable from the home page using only unrestricted links. This
diagram was generated by the Alloy tool.
List of Tables
List of Sidebars