Software Testing Automation
Software Testing Automation
Contents
Software Test Automation, Philip Laplante, Fevzi Belli, Jerry Gao, Greg Kapfhammer, Keith Miller,
W. Eric Wong, and Dianxiang Xu
Volume 2010, Article ID 163746, 2 pages
A Tester-Assisted Methodology for Test Redundancy Detection, Negar Koochakzadeh and Vahid Garousi
Volume 2010, Article ID 932686, 13 pages
A Strategy for Automatic Quality Signing and Verification Processes for Hardware and Software
Testing, Mohammed I. Younis and Kamal Z. Zamli
Volume 2010, Article ID 323429, 7 pages
Automated Test Case Prioritization with Reactive GRASP, Camila Loiola Brito Maia,
Rafael Augusto Ferreira do Carmo, Fabrcio Gomes de Freitas, Gustavo Augusto Lima de Campos,
and Jereson Teixeira de Souza
Volume 2010, Article ID 428521, 18 pages
A Proposal for Automatic Testing of GUIs Based on Annotated Use Cases, Pedro Luis Mateo Navarro,
Diego Sevilla Ruiz, and Gregorio Martnez Perez
Volume 2010, Article ID 671284, 8 pages
AnnaBot: A Static Verifier for Java Annotation Usage, Ian Darwin
Volume 2010, Article ID 540547, 7 pages
Software Test Automation in Practice: Empirical Observations, Jussi Kasurinen, Ossi Taipale,
and Kari Smolander
Volume 2010, Article ID 620836, 18 pages
Editorial
Software Test Automation
Phillip Laplante,1 Fevzi Belli,2 Jerry Gao,3 Greg Kapfhammer,4 Keith Miller,5
W. Eric Wong,6 and Dianxiang Xu7
1 Engineering
Research Article
A Tester-Assisted Methodology for Test Redundancy Detection
Negar Koochakzadeh and Vahid Garousi
Software Quality Engineering Research Group (SoftQual), Department of Electrical and Computer Engineering,
Schulich School of Engineering, University of Calgary, Calgary, AB, Canada T2N 1N4
Correspondence should be addressed to Negar Koochakzadeh, nkoochak@ucalgary.ca
Received 15 June 2009; Revised 16 September 2009; Accepted 13 October 2009
Academic Editor: Phillip Laplante
Copyright 2010 N. Koochakzadeh and V. Garousi. This is an open access article distributed under the Creative Commons
Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is
properly cited.
Test redundancy detection reduces test maintenance costs and also ensures the integrity of test suites. One of the most widely used
approaches for this purpose is based on coverage information. In a recent work, we have shown that although this information can
be useful in detecting redundant tests, it may suer from large number of false-positive errors, that is, a test case being identified
as redundant while it is really not. In this paper, we propose a semiautomated methodology to derive a reduced test suite from a
given test suite, while keeping the fault detection eectiveness unchanged. To evaluate the methodology, we apply the mutation
analysis technique to measure the fault detection eectiveness of the reduced test suite of a real Java project. The results confirm
that the proposed manual interactive inspection process leads to a reduced test suite with the same fault detection ability as the
original test suite.
1. Introduction
In todays large-scale software systems, test (suite) maintenance is an inseparable part of software maintenance. As a
software system evolves, its test suites need to be updated
(maintained) to verify new or modified functionality of the
software. That may cause test code to erode [1, 2]; it may
become complex and unmanageable [3] and increase the cost
of test maintenance. Decayed parts of test suite that cause test
maintenance problems are referred to as test smells [4].
Redundancy (among test cases) is a discussed but a
seldom-studied test smell. A redundant test case is one,
which if removed, will not aect the fault detection eectiveness of the test suite. Another type of test redundancy discussed in the literature (e.g., [5, 6]) is test code duplication.
This type of redundancy is similar to conventional source
code duplication and is of syntactic nature. We refer to the
above two types of redundancy as semantic and syntactic test
redundancy smells, respectively. In this work, we focus on
the semantic redundancy smell which is known to be more
challenging to detect in general than the syntactic one [5].
Redundant test cases can have serious consequences
on test maintenance. By modifying a software unit in the
2
However, test redundancy detection based on coverage
information does not guarantee to keep fault detection
capability of a given test suite. Evaluation results from
our previous work [12] showed that although coverage
information can be very useful in test redundancy detection,
detecting redundancy only based on this information may
lead to a test suite which is weaker in detecting faults than
the original one.
Considering fault detection capability of a test case for
the purpose of redundancy detection is thus very important.
To achieve this purpose, we propose a collaborative process
between testers and a proposed redundancy detection engine
to guide the tester to use valuable coverage information in a
proper and useful way.
The output of the process is a reduced test suite. We claim
that if testers play their role carefully in this process, fault
detection eectiveness of this reduced test set would be equal
to the original set.
High amount of human eort should be spent on
inspecting a test suite manually. However, the proposed
process in this paper tries to use the coverage information in
a constructive fashion to reduce the required tester eorts.
More automation can be added to this process later to
save more cost and thus the proposed process should be
considered as the first step to reduce required human eort
for test redundancy detection.
To evaluate our methodology, we apply the mutation
technique in a case study in which common types of faults
are injected. Then original and reduced test set are then
executed to detect faulty versions of the systems. The results
show similar capability of fault detection for those two test
sets.
The remainder of this paper is structured as follows. We
review the related works in Section 2. Our recent previous
work [12] which evaluated the precision of test redundancy
detection based on coverage information is summarized in
Section 3. The need for knowledge collaboration between
human testers and the proposed redundancy detection
engine is discussed in Section 4. To leverage and share
knowledge between the automated engine and human tester,
we propose a collaborative process for redundancy detection
in Section 5. In Section 6, we show the results of our case
study and evaluate the results using the mutation technique.
Eciency, precision, and a summary of the proposed process
are discussed in Section 7. Finally, we conclude the paper in
Section 8 and discuss the future works.
2. Related Works
We first review the related works on test minimization
and test redundancy detection. We then provide a brief
overview of the literature on semiautomated processes that
collaborate with software engineers to complete tasks in
software engineering and specifically in software testing.
There are numerous techniques that address test suite
minimization by considering dierent types of test coverage
criteria (e.g., [611]). In all of those works, to achieve
the maximum possible test reduction, the smallest test set
Test class
Test package
Test suite
Test case
Original
test set
(1) Test redundancy detection based on coverage information in all previous works have been done by only
considering limited number of coverage criteria. This fact
that two test cases may cover the same part of SUT
according to one coverage criterion but not the other one
causes impreciseness in test redundancy detection only by
considering one coverage criterion.
(2) In JUnit, each test case contains four phases: setup,
exercise, verify, and teardown [4]. In the setup phase the
required state of the SUT for the purpose of a particular
test case is setup. In the exercise phase, the SUT is exercised.
In the teardown phase the SUT state is rolled back into the
state before running the test. In these three phases SUT is
covered while in the verification phase only a comparison
between expected and actual outputs is performed and SUT
is not covered. Therefore, there might be some test cases
with the same covered part of SUT with various verifications.
In this case, coverage information may lead to detecting a
nonredundant test as redundant.
(3) Coverage information is calculated only based on
the SUT instrumented for coverage measurement. External
resources (e.g., libraries) are not usually instrumented. There
are cases in which two test methods cover dierent libraries.
In such cases, the coverage information of the SUT alone is
not enough to measure redundancies.
Another reason of impreciseness in redundancy detection based on coverage information mentioned in [12]
Algorithm 1: Source code of two test methods in the Allelogram test suite.
double a = getDefaultAdjusted(0);
Using the three above guidelines helps testers to collaborate more eectively in the proposed redundancy detection
process by analyzing test codes. Testers who have developed
test artifacts are the best source of knowledge to decide about
test redundancy by considering the above three lessons.
However, other test experts can also use our methodology
to find the redundancy of a test suite through manual
inspection. For instance, in the experiment of this work,
the input test suite was created by the developers of an
open source project while the first author has performed the
process of test redundancy detection.
Original test
suite
5: System
recalculates
redundancy
measures
List of test
artifacts
in no particular
order
Final reduced
test suite
2: Tester analyze redundancy
Tester sorts
and selects
test artifacts
List of test
artifacts sorted
by the tester
Tester
inspects
test source code
Reduced test
suite
4: System
removes
redundant
test artifact
A redundant test
artifact
[Tester decides on
enough investigation]
coverage criterion i (e.g., statement coverage). CoverageCriteria in these two equations is the set of available coverage
criteria used during the redundancy detection process.
Based on the design rationale of the above metrics, their
values are always a real number in the range of [0 1]. This
enables us to measure redundancy in a quantitative domain
(i.e., partial redundancy is supported too).
However, the results from [12] show that this type of
partial redundancy is not precise and may mislead the tester
in detecting the redundancy of the test. For instance, suppose
that two JUnit test methods have similar setups with dierent
exercises. If for example 90% of the test coverage is in the
common setup the pair redundancy metrics would indicate
that they are 90% redundant with respect to each other.
However dierent exercises in these tests separate their goals
and thus they should not be considered as redundant with
respect to each other while 90% redundancy can mislead the
tester about their redundancy.
Equation (1) shows Redundancy of test artifact (tj ) with
respect to another one (tk ):
PR t j , tk
CoveredItemsi t j
iCoverageCriteria
CoveredItemsi (tk ) /
CoveredItemsi t j ,
iCoverageCriteria
(1)
SR t j
CoveredItemsi t j
iCoverageCriteria
CoveredItemsi TS t j /
(2)
CoveredItemsi t j .
iCoverageCriteria
7
strategy test cases which need more time to be executed have
more priority of redundancy candidates. However, we believe
that in unit testing level execution time of test cases is not as
important as other smells like being eager.
After picking appropriate test artifact, tester can use
PR values of that test with respect to other tests. This
information guides tester to inspect source code of that
test case and compare it with source code of those tests
with higher PR values. Without this information, manual
inspection would take much more time from testers since
he/she may not have any idea how to find another test to
compare the source code together.
As discussed in Section 4, the main reason of need for
human knowledge is to cover shortcomings of coveragebased redundancy detection. Therefore testers should be
thoroughly familiar with these shortcomings and attempt at
covering them.
After redundancy analysis, the test is identified as
redundant or not. If it was detected as redundant by tester
(Step 3), system removes it from original test set (Step 4).
In this step, the whole collaborative process between system
and tester should be repeated. Removing one test from
test suite changes the value of CoveredItemsi (TS t j ) in
(2). Therefore system should recalculate Suite Redundancy
metric for all of the available tests (Step 5). In Section 6
we show how removing a redundant test detected by tester
and recalculating the redundancy information can help the
tester not to be misled by initial redundancy information and
reduce the required eort of the tester.
Stopping condition of this process depends on testers
discretion. To find this stopping point, tester needs to
compare the cost of process with savings in test maintenance
costs resulting from test redundancy detection. Process cost
at any point of the process can be measured by the time and
eort that testers have spent in the process.
Test maintenance tasks have two types of costs which
should be estimated: (1) costs incurred by updating (synchronizing) test code and SUT code, and (2) costs due to
fixing integrity problems in test suite (e.g., one of two test
cases testing the same SUT feature fails, while the other
passes). Having redundant tests can lead testers to updating
more than a test for each modification. Secondly, as a result
of having redundant tests, the test suites would suer from
integrity issues, since the tester might have missed to update
all the relevant tests.
To estimate the above two cost factors, one might perform change impact analysis on the SUT, and subsequently
eort-prediction analysis (using techniques such as [23]) on
SUT versus test code changes.
To decide about stopping point of the process, a tester
would need to measure the process costs spent so far and
to also estimate the maintenance costs containing both the
above-discussed cost factors. By comparing them, he/she
may decide to either stop or to continue the proposed
process.
In the outset of this work, we have not systematically
analyzed the above cost factors. As discussed before, we
suggest testers to inspect all the tests with the value SR = 1
as many as possible. However, according to high number
SLOC
3,296
Number of
classes
57
Number of
methods
323
Number of test
packages
6
Number of test
classes
21
Number of test
methods
82
6. Case Study
6.1. Performing the Proposed Process. We used Allelogram
[24], an open-source SUT developed in Java, as the object
of our case study. Allelogram is a program for processing
genomes and is used by biological scientists [24]. Table 1
shows the size measures of this system.
The unit test suite of Allelogram is also available through
its project website [24] and is developed in JUnit. Table 2 lists
the size metrics of its test suite. As the lowest implemented
test level in JUnit is test method, we applied our redundancy
detection process on the test method level in this SUT.
As the first step of proposed redundancy detection
process, coverage metrics are measured. For this purpose, we
used the CodeCover tool [20] in our experiment. This tool
is an open-source coverage tool written in Java supporting
Sortable lists
Checked in
the case of
redundant test
Test
methods
Redundancy
ratio
Number of
covered items
Set sizes
10
9 13 17 21 25 29 33 37 41 45 49
Number of test methods inspected (in order)
11
Cardinality
82
28
71
Mutation score
51%
20%
51%
7. Discussion
7.1. Eectiveness and Precision. Let us recall the main
purpose of reducing the number of test cases in a test suite
(Section 1): decreasing the cost of software maintenance.
Thus, if the proposed methodology turns to be very time
consuming, then it will not be worthwhile to be applied.
Although the best way to increase the eciency of the
process is to automate all required tasks, at this step we
suppose that it is not practical to automate all of them. Thus,
as we discuss next, human knowledge is currently needed in
this process.
To perform manual inspection on test suite with the
purpose of finding redundancy, testers need to spend time
and eort on each test source code and compare them
together. To decrease the amount of required eort, we
have devised the proposed approach in a way to reduce
the number of tests needed to be inspected (by using the
suite redundancy metric). Our process also suggests useful
information such as pair redundancy metric to help testers
find other proper tests to compare with the test under
inspection.
We believe that by using the above information, the
eciency of test redundancy detection has been improved.
This improvement was seen on our case study while we
first spent on average more than 15 minutes for each test
method of Allelogram test suite before having our process.
But inspecting them using the proposed process took on
average less than 5 minutes per test method (the reason of
time reduction is that in the later we knew other proper
test methods to compare them with the current test). Since
only one human subject (tester) performed the above two
approaches, dierent parts of the Allelogram test suite were
analyzed in each approach to avoid bias (due to learning and
gaining familiarity) on time measurement.
However the above results are based on our preliminary
experiment and it is thus inadequate to provide a general
picture about the eciency of the process. For a more
systematic analysis in that direction, both time and eort
should be measured more precisely with more than one
Full automation
Full manual
Semiautomated
Cost
Low
High
Mid
Benefit
Imprecise reduced set
Precise reduced set
Precise reduced set
12
test suite and also a subset of SUT. This functionality
of TeReDetect increases the scalability of this tool to a
great extent by making it possible to divide the process of
redundancy detection into separate parts and assign each
part to a tester. However a precise teamwork communication
is required to make the whole process successful.
Flexible stopping point of the proposed process is
another reason for its scalability. According to the testers
discretion, the process of redundancy detection may stop
after analyzing the subset of test cases or continue for all
existing tests. For instance, in huge systems, by considering
the cost of redundancy detection, project manager may
decide to analyze only the critical part of the system.
7.4. Threats to Validity
7.4.1. External Validity. Two issues limit the generalization
of our results. The first one is the subject representativeness
of our case study. In this paper the process has been done
by the first author (a graduate student). More than one
subject should be experimented in this process to be able to
compare their results to each other. Also, this subject knew
the exact objective of the study which is a threat to the result.
The second issue is the object program representativeness.
We have performed the process and evaluate the result on
one SUT (Allelogram). More objects should be used in
experiments to improve the result. Also our SUT is a random
project chosen from the open source community. Other
industrial programs with dierent characteristics may have
dierent test redundancy behavior.
7.4.2. Internal Validity. The result about eciency and
precision of the proposed process might be from some other
factors which we had no control or had not measured. For
instance, the bias and knowledge of the tester while trying to
find redundancy can be such a factor.
Acknowledgments
The authors were supported by the Discovery Grant no.
341511-07 from the Natural Sciences and Engineering
Research Council of Canada (NSERC). V. Garousi was
further supported by the Alberta Ingenuity New Faculty
Award no. 200600673.
References
[1] S. G. Eick, T. L. Graves, A. F. Karr, U. S. Marron, and
A. Mockus, Does code decay? Assessing the evidence from
change management data, IEEE Transactions on Software
Engineering, vol. 27, no. 1, pp. 112, 2001.
[2] D. L. Parnas, Software aging, in Proceedings of the International Conference on Software Engineering (ICSE 94), pp. 279
287, Sorrento, Italy, May 1994.
[3] B. V. Rompaey, B. D. Bois, and S. Demeyer, Improving test
code reviews with metrics: a pilot study, Tech. Rep., Lab
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
[17]
[18]
[19]
[20]
[21]
13
[22] B. V. Rompaey, B. D. Bois, S. Demeyer, and M. Rieger, On
the detection of test smells: a metrics-based approach for
general fixture and eager test, IEEE Transactions on Software
Engineering, vol. 33, no. 12, pp. 800816, 2007.
[23] L. C. Briand and J. Wust, Modeling development eort
in object-oriented systems using design properties, IEEE
Transactions on Software Engineering, vol. 27, no. 11, pp. 963
986, 2001.
[24] C. Manaster, Allelogram, August 2008, http://code
.google.com/p/allelogram/.
[25] R. A. DeMillo, R. J. Lipton, and F. G. Sayward, Hints on
test data selection: help for the practicing programmer, IEEE
Computer, vol. 11, no. 4, pp. 3441, 1978.
[26] R. G. Hamlet, Testing programs with the aid of a compiler,
IEEE Transactions on Software Engineering, vol. 3, no. 4, pp.
279290, 1977.
[27] J. H. Andrews, L. C. Briand, Y. Labiche, and A. S. Namin,
Using mutation analysis for assessing and comparing testing
coverage criteria, IEEE Transactions on Software Engineering,
vol. 32, no. 8, pp. 608624, 2006.
[28] J. H. Andrews, L. C. Briand, and Y. Labiche, Is mutation an
appropriate tool for testing experiments? in Proceedings of the
27th International Conference on Software Engineering (ICSE
05), pp. 402411, 2005.
[29] B. Smith and L. Williams, MuClipse, December 2008,
http://muclipse.sourceforge.net/.
[30] J. Outt, Y. S. Ma, and Y. R. Kwon, MuJava, December 2008,
http://cs.gmu.edu/outt/mujava/.
[31] Y. S. Ma and J. Outt, Description of method-level mutation
operators for java, December 2005, http://cs.gmu.edu/
outt/mujava/mutopsMethod.pdf.
[32] Y. S. Ma and J. Outt, Description of class mutation mutation
operators for java, December 2005, http://cs.gmu.edu/
outt/mujava/mutopsClass.pdf.
Research Article
A Strategy for Automatic Quality Signing and Verification
Processes for Hardware and Software Testing
Mohammed I. Younis and Kamal Z. Zamli
School of Electrical and Electronics, Universiti Sains Malaysia, 14300 Nibong Tebal, Malaysia
Correspondence should be addressed to Mohammed I. Younis, younismi@gmail.com
Received 14 June 2009; Revised 4 August 2009; Accepted 20 November 2009
Academic Editor: Phillip Laplante
Copyright 2010 M. I. Younis and K. Z. Zamli. This is an open access article distributed under the Creative Commons
Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is
properly cited.
We propose a novel strategy to optimize the test suite required for testing both hardware and software in a production line. Here,
the strategy is based on two processes: Quality Signing Process and Quality Verification Process, respectively. Unlike earlier work,
the proposed strategy is based on integration of black box and white box techniques in order to derive an optimum test suite
during the Quality Signing Process. In this case, the generated optimal test suite significantly improves the Quality Verification
Process. Considering both processes, the novelty of the proposed strategy is the fact that the optimization and reduction of test
suite is performed by selecting only mutant killing test cases from cumulating t-way test cases. As such, the proposed strategy can
potentially enhance the quality of product with minimal cost in terms of overall resource usage and time execution. As a case study,
this paper describes the step-by-step application of the strategy for testing a 4-bit Magnitude Comparator Integrated Circuits in a
production line. Comparatively, our result demonstrates that the proposed strategy outperforms the traditional block partitioning
strategy with the mutant score of 100% to 90%, respectively, with the same number of test cases.
1. Introduction
In order to ensure acceptable quality and reliability of any
embedded engineering products, many inputs parameters as
well as software/hardware configurations need to be tested
against for conformance. If the input combinations are large,
exhaustive testing is next to impossible due to combinatorial
explosion problem.
As illustration, consider the following small-scale product, a 4-bit Magnitude Comparator IC. Here, the Magnitude
Comparator IC consists of 8 bits for inputs and 3 bits for
outputs. It is clear that each IC requires 256 test cases for
exhaustive testing. Assuming that each test case takes one
second to run and be observed, the testing time for each IC
is 256 seconds. If there is a need to test one million chips, the
testing process will take more than 8 years using a single line
of test.
Now, let us assume that we received an order of delivery
for one million qualified (i.e., tested) chips within two weeks.
As an option, we can do parallel testing. However, parallel
testing can be expensive due to the need for 212 testing lines.
Now, what if there are simultaneous multiple orders? Here, as
Parameter 1
Netscape
IE
Firefox
Parameter 2
Windows XP
Windows VISTA
Windows 2008
Parameter 3
LAN
PPP
ISDN
Parameter 4
Sis
Intel
VIA
2. Related Work
Mandl was the first researcher who used pairwise coverage in
the software industry. In his work, Mandl adopts orthogonal
Latin square for testing an Ada compiler [11]. Berling
and Runeson use interaction testing to identify real and
false targets in target identification system [12]. Lazic and
Velasevic employed interaction testing on modeling and
simulation for automated target-tracking radar system [13].
White has also applied the technique to test graphical user
interfaces (GUIs) [14]. Other applications of interaction
testing include regression testing through the graphical user
interface [15] and fault localization [16, 17]. While earlier
work has indicated that pairwise testing (i.e., based on 2-way
interaction of variables) can be eective to detect most faults
in a typical software system, a counter argument suggests
such conclusion infeasible to generalize to all software system
faults. For example, a test set that covers all possible pairs of
variable values can typically detect 50% to 75% of the faults
in a program [1820]. In other works it is found that 100% of
3. Proposed Strategy
The proposed strategy consists for two processes, namely,
Test Quality Signing (TQS) process and Test Verification
process (TV). Briefly, the TQS process deals with optimizing
the selection of test suite for fault injection as well as
performs the actual injection whilst the TV process analyzes
for conformance (see Figure 1).
As implied earlier, the TQS process aims to derive an
eective and optimum test suite and works as follows.
(1) Start with an empty Optimized Test Suite (OTS), and
empty Signing Vector (SV).
(2) Select the desired software class (for software testing).
Alternatively, build an equivalent software class for
the Circuit Under Test (CUT) (for hardware testing).
(3) Store these faults in fault list (FL).
System
specification
OTS
SV
OTS
SUT
VV
Test
failed
4. Case Study
SV
(b) Quality verification process
t=
1
2
3
4
Live Mutant
15
5
2
0
Killed Mutant
65
75
78
80
% Mutant Score
81.25
93.75
97.50
100.00
1
2
3
FFFFFFFF
TTTTTTTT
FTTTTTTT
FTF
FTF
FFT
53
65
68
4
5
TTFTFTFT
TTFFTFTT
TFF
TFF
71
72
6
7
8
TTTFTTFF
TTFTTTTF
FFTTTTTF
TFF
FFT
FFT
75
77
78
TFTTTFTF
TFF
80
#TC
b0
a0
b1
a1
g1 (A > B)
m1
g2 (A = B)
m2
b2
a2
g3 (A < B)
m3
b3
a3
m4
Figure 3: Equivalent class Java program for the 4-bit magnitude comparator.
5. Comparison
In this section, we demonstrate the possible test reduction
using block partitioning approach [1, 37] for comparison
purposes. Here, the partitions could be two 4-bit numbers,
with block values =0, 0 < x < 15, =15 and 9 test cases would
give all combination coverage. In this case, we have chosen
x = 7 as a representative value. Additionally, we have also
run a series of 9 tests where x is chosen at random between
0 and 15. The results of the generated test cases and their
corresponding cumulative faults detected are tabulated in
Tables 4 and 5, respectively.
Referring to Tables 4 and 5, we observe that block
partitioning techniques have achieved the mutant score of
90%. For comparative purposes, it should be noted that our
proposed strategy achieved a mutant score of 100% with the
same number of test cases.
6. Conclusion
In this paper, we present a novel strategy for automatic
quality signing and verification technique for both hardware
and software testing. Our case study in hardware production
line demonstrated that the proposed strategy could improve
Acknowledgments
The authors acknowledge the help of Je Outt, Je Lei,
Raghu Kacker, Rick Kuhn, Myra B. Cohen, and Sudipto
Ghosh for providing them with useful comments and the
background materials. This research is partially funded
by the USM: Post Graduate Research GrantT-Way Test
Data Generation Strategy Utilizing Multicore System, USM
GRIDThe Development and Integration of Grid Services
& Applications, and the fundamental research grants
Investigating Heuristic Algorithm to Address Combinatorial Explosion Problem from the Ministry of Higher
Education (MOHE). The first author, Mohammed I. Younis,
is the USM fellowship recipient.
References
[1] M. Grindal, J. Outt, and S. F. Andler, Combination testing
strategies: a survey, Tech. Rep. ISETR-04-05, GMU, July 2004.
[2] M. I. Younis, K. Z. Zamli, and N. A. M. Isa, Algebraic strategy
to generate pairwise test set for prime number parameters
and variables, in Proceedings of the International Symposium
on Information Technology (ITSim 08), vol. 4, pp. 16621666,
IEEE Press, Kuala Lumpur, Malaysia, August 2008.
[3] M. I. Younis, K. Z. Zamli, and N. A. M. Isa, IRPS: an
ecient test data generation strategy for pairwise testing, in
Proceedings of the 12th International Conference on KnowledgeBased and Intelligent Information & Engineering Systems (KES
08), vol. 5177 of Lecture Notes in Computer Science, pp. 493
500, 2008.
[4] Jenny tool, June 2009, http://www.burtleburtle.net/bob/math/.
[5] TVG tool, June 2009, http://sourceforge.net/projects/tvg/.
[6] Y. Lei, R. Kacker, D. R. Kuhn, V. Okun, and J. Lawrence,
IPOG: a general strategy for T-way software testing, in
Proceedings of the International Symposium and Workshop on
Engineering of Computer Based Systems, pp. 549556, Tucson,
Ariz, USA, March 2007.
[7] Y. Lei, R. Kacker, D. R. Kuhn, V. Okun, and J. Lawrence,
IPOG-IPOG-D: ecient test generation for multi-way combinatorial testing, Software Testing Verification and Reliability,
vol. 18, no. 3, pp. 125148, 2008.
[8] M. Forbes, J. Lawrence, Y. Lei, R. N. Kacker, and D. R. Kuhn,
Refining the in-parameter-order strategy for constructing
covering arrays, Journal of Research of the National Institute of
Standards and Technology, vol. 113, no. 5, pp. 287297, 2008.
[9] R. C. Bryce and C. J. Colbourn, A density-based greedy
algorithm for higher strength covering arrays, Software
Testing Verification and Reliability, vol. 19, no. 1, pp. 3753,
2009.
[10] M. I. Younis, K. Z. Zamli, and N. A. M. Isa, A strategy for
grid based T-Way test data generation, in Proceedings the 1st
IEEE International Conference on Distributed Frameworks and
Application (DFmA 08), pp. 7378, Penang, Malaysia, October
2008.
[11] R. Mandl, Orthogonal latin squares: an application of
experiment design to compiler testing, Communications of the
ACM, vol. 28, no. 10, pp. 10541058, 1985.
[12] T. Berling and P. Runeson, Ecient evaluation of multifactor
dependent system performance using fractional factorial
design, IEEE Transactions on Software Engineering, vol. 29, no.
9, pp. 769781, 2003.
[13] L. Lazic and D. Velasevic, Applying simulation and design
of experiments to the embedded software testing process,
Software Testing Verification and Reliability, vol. 14, no. 4, pp.
257282, 2004.
[14] L. White and H. Almezen, Generating test cases for GUI
responsibilities using complete interaction sequences, in Proceedings of the International Symposium on Software Reliability
Engineering (ISSRE 00), pp. 110121, IEEE Computer Society,
San Jose, Calif, USA, 2000.
[15] A. M. Memon and M. L. Soa, Regression testing of GUIs,
in Proceedings of the 9th Joint European Software Engineering
Conference (ESEC) and the 11th SIGSOFT Symposium on the
Foundations of Software Engineering (FSE-11), pp. 118127,
ACM, September 2003.
[16] C. Yilmaz, M. B. Cohen, and A. A. Porter, Covering arrays
for ecient fault characterization in complex configuration
spaces, IEEE Transactions on Software Engineering, vol. 32, no.
1, pp. 2034, 2006.
Research Article
Automated Test Case Prioritization with Reactive GRASP
Camila Loiola Brito Maia, Rafael Augusto Ferreira do Carmo, Fabrcio Gomes de Freitas,
Gustavo Augusto Lima de Campos, and Jerffeson Teixeira de Souza
Optimization in Software Engineering Group (GOES.UECE), Natural and Intelligent Computing Lab (LACONI),
State University of Ceara (UECE), Avenue Paranjana 1700, Fortaleza, 60740-903 Ceara, Brazil
Correspondence should be addressed to Camila Loiola Brito Maia, camila.maia@gmail.com
Received 15 June 2009; Revised 17 September 2009; Accepted 14 October 2009
Academic Editor: Phillip Laplante
Copyright 2010 Camila Loiola Brito Maia et al. This is an open access article distributed under the Creative Commons
Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is
properly cited.
Modifications in software can aect some functionality that had been working until that point. In order to detect such a problem,
the ideal solution would be testing the whole system once again, but there may be insucient time or resources for this approach.
An alternative solution is to order the test cases so that the most beneficial tests are executed first, in such a way only a subset of the
test cases can be executed with little lost of eectiveness. Such a technique is known as regression test case prioritization. In this
paper, we propose the use of the Reactive GRASP metaheuristic to prioritize test cases. We also compare this metaheuristic with
other search-based algorithms previously described in literature. Five programs were used in the experiments. The experimental
results demonstrated good coverage performance with some time overhead for the proposed technique. It also demonstrated a
high stability of the results generated by the proposed approach.
1. Introduction
More than often, when a system is modified, the modifications may aect some functionality that had been working
until that point in time. Due to the unpredictability of the
eects that such modifications may cause to the systems
functionalities, it is recommended to test the system, as a
whole or partially, once again every time a modification takes
place. This is commonly known as regression testing. Its
purpose is to guarantee that the software modifications have
not aected the functions that were working previously.
A test case is a set of tests performed in a sequence and
related to a test objective [1], and a test suite is a set of
test cases that will execute sequentially. There are basically
two ways to perform regression tests. The first one is by
reexecuting all test cases in order to test the entire system
once again. Unfortunately, and usually, there may not be
sucient resources to allow the reexecution of all test cases
every time a modification is introduced. Another way to
perform regression test is to order the test cases in respect to
their beneficial factor to some attribute, such as coverage, and
reexecute the test cases according to that ordering. In doing
this, the most beneficial test cases would be executed first,
(1)
2. Related Work
This section reports the use of search-based prioritization
approaches and metaheuristics. Some algorithms implemented in [6] by Li et al. which will have their performance
compared to that of the approach proposed later on this
paper will also be described.
2.1. Search-Based Prioritization Approaches. The works below employed search-based prioritization approaches, such as
greedy- and metaheuristic-based solutions.
Elbaum et al. [10] analyze several prioritization techniques and provide responses to which technique is more
suitable for specific test scenarios and their conditions.
The metric APFD is calculated through a greedy heuristic.
Rothermel et al. [2] describe a technique that incorporates
a Greedy algorithm called Optimal Prioritization, which
considers the known faults of the program, and the test cases
are ordered using the fault detection rates. Walcott et al. [8]
propose a test case prioritization technique with a genetic
algorithm which reorders test suites based on testing time
constraints and code coverage. This technique significantly
outperformed other prioritization techniques described in
the paper, improving in, on average, 120% the APFD over
the others.
Yoo and Harman [9] describe a Pareto approach to
prioritize test case suites based on multiple objectives, such
as code coverage, execution cost, and fault-detection history.
The objective is to find an array of decision variables
(test case ordering) that maximize an array of objective
functions. Three algorithms were compared: a reformulation
of a Greedy algorithm (Additional Greedy algorithm), NonDominating Sorting Genetic Algorithm (NSGA-II) [11], and
a variant of NSGA-II, vNSGA-II. For two objective functions,
a genetic algorithm outperformed the Additional Greedy
algorithm, but for some programs the Additional Greedy
algorithm produced the best results. For three objective
functions, Additional Greedy algorithm had reasonable
performance.
Li et al. [6] compare five algorithms: Greedy algorithm,
which adds test cases that achieve the maximum value for the
coverage criteria, Additional Greedy algorithm, which adds
test cases that achieve the maximum coverage not already
consumed by a partial solution, 2-Optimal algorithm, which
selects two test cases that consume the maximum coverage
together, Hill Climbing, which performs local search in
a defined neighborhood, and genetic algorithm, which
generates new test cases based on previous ones. The authors
separated test suites in 1,000 small suites of size 8-155 and
1,000 large suites of size 228-4,350. Six C programs were used
3
suitesis randomly generated. The procedure then works,
until a stopping criterion is reached, as new populations are
generated based on the previous one [13]. The evolution
from one population to the next one is performed via
genetic operators, including operations of selection, that
is, the biased choice of which individuals of the current
population will reproduce to generate individuals for the new
population. This selection prioritizes individuals with high
fitness value, which represents how good this solution is.
The other two genetic operators are crossover, that is, the
combination of individuals to produce the ospring, and
mutation, which randomly changes a particular individual.
In the genetic algorithm proposed by Li et al. [6],
the initial population is produced by selecting test cases
randomly from the test case pool. The fitness function is
based on the test case position in the current test suite. The
fitness value was calculated as follows:
fitness pos = 2
pos 1
,
(n 1)
(2)
where pos is the test cases position in the current test suite
and n is the population size.
The crossover algorithm follows the ordering chromosome crossover style adopted by Antoniol [14] and used in
[6] by Li et al. for the genetic algorithm in the experiments.
It works as follows. Let p1 and p2 be the parents, and let
o1 and o2 be the ospring. A random position k is selected,
and the first k elements of p1 become the first k elements of
o1 , and the last n k elements of o1 are the n k elements
of p2 which remain when the k elements selected from p1
are removed from p2 . In the same way, the first k elements
of p2 become the first k elements of o2 , and the last n k
elements of o2 are the nk elements of p1 which remain when
the k elements selected from p2 are removed from p1 . The
mutation is performed by randomly exchanging the position
of two test cases.
2.2.4. Simulated Annealing. Simulated annealing is a generalization of a Monte Carlo method. Its name comes from
annealing in metallurgy, where a melt, initially disordered
at high temperature, is slowly cooled, with the purpose
of obtaining a more organized system (a local optimum
solution). The system approaches a frozen ground state
with T = 0. Each step of simulated annealing algorithm
replaces the current solution by a random solution in its
neighborhood, based on a probability that depends on the
energies of the two solutions.
Construction phase
Best solution
variation named Reactive GRASP [15, 16] has been proposed. This approach performs GRASP while varying the
values of according to their previous performance. In
practice, Reactive GRASP will initially determine a set of
possible values for . Each value will have a probability of
being selected in each iteration.
Initially, all probabilities are assigned to 1/n, where n
is the quantity of . For each one of the i values of , the
probabilities pi are reevaluated for each iteration, according
to the following equation:
qi
p i = n
j =1 q j
(3)
(1) solution ;
(2) initialize the candidate set C with random
test cases from the pool of test cases;
(3) evaluate the coverage c (e) for all e C;
(4) while C =
/ do
(5)
cmin = min{c (e) | e C };
(6)
cmax = max{c (e) | e C };
(7)
RCL = {e C | c (e) cmin
+ (cmax cmin )};
(8)
s test case from the RCL at random;
(9)
solution solution {s};
(10)
update C;
(11)
reevaluate c (e) for all e C;
(12) end;
(13) update Set(solution);
(14) return solution;
j =1
qj
(2) return
Algorithm 2: Selection of .
j =1 q j
(4)
4. Empirical Evaluation
In order to evaluate the performance of the proposed
approach, a series of empirical tests was executed. More
specifically, the experiments were designed to answer the
following question.
(1) How does the Reactive GRASP approach compare
in terms of coverage and time performances
to other search-based algorithms, including Greedy
algorithm, Additional Greedy algorithm, genetic
algorithm, and Simulated Annealing?
In addition to this result, the experiments can confirm
results previously described in literature, including the
performance of the Greedy algorithm.
Program
Print tokens
Print tokens2
Schedule
Schedule2
Space
LoC
726
570
412
374
9,564
Blocks
126
103
46
53
869
Decisions
123
154
56
74
1,068
dividend
.
divisor
the APDC, a 0 for the first decision means that the first
decision is not covered by the test suite and a 1 for the
second decision means that the second decision is covered
by the test suite, and so on.
All experiments were performed on Ubuntu Linux
workstations with kernel 2.6.22-14, a Core Duo processor,
and 1 GB of main memory. The programs used in the
experiment were implemented using the Java programing
language.
4.2. Results. The results are presented in Tables 2, 3, 4,
5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, and 18 and
Figures 2 to 17, separating the four small programs from
the space program. Tables 2, 3, 4, and 5 detail the average
of 10 executions of the coverage percentage achieved for
each coverage criterion and each algorithm for printtokens,
printtokens2, schedule, and schedule2, respectively. Table 12
has this information regarding the space program. The TSSp
column is the percentage of test cases selected from the test
case pool. The mean dierences on time execution in seconds
are also presented in Tables 6 and 16, for small programs and
space, respectively.
Tables 7 and 14 show the weighted average for the metrics
(APBC, APDC, and APSC) for each algorithm. Figures 2 to
17 demonstrate a comparison among the algorithms for the
metrics APBC, APDC, and APSC, for the small programs and
space program.
4.3. Analysis. Analyzing the results obtained from the experiments, which are detailed in Tables 2, 3, 4, 5, and 9 and
summarized in Tables 6 and 13, several relevant results
can be pointed out. First, the Additional Greedy algorithm
had the best performance in eectiveness of all tests. It
performed significantly better than the Greedy algorithm,
the genetic algorithm, and simulated annealing, both for the
four small programs and for the space program. The good
performance of the Additional Greedy algorithm had already
been demonstrated in several works, including Li et al. [6]
and Yoo and Harman [9].
4.3.1. Analysis for the Four Small Programs. The Reactive
GRASP algorithm had the second best performance. This
approach also significantly outperformed the Greedy algorithm, the genetic algorithm, and simulated annealing, considering the coverage results. When compared to the Additional Greedy algorithm, there were no significant dierences
in terms of coverage. Comparing the metaheuristic-based
approaches, the better performance obtained by the Reactive
GRASP algorithm over genetic algorithm and simulated
annealing was clear.
In 168 experiments, the genetic algorithm generated a
better coverage only once (block criterion, the schedule
program, and 100% of tests being considered). The two
algorithms tied also once. For all other tests, the Reactive
GRASP outperformed the genetic algorithm. The genetic
algorithm approach performed the fourth best in our
evaluation. In Li et al. [6], the genetic algorithm was also
worse than the Additional Greedy algorithm. The results
Greedy
Additional Greedy
Genetic Algorithm
Simulated Annealing
Reactive GRASP
1%
2%
96.6591
98.3209
98.235
99.2101
96.6893
98.3954
96.1242
98.3113
97.808
99.0552
3%
5%
98.6763
98.5054
99.5519
99.6909
98.5483
98.8988
98.5553
98.9896
99.3612
99.5046
10%
20%
30%
98.2116
98.266
98.3855
99.8527
99.9317
99.9568
99.2378
99.2378
99.6603
99.3898
99.6414
99.6879
99.7659
99.8793
99.9204
40%
50%
60%
98.3948
98.4064
98.4097
99.9675
99.9747
99.979
99.7829
99.8321
99.8666
99.736
99.8213
99.8473
99.9457
99.9627
99.9622
70%
80%
98.4133
98.4145
99.9818
99.9841
99.8538
99.8803
99.8698
99.8657
99.9724
99.9768
90%
100%
98.4169
98.418
99.9859
99.9873
99.9013
99.9001
99.8958
99.8895
99.9783
99.9775
Decision Coverage %
TSSp
Greedy
Additional Greedy
Genetic Algorithm
Simulated Annealing
Reactive GRASP
1%
2%
96.7692
98.0184
98.3836
99.1429
96.9125
97.9792
95.9213
98.2299
98.1204
98.8529
3%
5%
10%
98.5569
98.4898
98.1375
99.4499
99.6971
99.8462
98.3785
98.7105
98.8659
98.0762
98.7513
99.1759
99.2886
99.4631
99.697
20%
30%
98.2486
98.3131
99.928
99.952
99.3886
99.587
99.5111
99.6955
99.8668
99.9061
40%
50%
60%
98.3388
98.3437
98.358
98.3388
99.9712
99.9766
99.7137
99.7305
99.817
99.7505
99.78
99.8235
99.9237
99.9386
99.959
70%
80%
98.3633
98.3651
99.9799
99.9821
99.8109
99.8631
99.7979
99.8447
99.9543
99.9663
90%
100%
98.4169
98.418
99.9859
99.9873
99.9013
99.9001
99.8541
99.869
99.9783
99.9775
Statement Coverage %
TSSp
Greedy
Additional Greedy
Genetic Algorithm
Simulated Annealing
Reactive GRASP
1%
2%
3%
97.2989
97.7834
98.0255
98.3561
99.2557
99.4632
97.0141
98.0175
98.5163
97.251
98.576
98.5633
98.0439
98.9675
99.2356
5%
10%
97.8912
97.8137
99.6826
99.8534
98.5167
99.1497
99.0268
99.3131
99.4431
99.681
20%
30%
40%
98.0009
98.0551
98.0661
99.9264
99.954
99.9656
99.5024
99.6815
99.7342
99.5551
99.7151
99.7677
99.8554
99.9079
99.9296
50%
60%
98.0705
98.0756
99.9724
99.9773
99.8123
99.8348
99.8108
99.8456
99.9464
99.9598
70%
80%
90%
98.0887
98.088
98.0924
99.9805
99.9831
99.985
99.8641
99.89
99.9026
99.8633
99.8649
99.8819
99.9704
99.9682
99.9709
100%
98.0943
99.9865
99.8998
99.8897
99.977
Block Coverage%
TSSp
Greedy
Additional Greedy
Genetic Algorithm
Simulated Annealing
Reactive GRASP
1%
2%
97.233
98.3869
98.3518
99.2665
97.6629
98.6723
98.042
98.8302
98.1576
99.208
3%
5%
97.9525
98.1407
99.5122
99.711
98.8576
99.2379
99.1817
99.3382
99.3274
99.5932
10%
20%
30%
98.131
98.01
98.0309
99.8564
99.9293
99.9535
99.5558
99.7894
99.8269
99.6731
99.8015
99.839
99.7994
99.8689
99.9239
40%
50%
60%
98.0462
98.0569
98.0589
99.9656
99.9727
99.977
99.8602
99.9166
99.9165
99.8957
99.9106
99.9269
99.9495
99.9653
99.9689
70%
80%
98.0611
98.0632
99.9805
99.9828
99.9264
99.9383
99.9236
99.9261
99.9756
99.9778
90%
100%
98.0663
98.0671
99.9849
99.9864
99.9543
99.9562
99.9385
99.9434
99.9796
99.9811
Decision Coverage %
TSSp
Greedy
Additional Greedy
Genetic Algorithm
Simulated Annealing
Reactive GRASP
1%
2%
97.05
98.6637
98.3055
99.2489
97.2108
98.4368
97.6375
98.5244
98.0859
99.0987
3%
5%
10%
98.5798
98.5903
98.5673
99.5496
99.7371
99.8628
98.8814
99.3676
99.5183
99.0411
99.2289
99.6528
99.4344
99.6772
99.8118
20%
30%
98.6351
98.6747
99.9353
99.9593
99.7939
99.8615
99.8084
99.8405
99.913
99.9482
40%
50%
60%
98.6837
96.1134
98.6948
99.9692
99.9552
99.9795
99.8836
99.8269
99.9181
99.8802
99.8992
99.9109
99.9556
99.9318
99.9751
70%
80%
98.6964
98.6985
99.9826
99.9848
99.9358
99.9478
99.9302
99.931
99.9768
99.9788
90%
100%
98.0663
98.0671
99.9849
99.9864
99.9543
99.9562
99.9409
99.9424
99.9796
99.9811
Statement Coverage %
TSSp
Greedy
Additional Greedy
Genetic Algorithm
Simulated Annealing
Reactive GRASP
1%
2%
3%
97.4458
98.7611
98.3634
98.3742
99.2552
99.4745
97.8234
98.755
98.9385
97.453
98.5444
98.9165
98.2804
99.0653
99.3279
5%
10%
97.8694
98.0271
99.6856
99.8494
99.1507
99.5268
99.3858
99.6258
99.5327
99.7906
20%
30%
40%
98.1264
97.9467
97.9653
99.927
99.9518
99.9645
99.7455
99.8533
99.8833
99.7283
99.8297
99.864
99.9086
99.9328
99.9564
50%
60%
97.9762
97.9792
99.9717
99.9768
99.9126
99.9162
99.8891
99.905
99.9584
99.9644
70%
80%
90%
97.9851
97.9854
97.9877
99.9803
99.9827
99.9847
99.9265
99.9399
99.9399
99.9156
99.9187
99.9288
99.9708
99.9759
99.9789
100%
97.9894
99.9863
99.9477
99.9262
99.9791
Greedy
Additional Greedy
Genetic Algorithm
Simulated Annealing
Reactive GRASP
1%
2%
96.6505
96.5053
98.2286
99.0237
98.2286
98.9499
97.9387
98.8596
98.2286
99.0073
3%
5%
96.451
95.6489
99.3315
99.5652
99.2336
99.2481
99.1955
99.4066
99.2445
99.3233
10%
20%
30%
95.2551
95.9548
95.8225
99.767
99.8884
99.9224
99.5586
99.7604
99.8219
99.6455
99.7497
99.8442
99.7013
99.8589
99.8918
40%
50%
60%
96.0783
96.3159
96.9283
99.9429
99.9553
99.9644
99.8995
99.9051
99.918
99.8982
99.899
99.9156
99.9163
99.9396
99.9546
70%
80%
97.0744
97.0955
99.9695
99.9733
99.9235
99.9464
99.9322
99.9411
99.9643
99.9649
90%
100%
97.1171
97.0495
99.9763
99.9786
99.9474
99.9573
99.946
99.9454
99.9704
99.7013
Decision Coverage %
TSSp
Greedy
Additional Greedy
Genetic Algorithm
Simulated Annealing
Reactive GRASP
1%
2%
96.3492
95.9838
98.2671
99.0566
98.0952
98.7129
98.0092
98.7364
98.2407
98.9218
3%
5%
10%
95.933
95.1047
94.8611
99.3303
99.5327
99.7668
98.9107
98.6224
99.2237
98.9575
99.0856
99.3422
99.2589
99.2427
99.6159
20%
30%
94.75
95.3616
99.8739
99.9241
99.4858
99.7047
99.6266
99.7181
99.7749
99.8757
40%
50%
60%
95.3396
96.1134
96.3241
99.9413
99.9552
99.9627
99.7944
99.8269
99.852
99.7871
99.8515
99.8541
99.9144
99.9318
99.9416
70%
80%
96.5465
96.9312
99.968
99.9722
99.8673
99.88
99.8927
99.8868
99.9586
99.9553
90%
100%
97.1171
97.0495
99.9763
99.9786
99.9474
99.9573
99.9077
99.9118
99.9704
99.9701
Statement Coverage %
TSSp
Greedy
Additional Greedy
Genetic Algorithm
Simulated Annealing
Reactive GRASP
1%
2%
3%
96.747
97.0323
96.937
98.1768
99.039
99.3284
98.0792
98.8108
99.1366
98.1911
98.8664
99.1927
98.1596
99.0273
99.2257
5%
10%
96.3181
96.1091
99.5751
99.782
99.2731
99.452
99.4252
99.6635
99.4398
99.6428
20%
30%
40%
96.9909
97.2931
97.0724
99.8945
99.9307
99.9471
99.7965
99.8703
99.9003
99.8168
99.8683
99.8983
99.8693
99.9112
99.9358
50%
60%
97.4288
97.4015
99.9584
99.9653
99.9214
99.932
99.9146
99.9281
99.9445
99.9594
70%
80%
90%
97.6458
97.8832
97.8907
99.9707
99.9748
99.9777
99.9374
99.9399
99.9496
99.931
99.9273
99.9471
99.9653
99.9722
99.9653
100%
97.8901
99.9799
99.9627
99.9494
99.978
10
Block Coverage %
TSSp
Greedy
Additional Greedy
Genetic Algorithm
Simulated Annealing
Reactive GRASP
1%
2%
97.2708
98.2538
98.1199
99.0566
98.066
98.9605
98.167
99.0325
98.1064
99.036
3%
5%
98.6447
99.4678
99.3764
99.6184
99.3304
99.5851
99.3464
99.5879
99.3534
99.6184
10%
20%
30%
98.2116
99.9056
99.9385
99.8527
99.907
99.9385
99.2378
99.8952
99.9348
99.7869
99.893
99.9267
99.7659
99.907
99.9385
40%
50%
60%
99.9538
99.963
99.9692
99.9538
99.963
99.9692
99.9476
99.9586
99.9676
99.9418
99.9535
99.9612
99.9538
99.963
99.9692
70%
80%
99.9736
99.9769
99.9736
99.9769
99.9702
99.972
99.9584
99.9641
99.9736
99.9769
90%
100%
99.9794
99.9815
99.9794
99.9815
99.9779
99.9796
99.9735
99.9701
99.9794
99.9815
Decision Coverage %
TSSp
Greedy
Additional Greedy
Genetic Algorithm
Simulated Annealing
Reactive GRASP
1%
2%
95.6563
96.1375
98.3687
98.9533
97.9922
98.2113
98.1129
98.5404
98.3301
98.8501
3%
5%
10%
95.5965
97.6887
97.1277
99.3111
99.6164
99.7985
98.4344
99.058
99.4385
98.9122
99.2189
99.4873
99.0936
99.4773
99.7057
20%
30%
97.2249
97.2647
99.9027
99.9352
99.7033
99.8177
99.7575
99.8224
99.8713
99.9126
40%
50%
60%
97.2726
97.2823
97.2869
99.9513
99.9712
99.9676
99.8145
99.8745
99.8827
99.8673
99.8907
99.9143
99.9144
99.9411
99.9584
70%
80%
97.2981
97.3005
99.9722
99.9756
99.915
99.9311
99.9013
99.915
99.9595
99.9695
90%
100%
99.9794
99.9815
99.9794
99.9815
99.9779
99.9796
99.9304
99.9297
99.9794
99.9815
Statement Coverage %
TSSp
Greedy
Additional Greedy
Genetic Algorithm
Simulated Annealing
Reactive GRASP
1%
2%
3%
97.7116
97.4612
97.1499
98.2883
99.1097
99.336
98.1984
98.9346
98.9259
98.0316
98.6235
99.0397
98.2777
99.0208
99.1481
5%
10%
97.7227
98.3422
99.6029
99.8072
99.3066
99.6104
99.428
99.6295
99.5114
99.734
20%
30%
40%
98.4317
98.474
98.4861
99.9014
99.9363
99.9525
99.7765
99.8543
99.8892
99.7866
99.8455
99.8833
99.8815
99.9074
99.9441
50%
60%
98.4988
98.5041
99.962
99.9683
99.9159
99.9251
99.9055
99.9123
99.9568
99.9626
70%
80%
90%
98.5109
98.512
98.5166
99.9728
99.9762
99.9788
99.9345
99.9429
99.9549
99.9278
99.9291
99.9463
99.9663
99.9725
99.9757
100%
98.521
99.981
99.9583
99.9453
99.9783
11
Algorithm (y)
Greedy Algorithm
Additional Greedy
Algorithm
Genetic Algorithm
Simulated Annealing
Reactive GRASP
Coverage Dierence
Significance (t-test)
0.0000
0.0000
0.0000
0.0000
0.0000
0.0000
0.0000
0.1876
0.0000
0.0000
0.4918
0.0000
0.0000
0.0000
0.4918
0.0000
0.0000
0.1876
0.0000
0.0000
Genetic Algorithm
99.8825
99.8406
99.8743
Simulated Annealing
99.8863
99.8417
99.8706
Reactive GRASP
99.9335
99.9368
99.9417
Table 8: Dierence in Performance between the Best and Worst Criteria, Small Programs.
Greedy Algorithm
Dierence in performance
between the best and worst criteria
Additional Greedy
Genetic Algorithm Simulated Annealing Reactive GRASP
Algorithm
0.4739
0.0302
0.0419
0.0446
0.0082
Final Average
Greedy Algorithm
98.0435
Genetic Algorithm
99.8658
Simulated Annealing
99.8662
Reactive GRASP
99.9373
Table 10: Standard Deviation of the Eectiveness for the Four Algorithms, Small Programs.
Standard Deviation
Greedy Algorithm
0.002371
Genetic Algorithm
0.000222
Simulated Annealing
0.000226
Reactive GRASP
0.000041
Coverage Performance
The worst performance
Best performance of all
Fourth best performance
Third best performance
Second best performance
Execution Time
Fast
Fast
Medium
Fast
Slow
Observations
12
Block Coverage %
TSSp
1%
5%
10%
20%
30%
40%
50%
Decision Coverage %
TSSp
1%
5%
10%
20%
30%
40%
50%
Statement Coverage %
TSSp
1%
5%
10%
20%
30%
40%
50%
Greedy
87.4115
85.8751
85.5473
86.5724
86.9639
87.3629
87.8269
Additional Greedy
96.4804
98.5599
99.1579
99.6063
99.7423
99.811
99.842
Genetic Algorithm
92.6728
94.8614
95.9604
98.0118
98.5998
98.9844
99.1271
Simulated Annealing
91.4603
94.9912
96.7242
97.991
98.6937
98.9004
99.216
Reactive GRASP
95.6961
98.0514
98.6774
99.4235
99.6431
99.7339
99.7755
Greedy
88.753
85.5131
86.9345
87.9909
88.4008
88.6799
88.6635
Additional Greedy
96.9865
98.553
99.1999
99.6074
99.7464
99.8074
99.8476
Genetic Algorithm
91.6811
93.6639
95.9172
98.0217
98.4662
98.9283
99.0786
Simulated Annealing
92.0529
94.9256
96.6152
97.7348
98.5373
98.8599
98.84
Reactive GRASP
96.4502
97.8443
98.358
99.2446
99.3256
99.7149
99.7469
Greedy
92.8619
90.9306
91.3637
91.7803
92.1344
92.1866
92.2787
Additional Greedy
97.7642
99.1171
99.5086
99.7598
99.8473
99.8859
99.9117
Genetic Algorithm
94.3287
95.7946
97.5863
98.6129
99.0048
99.3106
99.4053
Simulated Annealing
93.5957
96.4218
97.7154
98.6336
99.2151
99.2963
99.4852
Reactive GRASP
97.0516
98.4031
99.3172
99.6214
99.6555
99.8365
99.8517
13
Table 13: Coverage Significance and Time Mean Dierence, Program Space.
Algorithm (x)
Greedy Algorithm
Genetic Algorithm
Simulated Annealing
Reactive GRASP
Mean Coverage
Dierence (%)
(x y)
Coverage Dierence
Significance
(t-test)
10.5391
0.0000
16.643
Genetic Algorithm
Simulated Annealing
Reactive GRASP
9.4036
495.608
36, 939.589
Algorithm (y)
Time Mean
Dierence (s)
(x y)
10.3639
0.0000
0.0000
0.0000
Greedy Algorithm
Genetic Algorithm
10.5391
1.1354
0.0000
0.0000
16.643
478.965
Simulated Annealing
Reactive GRASP
1.0931
0.1752
0.0000
0.0613
36, 922.945
9.4036
0.0000
0.0000
495.608
478.965
9.4459
Greedy Algorithm
Additional Greedy Algorithm
1.1354
5.339
11.303
Simulated Annealing
Reactive GRASP
0.0423
0.9602
0.4418
0.0000
36, 443.980
Greedy Algorithm
Additional Greedy Algorithm
9.4459
1.0931
0.0000
0.0000
5.339
11.303
Genetic Algorithm
Reactive GRASP
0.0423
0.9180
0.4418
0.0000
3, 6934.249
0.9180
0.0000
0.0613
0.0000
36, 939.589
36,922.945
3,6934.249
0.9602
0.0000
36,443.980
Greedy Algorithm
Additional Greedy Algorithm
Simulated Annealing
10.3639
0.1752
Genetic Algorithm
490.268
490.268
Genetic Algorithm
98.4650
98.3631
98.9375
Simulated Annealing
98.53273
98.33361
99.02625
Reactive GRASP
99.5424
99.4221
99.6819
simulated annealing, and Reactive GRASP algorithms significantly outperformed the Greedy algorithm. Comparing both
metaheuristic-based approaches, the better performance
obtained by the Reactive GRASP algorithm over the genetic
algorithm and simulated annealing is clear.
The Reactive GRASP algorithm was followed by genetic
algorithm approach, which performed the fourth best in
our evaluation. The third best evaluation was obtained by
simulated annealing.
Figures 10, 11, and 12 demonstrate a comparison
between the five algorithms used in the experiments, for
the space program. Based on these figures, it is possible
to conclude that the best performance was that of the
Additional Greedy algorithm, followed by the Reactive
GRASP algorithm. Reactive GRASP surpassed the genetic
algorithm, simulated annealing, and Greedy algorithm. One
dierence between the results for space program and the
small programs is that Additional Greedy algorithm was
better for all criteria, while, for small programs, Reactive
GRASP had the best results for the APDC criteria. Another
dierence is the required execution time. As the size of the
14
Dierence in performance
between the best and worst criteria
Additional Greedy
Genetic Algorithm Simulated Annealing Reactive GRASP
Algorithm
4.8956
0.1300
0.5744
0.6926
0.2598
Table 16: Average for Each Algorithm (All Metrics), Program Space.
Greedy Algorithm
89.1849
Genetic Algorithm
98.5885
100
100
99.5
99.5
99
Coverage (%)
Coverage (%)
Final Average
98.5
98
97.5
99
98
97.5
97
97
96.5
0
10
20
30
40
50
60
70
80
90
96
100
10
20
30
TSSp
Simulated annealing
Reactive GRASP
50
60
Greedy algorithm
Additional Greedy algorithm
Genetic algorithm
70
80
90
100
Simulated annealing
Reactive GRASP
Figure 4: APSC (Average Percentage Statement Coverage), Comparison among Algorithms for Small Programs.
100
100
Weighted average (%)
99.5
Coverage (%)
40
TSSp
Greedy algorithm
Additional Greedy algorithm
Genetic algorithm
99
98.5
98
97.5
97
96.5
96
Reactive GRASP
99.5488
98.5
96.5
96
Simulated Annealing
98.6308
99.5
99
98.5
98
97.5
97
10
20
30
40
50
60
70
80
90
100
Block
coverage
Decision
coverage
Statement
coverage
TSSp
Greedy algorithm
Additional Greedy algorithm
Genetic algorithm
Simulated annealing
Reactive GRASP
Figure 3: APDC (Average Percentage Decision Coverage), Comparison among Algorithms for Small Programs.
Greedy algorithm
Genetic algorithm
Reactive GRASP
15
Table 17: Standard Deviation of the Eectiveness for the Four Algorithms, Program Space.
Greedy Algorithm
Standard Deviation
0.025599
Genetic Algorithm
0.003065
Simulated Annealing
0.003566
Reactive GRASP
0.001300
Execution Time
Fast
Fast
Medium
Fast
Slow
100
100
99.5
Weighted average (%)
Observations
99
98.5
98
97.5
97
Greedy Additional Genetic Simulated
algorithm Greedy
algorithm annealing
algorithm
Reactive
GRASP
Block coverage
Statement coverage
Decision coverage
99.8
99.6
99.4
99.2
99
Block
coverage
Decision
coverage
Statement
coverage
In terms of time, as expected, the use of global approaches, such as both metaheuristic-based algorithms evaluated here, adds an overhead to the process. Considering
time eciency, one can see from Tables 6 and 13 that
the Greedy algorithm performed more eciently than all
other algorithms. This algorithm was, on average, 1.491
seconds faster than Additional Greedy algorithm, 8.436 faster
than the genetic algorithm, 0.057 faster than the simulated
annealing, and almost 50 seconds faster than the Reactive
GRASP approach, for the small programs. In terms of
relative values, Reactive GRASP was 61.53 times slower than
Additional Greedy, 11.68 slower than genetic algorithm,
513.87 slower than simulated annealing, and 730.92 slower
than Greedy algorithm. This result demonstrates, once again,
the great performance obtained by the Additional Greedy
algorithm compared to that of the Greedy algorithm, since
it was significantly better, performance-wise, and achieved
these results with a very similar execution time. On the
other spectrum, we had the Reactive GRASP algorithm,
which performed on average 48,456 seconds slower than
the Additional Greedy algorithm and 41,511 seconds slower
than the genetic algorithm. In favor of both metaheuristicbased approaches is the fact that one may calibrate the time
16
99.8
Coverage (%)
100
99.6
99.4
95
92.5
90
87.5
99.2
85
99
Additional
Greedy
algorithm
Simulated
annealing
10
15
20
25
30
Reactive
GRASP
40
45
50
Simulated annealing
Reactive GRASP
Greedy algorithm
Additional Greedy algorithm
Genetic algorithm
Block coverage
Decision coverage
Statement coverage
35
TSSp
Figure 10: APBC (Average Percentage Block Coverage), Comparison among Algorithms for Program Space.
100
97.5
99.5
Coverage (%)
100
99
98.5
98
95
92.5
90
97.5
87.5
97
Greedy Additional Genetic Simulated
algorithm Greedy
algorithm annealing
algorithm
Reactive
GRASP
85
10
15
20
25
30
35
40
45
50
TSSp
Simulated annealing
Reactive GRASP
Greedy algorithm
Additional Greedy algorithm
Genetic algorithm
Figure 11: APDC (Average Percentage Decision Coverage), Comparison among Algorithms for Program Space.
100
97.5
Coverage (%)
95
92.5
90
87.5
85
10
15
20
25
30
35
40
45
50
TSSp
Greedy algorithm
Additional Greedy algorithm
Genetic algorithm
Simulated annealing
Reactive GRASP
Figure 12: APSC (Average Percentage Statement Coverage), Comparison among Algorithms for Program Space.
17
100
Weighted average (%)
100
97.5
95
92.5
90
87.5
99.5
99
98.5
98
97.5
85
Block
coverage
Decision
coverage
Additional
Greedy
algorithm
Statement
coverage
Greedy algorithm
Genetic algorithm
Reactive GRASP
Simulated
annealing
Reactive
GRASP
Block coverage
Decision coverage
Statement coverage
97.5
100
95
100
92.5
90
87.5
85
95
92.5
90
87.5
85
Reactive
GRASP
Reactive
GRASP
Block coverage
Statement coverage
Decision coverage
97.5
Decision
coverage
Statement
coverage
References
[1] M. Fewster and D. Graham, Software Test Automation,
Addison-Wesley, Reading, Mass, USA, 1st edition, 1994.
[2] G. Rothermel, R. H. Untcn, C. Chu, and M. J. Harrold, Prioritizing test cases for regression testing, IEEE Transactions on
Software Engineering, vol. 27, no. 10, pp. 929948, 2001.
[3] G. J. Myers, The Art of Software Testing, John Wiley & Sons,
New York, NY, USA, 2nd edition, 2004.
[4] T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein,
Introduction to Algorithms, MIT Press, Cambridge, Mass, USA;
McGraw-Hill, New York, NY, USA, 2nd edition, 2001.
18
[5] G. Rothermel, R. H. Untch, C. Chu, and M. J. Harrold, Test
case prioritization: an empirical study, in Proceedings of the
International Conference on Software Maintenance (ICSM 99),
pp. 179188, Oxford, UK, September 1999.
[6] Z. Li, M. Harman, and R. M. Hierons, Search algorithms
for regression test case prioritization, IEEE Transactions on
Software Engineering, vol. 33, no. 4, pp. 225237, 2007.
[7] F. Glover and G. Kochenberger, Handbook of Metaheuristics,
Springer, Berlin, Germany, 1st edition, 2003.
[8] K. R. Walcott, M. L. Soa, G. M. Kapfhammer, and R. S.
Roos, Time-aware test suite prioritization, in Proceedings of
the International Symposium on Software Testing and Analysis
(ISSTA 06), pp. 112, Portland, Me, USA, July 2006.
[9] S. Yoo and M. Harman, Pareto ecient multi-objective test
case selection, in Proceedings of the International Symposium
on Software Testing and Analysis (ISSTA 07), pp. 140150,
London, UK, July 2007.
[10] S. Elbaum, G. Rothermel, S. Kanduri, and A. G. Malishevsky,
Selecting a cost-eective test case prioritization technique,
Software Quality Journal, vol. 12, no. 3, pp. 185210, 2004.
[11] K. Deb, S. Agrawal, A. Pratab, and T. Meyarivan, A fast
elitist non-dominated sorting genetic algorithm for multiobjective optimization: NSGA-II, in Proceedings of the 6th
Parallel Problem Solving from Nature Conference (PPSN 00),
pp. 849858, Paris, France, September 2000.
[12] J. H. Holland, Adaptation in Natural and Artificial Systems:
An Introductory Analysis with Applications to Biology, Control,
and Artificial Intelligence, University of Michigan, Ann Arbor,
Mich, USA, 1975.
[13] M. Harman, The current state and future of search based
software engineering, in Proceedings of the International
Conference on Software EngineeringFuture of Software Engineering (FoSE 07), pp. 342357, Minneapolis, Minn, USA,
May 2007.
[14] G. Antoniol, M. D. Penta, and M. Harman, Search-based
techniques applied to optimization of project planning for
a massive maintenance project, in Proceedings of the IEEE
International Conference on Software Maintenance (ICSM 05),
pp. 240252, Budapest, Hungary, September 2005.
[15] M. Resende and C. Ribeiro, Greedy randomized adaptative
search procedures, in Handbook of Metaheuristics, F. Glover
and G. Kochenberger, Eds., pp. 219249, Kluwer Academic
Publishers, Dordrecht, The Netherlands, 2001.
[16] M. Paris and C. C. Ribeiro, Reactive GRASP: an application
to a matrix decomposition problem in TDMA trac assignment, INFORMS Journal on Computing, vol. 12, no. 3, pp.
164176, 2000.
[17] M. Hutchins, H. Foster, T. Goradia, and T. Ostrand, Experiments on the eectiveness of dataflow- and control-flowbased test adequacy criteria, in Proceedings of the 16th
International Conference on Software Engineering (ICSE 99),
pp. 191200, Los Angeles, Calif, USA, 1999.
[18] SEBASE, Software Engineering By Automated Search, September 2009, http://www.sebase.org/applications.
Research Article
A Proposal for Automatic Testing of GUIs
Based on Annotated Use Cases
Pedro Luis Mateo Navarro,1, 2 Diego Sevilla Ruiz,1, 2 and Gregorio Martnez Perez1, 2
1 Departamento
2 Departamento
1. Introduction
It is well known that testing the correctness of a Graphical
User Interfaces (GUI) is dicult for several reasons [1]. One
of those reasons is that the space of possible interactions with
a GUI is enormous, which leads to a large number of GUI
states that have to be properly tested (a related problem is
to determine the coverage of a set of test cases); the large
number of possible GUI states results in a large number of
input permutations that have to be considered. Another one
is that validating the GUI state is not straightforward, since
it is dicult to define which objects (and what properties of
these objects) have to be verified.
This paper describes a new approach between Model-less
and Model-Based Testing approaches. This new approach
describes a GUI test case autogeneration process based on a
set of use cases (which are used to describe the GUI behavior)
and the annotation (definition of values, validation rules,
etc.) of the relevant GUI elements. The process generates
automatically all the possible test cases depending on the values defined during the annotation process and incorporates
2. Related Work
Model-Based GUI Testing approaches can be classified
depending on the amount of GUI details that are included
in the model. By GUI details we mean the elements which
are chosen by the Coverage Criteria to faithfully represent the
tested GUI (e.g., window properties, widget information and
properties, GUI metadata, etc.).
2
Many approaches usually choose all window and widget
properties in order to build a highly descriptive model of
the GUI. For example, in [1] (Xie and Memon) and in
[3, 4] (Memon et al.) it is described a process based on GUI
Ripping, a method which traverses all the windows of the
GUI and analyses all the events and elements that may appear
to automatically build a model. That model is composed of
a set of graphs which represent all the GUI elements (a tree
called GUI Forest) all the GUI events and their interaction
(Event-Flow Graphs (EFG), and Event Interaction Graphs
(EIG)). At the end of the model building process, it has to be
verified, fixed, and completed manually by the developers.
Once the model is built, the process explores automatically all the possible test cases. Of those, the developers select
the set of test cases identified as meaningful, and the Oracle
Generator creates the expected output( a Test Oracle [5] is a
mechanism which generates outputs that a product should
have for determining, after a comparison process, whether
the product has passed or failed a test (e.g., a previous stored
state that has to be met in future test executions). Test Oracles
also may be based on a set of rules (related to the product)
that have to be validated during test execution). Finally, test
cases are automatically executed and their output compared
with the Oracle expected results.
As said in [6], the primary problem with these
approaches is that as the number of GUI elements increases,
the number of event sequences grows exponentially. Another
problem is that the model has to be verified, fixed, and
completed manually by the testers, with this being a tedious
and error-prone process itself. These problems lead to other
problems, such a scalability and modifications tolerance. In
these techniques, adding a new GUI element (e.g., a new
widget or event) has two worrying side eects. First, it may
cause the set of generated test cases to grow exponentially
(all paths are explored); second, it forces a GUI Model
update (and a manual verification and completion) and the
regeneration of all aected test cases.
Other approaches use a more restrictive coverage criteria
in order to focus the test case autogeneration eorts on only
a section of the GUI which usually includes all the relevant
elements to be tested. In [7] Vieira et al. describe a method
in which enriched UML Diagrams (UML Use Cases and
Activity Diagrams) are used to describe which functionalities
should be tested and how to test them. The diagrams are
enriched in two ways. First, the UML Activity Diagrams are
refined to improve the accuracy; second, these diagrams are
annotated by using custom UML Stereotypes representing
additional test requirements. Once the model is built, an
automated process generates test cases from these enriched
UML diagrams. In [8] Paiva et al. also describe a UML
Diagrams-based model. In this case, however, the model is
translated to a formal specification.
The scalability of this approach is better than the
previously mentioned because it focuses its eorts only on
a section of the model. The diagram refinement also helps
to reduce the number of generated test cases. On the other
hand, some important limitations make this approach not
so suitable for certain scenarios. The building, refining, and
annotation processes require a considerable eort since they
3
If the annotation
is cancelled
Widget
interacted
GUI
ready
Assert
Oracle
Set value
constraints
Perform
widget
action
Set
validation
rules
State
Oracle
Set
enabled
Store
annotations
4. Annotation Process
The annotation process is the process by which the tester
indicates what GUI elements are important in terms of the
following: First, which values can a GUI element hold (i.e.,
a new set of values or a range), and thus should be tested;
second, what constraints should be met by a GUI element
at a given time (i.e., validation rules), and thus should be
validated. The result of this process is a set of annotated
GUI elements which will be helpful during the test case
autogeneration process in order to identify the elements that
represent a variation point, and the constraints that have to
be met for a particular element or set of elements. From now
on, this set will be called Annotation Test Case.
This process could be implemented, for example,
using a capture and replay (C&R) tool( a Capture and
Test suite
Generate an
annotation
test case
If new validation
rules, range
values or elements
have to be added
Auto-generated
annotated
test suite
Annotated
test suite
Test case
auto-generation
5
(i) Assert Oracles. These oracles check if a set of validation rules related to a widget are met or not.
Therefore, the tester needs to somehow define a set of
validation rules. As said in Section 4 corresponding
to the annotation process, defining these rules is
not straightforward. Expressive and flexible (e.g.,
constraint or script) languages are needed to allow
the tester to define assert rules for the properties
of the annotated widget, and, possibly, to other
widgets. Another important pitfall is that if the GUI
encounters an error, it may reach an unexpected or
inconsistent state. Further executing the test case is
useless; therefore it is necessary to some mechanism
to detect these bad states and stop the test case
execution (e.g., a special statement which indicates
that the execution and validation process have to finish
if an error is detected).
(ii) State Oracles. These oracles check if the state of the
widget during the execution process matches the state
stored during the annotation process. To implement
this functionality, the system needs to know how
to extract the state from the widgets, represent it
somehow, and be able to check it for validity. In
our approach, it could be implemented using widget
adapters which, for example, could represent the state
of a widget as a string; so, the validation would be as
simple as a string comparison.
The validation process may be additionally completed
with Crash Oracles, which perform an application-level
validation (as opposed to widget-level) as they can detect
crashes during test case execution. These oracles are used to
signal and identify serious problems in the software; they are
very useful in the first steps of the development process.
Finally, it is important to remember that there are two
important limitations when using test oracles in GUI testing
[5]. First, GUI events have to be deterministic in order to be
able to predict their outcome (e.g., it would not make sense
if the process is validating a property which depends on a
random value); second, since the software back-end is not
modeled (e.g., data in a data base), the GUI may return a
nonexpected state which would be detected as an error (e.g.,
if the process is validating the output in a database query
application, and the content of this database changes during
the process).
7. Example
In order to show this process working on a real example, we
have chosen a fixed-term deposit calculator application. This
example application has a GUI (see Figure 4) composed of a
set of widgets: a menu bar, three number boxes (two integer
and one double), two buttons (one to validate the values and
another to operate with them), and a label to output the
obtained result. Obviously, there are other widgets in that
GUI (i.e., a background panel, text labels, a main window,
etc.), but these elements are not of interest for the example.
AW
Rules:
no validation rules
Values:
{15, 25}
Rules:
assert(widget.color == red)
if (value == 2)
assert(widget.text == 2)
Values:
{1, 2}
15
15
25
25
The valid values for the number boxes are the following.
(i) Interest Rate. Assume that the interest rate imposed
by the bank is between 2 and 3 percent (both
included).
(ii) Deposit Amount. Assume that the initial deposit
amount has to be greater or equal to 1000, and no
more than 10 000.
TC 0
500
TC 1
500
12
TC 2
500
24
TC 97
8000
12
TC 98
8000
24
Validation points
Acknowledgments
This paper has been partially funded by the Catedra SAES
of the University of Murcia initiative, a joint eort between
Sociedad Anonima
de Electronica
Submarina (SAES),
http://www.electronica-submarina.com/ and the University
of Murcia to work on open-source software, and real-time
and critical information systems.
References
[1] Q. Xie and A. M. Memon, Model-based testing of
community-driven open-source GUI applications, in Proceedings of the 22nd IEEE International Conference on Software
Maintenance (ICSM 06), pp. 203212, Los Alamitos, Calif,
USA, 2006.
[2] P. Mateo, D. Sevilla, and G. Martnez, Automated GUI testing
validation guided by annotated use cases, in Proceedings
of the 4th Workshop on Model-Based Testing (MoTes 09) in
Conjunction with the Annual National Conference of German Association for Informatics (GI 09), Lubeck, Germany,
September 2009.
[3] A. Memon, I. Banerjee, and A. Nagarajan, GUI ripping:
reverse engineering of graphical user interfaces for testing, in
Proceedings of the 10th IEEE Working Conference on Reverse
Engineering (WCRE 03), pp. 260269, Victoria, Canada,
November 2003.
[4] A. Memon, I. Banerjee, N. Hashmi, and A. Nagarajan, Dart:
a framework for regression testing nightly/daily builds of
GUI applications, in Proceedings of the IEEE Internacional
Conference on Software Maintenance (ICSM 03), pp. 410419,
2003.
8
[5] Q. Xie and A. M. Memon, Designing and comparing
automated test oracles for GUI based software applications,
ACM Transactions on Software Engineering and Methodology,
vol. 16, no. 1, p. 4, 2007.
[6] X. Yuan and A. M. Memon, Using GUI run-time state as
feedback to generate test cases, in Proceedings of the 29th
International Conference on Software Engineering (ICSE 07),
Minneapolis, Minn, USA, May 2007.
[7] M. Vieira, J. Leduc, B. Hasling, R. Subramanyan, and J.
Kazmeier, Automation of GUI testing using a model-driven
approach, in Proceedings of the International Workshop on
Automation of Software Test, pp. 914, Shanghai, China, 2006.
[8] A. Paiva, J. Faria, and R. Vidal, Towards the integration of
visual and formal models for GUI testing, Electronic Notes in
Theoretical Computer Science, vol. 190, pp. 99111, 2007.
[9] A. Memon and Q. Xie, Studying the fault-detection eectiveness of GUI test cases for rapidly envolving software, IEEE
Transactions on Software Engineering, vol. 31, no. 10, pp. 884
896, 2005.
[10] R. S. Zybin, V. V. Kuliamin, A. V. Ponomarenko, V. V.
Rubanov, and E. S. Chernov, Automation of broad sanity test
generation, Programming and Computer Software, vol. 34, no.
6, pp. 351363, 2008.
[11] Object Management Group, Object constraint language
(OCL), version 2.0, OMG document formal/2006-05-01,
2006, http://www.omg.org/spec/OCL/2.0/.
[12] Y. Matsumoto, Ruby Scripting Language, 2009, http://www
.ruby-lang.org/en/.
Research Article
AnnaBot: A Static Verifier for Java Annotation Usage
Ian Darwin
8748 10 Sideroad Adjala, RR 1, Palgrave, ON, Canada L0N 1P0
Correspondence should be addressed to Ian Darwin, ian@darwinsys.com
Received 16 June 2009; Accepted 9 November 2009
Academic Editor: Phillip Laplante
Copyright 2010 Ian Darwin. This is an open access article distributed under the Creative Commons Attribution License, which
permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
This paper describes AnnaBot, one of the first tools to verify correct use of Annotation-based metadata in the Java programming
language. These Annotations are a standard Java 5 mechanism used to attach metadata to types, methods, or fields without using
an external configuration file. A binary representation of the Annotation becomes part of the compiled .class file, for inspection
by another component or library at runtime. Java Annotations were introduced into the Java language in 2004 and have become
widely used in recent years due to their introduction in the Java Enterprise Edition 5, the Hibernate object-relational mapping
API, the Spring Framework, and elsewhere. Despite that, mainstream development tools have not yet produced a widely-used
verification tool to confirm correct configuration and placement of annotations external to the particular runtime component.
While most of the examples in this paper use the Java Persistence API, AnnaBot is capable of verifying anyannotation-based API
for which claimsdescription of annotation usageare available. These claims can be written in Java or using a proposed
Domain-Specific Language, which has been designed and a parser (but not the code generator) have been written.
1. Introduction
1.1. Java Annotations. Java Annotations were introduced
into the language in 2004 [1] and have become widely used
in recent years, especially since their introduction in the Java
Enterprise Edition. Many open source projects including the
Spring [2] and Seam [1] Frameworks, and the Hibernate
and Toplink ORMs use annotations. So do many new Sun
Java standards, including the Java Standard Edition, the Java
Persistence API (an Object Relational Mapping API), the EJB
container, and the Java API for XML Web Services (JAXWS). Until now there has not been a general-purpose tool
for independently verifying correct use of these annotations.
The syntax of Java Annotations is slightly unusual
while they are given class-style names (names begin with a
capital letter by convention) and are stored in binary .class
files, they may not be instantiated using the new operator.
Instead, they are placed by themselves in the source code,
preceding the element that is to be annotated (see Figure 1).
Annotations may be compile-time or run-time; the latters
binary representation becomes part of the compiled class
file, for inspection by another component at run time.
Annotations are used by preceding their name with the at
@WebService
public class Fred extends Caveman {
@Override
public void callHome() {
// call Wilma here
}
import javax.persistence.Entity;
import javax.persistence.Id;
claim JPA {
if (class.annotated(javax.persistence.Entity)) {
require method.annotated(javax.persistence.Id)
|| field.annotated(javax.persistence.Id);
atMostOne
method.annotated(javax.persistence.ANY)
||
field.annotated(javax.persistence.ANY)
error "The JPA Spec only allows JPA annotations on methods OR fields";
};
if (class.annotated(javax.persistence.Embeddable)) {
noneof method.annotated(javax.persistence.Id) ||
field.annotated(javax.persistence.Id);
};
}
claim EJB3Type {
atMostOne
class.annotated(javax.ejb.Stateless),
class.annotated(javax.ejb.Stateful)
error "Class has conflicting top-level EJB
annotations"
;
3. Implementation
The basic operation of Annabots use of the reflection
API is shown in class AnnaBot0.java in Figure 6. This
demonstration has no configuration input; it simply hardcodes a single claim about the Java Persistence Architecture
API, that only methods or only fields be JPA-annotated. This
version was a small, simple proof-of-concept and did one
thing well.
The Java class under investigation is accessed using Javas
built-in Reflection API [9]. There are other reflection-like
packages for Java such as Javassist [10] and the Apache
Software Foundations Byte Code Engineering Language
[11]. Use of the standard API avoids dependencies on
external APIs and avoids both the original author and
potential contributors having to learn an additional API.
Figure 6 is a portion of the code from the original AnnaBot0
which determines whether the class under test contains any
fields with JPA annotations.
To make the program generally useful, it was necessary to
introduce some flexibility into the processing. It was decided
to design and implement a Domain-Specific Language [12,
13] to allow declarative rather than procedural specification
of additional checking. One will still be able to extend the
functionality of AnnaBot using Java, but some will find it
more convenient to use the DSL.
The first version of the language uses the Java compiler
to convert claim files into a runnable form. Thus, it is slightly
fieldHasJpaAnno =
true;
break;
program:
import stmt*
CLAIM IDENTIFIER {
stmt+
}
;
import stmt:
;
checks: check
package jpa;
import annabot.Claim;
import tree.*;
public class JPAEntityMethodFieldClaim extends Claim
{
| NOT check
| ( check OR check )
| ( check AND check )
| ( check , check)
;
check:
IMPORT NAMEINPACKAGE ;
classAnnotated
| methodAnnotated
| fieldAnnotated;
ERROR QSTRING;
5. Future Development
2.27
26.8
Classes
Entity/Total
7, 22
94/156
Errors
Time (Seconds)
0
0
0.6
6. Related Research
Relatively little attention has been paid to developing tools
that assist in verifying correct use of annotations. Eichberg
et al. [21] produced an Eclipse plug-in which uses a
dierent, and I think rather clever, approach: they preprocess
6
the Java class into an XML representation of the byte-code,
then use XPath to query for correct use of annotations.
This allows them to verify some nonannotation-related
attributes of the softwares specification conformance. For
example, they can check that EJB classes have a noargument constructor (which Annabot can easily do at
present). They can also verify that such a class does not
create new threads. Annabot cannot do this at present
since that analysis requires inspection of the bytecode to
check for monitor locking machine instructions. However, this is outside my researchs scope of verifying the
correct use of annotations. It could be implemented by
using one of the non-Sun reflection APIs mentioned in
Section 3.
The downside of Eichbergs approach is that all classes
in the target system must be preprocessed, whereas AnnaBot
simply examines target classes by reflection, making it faster
and simpler to use.
Noguera and Pawlak [22] explore an alternate approach.
They produce a rather powerful annotation verifier called
AVal. However, as they point out, AVal follows the idea that
annotations should describe the way in which they should
be validated, and that self validation is expressed by metaannotations (@Validators). Since my research goal was to
explore validation of existing annotation-based APIs provided by Sun, SpringSource, Hibernate project, and others,
I did not pursue investigation of procedures that would have
required attempting to convince each API provider to modify
their annotations.
JavaCOP by Andreae [23] provides a very comprehensive
type checking system for Java programs; it provides several
forms of type checking, but goes beyond the use of annotations to provide a complete constraint system.
The work of JSR-305 [24] has been suggested as relevant.
JSR-305 is more concerned with specifying new annotations
for making assertions about standard Java than with ensuring
correct use of annotations in code written to more specialized
APIs. As the project describes itself, This JSR will work
to develop standard annotations (such as @NonNull) that
can be applied to Java programs to assist tools that detect
software defects.
Similarly, the work of JSR-308 [25], an oshoot of JSR305, has been suggested as relevant, but it is concerned with
altering the syntax of Java itself to extend the number of
places where annotations are allowed. For example, it would
be convenient if annotations could be applied to Java 5+ Type
Parameters. This is not allowed by the compilers at present
but will be when JSR-308 becomes part of the language,
possibly as early as Java 7.
Neither JSR-305 nor JSR-308 provides any support for
finding misplaced annotations.
Acknowledgments
Research performed in partial fulfillment of the M.Sc degree
at Staordshire University. The web site development which
led to the idea for AnnaBot was done while under contract
to The Toronto Centre for Phenogenomics [17]; however,
the software was developed and tested on the authors own
time. Ben Rady suggested the use of Javassist in the Compiler.
Several anonymous reviewers contributed significantly to the
readability and accuracy of this paper.
References
[1] Java 5 Annotations, November 2009, http://java.sun.com/
j2se/1.5.0/docs/guide/language/annotations.html.
[2] Spring Framework Home Page, October 2009, http://www
.springsource.org/.
[3] G. King, Seam Web/JavaEE Framework, October 2009,
http://www.seamframework.org/.
[4] L. Goldschlager, Computer Science: A Modern Introduction,
Prentice-Hall, Upper Saddle River, NJ, USA, 1992.
[5] J. Voas, et al., A Testability-based Assertion Placement
Tool for Object-Oriented Software, October 1997, http://
hissa.nist.gov/latex/htmlver.html.
[6] Java Programming with Assertions, November 2009,
http://java.sun.com/j2se/1.4.2/docs/guide/lang/assert.html.
[7] J. A. Darringer, The application of program verification
techniques to hardware verification, in Proceedings of the
Annual ACM IEEE Design Automation Conference, pp. 373
379, ACM, 1988.
[8] EJB3 and JPA Specifications, November 2009, http://jcp.org/
aboutJava/communityprocess/final/jsr220/index.html.
[9] I. Darwin, The reflection API, in Java Cookbook, chapter 25,
OReilly, Sebastopol, Calif, USA, 2004.
[10] Javassist bytecode manipulation library, November 2009,
http://www.csg.is.titech.ac.jp/chiba/javassist/.
[11] Apache BCELByte Code Engineering Library, November
2009, http://jakarta.apache.org/bcel/.
[12] J. Bentley, Programming pearls: little languages, Communications of the ACM, vol. 29, no. 8, pp. 711721, 1986.
[13] I. Darwin, PageUnit: A Little Language for Testing
Web Applications, Staordshire University report, 2006,
http://www.pageunit.org/.
[14] S. Johnson, YACC: yet another compiler-compiler, Tech.
Rep. CSTR-32, Bell Laboratories, Madison, Wis, USA, 1978.
[15] Open Source Parser Generators in Java, April 2009, http://
java-source.net/open-source/parser-generators.
[16] T. Parr, The Definitive ANTLR Reference: Building DomainSpecific Languages, Pragmatic Bookshelf, Raleigh, NC, USA,
2007.
[17] Toronto Centre for Phenogenomics, April 2009, http://www
.phenogenomics.ca/.
[18] Eclipse Foundation, Eclipse IDE project, November 2009,
http://www.eclipse.org/.
Research Article
Software Test Automation in Practice: Empirical Observations
Jussi Kasurinen, Ossi Taipale, and Kari Smolander
Department of Information Technology, Laboratory of Software Engineering, Lappeenranta University of Technology,
P.O. Box 20, 53851 Lappeenranta, Finland
Correspondence should be addressed to Jussi Kasurinen, jussi.kasurinen@lut.fi
Received 10 June 2009; Revised 28 August 2009; Accepted 5 November 2009
Academic Editor: Phillip Laplante
Copyright 2010 Jussi Kasurinen et al. This is an open access article distributed under the Creative Commons Attribution
License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly
cited.
The objective of this industry study is to shed light on the current situation and improvement needs in software test automation. To
this end, 55 industry specialists from 31 organizational units were interviewed. In parallel with the survey, a qualitative study was
conducted in 12 selected software development organizations. The results indicated that the software testing processes usually
follow systematic methods to a large degree, and have only little immediate or critical requirements for resources. Based on
the results, the testing processes have approximately three fourths of the resources they need, and have access to a limited, but
usually sucient, group of testing tools. As for the test automation, the situation is not as straightforward: based on our study, the
applicability of test automation is still limited and its adaptation to testing contains practical diculties in usability. In this study,
we analyze and discuss these limitations and diculties.
1. Introduction
Testing is perhaps the most expensive task of a software
project. In one estimate, the testing phase took over 50%
of the project resources [1]. Besides causing immediate
costs, testing is also importantly related to costs related to
poor quality, as malfunctioning programs and errors cause
large additional expenses to software producers [1, 2]. In
one estimate [2], software producers in United States lose
annually 21.2 billion dollars because of inadequate testing
and errors found by their customers. By adding the expenses
caused by errors to software users, the estimate rises to
59.5 billion dollars, of which 22.2 billion could be saved by
making investments on testing infrastructure [2]. Therefore
improving the quality of software and eectiveness of the
testing process can be seen as an eective way to reduce
software costs in the long run, both for software developers
and users.
One solution for improving the eectiveness of software
testing has been applying automation to parts of the testing
work. In this approach, testers can focus on critical software
features or more complex cases, leaving repetitive tasks to
the test automation system. This way it may be possible to
2
tests. For these activities manual testing is more suitable, as
building automation is an extensive task and feasible only
if the case is repeated several times [7, 8]. However, the
division between automated and manual testing is not as
straightforward in practice as it seems; a large concern is also
the testability of the software [9], because every piece of code
can be made poorly enough to be impossible to test it reliably,
therefore making it ineligible for automation.
Software engineering research has two key objectives: the
reduction of costs and the improvement of the quality of
products [10]. As software testing represents a significant
part of quality costs, the successful introduction of test
automation infrastructure has a possibility to combine these
two objectives, and to overall improve the software testing
processes. In a similar prospect, the improvements of the
software testing processes are also at the focus point of the
new software testing standard ISO 29119 [11]. The objective
of the standard is to oer a company-level model for the
test processes, oering control, enhancement and follow-up
methods for testing organizations to develop and streamline
the overall process.
In our prior research project [4, 5, 1214], experts
from industry and research institutes prioritized issues of
software testing using the Delphi method [15]. The experts
concluded that process improvement, test automation with
testing tools, and the standardization of testing are the
most prominent issues in concurrent cost reduction and
quality improvement. Furthermore, the focused study on
test automation [4] revealed several test automation enablers
and disablers which are further elaborated in this study.
Our objective is to observe software test automation in
practice, and further discuss the applicability, usability and
maintainability issues found in our prior research. The
general software testing concepts are also observed from the
viewpoint of the ISO 29119 model, analysing the test process
factors that create the testing strategy in organizations. The
approach to achieve these objectives is twofold. First, we wish
to explore the software testing practices the organizations are
applying and clarify the current status of test automation
in the software industry. Secondly, our objective is to
identify improvement needs and suggest improvements for
the development of software testing and test automation
in practice. By understanding these needs, we wish to
give both researchers and industry practitioners an insight
into tackling the most hindering issues while providing
solutions and proposals for software testing and automation
improvements.
The study is purely empirical and based on observations from practitioner interviews. The interviewees of this
study were selected from companies producing software
products and applications at an advanced technical level.
The study included three rounds of interviews and a
questionnaire, which was filled during the second interview
round. We personally visited 31 companies and carried out
55 structured or semistructured interviews which were taperecorded for further analysis. The sample selection aimed to
represent dierent polar points of the software industry; the
selection criteria were based on concepts such as operating
environments, product and application characteristics (e.g.,
2. Related Research
Besides our prior industry-wide research in testing [4, 5, 12
14], software testing practices and test process improvement
have also been studied by others, like Ng et al. [16]
in Australia. Their study applied the survey method to
establish knowledge on such topics as testing methodologies,
tools, metrics, standards, training and education. The study
indicated that the most common barrier to developing
testing was the lack of expertise in adopting new testing
methods and the costs associated with testing tools. In
their study, only 11 organizations reported that they met
testing budget estimates, while 27 organizations spent 1.5
times the estimated cost in testing, and 10 organizations
even reported a ratio of 2 or above. In a similar vein,
Torkar and Mankefors [17] surveyed dierent types of
communities and organizations. They found that 60% of
the developers claimed that verification and validation were
the first to be neglected in cases of resource shortages
during a project, meaning that even if the testing is
important part of the project, it usually is also the first
part of the project where cutbacks and downscaling are
applied.
As for the industry studies, a similar study approach
has previously been used in other areas of software engineering. For example, Ferreira and Cohen [18] completed
a technically similar study in South Africa, although their
study focused on the application of agile development and
stakeholder satisfaction. Similarly, Li et al. [19] conducted
research on the COTS-based software development process
in Norway, and Chen et al. [20] studied the application
of open source components in software development in
China. Overall, case studies covering entire industry sectors
are not particularly uncommon [21, 22]. In the context of
test automation, there are several studies and reports in
test automation practices (such as [2326]). However, there
seems to be a lack of studies that investigate and compare the
practice of software testing automation in dierent kinds of
software development organizations.
In the process of testing software for errors, testing
work can be roughly divided into manual and automated
testing, which both have individual strengths and weaknesses. For example, Ramler and Wolfmaier [3] summarize
the dierence between manual and automated testing by
suggesting that automation should be used to prevent further
errors in working modules, while manual testing is better
suited for finding new and unexpected errors. However, how
3
kind of hindrances to the application of test automation
and based on these findings, oer guidelines on what
aspects should be taken into account when implementing test
automation in practice.
3. Research Process
3.1. Research Population and Selection of the Sample. The
population of the study consisted of organization units
(OUs). The standard ISO/IEC 15504-1 [36] specifies an
organizational unit (OU) as a part of an organization that is
the subject of an assessment. An organizational unit deploys
one or more processes that have a coherent process context
and operates within a coherent set of business goals. An
organizational unit is typically part of a larger organization,
although in a small organization, the organizational unit may
be the whole organization.
The reason to use an OU as the unit for observation
was that we wanted to normalize the eect of the company
size to get comparable data. The initial population and
population criteria were decided based on the prior research
on the subject. The sample for the first interview round
consisted of 12 OUs, which were technically high level units,
professionally producing software as their main process. This
sample also formed the focus group of our study. Other
selection criteria for the sample were based on the polar type
selection [37] to cover dierent types of organizations, for
example dierent businesses, dierent sizes of the company,
and dierent kinds of operation. Our objective of using this
approach was to gain a deep understanding of the cases and
to identify, as broadly as possible, the factors and features that
have an eect on software testing automation in practice.
For the second round and the survey, the sample was
expanded by adding OUs to the study. Selecting the sample
was demanding because comparability is not determined
by the company or the organization but by comparable
processes in the OUs. With the help of national and local
authorities (the network of the Finnish Funding Agency for
Technology and Innovation) we collected a population of 85
companies. Only one OU from each company was accepted
to avoid the bias of over-weighting large companies. Each
OU surveyed was selected from a company according to
the population criteria. For this round, the sample size was
expanded to 31 OUs, which also included the OUs from
the first round. The selection for expansion was based on
probability sampling; the additional OUs were randomly
entered into our database, and every other one was selected
for the survey. In the third round, the same sample as
in the first round was interviewed. Table 1 introduces the
business domains, company sizes and operating areas of our
focus OUs. The company size classification is taken from
[38].
3.2. Interview Rounds. The data collection consisted of
three interview rounds. During the first interview round,
the designers responsible for the overall software structure
and/or module interfaces were interviewed. If the OU did
not have separate designers, then the interviewed person
was selected from the programmers based on their role in
OU
Case A
Case B
Case C
Case D
Case E
Case F
Case G
Case H
Case I
Case J
Case K
Case L
1
Business
MES1 producer and electronics manufacturer
Internet service developer and consultant
Logistics software developer
ICT consultant
Safety and logistics system developer
Naval software system developer
Financial software developer
MES1 producer and logistics service systems provider
SME2 business and agriculture ICT service provider
Modeling software developer
ICT developer and consultant
Financial software developer
5
Table 2: Interviewee roles and interview rounds.
Round type
(1) Semistructured
(2) Structured/
Semistructured
(3) Semistructured
Number of interviews
12 focus OUs
Interviewee role
Designer or
Programmer
31 OUs quantitative,
including 12 focus
OUs qualitative
Project or testing
manager
12 focus OUs
Tester
Description
The interviewee is responsible for software design
or has influence on how software is implemented
The interviewee is responsible for software
development projects or test processes of software
products
The interviewee is a dedicated software tester or is
responsible for testing the software product
Interview transcript
Well, I would hope for stricter control or management for implementing our testing strategy, as I
am not sure if our testing covers everything and is it sophisticated enough. On the other hand, we
do have strictly limited resources, so it can be enhanced only to some degree, we cannot test
everything. And perhaps, recently we have had, in the newest versions, some regression testing,
going through all features, seeing if nothing is broken, but in several occasions this has been left
unfinished because time has run out. So there, on that issue we should focus.
IT development
11
IT services
Survey average
Case L
Finances
Case K
Telecommunications
Case J
Ind. automation
Logistics
Public sector
75
50
30
15
0
Median
350 000
315
600
01
30
90
10
100
30
100
10
75
70
02
25
0 means that all of the OUs developers and testers are acquired from 3rd
parties.
2 0 means that no project time is allocated especially for testing.
20
60
60
20
Case B
Min.
75
50
20
20
5
3
Case A
90
55
20
20
25
Case C
Max.
60
35
Case D
70
25
Case F
90
100
10
10
Case E
Number of employees in
the company.
Number of SW developers
and testers in the OU.
Percentage of automation
in testing.
Percentage of agile
(reactive, iterative) versus
plan driven methods in
projects.
Percentage of existing
testers versus resources
need.
How many percent of the
development eort is spent
on testing?
70
Case G
70
Case H
27
26
35
Case I
Metal industry
20
60
10
20
40
60
80
100
3.2
2.9
3.3
3.5
3.3
3.4
3.5
1
1.5
2.5
3.5
4.5
3.7
3.3
1
Conformance testing is
excellent
3.3
1.5
2.5
3.5
4.5
3.8
3.1
2.8
1
1.5
2.5
3.5
4.5
15
Unit testing
Test automation
12
9
Performance testing
Bug reporting
7
6
Unit testing
Regression testing
Testability-related
Test environment-related
Functional testing
Performance testing
Other
8
2
2
5
4
3
10
12
10
Primary
Secondary
Tertiary
10
their relation to test automation. We also concentrated on
the OU dierences in essential concepts such as automation
tools, implementation issues or development strategies. This
conceptualization resulted to the categories listed in Table 5.
The category Automation application describes the
areas of software development, where test automation was
applied successfully. This category describes the testing activities or phases which apply test automation processes. In the
case where the test organization did not apply automation, or
had so far only tested it for future applications, this category
was left empty. The application areas were generally geared
towards regression and stress testing, with few applications
of functionality and smoke tests in use.
The category Role in software process is related to the
objective for which test automation was applied in software
development. The role in the software process describes the
objective for the existence of the test automation infrastructure; it could, for example, be in quality control, where
automation is used to secure module interfaces, or in quality
assurance, where the operation of product functionalities is
verified. The usual role for the test automation tools was
in quality control and assurance, the level of application
varying from third party-produced modules to primary
quality assurance operations. On two occasions, the role
of test automation was considered harmful to the overall
testing outcomes, and on one occasion, the test automation
was considered trivial, with no real return on investments
compared to traditional manual testing.
The category Test automation strategy is the approach
to how automated testing is applied in the typical software
processes, that is, the way the automation was used as a
part of the testing work, and how the test cases and overall
test automation strategy were applied in the organization.
The level of commitment to applying automation was the
main dimension of this category, the lowest level being
individual users with sporadic application in the software
projects, and the highest being the application of automation
to the normal, everyday testing infrastructure, where test
automation was used seamlessly with other testing methods
and had specifically assigned test cases and organizational
support.
The category of Automation development is the general
category for OU test automation development. This category
summarizes the ongoing or recent eorts and resource
allocations to the automation infrastructure. The type of new
development, introduction strategies and current development towards test automation are summarized in this category. The most frequently chosen code was general increase
of application, where the organization had committed itself
to test automation, but had no clear idea of how to develop
the automation infrastructure. However, one OU had a
development plan for creating a GUI testing environment,
while two organizations had just recently scaled down the
amount of automation as a result of a pilot project. Two
organizations had only recently introduced test automation
to their testing infrastructure.
The category of Automation tools describes the types
of test automation tools that are in everyday use in the
OU. These tools are divided based on their technological
11
Table 5: Test automation categories.
Category
Automation application
Role in software process
Test automation strategy
Automation development
Automation tools
Automation issues
Definition
Areas of application for test automation in the software process
The observed roles of test automation in the company software process and
the eect of this role
The observed method for selecting the test cases where automation is
applied and the level of commitment to the application of test automation
in the organizations
The areas of active development in which the OU is introducing test
automation
The general types of test automation tools applied
The items that hinder test automation development in the OU
However, there seemed to be some contradicting considerations regarding the applicability of test automation.
Cases F, J, and K had recently either scaled down their test
automation architecture or considered it too expensive or
inecient when compared to manual testing. In some cases,
automation was also considered too bothersome to configure
for a short-term project, as the system would have required
constant upkeep, which was an unnecessary addition to the
project workload.
We really have not been able to identify
any major advancements from it [test automation].Tester, Case J
It [test automation] just kept interfering.
Designer, Case K
Both these viewpoints indicated that test automation
should not be considered a frontline test environment for
finding errors, but rather a quality control tool to maintain
functionalities. For unique cases or small projects, test
automation is too expensive to develop and maintain, and
it generally does not support single test cases or explorative
testing. However, it seems to be practical in larger projects,
where verifying module compatibility or oering legacy
support is a major issue.
Hypothesis 2 (Maintenance and development costs are common test automation hindrances that universally aect all
test organizations regardless of their business domain or
company size). Even though the case organizations were
selected to represent dierent types of organizations, the
common theme was that the main obstacles in automation
adoption were development expenses and upkeep costs. It
seemed to make no dierence whether the organization unit
belonged to a small or large company, as in the OU levels they
shared common obstacles. Even despite the maintenance
and development hindrances, automation was considered a
feasible tool in many organizations. For example, Cases I and
L pursued the development of some kind of automation to
enhance the testing process. Similarly, Cases E and H, which
already had a significant number of test automation cases,
were actively pursuing a larger role for automated testing.
Well, it [automation] creates a sense of security
and controllability, and one thing that is easily
12
OU
Category
Test automation
Automation
strategy
development
Automation
application
Role in software
process
Case A
GUI testing,
regression
testing
Functionality
verification
Part of the
normal test
infrastructure
General
increase of
application
Case B
Performance,
smoke testing
Quality control
tool
Part of the
normal test
infrastructure
GUI testing,
unit testing
Case C
Functionality,
regression
testing,
documentation
automation
Quality control
tool
Part of the
normal test
infrastructure
Case D
Functionality
testing
Quality control
for secondary
modules
Case E
System stress
testing
Case F
Case G
Case H
Unit and
module testing,
documentation
automation
Regression
testing for use
cases
Regression
testing for
module
interfaces
Automation
tools
Individual tools,
test suite,
in-house
development
Individual tools,
in-house
development
Automation
issues
Complexity of
adapting
automation to
test processes
Costs of
automation
implementation
General
increase of
application
Test suite,
in-house
development
Cost of
automation
maintenance
Project-related
cases
Upkeep for
existing parts
Individual tools
Quality
assurance tool
Part of the
normal test
infrastructure
General
increase of
application
Test suite
QC, overall
eect harmful
Individual users
Recently scaled
down
Individual tools
Quality
assurance tool
Part of the
normal test
infrastructure
General
increase of
application
Test suite
Quality control
for secondary
modules
Part of the
normal test
infrastructure
General
increase of
application
Test suite,
in-house
development
Application
pilot in
development
Application
pilot in
development
Proof-ofconcept
tools
Proof-ofconcept
tools
Self-created
tools; drivers
and stubs
Case I
Functionality
testing
Quality control
tool
Project-related
cases
Case J
Automation not
in use
QA, no eect
observed
Individual users
Case K
Small scale
system testing
QC, overall
eect harmful
Individual users
Recently scaled
down
Case L
System stress
testing
Verifies module
compatibility
Project-related
cases
Adapting
automation to
the testing
strategy
Individual tools,
in-house
development
Costs of
automation
implementation
Costs of
implementing
new automation
Manual testing
seen more
ecient
Cost of
automation
maintenance
Underestimation
of the eect of
automated testing
on quality
Costs of
automation
implementation
No development
incentive
Manual testing
seen more
ecient
Complexity of
adapting
automation to
test processes
6. Discussion
An exploratory survey combined with interviews was used
as the research method. The objective of this study was to
shed light on the status of test automation and to identify
improvement needs in and the practice of test automation.
The survey revealed that the total eort spent on testing
(median 25%) was less than expected. The median percentage (25%) of testing is smaller than the 50%60% that is
13
often mentioned in the literature [38, 39]. The comparable
low percentage may indicate that that the resources needed
for software testing are still underestimated even though
testing eciency has grown. The survey also indicated that
companies used fewer resources on test automation than
expected: on an average 26% of all of the test cases apply
automation. However, there seems to be ambiguity as to
which activities organizations consider test automation, and
how automation should be applied in the test organizations.
In the survey, several organizations reported that they have
an extensive test automation infrastructure, but this did
not reflect on the practical level, as in the interviews with
testers particularly, the figures were considerably dierent.
This indicates that the test automation does not have
strong strategy in the organization, and has yet to reach
maturity in several test organizations. Such concepts as
quality assurance testing and stress testing seem to be
particularly unambiguous application areas, as Cases E and L
demonstrated. In Case E, the management did not consider
stress testing an automation application, whereas testers did.
Moreover, in Case L the large automation infrastructure
did not reflect on the individual project level, meaning
that the automation strategy may strongly vary between
dierent projects and products even within one organization
unit.
The qualitative study which was based on interviews
indicated that some organizations, in fact, actively avoid
using test automation, as it is considered to be expensive
and to oer only little value for the investment. However,
test automation seems to be generally applicable to the
software process, but for small projects the investment is
obviously oversized. One additional aspect that increases
the investment are tools, which unlike in other areas of
software testing, tend to be developed in-house or are
heavily modified to suit specific automation needs. This
development went beyond the localization process which
every new software tool requires, extending even to the
development of new features and operating frameworks. In
this context it also seems plausible that test automation
can be created for several dierent test activities. Regression
testing, GUI testing or unit testing, activities which in some
form exist in most development projects, all make it possible
to create successful automation by creating suitable tools for
the task, as in each phase can be found elements that have
sucient stability or unchangeability. Therefore it seems that
the decision on applying automation is not only connected to
the enablers and disablers of test automation [4], but rather
on tradeo of required eort and acquired benefits; In small
projects or with low amount of reuse the eort becomes
too much for such investment as applying automation to be
feasible.
The investment size and requirements of the eort
can also be observed on two other occasions. First, test
automation should not be considered as an active testing tool
for finding errors, but as a tool to guarantee the functionality
of already existing systems. This observation is in line with
those of Ramler and Wolfmaier [3], who discuss the necessity
of a large number of repetitive tasks for the automation
to supersede manual testing in cost-eectiveness, and of
14
Berner et al. [8], who notify that the automation requires
a sound application plan and well-documented, simulatable
and testable objects. For both of these requirements, quality
control at module interfaces and quality assurance on system
operability are ideal, and as it seems, they are the most
commonly used application areas for test automation. In
fact, Kaner [56] states that 60%80% of the errors found with
test automation are found in the development phase for the
test cases, further supporting the quality control aspect over
error discovery.
Other phenomena that increase the investment are the
limited availability and applicability of automation tools.
On several occasions, the development of the automation
tools was an additional task for the automation-building
organization that required the organization to allocate their
limited resources to the test automation tool implementation. From this viewpoint it is easy to understand why
some case organizations thought that manual testing is
sucient and even more ecient when measured in resource
allocation per test case. Another approach which could
explain the observed resistance to applying or using test
automation was also discussed in detail by Berner et al. [8],
who stated that organizations tend to have inappropriate
strategies and overly ambitious objectives for test automation
development, leading to results that do not live up to their
expectations, causing the introduction of automation to
fail. Based on the observations regarding the development
plans beyond piloting, it can also be argued that the lack of
objectives and strategy also aect the successful introduction
processes. Similar observations of automation pitfalls were
also discussed by Persson and Yilmazturk [26] and Mosley
and Posey [57].
Overall, it seems that the main disadvantages of testing
automation are the costs, which include implementation
costs, maintenance costs, and training costs. Implementation
costs included direct investment costs, time, and human
resources. The correlation between these test automation
costs and the eectiveness of the infrastructure are discussed
by Fewster [24]. If the maintenance of testing automation
is ignored, updating an entire automated test suite can cost
as much, or even more than the cost of performing all
the tests manually, making automation a bad investment
for the organization. We observed this phenomenon in
two case organizations. There is also a connection between
implementation costs and maintenance costs [24]. If the
testing automation system is designed with the minimization
of maintenance costs in mind, the implementation costs
increase, and vice versa. We noticed the phenomenon of
costs preventing test automation development in six cases.
The implementation of test automation seems to be possible
to accomplish with two dierent approaches: by promoting
either maintainability or easy implementation. If the selected
focus is on maintainability, test automation is expensive, but
if the approach promotes easy implementation, the process
of adopting testing automation has a larger possibility for
failure. This may well be due to the higher expectations and
assumption that the automation could yield results faster
when promoting implementation over maintainability, often
leading to one of the automation pitfalls [26] or at least a low
7. Conclusions
The objective of this study was to observe and identify
factors that aect the state of testing, with automation as the
central aspect, in dierent types of organizations. Our study
included a survey in 31 organizations and a qualitative study
in 12 focus organizations. We interviewed employees from
dierent organizational positions in each of the cases.
This study included follow-up research on prior observations [4, 5, 1214] on testing process diculties and
enhancement proposals, and on our observations on industrial test automation [4]. In this study we further elaborated
on the test automation phenomena with a larger sample of
polar type OUs, and more focused approach on acquiring
knowledge on test process-related subjects. The survey
revealed that test organizations use test automation only in
26% of their test cases, which was considerably less than
could be expected based on the literature. However, test
automation tools were the third most common category of
test-related tools, commonly intended to implement unit
and regression testing. The results indicate that adopting test
automation in software organization is a demanding eort.
The lack of existing software repertoire, unclear objectives
for overall development and demands for resource allocation
both for design and upkeep create a large threshold to
overcome.
Test automation was most commonly used for quality
control and quality assurance. In fact, test automation was
observed to be best suited to such tasks, where the purpose
was to secure working features, such as check module
interfaces for backwards compatibility. However, the high
implementation and maintenance requirements were considered the most important issues hindering test automation
development, limiting the application of test automation
in most OUs. Furthermore, the limited availability of test
automation tools and the level of commitment required to
develop a suitable automation infrastructure caused additional expenses. Due to the high maintenance requirements
and low return on investments in small-scale application,
some organizations had actually discarded their automation
systems or decided not to implement test automation. The
lack of a common strategy for applying automation was also
evident in many interviewed OUs. Automation applications
varied even within the organization, as was observable
in the dierences when comparing results from dierent
stakeholders. In addition, the development strategies were
vague and lacked actual objectives. These observations can
also indicate communication gaps [58] between stakeholders
of the overall testing strategy, especially between developers
and testers.
The data also suggested that the OUs that had successfully
implemented test automation infrastructure to cover the
entire organization seemed to have diculties in creating
a continuance plan for their test automation development.
After the adoption phases were over, there was an ambiguity
about how to continue, even if the organization had decided
Appendix
Case Descriptions
Case A (Manufacturing execution system (MES) producer
and electronics manufacturer). Case A produces software
as a service (SaaS) for their product. The company is a
small-sized, nationally operating company that has mainly
industrial customers. Their software process is a plandriven cyclic process, where the testing is embedded to the
development itself, having only little amount of dedicated
resources. This organization unit applied test automation
as a user interface and regression testing tool, using it for
product quality control. Test automation was seen as a part
of the normal test strategy, universally used in all software
projects. The development plan for automation was to
generally increase the application, although the complexity
of the software- and module architecture was considered
major obstacle on the automation process.
Case B (Internet service developer and consultant). Case B
organization oers two types of services; development of
Internet service portals for the customers like communities
and public sector, and consultation in the Internet service
15
business domain. The origination company is small and
operates on a national level. Their main resource on the test
automation is in the performance testing as a quality control
tool, although addition of GUI test automation has also been
proposed. The automated tests are part of the normal test
process, and the overall development plan was to increase the
automation levels especially to the GUI test cases. However,
this development has been hindered by the cost of designing
and developing test automation architecture.
Case C (Logistics software developer). Case C organization
focuses on creating software and services for their origin
company and its customers. This organization unit is a
part of a large-sized, nationally operating company with
large, highly distributed network and several clients. The
test automation is widely used in several testing phases like
functionality testing, regression testing and document generation automation. These investments are used for quality
control to ensure the software usability and correctness.
Although the OU is still aiming for larger test automation
infrastructure, the large amount of related systems and
constant changes within the inter-module communications
is causing diculties in development and maintenance of the
new automation cases.
Case D (ICT consultant). Case D organization is a small,
regional software consultant company, whose customers
mainly compose of small business companies and the public
sector. Their organization does some software development
projects, in which the company develops services and ICT
products for their customers. The test automation comes
mainly trough this channel, as the test automation is mainly
used as a conformation test tool for the third party modules.
This also restricts the amount of test automation to the
projects, in which these modules are used. The company
currently does not have development plans for the test
automation as it is considered unfeasible investment for the
OU this size, but they do invest on the upkeep of the existing
tools as they have usage as a quality control tool for the
acquired outsider modules.
Case E (Safety and logistics system developer). Case E
organization is a software system developer for safety and
logistics systems. Their products have high amount of safety
critical features and have several interfaces on which to
communicate with. The test automation is used as a major
quality assurance component, as the service stress tests are
automated to a large degree. Therefore the test automation is
also a central part of the testing strategy, and each project
has defined set of automation cases. The organization is
aiming to increase the amount of test automation and
simultaneously develop new test cases and automation
applications for the testing process. The main obstacle for
this development has so far been the costs of creating new
automation tools and extending the existing automation
application areas.
Case F (Naval software system developer). The Case F
organization unit is responsible for developing and testing
16
naval service software systems. Their product is based on
a common core, and has considerable requirements for
compatibility with the legacy systems. This OU has tried
test automation on several cases with application areas
such as unit- and module testing, but has recently scaled
down test automation for only support aspects such as
the documentation automation. This decision was based
on the resource requirements for developing and especially
maintaining the automation system, and because the manual
testing was in this context considered much more ecient as
there were too much ambiguity in the automation-based test
results.
Case G (Financial software developer). Case G is a part
of a large financial organization, which operates nationally
but has several internationally connected services due to
their business domain. Their software projects are always
aimed as a service portal for their own products, and have
to pass considerable verification and validation tests before
being introduced to the public. Because of this, the case
organization has sizable test department when compared to
other case companies in this study, and follows rigorous test
process plan in all of their projects. The test automation is
used in the regression tests as a quality assurance tool for
user interfaces and interface events, and therefore embedded
to the testing strategy as a normal testing environment.
The development plans for the test automation is aimed
to generally increase the amount of test cases, but even
the existing test automation infrastructure is considered
expensive to upkeep and maintain.
Case H (Manufacturing execution system (MES) producer
and logistics service system provider). Case H organization
is a medium-sized company, whose software development is
a component for the company product. Case organization
products are used in logistics service systems, usually working as a part of automated processes. The case organization
applies automated testing as a module interface testing tool,
applying it as a quality control tool in the test strategy.
The test automation infrastructure relies on the in-housedeveloped testing suite, which enables organization to use
the test automation to run daily tests to validate module
conformance. Their approach on the test automation has
been seen as a positive enabler, and the general trend is
towards increasing automation cases. The main test automation disability is considered to be that the quality control
aspect is not visible when working correctly and therefore the
eect of test automation may be underestimated in the wider
organization.
Case I (Small and medium-sized enterprise (SME) business
and agriculture ICT-service provider). The case I organization is a small, nationally operating software company
which operates on multiple business domain. Their customer
base is heterogeneous, varying from finances to the agriculture and government services. The company is currently
not utilizing test automation in their test process, but
they have development plans for designing quality control
automation. For this development they have had some
Acknowledgment
This study is a part of the ESPA project (http://www.soberit
.hut.fi/espa/), funded by the Finnish Funding Agency for
Technology and Innovation (project number 40125/08) and
by the participating companies listed on the project web site.
References
[1] E. Kit, Software Testing in the Real World: Improving the Process,
Addison-Wesley, Reading, Mass, USA, 1995.
[2] G. Tassey, The economic impacts of inadequate infrastructure for software testing, RTI Project 7007.011, U.S. National
Institute of Standards and Technology, Gaithersburg, Md,
USA, 2002.
[3] R. Ramler and K. Wolfmaier, Observations and lessons
learned from automated testing, in Proceedings of the International Workshop on Automation of Software Testing (AST 06),
pp. 8591, Shanghai, China, May 2006.
17
[21]
[22]
[23]
[24]
[25]
[26]
[27]
[28]
[29]
[30]
[31]
[32]
[33]
[34]
[35]
18
[36] ISO/IEC and ISO/IEC 15504-1, Information Technology
Process AssessmentPart 1: Concepts and Vocabulary, 2002.
[37] K. M. Eisenhardt, Building theories from case study
research, The Academy of Management Review, vol. 14, no. 4,
pp. 532550, 1989.
[38] EU and European Commission, The new SME definition:
user guide and model declaration, 2003.
[39] G. Pare and J. J. Elam, Using case study research to build
theories of IT implementation, in Proceedings of the IFIP
TC8 WG 8.2 International Conference on Information Systems
and Qualitative Research, pp. 542568, Chapman & Hall,
Philadelphia, Pa, USA, May-June 1997.
[40] A. Strauss and J. Corbin, Basics of Qualitative Research:
Grounded Theory Procedures and Techniques, SAGE, Newbury
Park, Calif, USA, 1990.
[41] ATLAS.ti, The Knowledge Workbench, Scientific Software
Development, 2005.
[42] M. B. Miles and A. M. Huberman, Qualitative Data Analysis,
SAGE, Thousand Oaks, Calif, USA, 1994.
[43] C. B. Seaman, Qualitative methods in empirical studies of
software engineering, IEEE Transactions on Software Engineering, vol. 25, no. 4, pp. 557572, 1999.
[44] C. Robson, Real World Research, Blackwell, Oxford, UK, 2nd
edition, 2002.
[45] N. K. Denzin, The Research Act: A Theoretical Introduction to
Sociological Methods, McGraw-Hill, New York, NY, USA, 1978.
[46] A. Fink and J. Koseco, How to Conduct Surveys: A Step-byStep Guide, SAGE, Beverly Hills, Calif, USA, 1985.
[47] B. A. Kitchenham, S. L. Pfleeger, L. M. Pickard, et al.,
Preliminary guidelines for empirical research in software
engineering, IEEE Transactions on Software Engineering, vol.
28, no. 8, pp. 721734, 2002.
[48] T. Dyba, An instrument for measuring the key factors of
success in software process improvement, Empirical Software
Engineering, vol. 5, no. 4, pp. 357390, 2000.
[49] ISO/IEC and ISO/IEC 25010-2, Software Engineering
Software product Quality Requirements and Evaluation
(SQuaRE) Quality Model, 2008.
[50] Y. Baruch, Response rate in academic studiesa comparative
analysis, Human Relations, vol. 52, no. 4, pp. 421438, 1999.
[51] T. Koomen and M. Pol, Test Process Improvement: A Practical
Step-by-Step Guide to Structured Testing, Addison-Wesley,
Reading, Mass, USA, 1999.
[52] P. Kruchten, The Rational Unified Process: An Introduction,
Addison-Wesley, Reading, Mass, USA, 2nd edition, 1998.
[53] K. Schwaber and M. Beedle, Agile Software Development with
Scrum, Prentice-Hall, Upper Saddle River, NJ, USA, 2001.
[54] K. Beck, Extreme Programming Explained: Embrace Change,
Addison-Wesley, Reading, Mass, USA, 2000.
[55] B. Glaser and A. L. Strauss, The Discovery of Grounded Theory:
Strategies for Qualitative Research, Aldine, Chicago, Ill, USA,
1967.
[56] C. Kaner, Improving the maintainability of automated test
suites, Software QA, vol. 4, no. 4, 1997.
[57] D. J. Mosley and B. A. Posey, Just Enough Software Test
Automation, Prentice-Hall, Upper Saddle River, NJ, USA,
2002.
[58] D. Foray, Economics of Knowledge, MIT Press, Cambridge,
Mass, USA, 2004.