0% found this document useful (0 votes)

178 views

Application of Data Mining Algorithms Fo

Uploaded by

Roland Rütten

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

178 views

Application of Data Mining Algorithms Fo

Uploaded by

Roland Rütten

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 357

Data Mining Applications

for Empowering
Knowledge Societies
Hakikur Rahman
Sustainable Development Networking Foundation (SDNF), Bangladesh

InformatIon scIence reference

Hershey • New York
Director of Editorial Content: Kristin Klinger
Managing Development Editor: Kristin M. Roth
Assistant Managing Development Editor: Jessica Thompson
Assistant Development Editor: Deborah Yahnke
Senior Managing Editor: Jennifer Neidig
Managing Editor: Jamie Snavely
Assistant Managing Editor: Carole Coulson
Copy Editor: Erin Meyer
Typesetter: Sean Woznicki
Cover Design: Lisa Tosheff
Printed at: Yurchak Printing Inc.

Published in the United States of America by

Information Science Reference (an imprint of IGI Global)
701 E. Chocolate Avenue, Suite 200
Hershey PA 17033
Tel: 717-533-8845
Fax: 717-533-8661
E-mail: cust@igi-global.com
Web site: http://www.igi-global.com

and in the United Kingdom by

Information Science Reference (an imprint of IGI Global)
3 Henrietta Street
Covent Garden
London WC2E 8LU
Tel: 44 20 7240 0856
Fax: 44 20 7379 0609
Web site: http://www.eurospanbookstore.com

Copyright © 2009 by IGI Global. All rights reserved. No part of this publication may be reproduced, stored or distributed in any form or by
any means, electronic or mechanical, including photocopying, without written permission from the publisher.
Product or company names used in this set are for identification purposes only. Inclusion of the names of the products or companies does
not indicate a claim of ownership by IGI Global of the trademark or registered trademark.
Library of Congress Cataloging-in-Publication Data

Data mining applications for empowering knowledge societies / Hakikur Rahman, editor.
p. cm.

Summary: “This book presents an overview on the main issues of data mining, including its classification, regression, clustering, and
ethical issues”--Provided by publisher.

Includes bibliographical references and index.

ISBN 978-1-59904-657-0 (hardcover) -- ISBN 978-1-59904-659-4 (ebook)

1. Data mining. 2. Knowledge management. I. Rahman, Hakikur, 1957-
QA76.9.D343D38226 2009
005.74--dc22
2008008466

British Cataloguing in Publication Data

A Cataloguing in Publication record for this book is available from the British Library.

All work contributed to this book set is original material. The views expressed in this book are those of the authors, but not necessarily of
the publisher.

If a library purchased a print copy of this publication, please go to http://www.igi-global.com/agreement for information on activating
the library's complimentary electronic access to this publication.
Table of Contents

Foreword .............................................................................................................................................. xi
Preface ................................................................................................................................................. xii
Acknowledgment .............................................................................................................................. xxii

Section I
Education and Research

Chapter I
Introduction to Data Mining Techniques via Multiple Criteria Optimization
Approaches and Applications ................................................................................................................ 1
Yong Shi, University of the Chinese Academy of Sciences, China
and University of Nebraska at Omaha, USA
Yi Peng, University of Nebraska at Omaha, USA
Gang Kou, University of Nebraska at Omaha, USA
Zhengxin Chen, University of Nebraska at Omaha, USA

Chapter II
Making Decisions with Data: Using Computational Intelligence Within a
Business Environment ......................................................................................................................... 26
Kevin Swingler, University of Stirling, Scotland
David Cairns, University of Stirling, Scotland

Chapter III
Data Mining Association Rules for Making Knowledgeable Decisions ............................................. 43
A.V. Senthil Kumar, CMS College of Science and Commerce, India
R. S. D. Wahidabanu, Govt. College of Engineering, India
Section II
Tools, Techniques, Methods

Chapter IV
Image Mining: Detecting Deforestation Patterns Through Satellites .................................................. 55
Marcelino Pereira dos Santos Silva, Rio Grande do Norte State University, Brazil
Gilberto Câmara, National Institute for Space Research, Brazil
Maria Isabel Sobral Escada, National Institute for Space Research, Brazil

Chapter V
Machine Learning and Web Mining: Methods and Applications in Societal Benefit Areas ................ 76
Georgios Lappas, Technological Educational Institution of Western Macedonia,
Kastoria Campus, Greece

Chapter VI
The Importance of Data Within Contemporary CRM ......................................................................... 96
Diana Luck, London Metropolitan University, UK

Chapter VII
Mining Allocating Patterns in Investment Portfolios ......................................................................... 110
Yanbo J. Wang, University of Liverpool, UK
Xinwei Zheng, University of Durham, UK
Frans Coenen, University of Liverpool, UK

Chapter VIII
Application of Data Mining Algorithms for Measuring Performance Impact
of Social Development Activities ...................................................................................................... 136
Hakikur Rahman, Sustainable Development Networking Foundation (SDNF), Bangladesh

Section III
Applications of Data Mining

Chapter IX
Prospects and Scopes of Data Mining Applications in Society Development Activities .................. 162
Hakikur Rahman, Sustainable Development Networking Foundation, Bangladesh

Chapter X
Business Data Warehouse: The Case of Wal-Mart ............................................................................ 189
Indranil Bose, The University of Hong Kong, Hong Kong
Lam Albert Kar Chun, The University of Hong Kong, Hong Kong
Leung Vivien Wai Yue, The University of Hong Kong, Hong Kong
Li Hoi Wan Ines, The University of Hong Kong, Hong Kong
Wong Oi Ling Helen, The University of Hong Kong, Hong Kong
Chapter XI
Medical Applications of Nanotechnology in the Research Literature ............................................... 199
Ronald N. Kostoff, Office of Naval Research, USA
Raymond G. Koytcheff, Office of Naval Research, USA
Clifford G.Y. Lau, Institute for Defense Analyses, USA

Chapter XII
Early Warning System for SMEs as a Financial Risk Detector ......................................................... 221
Ali Serhan Koyuncugil, Capital Markets Board of Turkey, Turkey
Nermin Ozgulbas, Baskent University, Turkey

Chapter XIII
What Role is “Business Intelligence” Playing in Developing Countries?
A Picture of Brazilian Companies ...................................................................................................... 241
Maira Petrini, Fundação Getulio Vargas, Brazil
Marlei Pozzebon, HEC Montreal, Canada

Chapter XIV
Building an Environmental GIS Knowledge Infrastructure .............................................................. 262
Inya Nlenanya, Center for Transportation Research and Education,
Iowa State University, USA

Chapter XV
The Application of Data Mining for Drought Monitoring and Prediction ......................................... 280
Tsegaye Tadesse, National Drought Mitigation Center, University of Nebraska, USA
Brian Wardlow, National Drought Mitigation Center, University of Nebraska, USA
Michael J. Hayes, National Drought Mitigation Center, University of Nebraska, USA

Compilation of References .............................................................................................................. 292

About the Contributors ................................................................................................................... 325

Index ................................................................................................................................................ 330

Detailed Table of Contents

Section I
Education and Research

This chapter presents an overview of a series of multiple criteria optimization-based data mining meth-
ods that utilize multiple criteria programming to solve various data mining problems and outlines some
research challenges. At the same time, this chapter points out to several research opportunities for the
data mining community.

This chapter identifies important barriers to the successful application of computational intelligence
techniques in a commercial environment and suggests a number of ways in which they may be over-
come. It further identifies a few key conceptual, cultural, and technical barriers and describes different
ways in which they affect business users and computational intelligence practitioners. This chapter
aims to provide knowledgeable insight for its readers through outcome of a successful computational
intelligence project.
Chapter III
Data Mining Association Rules for Making Knowledgeable Decisions ............................................. 43
A.V. Senthil Kumar, CMS College of Science and Commerce, India
R. S. D. Wahidabanu, Govt. College of Engineering, India

This chapter describes two popular data mining techniques that are being used to explore frequent large
itemsets in the database. The first one is called closed directed graph approach where the algorithm scans
the database once making a count on possible 2-itemsets from which only the 2-itemsets with a mini-
mum support are used to form the closed directed graph and explores possible frequent large itemsets
in the database. In the second one, dynamic hashing algorithm where large 3-itemsets are generated at
an earlier stage that reduces the size of the transaction database after trimming and thereby cost of later
iterations will be reduced. However, this chapter envisages that these techniques may help researchers
not only to understand about generating frequent large itemsets, but also finding association rules among
transactions within relational databases, and make knowledgeable decisions.

Section II
Tools, Techniques, Methods

This chapter presents with relevant definitions on remote sensing and image mining domain, by refer-
ring to related work in this field and demonstrates the importance of appropriate tools and techniques
to analyze satellite images and extract knowledge from this kind of data. A case study, the Amazonia
with deforestation problem is being discussed, and effort has been made to develop strategy to deal with
challenges involving Earth observation resources. The purpose is to present new approaches and research
directions on remote sensing image mining, and demonstrates how to increase the analysis potential of
such huge strategic data for the benefit of the researchers.

This chapter reviews contemporary researches on machine learning and Web mining methods that are
related to areas of social benefit. It further demonstrates that machine learning and web mining methods
may provide intelligent Web services of social interest. The chapter also discusses about the growing
interest of researchers in recent days for using advanced computational methods, such as machine learn-
ing and Web mining, for better services to the public.
Chapter VI
The Importance of Data Within Contemporary CRM ......................................................................... 96
Diana Luck, London Metropolitan University, UK

This chapter search for the importance of customer relationship management (CRM) in the product
development and service elements as well as organizational structure and strategies, where data takes as
the pivotal dimension around which the concept of CRM revolves in contemporary terms. Subsequently
it has tried to demonstrate how these processes are associated with data management, namely: data col-
lection, data collation, data storage and data mining, and are becoming essential components of CRM
in both theoretical and practical aspects.

This chapter has introduced the concept of “one-sum” weighted association rules (WARs) and named
such WARs as allocating patterns (ALPs). Here, an algorithm is being proposed to extract hidden and
interesting ALPs from data. The chapter further points out that ALPs can be applied in portfolio manage-
ment, and modeling a collection of investment portfolios as a one-sum weighted transaction-database,
ALPs can be applied to guide future investment activities.

This chapter focuses to data mining applications and their utilizations in devising performance-measuring
tools for social development activities. It has provided justifications to include data mining algorithm
for establishing specifically derived monitoring and evaluation tools that may be used for various social
development applications. Specifically, this chapter gave in-depth analytical observations for establishing
knowledge centers with a range of approaches and put forward a few research issues and challenges to
transform the contemporary human society into a knowledge society.

Section III
Applications of Data Mining

Chapter IX
Prospects and Scopes of Data Mining Applications in Society Development Activities .................. 162
Hakikur Rahman, Sustainable Development Networking Foundation, Bangladesh

Chapter IX focuses on a few areas of social development processes and put forwards hints on application
of data mining tools, through which decision-making would be easier. Subsequently, it has put forward
potential areas of society development initiatives, where data mining applications can be incorporated.
The focus area may vary from basic social services, like education, health care, general commodities,
tourism, and ecosystem management to advanced uses, like database tomography.

This chapter highlights on business data warehouse and discusses about the retailing giant Wal-Mart. Here,
the planning and implementation of the Wal-Mart data warehouse is being described and its integration
with the operational systems is being discussed. This chapter has also highlighted some of the problems
that have been encountered during the development process of the data warehouse, and provided some
future recommendations about Wal-Mart data warehouse.

Chapter XI
Medical Applications of Nanotechnology in the Research Literature ............................................... 199
Ronald N. Kostoff, Office of Naval Research, USA
Raymond G. Koytcheff, Office of Naval Research, USA
Clifford G.Y. Lau, Institute for Defense Analyses, USA

Chapter XI examines medical applications literatures that are associated with nanoscience and nano-
technology research. For this research, authors have retrieved about 65000 nanotechnology records in
2005 from the Science Citation Index/ Social Science Citation Index (SCI/SSCI) using a comprehensive
300+ term query, and in this chapter they intend to facilitate the nanotechnology transition process by
identifying the significant application areas. Specifically, it has identified the main nanotechnology health
applications from today’s vantage point, as well as the related science and infrastructure. The medical
applications were ascertained through a fuzzy clustering process, and metrics were generated using text
mining to extract technical intelligence for specific medical applications/ applications groups.

This chapter introduces an early warning system for SMEs (SEWS) as a financial risk detector that is
based on data mining. During the development of an early warning system, it compiled a system in
which qualitative and quantitative data about the requirements of enterprises are taken into consider-
ation. Moreover, an easy to understand, easy to interpret and easy to apply utilitarian model is targeted
by discovering the implicit relationships between the data and the identification of effect level of every
factor related to the system. This chapter eventually shows the way of empowering knowledge society
from SME’s point of view by designing an early warning system based on data mining.
Chapter XIII
What Role is “Business Intelligence” Playing in Developing Countries?
A Picture of Brazilian Companies ...................................................................................................... 241
Maira Petrini, Fundação Getulio Vargas, Brazil
Marlei Pozzebon, HEC Montreal, Canada

Chapter XIII focuses at various business intelligence (BI) projects in developing countries, and spe-
cifically highlights on Brazilian BI projects. Within a broad enquiry about the role of BI playing in
developing countries, two specific research questions were explored in this chapter. The first one tried
to determine whether the approaches, models or frameworks are tailored for particularities and the
contextually situated business strategy of each company, or if they are “standard” and imported from
“developed” contexts. The second one tried to analyze what type of information is being considered for
incorporation by BI systems; whether they are formal or informal in nature; whether they are gathered
from internal or external sources; whether there is a trend that favors some areas, like finance or mar-
keting, over others, or if there is a concern with maintaining multiple perspectives; who in the firms is
using BI systems, and so forth.

In Chapter XIV, the author proposes a simple and accessible conceptual geographical information system
(GIS) based knowledge discovery interface that can be used as a decision making tool. The chapter also
addresses some issues that might make this knowledge infrastructure stimulate sustainable development,
especially emphasizing sub-Saharan African region.

Chapter XV discusses about the application of data mining to develop drought monitoring utilities, which
enable monitoring and prediction of drought’s impact on vegetation conditions. The chapter also sum-
marizes current research using data mining approaches to build up various types of drought monitoring
tools and explains how they are being integrated with decision support systems, specifically focusing
drought monitoring and prediction in the United States.

Compilation of References .............................................................................................................. 292

About the Contributors ................................................................................................................... 325

Index ................................................................................................................................................ 330

Foreword

Advances in information technology and data collection methods have led to the availability of larger
data sets in government and commercial enterprises, and in a wide variety of scientific and engineering
disciplines. Consequently, researchers and practitioners have an unprecedented opportunity to analyze
this data in much more analytic ways and extract intelligent and useful information from it.
The traditional approach to data analysis for decision making has been shifted to merge business
and scientific expertise with statistical modeling techniques in order to develop experimentally verified
solutions for explicit problems. In recent years, a number of trends have emerged that have started to
challenge this traditional approach. One trend is the increasing accessibility of large volumes of high-
dimensional data, occupying database tables with many millions of rows and many thousands of col-
umns. Another trend is the increasing dynamic demand for rapidly building and deploying data-driven
analytics. A third trend is the increasing necessity to present analysis results to end-users in a form that
can be readily understood and assimilated so that end-users can gain the insights they need to improve
the decisions they make.
Data mining tools sweep through databases and identify previously hidden patterns in one step. An
example of pattern discovery is the analysis of retail sales data to identify seemingly unrelated products
that are often purchased together. Other pattern discovery problems include detecting fraudulent credit
card transactions and identifying anomalous data that could represent data entry keying errors. Data
mining algorithms embody techniques that have existed for at least 10 years, but have only recently
been implemented as mature, reliable, understandable tools that consistently outperform older statisti-
cal methods.
This book has specifically focused on applying data mining techniques to design, develop, and
evaluate social advancement processes that have been applied in several developing economies. This
book provides a overview on the main issues of data mining (including its classification, regression,
clustering, association rules, trend detection, feature selection, intelligent search, data cleaning, privacy
and security issues, etc.) and knowledge enhancing processes as well as a wide spectrum of data mining
applications such as computational natural science, e-commerce, environmental study, financial market
study, network monitoring, social service analysis, and so forth.
This book will be highly acceptable to researchers, academics and practitioners, including GOs and
NGOs for further research and study, especially who would be working in the aspect of monitoring and
evaluation of projects; follow-up activities on development projects, and be an invaluable scholarly
content for development practitioners.

Dr. Abdul Matin Patwari

Vice Chancellor, The University of Asia Pacific
Dhaka, Bangladesh.
xii

Preface

Data mining may be characterized as the process of extracting intelligent information from large amounts
of raw data, and day-by-day becoming a pervasive technology in activities as diverse as using historical
data to predict the success of a awareness raising campaign by looking into pattern sequence formations,
or a promotional operation by looking into pattern sequence transformations, or a monitoring tool by look-
ing into pattern sequence repetitions, or a analysis tool by looking into pattern sequence formations.
Theories and concepts on data mining recently added to the arena of database and researches in this
aspect do not go beyond more than a decade. Very minor research and development activities have been
observed in the 1990’s, along the immense prospect of information and communication technologies
(ICTs). Organized and coordinated researches on data mining started in 2001, with the advent of various
workshops, seminars, promotional campaigns, and funded researches. International conferences on data
mining organized by Institute of Electrical and Electronics Engineers, Inc. (since 2001), Wessex Institute
of Technology (since 1999), Society for Industrial and Applied Mathematics (since 2001), Institute of
Computer Vision and applied Computer Sciences (since 1999), and World Academy of Science are among
the leaders in creating awareness on advanced research activities on data mining and its effective appli-
cations. Furthermore, these events reveal that the theme of research has been shifting from fundamental
data mining to information engineering and/or information management along these years.
Data mining is a promising and relatively new area of research and development, which can provide
important advantages to the users. It can yield substantial knowledge from data primarily gathered
through a wide range of applications. Various institutions have derived considerable benefits from its
application and many other industries and disciplines are now applying the methodology in increasing
effect for their benefit.
Subsequently, collective efforts in machine learning, artificial intelligence, statistics, and database
communities have been reinforcing technologies of knowledge discovery in databases to extract valuable
information from massive amounts of data in support of intelligent decision making. Data mining aims
to develop algorithms for extracting new patterns from the facts recorded in a database, and up till now,
data mining tools adopted techniques from statistics, network modeling and visualization to classify data
and identify patterns. Ultimately, knowledge recovery aims to enable an information system to transform
information to knowledge through hypothesis, testing and theory formation. It sets new challenges for
database technology: new concepts and methods are needed for basic operations, query languages, and
query processing strategies (Witten & Frank, 2005; Yuan, Buttenfield, Gehagen & Miller, 2004).
However, data mining does not provide any straightforward analysis, nor does it necessarily equate
with machine learning, especially in a situation of relatively larger databases. Furthermore, an exhaustive
statistical analysis is not possible, though many data mining methods contain a degree of nondetermin-
ism to enable them to scale massive datasets.
At the same time, successful applications of data mining are not common, despite the vast literature
now accumulating on the subject. The reason is that, although it is relatively straightforward to find
xiii

pattern or structure in data, but establishing its relevance and explaining its cause are both very diffi-
cult tasks. In addition, much of what that has been discovered so far may well be known to the expert.
Therefore, addressing these problematic issues requires the synthesis of underlying theory from the
databases, statistics, algorithms, machine learning, and visualization (Giudici, 2003; Hastie, Tibshirani
& Friedman, 2001; Yuan, Buttenfield, Gehagen & Miller, 2004).
Along these perspectives, to enable practitioners in improving their researches and participate actively
in solving practical problems related to data explosion, optimum searching, qualitative content manage-
ment, improved decision making, and intelligent data mining a complete guide is the need of the hour.
A book featuring all these aspects can fill an extremely demanding knowledge gap in the contemporary
world.
Furthermore, data mining is not an independently existed research subject anymore. To understand
its essential insights, and effective implementations one must open the knowledge periphery in multi-
dimensional aspects. Therefore, in this era of information revolution data mining should be treated as a
cross-cutting and cross-sectoral feature. At the same time, data mining is becoming an interdisciplinary
field of research driven by a variety of multidimensional applications. On one hand it entails techniques
for machine learning, pattern recognition, statistics, algorithm, database, linguistic, and visualization.
On the other hand, one finds applications to understand human behavior, such as that of the end user of
an enterprise. It also helps entrepreneurs to perceive the type of transactions involved, including those
needed to evaluate risks or detect scams.
The reality of data explosion in multidimensional databases is a surprising and widely misunderstood
phenomenon. For those about to use an OLAP (online analytical processing) product, it is critically
important to understand what data explosion is, what causes it, and how it can be avoided, because the
consequences of ignoring data explosion can be very costly, and, in most cases, result in project failure
(Applix, 2003), while enterprise data requirements grow at 50-100% a year, creating a constant storage
infrastructure management challenge (Intransa, 2005).
Concurrently, the database community draws much of its motivation from the vast digital datasets
now available online and the computational problems involved in analyzing them. Almost without excep-
tion, current databases and database management systems are designed without to knowledge or content,
so the access methods and query languages they provide are often inefficient or unsuitable for mining
tasks. The functionality of some existing methods can be approximated either by sampling the data or
reexpressing the data in a simpler form. However, algorithms attempt to encapsulate all the important
structure contained in the original data, so that information loss is minimal and mining algorithms can
function more efficiently. Therefore, sampling strategies must try to avoid bias, which is difficult if the
target and its explanation are unknown.
These are related to the core technology aspects of data mining. Apart from the intricate technology
context, the applications of data mining methods lag in the development context. Lack of data has been
found to inhibit the ability of organizations to fully assist clients, and lack of knowledge made the gov-
ernment vulnerable to the influence of outsiders who did have access to data from countries overseas.
Furthermore, disparity in data collection demands a coordinated data archiving and data sharing, as it
is extremely crucial for developing countries.
The technique of data mining enables governments, enterprises, and private organizations to carry
out mass surveillance and personalized profiling, in most cases without any controls or right of access
to examine this data. However, to raise the human capacity and establish effective knowledge systems
from the applications of data mining, the main focus should be on sustainable use of resources and the
associated systems under specific context (ecological, climatic, social and economic conditions) of
developing countries. Research activities should also focus on sustainable management of vulnerable
xiv

resources and apply integrated management techniques, with a view to support the implementation of
the provisions related to research and sustainable use of existing resources (EC, 2005).
To obtain advantages of data mining applications, the scientific issues and aspects of archiving scientific
and technology data can include the discipline specific needs and practices of scientific communities as
well as interdisciplinary assessments and methods. In this context, data archiving can be seen primarily
as a program of practices and procedures that support the collection, long-term preservation, and low
cost access to, and dissemination of scientific and technology data. The tasks of the data archiving in-
clude: digitizing data, gathering digitized data into archive collections, describing the collected data to
support long term preservation, decreasing the risks of losing data, and providing easy ways to make the
data accessible. Hence, data archiving and the associated data centers need to be part of the day-to-day
practice of science. This is particularly important now that much new data is collected and generated
digitally, and regularly (Codata, 2002; Mohammadian, 2004).
So far, data mining has existed in the form of discrete technologies. Recently, its integration into many
other formats of ICTs has become attractive as various organizations possessing huge databases began to
realize the potential of information hidden there (Hernandez, Göhring & Hopmann, 2004). Thereby, the
Internet can be a tremendous tool for the collection and exchange of information, best practices, success
cases and vast quantities of data. But it is also becoming increasingly congested and its popular use raises
issues about authentication and evaluation of information and data. Interoperability is another issue,
which provides significant challenges. The growing number and volume of data sources, together with
the high-speed connectivity of the Internet and the increasing number and complexity of data sources,
are making interoperability and data integration an important research and industry focus. Moreover,
incompatibilities between data formats, software systems, methodologies and analytical models are
creating barriers to easy flow and creation of data, information and knowledge (Carty, 2002). All these
demand, not only technology revolution, but also tremendous uplift of human capacity as a whole.
Therefore, the challenge of human development taking into account the social and economic background
while protecting the environment confronts decision makers like national governments, local communi-
ties and development organizations. A question arises, as how can new technology for information and
communication be applied to fulfill this task (Hernandez, Göhring & Hopmann, 2004)? This book gives
a review of data mining and decision support techniques and their requirement to achieve sustainable
outcomes. It looks into authenticated global approaches on data mining and shows its capabilities as
an effective instrument on the base of its application as real projects in the developing countries. The
applications are on development of algorithms, computer security, open and distance learning, online
analytical processing, scientific modeling, simple warehousing, and social and economic development
process.
Applying data mining techniques in various aspects of social development processes could thereby
empower the society with proper knowledge, and would produce economic products by raising their
economic capabilities.
On the other hand, coupled to linguistic techniques data mining has produced a new field of text
mining. This has considerably increased the applications of data mining to extract ideas and sentiment
from a wide range of sources, and opened up new possibilities for data mining that can act as a bridge
between the technology and physical sciences and those related to social sciences. Furthermore, data
mining today is recognized as an important tool to analyze and understand the information collected
by governments, businesses and scientific centers. In the context of novel data, text, and Web-mining
application areas are emerging fast and these developments call for new perspectives and approaches
in the form of inclusive researches.
Similarly, info-miners in the distance learning community are using one or more info-mining tools.
They offer a high quality open and distance learning (ODL) information retrieval and search services.
xv

Thus, ICT based info-mining services will likely be producing huge digital libraries such as e-books,
journals, reports and databases on DVD and similar high-density information storage media. Most of
these off-line formats are PC-accessible, and can store considerably more information per unit than a
CD-ROM (COL, 2003). Hence, knowledge enhancement processes can be significantly improved through
proper use of data mining techniques.
Thus, data mining techniques are gradually becoming essential components of corporate intelligence
systems and are progressively evolving into a pervasive technology within activities that range from
the utilization of historical data to predicting the success of an awareness campaign, or a promotional
operation in search of succession patterns used as monitoring tools, or in the analysis of genome chains
or formation of knowledge banks. In reality, data mining is becoming an interdisciplinary field driven
by various multidimensional applications. On one hand it involves schemes for machine learning, pat-
tern recognition, statistics, algorithm, database, linguistic, and visualization. On the other hand, one
finds its applications to understand human behavior, or to understand the type of transactions involved,
or to evaluate risks or detect frauds in an enterprise. Data mining can yield substantial knowledge from
raw data that are primarily gathered for a wide range of applications. Various institutions have derived
significant benefits from its application, and many other industries and disciplines are now applying the
modus operandi in increasing effect for their overall management development.
This book tries to examine the meaning and role of data mining in terms of social development ini-
tiatives and its outcomes in developing economies in terms of upholding knowledge dimensions. At the
same time, it gives an in-depth look into the critical management of information in developed countries
with a similar point of view. Furthermore, this book provides an overview on the main issues of data
mining (including its classification, regression, clustering, association rules, trend detection, feature
selection, intelligent search, data cleaning, privacy and security issues, etc.) and knowledge enhancing
processes as well as a wide spectrum of data mining applications such as computational natural science,
e-commerce, environmental study, business intelligence, network monitoring, social service analysis,
and so forth to empower the knowledge society.

Where the Book StandS

In the global context, a combination of continual technological innovation and increasing competitiveness
makes the management of information a huge challenge and requires decision-making processes built
on reliable and opportune information, gathered from available internal and external sources. Although
the volume of acquired information is immensely increasing, this does not mean that people are able
to derive appropriate value from it (Maira & Marlei, 2003). This deserves authenticated investigation
on information archival strategies and demands years of continuous investments in order to put in
place a technological platform that supports all development processes and strengthens the efficiency
of the operational structure. Most organizations are supposed to have reached at a certain level where
the implementation of IT solutions for strategic levels becomes achievable and essential. This context
explains the emergence of the domain generally known as “intelligent data mining”, seen as an answer
to the current demands in terms of data/information for decision-making with the intensive utilization
of information technology.
The objective of the book is to examine the meaning and role of data mining in a particular context
(i.e., in terms of development initiatives and its outcomes), especially in developing countries and tran-
sitional economies. If the management of information is a challenge even to enterprises in developed
xvi

countries, what can be said about organizations struggling in unstable contexts such as developing ones?
The book has tried to focus on data mining application in developed countries’ context, too.
With the unprecedented rate at which data is being collected today in almost all fields of human
endeavor, there is an emerging demand to extract useful information from it for economic and scien-
tific benefit of the society. Intelligent data mining enables the community to take advantages out of the
gathered data and information by taking intelligent decisions. This increases the knowledge content of
each member of the community, if it can be applied to practical usage areas. Eventually, a knowledge
base is being created and a knowledge-based society will be established.
However, data mining involves the process of automatic discovery of patterns, sequences, trans-
formations, associations, and anomalies in massive databases, and is a enormously interdisciplinary
field representing the confluence of several disciplines, including database systems, data warehousing,
machine learning, statistics, algorithms, data visualization, and high-performance computing (LCPS,
2001; UN, 2004). A book of this nature, encompassing such omnipotent subject area has been missing
in the contemporary global market, intends to fill in this knowledge gap.
In this context, this book provides an overview on the main issues of data mining (including its clas-
sification, regression, clustering, association rules, trend detection, feature selection, intelligent search,
data cleaning, privacy and security issues, and etc.) and knowledge enhancing processes as well as a
wide spectrum of data mining applications such as computational natural science, e-commerce, envi-
ronmental study, financial market study, machine learning, Web mining, nanotechnology, e-tourism,
and social service analysis.
Apart from providing insight into the advanced context of data mining, this book has emphasized
on:

• Development and availability of shared data, metadata, and products commonly required across
diverse societal benefit areas
• Promoting research efforts that are necessary for the development of tools required in all societal
benefit areas
• Encouraging and facilitating the transition from research to operations of appropriate systems and
techniques
• Facilitating partnerships between operational groups and research groups
• Developing recommended priorities for new or augmented efforts in human capacity building
• Contributing to, access, and retrieve data from global data systems and networks
• Encouraging the adoption of existing and new standards to support broader data and information
usability
• Data management approaches that encompass a broad perspective on the observation of data life
cycle, from input through processing, archiving, and dissemination, including reprocessing, analysis
and visualization of large volumes and diverse types of data
• Facilitating recording and storage of data in clearly defined formats, with metadata and quality
indications to enable search, retrieval, and archiving as easily accessible data sets
• Facilitating user involvement and conducting outreach at global, regional, national and local levels
• Complete and open exchange of data, metadata, and products within relevant agencies and national
policies and legislations
xvii

organization of ChapterS

Altogether this book has fifteen chapters and they are divided into three sections: Education and Re-
search; Tools, Techniques, Methods; and Applications of Data Mining. Section I has three chapters, and
they discuss policy and decision-making approaches of data mining for sociodevelopment aspects in
technical and semitechnical contexts. Section II is comprised of five chapters and they illustrate tools,
techniques, and methods of data mining applications for various human development processes and
scientific research. The third section has seven chapters and those chapters show various case studies,
practical applications and research activities on data mining applications that are being used in the social
development processes for empowering the knowledge societies.
Chapter I provides an overview of a series of multiple criteria optimization-based data mining meth-
ods that utilize multiple criteria programming (MCP) to solve various data mining problems. Authors
state that data mining is being established on the basis of many disciplines, such as machine learning,
databases, statistics, computer science, and operation research and each field comprehends data mining
from its own perspectives by making distinct contributions. They further state that due to the difficulty of
accessing the accuracy of hidden data and increasing the predicting rate in a complex large-scale database,
researchers and practitioners have always desired to seek new or alternative data mining techniques.
Therefore, this chapter outlines a few research challenges and opportunities at the end.
Chapter II identifies some important barriers to the successful application of computational intel-
ligence (CI) techniques in a commercial environment and suggests various ways in which they may be
overcome. It states that CI offers new opportunities to a business that wishes to improve the efficiency of
their operations. In this context, this chapter further identifies a few key conceptual, cultural, and techni-
cal barriers and describes different ways in which they affect the business users and the CI practitioners.
This chapter aims to provide knowledgeable insight for its readers through outcome of a successful
computational intelligence project and expects that by enabling both parties to understand each other’s
perspectives, the true potential of CI may be realized.
Chapter III describes two data mining techniques that are used to explore frequent large itemsets
in the database. In the first technique called closed directed graph approach. The algorithm scans the
database once making a count on 2-itemsets possible from which only the 2-itemsets with a minimum
support are used to form the closed directed graph and explores frequent large itemsets in the database.
In the second technique, dynamic hashing algorithm large 3-itemsets are generated at an earlier stage
that reduces the size of the transaction database after trimming and thereby cost of later iterations will
be reduced. Furthermore, this chapter predicts that the techniques may help researchers not only to un-
derstand about generating frequent large itemsets, but also finding association rules among transactions
within relational databases, and make knowledgeable decisions.
It is observed that daily, different satellites capture data of distinct contexts, and among which images
are processed and stored by many institutions. In Chapter IV authors present relevant definitions on
remote sensing and image mining domain, by referring to related work in this field and indicating about
the importance of appropriate tools and techniques to analyze satellite images and extract knowledge
from this kind of data. As a case study, the Amazonia deforestation problem is being discussed; as well
INPE’s effort to develop and spread technology to deal with challenges involving Earth observation
resources. The purpose is to present relevant technologies, new approaches and research directions on
remote sensing image mining, and demonstrating how to increase the analysis potential of such huge
strategic data for the benefit of the researchers.
Chapter V reviews contemporary research on machine learning and Web mining methods that are
related to areas of social benefit. It demonstrates that machine learning and Web mining methods may
xviii

provide intelligent Web services of social interest. The chapter also reveals a growing interest for using
advanced computational methods, such as machine learning and Web mining, for better services to the
public, as most research identified in the literature has been conducted during recent years. The chapter
tries to assist researchers and academics from different disciplines to understand how Web mining and
machine learning methods are applied to Web data. Furthermore, it aims to provide the latest develop-
ments on research in this field that is related to societal benefit areas.
In recent times, customer relationship management (CRM) can be related to sales, marketing and
even services automation. Additionally, the concept of CRM is increasingly associated with cost savings
and streamline processes as well as with the engendering, nurturing and tracking of relationships with
customers. Chapter VI seeks to illustrate how, although the product and service elements as well as
organizational structure and strategies are central to CRM, data is the pivotal dimension around which the
concept revolves in contemporary terms, and subsequently tried to demonstrate how these processes are
associated with data management, namely: data collection, data collation, data storage and data mining,
which are becoming essential components of CRM in both theoretical and practical aspects.
In Chapter VII, authors have introduced the concept of “one-sum” weighted association rules
(WARs) and named such WARs as allocating patterns (ALPs). An algorithm is also being proposed to
extract hidden and interesting ALPs from data. The chapter further point out that ALPs can be applied in
portfolio management. Modeling a collection of investment portfolios as a one-sum weighted transac-
tion-database that contains hidden ALPs can do this, and eventually those ALPs, mined from the given
portfolio-data, can be applied to guide future investment activities.
Chapter VIII is focused to data mining applications and their utilizations in formulating performance-
measuring tools for social development activities. In this context, this chapter provides justifications to
include data mining algorithm to establish specifically derived monitoring and evaluation tools for vari-
ous social development applications. In particular, this chapter gave in-depth analytical observations to
establish knowledge centers with a range of approaches and finally it put forward a few research issues
and challenges to transform the contemporary human society into a knowledge society.
Chapter IX highlightes a few areas of development aspects and hints application of data mining tools,
through which decision-making would be easier. Subsequently, this chapter has put forward potential
areas of society development initiatives, where data mining applications can be introduced. The focus
area may vary from basic education, health care, general commodities, tourism, and ecosystem manage-
ment to advanced uses, like database tomography. This chapter also provides some future challenges and
recommendations in terms of using data mining applications for empowering knowledge society.
Chapter X focuses on business data warehouse and discusses the retailing giant, Wal-Mart. In this
chapter, the planning and implementation of the Wal-Mart data warehouse is being described and its
integration with the operational systems is discussed. It also highlighted some of the problems that have
been encountered during the development process of the data warehouse, including providing some
future recommendations.
In Chapter XI medical applications literature associated with nanoscience and nanotechnology re-
search was examined. Authors retrieved about 65,000 nanotechnology records in 2005 from the Science
Citation Index/ Social Science Citation Index (SCI/SSCI) using a comprehensive 300+ term query. This
chapter intends to facilitate the nanotechnology transition process by identifying the significant applica-
tion areas. It also identified the main nanotechnology health applications from today’s vantage point, as
well as the related science and infrastructure. The medical applications were identified through a fuzzy
clustering process, and metrics were generated using text mining to extract technical intelligence for
specific medical applications/ applications groups.
xix

Chapter XII introduces an early warning system for SMEs (SEWS) as a financial risk detector
that is based on data mining. Through a study this chapter composes a system in which qualitative and
quantitative data about the requirements of enterprises are taken into consideration, during the develop-
ment of an early warning system. Moreover, during the formation of this system; an easy to understand,
easy to interpret and easy to apply utilitarian model is targeted by discovering the implicit relationships
between the data and the identification of effect level of every factor related to the system. This chapter
also shows the way of empowering knowledge society from SME’s point of view by designing an early
warning system based on data mining. Using this system, SME managers could easily reach financial
management, risk management knowledge without any prior knowledge and expertise.
Chapter XIII looks at various business intelligence (BI) projects in developing countries, and spe-
cifically focuses on Brazilian BI projects. Authors poised this question that, if the management of IT is
a challenge for companies in developed countries, what can be said about organizations struggling in
unstable contexts such as those often prevailing in developing countries. Within this broad enquiry about
the role of BI playing in developing countries, two specific research questions are explored in this chapter.
The purpose of the first question is to determine whether those approaches, models, or frameworks are
tailored for particularities and the contextually situated business strategy of each company, or if they are
“standard” and imported from “developed” contexts. The purpose of the second one is to analyze: what
type of information is being considered for incorporation by BI systems; whether they are formal or
informal in nature; whether they are gathered from internal or external sources; whether there is a trend
that favors some areas, like finance or marketing, over others, or if there is a concern with maintaining
multiple perspectives; who in the firms is using BI systems, and so forth.
Technologies such as geographic information systems (GIS) enable geo-spatial information to be
gathered, modified, integrated, and mapped easily and cost effectively. However, these technologies
generate both opportunities and challenges for achieving wider and more effective use of geo-spatial
information in stimulating and sustaining sustainable development through elegant policy making. In
Chapter XIV, the author proposes a simple and accessible conceptual knowledge discovery interface
that can be used as a tool. Moreover, the chapter addresses some issues that might make this knowledge
infrastructure stimulate sustainable development, especially emphasizing sub-Saharan African region.
Finally, Chapter XV discusses the application of data mining to develop drought monitoring tools
that enable monitoring and prediction of drought’s impact on vegetation conditions. The chapter also
summarizes current research using data mining approaches (e.g., association rules and decision-tree
methods) to develop various types of drought monitoring tools and briefly explains how they are being
integrated with decision support systems. This chapter also introduces how data mining can be used to
enhance drought monitoring and prediction in the United States, and at the same time, assist others to
understand how similar tools might be developed in other parts of the world.

ConCluSion

Data mining is becoming an essential tool in science, engineering, industrial processes, healthcare, and
medicine. The datasets in these fields are large, complex, and often noisy. However, extracting knowledge
from raw datasets requires the use of sophisticated, high-performance and principled analysis techniques
and algorithms, based on sound statistical foundations. In turn, these techniques require powerful visual-
ization technologies; implementations that must be carefully tuned for enhanced performance; software
systems that are usable by scientists, engineers, and physicians as well as researchers.
xx

Data mining, as stated earlier, is denoted as the extraction of hidden predictive information from large
databases, and it is a powerful new technology with great potential to help enterprises focus on the most
important information in their data warehouses. Data mining tools predict future trends and behaviors,
allowing entrepreneurs to make proactive, knowledge-driven decisions. The automated, prospective
analyses offered by data mining move beyond the analyses of past events provided by retrospective
constituents typical of decision support systems. Data mining tools can answer business questions that
traditionally were too time consuming to resolve. They scour databases for hidden patterns, finding
predictive information that experts may miss because it lies outside their expectations.
In effect, data mining techniques are the result of a long process of research and product development.
This evolution began when business data was first stored on computers, continued with improvements
in data access, and more recently, generated technologies that allow users to navigate through their data
in real time. Thus, data mining takes this evolutionary progression beyond retrospective data access
and navigation to prospective and proactive information delivery. Furthermore, data mining algorithms
allow researchers to device unique decision-making tools from emancipated data varying in nature.
Foremost, applying data mining techniques extremely valuable utilities can be devised that could raise
the knowledge content at each tier of society segments.
However, in terms of accumulated literature and research contexts, not many publications are avail-
able in the field of data mining applications in social development phenomenon, especially in the form
of a book. By taking this as a baseline, compiled literature seems to be extremely valuable in the context
of utilizing data mining and other information techniques for the improvement of skills development,
knowledge management, and societal benefits. Similarly, Internet search engines do not fetch sufficient
bibliographies in the field of data mining for development perspective. Due to the high demand from
researchers’ in the aspect of ICTD, a book of this format stands to be unique. Moreover, utilization of
new ICTs in the form of data mining deserves appropriate intervention for their diffusion at local, na-
tional, regional, and global levels.
It is assumed that numerous individuals, academics, researchers, engineers, professionals from govern-
ment and nongovernment security and development organizations will be interested in this increasingly
important topic for carrying out implementation strategies towards their national development. This book
will assist its readers to understand the key practical and research issues related to applying data min-
ing in development data analysis, cyber acclamations, digital deftness, contemporary CRM, investment
portfolios, early warning system in SMEs, business intelligence, and intrinsic nature in the context of
society uplift as a whole and the use of data and information for empowering knowledge societies.
Most books of data mining deal with mere technology aspects, despite the diversified nature of its
various applications along many tiers of human endeavor. However, there are a few activities in recent
years that are producing high quality proceedings, but it is felt that compilation of contents of this nature
from advanced research outcomes that have been carried out globally may produce a demanding book
among the researchers.

referenCeS

Applix (2003). OLAP data scalability: Ignore the OLAP data explosion at great cost. A White Paper.
Westborough, MA: Applix, Inc.
Carty, A. J. (2002, September 29). Scientific and technical data: Extending the frontiers of research. In Pro-
ceedings of the Opening Address at the 18th International CODATA Conference, Montreal, Quebec.
xxi

Codata (2002, May 21-22). In Proceedings of the Workshop on Archiving Scientific and Technical Data,
Committee on Data for Science and Technology (CODATA), Pretoria, South Africa.
COL (2003). Find information faster: COL’s “Info-mining” tools. Vancouver, BC: Clippings, Com-
monwealth of Learning.
EC (2005). Integrating and strengthening the European Research Area, 2005 Work Programme (SP1-
10). European Commission.
Hernandez, V., Göhring, W., & Hopmann, C. (2004, Nov. 30-Dec. 3). Sustainable decision support for
environmental problems in developing countries: Applying multicriteria spatial analysis on the Nicara-
gua Development Gateway niDG. In Proceedings of the Workshop on Binding EU-Latin American IST
Research Initiatives for Enhancing Future Co-Operation. Santo Domingo, Costa Rica.
Giudici, P. (2003). Applied data mining: Statistical methods for business and industry. John Wiley.
Hastie, T., Tibshirani, R., & Friedman, J. (2001) (Eds.). The elements of statistical learning: Data min-
ing, inference, and prediction. Springer Verlag.
Intransa (2005). Managing storage growth with an affordable and flexible IP SAN: A highly cost-effective
storage solution that leverages existing IT resources. San Jose, CA: Intransa, Inc.
LCPS (2001, September 11-12). Draft workshop report. In Proceedings of the International Consulta-
tive Workshop, The Digital Initiative for Development Agency (DID), The Lebanese Center for Policy
Studies (LCPS), Beirut.
Maira, P. & Marlei, P. (2003, June 16-21). The value of “business intelligence” in the context of devel-
oping countries. In Proceedings of the 11th European Conference on Information Systems, ECIS 2003,
Naples, Italy. Retrieved April 6, 2008, http://is2.lse.ac.uk/asp/aspecis/20030119.pdf
Mohammadian, M. (2004). Intelligent agents for data mining and information retrieval. Hershey, PA:
Idea Group Publishing.
UN (2004, June 16). Draft Sao Paulo Consensus, UNCTAD XI Multi-Stakeholder Partnerships, United
Nations Conference on Trade and Development, TD/L.380/Add.1, Sao Paulo.
Witten, I. H. & Frank, E. (2005). Data mining: Practical machine learning tools and techniques (2nd
ed). Morgan Kaufmann.
Yuan, M., Buttenfield, B., Gehagen, M. & Miller, H. (2004). Geospatial data mining and knowledge
discovery. In R. B. McMaster & E. L. Usery (Eds.), A research agenda for geographic information sci-
ence (pp. 365-388). Boca Raton, FL: CRC Press.
xxii

Acknowledgment

The editor would like to acknowledge the assistance from all involved in the entire accretion of manu-
scripts, painstaking review process, and methodical revision of the book, without whose support the
project could not have been satisfactorily completed. I am indebted to all the authors who provided their
relentless and generous supports, but reviewers who were most helpful and provided comprehensive,
thorough and creative comments are: Ali Serhan Koyuncugil, Georgios Lappas, and Paul Henman.
Thanks go to my close friends at UNDP, and colleagues at SDNF and ICMS for their wholehearted
encouragements during the entire process.
Special thanks also go to the dedicated publishing team at IGI Global. Particularly to Kristin Roth,
Jessica Thompson, and Jennifer Neidig for their continuous suggestions, supports and feedbacks via e-
mail for keeping the project on schedule, and to Mehdi Khosrow-Pour and Jan Travers for their enduring
professional supports. Finally, I would like to thank all my family members for their love and support
throughout this period.

Hakikur Rahman, Editor

SDNF, Bangladesh
September 2007
Section I
Education and Research

Chapter I
Introduction to Data Mining
Techniques via Multiple Criteria
Optimization Approaches and
Applications
Yong Shi
University of the Chinese Academy of Sciences, China
and University of Nebraska at Omaha, USA

Yi Peng
University of Nebraska at Omaha, USA

Gang Kou
University of Nebraska at Omaha, USA

Zhengxin Chen
University of Nebraska at Omaha, USA

aBStraCt

This chapter provides an overview of a series of multiple criteria optimization-based data mining meth-
ods, which utilize multiple criteria programming (MCP) to solve data mining problems, and outlines
some research challenges and opportunities for the data mining community. To achieve these goals, this
chapter first introduces the basic notions and mathematical formulations for multiple criteria optimiza-
tion-based classification models, including the multiple criteria linear programming model, multiple
criteria quadratic programming model, and multiple criteria fuzzy linear programming model. Then it
presents the real-life applications of these models in credit card scoring management, HIV-1 associated
dementia (HAD) neuronal damage and dropout, and network intrusion detection. Finally, the chapter
discusses research challenges and opportunities.

Copyright © 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
Introduction to Data Mining Techniques via Multiple Criteria Optimization Approaches and Applications

introduCtion scoring management, classifications on HIV-1

associated dementia (HAD) neuronal damage
Data mining has become a powerful information and dropout, and network intrusion detection.
technology tool in today’s competitive business The chapter then outlines research challenges and
world. As the sizes and varieties of electronic data- opportunities, and the conclusion is presented.
sets grow, the interest in data mining is increasing
rapidly. Data mining is established on the basis of
many disciplines, such as machine learning, data- Multiple Criteria
bases, statistics, computer science, and operations optiMization-BaSed
research. Each field comprehends data mining ClaSSifiCation ModelS
from its own perspective and makes its distinct
contributions. It is this multidisciplinary nature This section explores solving classification
that brings vitality to data mining. One of the problems, one of the major areas of data mining,
application roots of data mining can be regarded through the use of multiple criteria mathematical
as statistical data analysis in the pharmaceutical programming-based methods (Shi, Wise, Luo, &
industry. Nowadays the financial industry, includ- Lin, 2001; Shi, Peng, Kou, & Chen, 2005). Such
ing commercial banks, has benefited from the use methods have shown its strong applicability in
of data mining. In addition to statistics, decision solving a variety of classification problems (e.g.,
trees, neural networks, rough sets, fuzzy sets, and Kou et al., 2005; Zheng et al., 2004).
vector support machines have gradually become
popular data mining methods over the last 10 years. Classification
Due to the difficulty of accessing the accuracy of
hidden data and increasing the predicting rate in Although the definition of classification in data
a complex large-scale database, researchers and mining varies, the basic idea of classification
practitioners have always desired to seek new can be generally described as to “predicate
or alternative data mining techniques. This is a the most likely state of a categorical variable
key motivation for the proposed multiple criteria (the class) given the values of other variables”
optimization-based data mining methods. (Bradley, Fayyad, & Mangasarian, 1999, p. 6).
The objective of this chapter is to provide Classification is a two-step process. The first step
an overview of a series of multiple criteria constructs a predictive model based on training
optimization-based methods, which utilize the dataset. The second step applies the predictive
multiple criteria programming (MCP) to solve model constructed from the first step to testing
classification problems. In addition to giving an dataset. If the classification accuracy of testing
overview, this chapter lists some data mining dataset is acceptable, the model can be used to
research challenges and opportunities for the predicate unknown data (Han & Kamber, 2000;
data mining community. To achieve these goals, Olson & Shi, 2005).
the next section introduces the basic notions and Using the multiple criteria programming, the
mathematical formulations for three multiple classification task can be defined as follows: for a
criteria optimization-based classification models: given set of variables in the database, the boundar-
the multiple criteria linear programming model, ies between the classes are represented by scalars
multiple criteria quadratic programming model, in the constraint availabilities. Then, the standards
and multiple criteria fuzzy linear programming of classification are measured by minimizing
model. The third section presents some real-life the total overlapping of data and maximizing
applications of these models, including credit card the distances of every data to its class boundary

Introduction to Data Mining Techniques via Multiple Criteria Optimization Approaches and Applications

simultaneously. Through the algorithms of MCP, To formulate the criteria and complete con-
an “optimal” solution of variables (so-called clas- straints for data separation, some variables need
sifier) for the data observations is determined to be introduced. In the classification problem, Ai
for the separation of the given classes. Finally, X is the score for the ith data record. Let ai be the
the resulting classifier can be used to predict the overlapping of two-group boundary for record
unknown data for discovering the hidden patterns Ai (external measurement) and βi be the distance
of data as possible knowledge. Note that MCP of record Ai from its adjusted boundary (internal
differs from the known support vector machine measurement). The overlapping ai means the
(SVM) (e.g., Mangasarian, 2000; Vapnik, 2000). distance of record Ai to the boundary b if Ai is
While the former uses multiple measurements misclassified into another group. For instance, in
to separate each data from different classes, the Figure 1 the “black dot” located to the right of the
latter searches the minority of the data (support boundary b belongs to G1, but it was misclassi-
vectors) to represent the majority in classifying the fied by the boundary b to G2. Thus, the distance
data. However, both can be generally regarded as between b and the “dot” equals ai. Adjusted
in the same category of optimization approaches boundary is defined as b-a* or b+a*, while a*
to data mining. represents the maximum of overlapping (Freed
In the following, we first discuss a general- & Glover, 1981, 1986). Then, a mathematical
ized multi-criteria programming model formula- function f(a) can be used to describe the relation
tion, and then explore several variations of the of all overlapping ai, while another mathematical
model. function g(β) represents the aggregation of all
distances βi. The final classification accuracies
A Generalized Multiple Criteria depend on simultaneously minimizing f(a) and
Programming Model Formulation maximizing g(β). Thus, a generalized bi-criteria
programming method for classification can be
This section introduces a generalized multi-crite- formulated as:
ria programming method for classification. Simply
speaking, this method is to classify observations (Generalized Model) Minimize f(a) and Maximize
into distinct groups based on two criteria for data g(β)
separation. The following models represent this
concept mathematically: Subject to:
Given an r-dimensional attribute vector Ai X - ai +βi - b = 0,∀ Ai ∈ G1 ,
a=(a1,...ar), let Ai =(Ai1,...,Air)∈Rr be one of the Ai X + ai -βi - b = 0, ∀ Ai ∈ G2 ,
sample records of these attributes, where i=1,...,n; n
represents the total number of records in the data- where Ai, i = 1, …, n are given, X and b are un-
set. Suppose two groups G1 and G2 are predefined. restricted, and a= (a1,...an)T, β=(β1,...βn)T;ai, βi ≥
A boundary scalar b can be selected to separate 0, i = 1, …, n.
these two groups. A vector X = (x1,...,Xr)T∈Rr can All variables and their relationships are repre-
be identified to establish the following linear sented in Figure 1. There are two groups in Figure
inequations (Fisher, 1936; Shi et al., 2001): 1: “black dots” indicate G1 data objects, and “stars”
indicate G2 data objects. There is one misclassified
• Ai X < b,∀Ai∈G1 data object from each group if the boundary scalar
• Ai X ≥ b,∀Ai∈G2 b is used to classify these two groups, whereas
adjusted boundaries b-a* and b+a* separate two
groups without misclassification.

Introduction to Data Mining Techniques via Multiple Criteria Optimization Approaches and Applications

Figure 1. Two-group classification model

Ai X = b

i
i G2
G1

Ai X = b - a* Ai X = b + a*

Based on the above generalized model, the Subject to:

following subsection formulates a multiple cri-
teria linear programming (MCLP) model and a Ai X - ai+βi -b=0, ∀ Ai ∈ G1,
multiple criteria quadratic programming (MCQP) Ai X+ai -βi -b=0, ∀Ai ∈ G2,
model.
where Ai, i = 1, …, n are given, X and b are un-
Multiple Criteria Linear and Quadratic restricted, and a = (a1,...,an)T, β = (β1,...βn)T; ai, βi
Programming Model Formulation ≥ 0, i = 1, …, n.
Based on Model 1, mathematical program-
Different forms of f(a) and g(β) in the general- ming models with any norm can be theoretically
ized model will affect the classification criteria. defined. This study is interested in formulating
Commonly f(a) (or g(β)) can be component-wise a linear and a quadratic programming model. Let
n n
and non-increasing (or non-decreasing) functions. p = q = 1, then ||a||1 = ∑ i and ||β||1 = ∑ i . Let
For example, in order to utilize the computational i =1 i =1
n n
power of some existing mathematical program-
ming software packages, a sub-model can be set
p = q = 2, then ||a||2 = ∑
i =1
i
2
and ||β||2 = ∑ i
2
.
i =1

up by using the norm to represent f(a) and g(β). The objective function in Model 1 can now be
This means that we can assume f(a) = ||a|| p and
an MCLP model or MCQP model.
g(β) = ||β||q. To transform the bi-criteria problems
of the generalized model into a single-criterion
Model 2: MCLP
problem, we use weights wa > 0 and wβ > 0 for
||a|| p and ||β||q, respectively. The values of wa and n n

wβ can be pre-defined in the process of identifying

Minimize wa ∑ i - wβ∑ i
i =1 i =1
the optimal solution. Thus, the generalized model
Subject to:
is converted into a single criterion mathematical
programming model as:
AiX-ai+βi+b=0, ∀Ai ∈ G1,
Ai X+ai -βi -b=0, ∀Ai ∈ G2,
Model 1: Minimize wa ||a|| p - wβ ||β||q

Introduction to Data Mining Techniques via Multiple Criteria Optimization Approaches and Applications

where Ai, i = 1, …, n are given, X and b are un- Developing algorithms directly to solve
restricted, and a=(a1,...an)T, β = (β1,...βn)T; ai, βi these models can be a challenge. Although
≥ 0, i = 1, …, n. in application we can utilize some existing
commercial software, the theoretical-related
Model 3: MCQP problem will be addressed in later in this
n n chapter.
Minimize wa ∑ 2
i - wβ ∑ i
2
i =1 i =1 Multiple Criteria Fuzzy Linear
Subject to: Programming Model Formulation

Ai X - ai + βi - b = 0, ∀Ai ∈ G1, It has been recognized that in many decision-

Ai X + ai - βi - b = 0, ∀Ai ∈ G2, making problems, instead of finding the existing
“optimal solution” (a goal value), decision makers
where Ai, i = 1, …, n are given, X and b are un- often approach a “satisfying solution” between
restricted, and a = (a1,...,an)T, β = (β1,...βn)T; ai, βi upper and lower aspiration levels that can be
≥ 0, i = 1, …, n. represented by the upper and lower bounds of
acceptability for objective payoffs, respectively
Remark (Charnes & Cooper, 1961; Lee, 1972; Shi & Yu,
1989; Yu, 1985). This idea, which has an important
There are some issues related to MCLP and MCQP and pervasive impact on human decision making
that can be briefly addressed here: (Lindsay & Norman 1972), is called the decision
makers’ goal-seeking concept. Zimmermann
1. In the process of finding an optimal solu- (1978) employed it as the basis of his pioneering
tion for MCLP problem, if some βi is too work on FLP. When FLP is adopted to classify the
large with given wa > 0 and wβ > 0 and all ‘good’ and ‘bad’ data, a fuzzy (satisfying) solution
ai relatively small, the problem may have is used to meet a threshold for the accuracy rate
an unbounded solution. In the real applica- of classifications, although the fuzzy solution is
tions, the data with large βi can be detected a near optimal solution.
as “outlier” or “noisy” in the data prepro- According to Zimmermann (1978), in formu-
cessing, which should be removed before lating an FLP problem, the objectives (Minimize
classification. Σiai and Maximize Σiβi) and constraints (Ai X = b
2. Note that although variables X and b are + ai - βi, Ai ∈ G; Ai X = b - ai + βi, Ai∈B) of the
unrestricted in the above models, X = 0 is an generalized model are redefined as fuzzy sets
“insignificant case” in terms of data separa- F and X with corresponding membership func-
tion, and therefore it should be ignored in the tions µF (x) and µX (x) respectively. In this case
process of solving the problem. For b = 0, the fuzzy decision set D is defined as D = F ∪ X,
however, may result a solution for the data and the membership function is defined as µD (x)
separation depending on the data structure. ={µF (x), µX (x)}. In a maximal problem, x1 is a
From experimental studies, a pre-defined “better” decision than x2 if µD (x1) ≥ µD (x2) . Thus,
value of b can quickly lead to an optimal it can be considered appropriately to select x*
solution if the user fully understands the such that max D ( x) = max min{ F ( x), X ( x)}
= min{ F ( x * ), X ( x * )} is the maximized solu-
x x
data structure.
3. Some variations of the generalized model, tion.
such as MCQP, are NP-hard problems.

Introduction to Data Mining Techniques via Multiple Criteria Optimization Approaches and Applications

Σi i − y2L
Let y1L be Minimize Σiai and y2U be Maximize ≤
Σiβi, then one can assume that the value of Maxi- y 2U − y 2 L
mize Σiai to be y1U and that of Minimize Σiβi to be
y2L. If the “upper bound” y1U and the “lower bound” Ai X = b + ai - βi, Ai ∈ G,
y2L do not exist for the formulations, they can be Ai X = b - ai + βi, Ai ∈ B,
estimated. Let F1{x: y1L ≤ Σiai ≤ y1U } and F2{x:
y2L ≤ Σiβi ≤ y2U }and their membership functions where Ai, y1L , y1U , y2L and y2U are known, X and b
can be expressed respectively by: are unrestricted, and ai , βi , ξ ≥ 0.
Note that Model 4 will produce a value of ξ
 1, if Σ i i ≥ y1U with 1 > ξ ≥ 0. To avoid the trivial solution, one
 Σ
i i − y1L can set up ξ > ε ≥ 0, for a given ε. Therefore,
F1 ( x ) =  , if y1L < Σ i i < y1U
 y1U − y1L seeking Maximum ξ in the FLP approach becomes
 0, if Σ i i ≤ y1L the standard of determining the classifications
between ‘good’ and ‘bad’ records in the database.
and A graphical illustration of this approach can be
seen from Figure 2; any point of hyper plane
 1, if Σ i i ≥ y 2U
 Σ 0 < ξ < 1 over the shadow area represents the pos-
− y
F2 ( x ) =  , if y 2 L < Σ i i < y 2U sible determination of classifications by the FLP
i i 2L
y
 2U − y 2L method. Whenever Model 4 has been trained to
 0, if Σ i i ≤ y 2 L
meet the given thresholdt, it is said that the better
classifier has been identified.
Then the fuzzy set of the objective functions A procedure of using the FLP method for data
is F = F1∩ F2, and its membership function is classifications can be captured by the flowchart of
F ( x ) = min{ F ( x ), F ( x )}. Using the crisp con-
1 2
Figure 2. Note that although the boundary of two
straint set X = {x: Ai X = b + ai - βi, Ai ∈ G; Ai X classes b is the unrestricted variable in Model 4, it
= b - ai + βi, Ai ∈ B}, the fuzzy set of the decision can be presumed by the analyst according to the
problem is D = F1 ∩ F2 ∩ X , and its membership structure of a particular database. First, choosing
function is D ( x) = F1 ∩ F2 ∩ X ( x). a proper value of b can speed up solving Model
Z i m m e r m a n n (19 7 8) h a s s h o w n 4. Second, given a thresholdt, the best data sepa-
that the “opt i mal solut ion” of ration can be selected from a number of results
max D ( x) = max min{ F ( x), F ( x), X ( x)} is a n
1 2
determined by different b values. Therefore, the
x x
efficient solution of a variation of the generalized parameter b plays a key role in this chapter to
model when f(a) = Σiai and g(β) = Σiβi. Then, achieve and guarantee the desired accuracy ratet.
this problem is equivalent to the following linear For this reason, the FLP classification method uses
program (He, Liu, Shi, Xu, & Yan, 2004): b as an important control parameter as shown in
Figure 2.
Model 4: FLP

Maximize ξ real-life appliCationS uSing

Multiple Criteria optiMization
Subject to: approaCheS

Σ i i − y1L The models of multiple criteria optimization data

≤
y1U − y1L mining in this chapter have been applied in credit

Introduction to Data Mining Techniques via Multiple Criteria Optimization Approaches and Applications

Figure 2. A flowchart of the fuzzy linear program- ness of the models, the key experiences in some
ming classification method applications are reported as below.

Credit Card Portfolio Management

The goal of credit card accounts classification is

to produce a “blacklist” of the credit cardhold-
ers; this list can help creditors to take proactive
steps to minimize charge-off loss. In this study,
credit card accounts are classified into two groups:
‘good’ or ‘bad’. From the technical point of view,
we need first construct a number of classifiers and
then choose one that can find more bad records.
The research procedure consists of five steps. The
first step is data cleaning. Within this step, miss-
ing data cells and outliers are removed from the
dataset. The second step is data transformation.
The dataset is transformed in accord with the
format requirements of MCLP software (Kou &
Shi, 2002) and LINGO 8.0, which is a software
tool for solving nonlinear programming problems
(LINDO Systems Inc.). The third step is datasets
selection. The training dataset and the testing
dataset are selected according to a heuristic
process. The fourth step is model formulation
and classification. The two-group MCLP and
MCQP models are applied to the training dataset
to obtain optimal solutions. The solutions are
then applied to the testing dataset within which
class labels are removed for validation. Based on
card portfolio management (He et al., 2004; Kou, these scores, each record is predicted as either
Liu, Peng, Shi, Wise, & Xu, 2003; Peng, Kou, bad (bankrupt account) or good (current account).
Chen, & Shi, 2004; Shi et al., 2001; Shi, Peng, Xu, By comparing the predicted labels with original
& Tang, 2002; Shi et al., 2005), HIV-1-mediated labels of records, the classification accuracies of
neural dendritic and synaptic damage treatment multiple-criteria models can be determined. If
(Zheng et al., 2004), network intrusion detection the classification accuracy is acceptable by data
(Kou et al., 2004a; Kou, Peng, Chen, Shi, & Chen. analysts, this solution will be applied to future
2004b), and firms bankruptcy analyses (Kwak, unknown credit card records or applications to
Shi, Eldridge, & Kou, 2006). These approaches are make predictions. Otherwise, data analysts can
also being applied in other ongoing real-life data modify the boundary and attributes values to get
mining projects, such as anti-gene and antibody another set of optimal solutions. The fifth step is
analyses, petroleum drilling and exploration, results’ presentation. The acceptable classifica-
fraud management, and financial risk evaluation. tion results are summarized in tables or figures
In order to let the reader understand the useful- and presented to end users.

Introduction to Data Mining Techniques via Multiple Criteria Optimization Approaches and Applications

Credit Card Dataset is determined according to empirical results of

k-fold cross-validation. Thus 700 ‘bad’ records
The credit card dataset used in this chapter is are obtained. Second, the good-status dataset
provided by a major U.S. bank. It contains 5,000 (4,185 records) is divided into 100 intervals (each
records and 102 variables (38 original variables interval has 41 records). Within each interval,
and 64 derived variables). The data were col- seven records are randomly selected. Thus the
lected from June 1995 to December 1995, and total of 700 ‘good’ records is obtained. Third,
the cardholders were from 28 states of the United the 700 bankruptcy and 700 current records are
States. Each record has a class label to indicate combined to form a training dataset. Finally, the
its credit status: either ‘good’ or ‘bad’. ‘Bad’ indi- remaining 115 bankruptcy and 3,485 current ac-
cates a bankruptcy credit card account and ‘good’ counts become the testing dataset. According to
indicates a good status account. Among these this procedure, the total possible combinations
5,000 records, 815 are bankruptcy accounts and of this selection equals (C 87 ×C 741 )100. Thus, the
4,185 are good status accounts. The 38 original possibility of getting identical training or testing
variables can be divided into four categories: bal- datasets is approximately zero. The across-the-
ance, purchase, payment, and cash advance. The board thresholds of 65% and 70% are set for the
64 derived variables are created from the original ‘bad’ and ‘good’ class, respectively. The values of
38 variables to reinforce the comprehension of thresholds are determined from previous experi-
cardholders’ behaviors, such as times over-limit ence. The classification results whose predictive
in last two years, calculated interest rate, cash as accuracies are below these thresholds will be
percentage of balance, purchase as percentage to filtered out.
balance, payment as percentage to balance, and The whole research procedure can be sum-
purchase as percentage to payment. For the pur- marized using the following algorithm:
pose of credit card classification, the 64 derived
variables were chosen to compute the model since Algorithm 1
they provide more precise information about credit
cardholders’ behaviors. Input: The data set A = {A1, A2, A3,…, An},
boundary b
Experimental Results of MCLP Output: The optimal solution, X* = (x1*,
x2*, x3*, . . . , x64*), the classification score
Inspired by the k-fold cross-validation method MCLPi
in classification, this study proposed a heuristic Step 1: Generate the Training set and the
process for training and testing dataset selec- Testing set from the credit card data set.
tions. Standard k-fold cross-validation is not Step 2: Apply the two-group MCLP model to
used because the majority-vote ensemble method compute the optimal solution X*= (x1*, x2*, . .
used later on in this chapter may need hundreds . , x64*) as the best weights of all 64 variables
of voters. If standard k-fold cross-validation with given values of control parameters (b,
was employed, k should be equal to hundreds. a*, β*) in Training set.
The following paragraph describes the heuristic Step 3: The classification score MCLPi = AiX*
process. against of each observation in the Training
First, the bankruptcy dataset (815 records) is set is calculated against the boundary b
divided into 100 intervals (each interval has eight to check the performance measures of the
records). Within each interval, seven records classification.
are randomly selected. The number of seven

Introduction to Data Mining Techniques via Multiple Criteria Optimization Approaches and Applications

Step 4: If the classification result of Step 3 is records divided by the total records in that class.
acceptable (i.e., the found performance mea- For instance, 80.43% accuracy of Dataset 1 for
sure is larger or equal to the given threshold), bad record in the training dataset was calculated
go to the next step. Otherwise, arbitrarily using 563 divided by 700 and means that 80.43%
choose different values of control parameters of bad records were correctly classified. The aver-
(b, a*, β*) and go to Step 1. age predictive accuracies for bad and good groups
Step 5: Use X* = (x1*, x2*, . . . , x64*) to calculate in the training dataset are 79.79% and 78.97%,
the MCLP scores for all Ai in the Testing set and the average predictive accuracies for bad and
and conduct the performance analysis. If it good groups in the testing dataset are 68% and
produces a satisfying classification result, 74.39%. The results demonstrated that a good
go to the next step. Otherwise, go back to separation of bankruptcy and good status credit
Step 1 to reformulate the Training Set and card accounts is observed with this method.
Testing Set.
Step 6: Repeat the whole process until a Improvement of MCLP Experimental
preset number (e.g., 999) of different X* are Results with Ensemble Method
generated for the future ensemble method.
End. In credit card bankruptcy predictions, even a small
percentage of increase in the classification accu-
Using Algorithm 1 to the credit card dataset, racy can save creditors millions of dollars. Thus
classification results were obtained and summa- it is necessary to investigate possible techniques
rized. Due to the space limitation, only a part (10 that can improve MCLP classification results. The
out of the total 500 cross-validation results) of technique studied in this experiment is major-
the results is summarized in Table 1 (Peng et al., ity-vote ensemble. An ensemble consists of two
2004). The columns “Bad” and “Good” refer to the fundamental elements: a set of trained classifiers
number of records that were correctly classified as and an aggregation mechanism that organizes
“bad” and “good,” respectively. The column “Ac- these classifiers into the output ensemble. The
curacy” was calculated using correctly classified aggregation mechanism can be an average or a

Table 1. MCLP credit card accounts classification

Cross Training Set (700 Bad +700 Good) Testing Set (115 Bad +3485 Good)
Validation
Bad Accuracy Good Accuracy Bad Accuracy Good Accuracy
DataSet 1 563 80.43% 557 79.57% 78 67.83% 2575 73.89%
DataSet 2 546 78.00% 546 78.00% 75 65.22% 2653 76.13%
DataSet 3 564 80.57% 560 80.00% 75 65.22% 2550 73.17%
DataSet 4 553 79.00% 553 79.00% 78 67.83% 2651 76.07%
DataSet 5 548 78.29% 540 77.14% 78 67.83% 2630 75.47%
DataSet 6 567 81.00% 561 80.14% 79 68.70% 2576 73.92%
DataSet 7 556 79.43% 548 78.29% 77 66.96% 2557 73.37%
DataSet 8 562 80.29% 552 78.86% 79 68.70% 2557 73.37%
DataSet 9 566 80.86% 557 79.57% 83 72.17% 2588 74.26%
DataSet 10 560 80.00% 554 79.14% 80 69.57% 2589 74.29%

Introduction to Data Mining Techniques via Multiple Criteria Optimization Approaches and Applications

majority vote (Zenobi & Cunningham, 2002). the classification, then the prediction Pi for
Weingessel, Dimitriadou, and Hornik (2003) have this observation is successful, otherwise the
reviewed a series of ensemble-related publications prediction is failed.
(Dietterich, 2000; Lam, 2000; Parhami, 1994; Step 3: The accuracy for each group will be
Bauer & Kohavi, 1999; Kuncheva, 2000). Previ- computed by the percentage of successful
ous research has shown that an ensemble can help classification in all observations.
to increase classification accuracy and stability End.
(Opitz & Maclin, 1999). A part of MCLP’s optimal
solutions was selected to form ensembles. Each The results of applying Algorithm 2 are sum-
solution will have one vote for each credit card marized in Table 2 (Peng et al., 2004). The average
record, and final classification result is determined predictive accuracies for bad and good groups in
by the majority votes. Algorithm 2 describes the the training dataset are 80.8% and 80.6%, and
ensemble process: the average predictive accuracies for bad and
good groups in the testing dataset are 72.17% and
Algorithm 2 76.4%. Compared with previous results, ensemble
technique improves the classification accuracies.
Input: The data set A = {A1, A2, A3, …, An}, Especially for bad records classification in the
boundary b , a certain number of solutions, testing set, the average accuracy increased 4.17%.
X* = (x1*, x2*, x3*, . . . , x64*) Since bankruptcy accounts are the major cause
Output: The classification score MCLPi and of creditors’ loss, predictive accuracy for bad
the prediction Pi records is considered to be more important than
Step 1: A committee of certain odd number for good records.
of classifiers X* is formed.
Step 2: The classification score MCLPi = Experimental Results of MCQP
Ai X* against each observation is calculated
against the boundary b by every member of Based on the MCQP model and the research
the committee. The performance measures procedure described in previous sections, similar
of the classification will be decided by experiments were conducted to get MCQP results.
majorities of the committee. If more than LINGO 8.0 was used to compute the optimal solu-
half of the committee members agreed in tions. The whole research procedure for MCQP
is summarized in Algorithm 3:

Table 2. MCLP credit card accounts classification with ensemble

Ensemble Training Set Testing Set
Results (700 Bad data+700 Good data) (115 Bad data+3485 Good data)
No. of Voters Bad Accuracy Good Accuracy Bad Accuracy Good Accuracy
9 563 80.43% 561 80.14% 81 70.43% 2605 74.75%
99 565 80.71% 563 80.43% 83 72.17% 2665 76.47%
199 565 80.71% 566 80.86% 83 72.17% 2656 76.21%
299 568 81.14% 564 80.57% 84 73.04% 2697 77.39%
399 567 81.00% 567 81.00% 84 73.04% 2689 77.16%

0
Introduction to Data Mining Techniques via Multiple Criteria Optimization Approaches and Applications

Algorithm 3 produces a satisfying classification result,

go to the next step. Otherwise, go back to
Input: The data set A = {A1, A2, A3,…, An}, Step 1 to reformulate the Training Set and
boundary b Testing Set.
Output: The optimal solution, X* = (x1* Step 6: Repeat the whole process until a
x2*, x3*, . . . , x64*), the classification score preset number of different X* are gener-
MCQPi ated.
Step 1: Generate the Training set and Test- End.
ing set from the credit card data set.
Step 2: Apply the two-group MCQP model A part (10 out of the total 38 results) of the
to compute the compromise solution X* = results is summarized in Table 3.
(x1*, x2*, . . . , x64*) as the best weights of all The average predictive accuracies for bad and
64 variables with given values of control good groups in the training dataset are 86.61%
parameters (b, a*, β*) using LINGO 8.0 and 73.29%, and the average predictive accuracies
software. for bad and good groups in the testing dataset
Step 3: The classification score MCQPi = are 81.22% and 68.25%. Compared with MCLP,
Ai X* against each observation is calculated MCQP has lower predictive accuracies for good
against the boundary b to check the perfor- records. Nevertheless, bad group classification ac-
mance measures of the classification. curacies of the testing set using MCQP increased
Step 4: If the classification result of Step 3 from 68% to 81.22%, which is a remarkable
is acceptable (i.e., the found performance improvement.
measure is larger or equal to the given
threshold), go to the next step. Otherwise, Improvement of MCQP with Ensemble
choose different values of control parameters Method
(b, a*, β*) and go to Step 1.
Step 5: Use X* = (x1*, x2*,..., x64*) to calculate Similar to the MCLP experiment, the majority-
the MCQP scores for all Ai in the test set vote ensemble discussed previously was applied
and conduct the performance analysis. If it

Table 3. MCQP credit card accounts classification

Cross Validation Training Set (700 Bad data+700 Good data) Testing Set (115 Bad data+3485 Good data)
Bad Accuracy Good Accuracy Bad Accuracy Good Accuracy
DataSet 1 602 86.00% 541 77.29% 96 83.48% 2383 68.38%
DataSet 2 614 87.71% 496 70.86% 93 80.87% 2473 70.96%
DataSet 3 604 86.29% 530 75.71% 95 82.61% 2388 68.52%
DataSet 4 616 88.00% 528 75.43% 95 82.61% 2408 69.10%
DataSet 5 604 86.29% 547 78.14% 90 78.26% 2427 69.64%
DataSet 6 614 87.71% 502 71.71% 94 81.74% 2328 66.80%
DataSet 7 610 87.14% 514 73.43% 95 82.61% 2380 68.29%
DataSet 8 582 83.14% 482 68.86% 93 80.87% 2354 67.55%
DataSet 9 614 87.71% 479 68.43% 90 78.26% 2295 65.85%
DataSet 10 603 86.14% 511 73.00% 93 80.87% 2348 67.37%

Introduction to Data Mining Techniques via Multiple Criteria Optimization Approaches and Applications

to MCQP to examine whether it can make an Note that in both Table 5 and Table 6, the
improvement. The results are represented in Table columns Tg and Tb respectively represent the
4. The average predictive accuracies for bad and number of good and bad accounts identified by a
good groups in the training dataset are 89.18% method, while the rows of good and bad represent
and 74.68%, and the average predictive accuracies the actual numbers of the accounts.
for bad and good groups in the testing dataset are
85.61% and 68.67%. Compared with previous Classifications on HIV-1 Mediated
MCQP results, majority-vote ensemble improves Neural Dendritic and Synaptic
the total classification accuracies. Especially for Damage Using MCLP
bad records in testing set, the average accuracy
increased 4.39%. The ability to identify neuronal damage in the
dendritic arbor during HIV-1-associated dementia
Experimental Results of Fuzzy Linear (HAD) is crucial for designing specific therapies
Programming for the treatment of HAD. A two-class model of
multiple criteria linear programming (MCLP) was
Applying the fuzzy linear programming model proposed to classify such HIV-1 mediated neuro-
discussed earlier in this chapter to the same credit nal dendritic and synaptic damages. Given certain
card dataset, we obtained some FLP classifica- classes, including treatments with brain-derived
tion results. These results are compared with the neurotrophic factor (BDNF), glutamate, gp120,
decision tree, MCLP, and neural networks (see or non-treatment controls from our in vitro ex-
Tables 5 and 6). The software of decision tree is perimental systems, we used the two-class MCLP
the commercial version called C5.0 (C5.0 2004), model to determine the data patterns between
while software for both neural network and classes in order to gain insight about neuronal
MCLP were developed at the Data Mining Lab, dendritic and synaptic damages under different
University of Nebraska at Omaha, USA (Kou & treatments (Zheng et al., 2004). This knowledge
Shi, 2002). can be applied to the design and study of specific
therapies for the prevention or reversal of neuronal
damage associated with HAD.

Table 4. MCQP credit card accounts classification with ensemble

Ensemble Results Training Set (700 Bad data+700 Good data) Testing Set (115 Bad data+3485 Good data)
No. of Voters Bad Accuracy Good Accuracy Bad Accuracy Good Accuracy
3 612 87.43% 533 76.14% 98 85.22% 2406 69.04%
5 619 88.43% 525 75.00% 95 82.61% 2422 69.50%
7 620 88.57% 525 75.00% 97 84.35% 2412 69.21%
9 624 89.14% 524 74.86% 100 86.96% 2398 68.81%
11 625 89.29% 525 75.00% 99 86.09% 2389 68.55%
13 629 89.86% 517 73.86% 100 86.96% 2374 68.12%
15 629 89.86% 516 73.71% 98 85.22% 2372 68.06%
17 632 90.29% 520 74.29% 99 86.09% 2379 68.26%
19 628 89.71% 520 74.29% 100 86.96% 2387 68.49%

Introduction to Data Mining Techniques via Multiple Criteria Optimization Approaches and Applications

Database 1,000 M). At low concentrations, gluta-

mate acts as a neurotransmitter in the brain.
The data produced by laboratory experimentation However, at high concentrations, it has been
and image analysis was organized into a database shown to be a neurotoxin by over-stimulat-
composed of four classes (G1-G4), each of which ing NMDA receptors. This factor has been
has nine attributes. The four classes are defined shown to be upregulated in HIV-1-infected
as the following: macrophages (Jiang et al., 2001) and thereby
linked to neuronal damage by HIV-1 infected
• G1: Treatment with the neurotrophin BDNF macrophages.
(brain-derived neurotrophic factor, 0.5 • G4: Treatment with gp120 (1 nanoM), an
ng/ml, 5 ng/ml, 10 ng/mL, and 50 ng/ml), HIV-1 envelope protein. This protein could
this factor promotes neuronal cell survival interact with receptors on neurons and inter-
and has been shown to enrich neuronal cell fere with cell signaling leading to neuronal
cultures (Lopez et al., 2001; Shibata et al., damage, or it could also indirectly induce
2003). neuronal injury through the production of
• G2: Non-treatment, neuronal cells are kept other neurotoxins (Hesselgesser et al., 1998;
in their normal media used for culturing Kaul, Garden, & Lipton, 2001; Zheng et al.,
(Neurobasal media with B27, which is a neu- 1999).
ronal cell culture maintenance supplement
from Gibco, with glutamine and penicillin- The nine attributes are defined as:
streptomycin).
• G3: Treatment with glutamate (10, 100, and • x1 = The number of neurites

Table 5. Learning comparisons on balanced 280 Table 6. Comparisons on prediction of 5,000

records records

Decision Tree Tg Tb Total Decision Tree Tg Tb Total

Good 138 2 140 Good 2180 2005 4185
Bad 13 127 140 Bad 141 674 815
Total 151 129 280 Total 2321 2679 5000
Neural Network Tg Tb Total Neural Network Tg Tb Total
Good 116 24 140 Good 2814 1371 4185
Bad 14 126 140 Bad 176 639 815
Total 130 150 280 Total 2990 2010 5000
MCLP Tg Tb Total MCLP Tg Tb Total
Good 134 6 140 Good 3160 1025 4185
Bad 7 133 140 Bad 484 331 815
Total 141 139 280 Total 3644 1356 5000
FLP Tg Tb Total FLP Tg Tb Total
Good 127 13 140 Good 2498 1687 4185
Bad 13 127 140 Bad 113 702 815
Total 140 140 280 Total 2611 2389 5000

Introduction to Data Mining Techniques via Multiple Criteria Optimization Approaches and Applications

• x2 = The number of arbors vs. G4; and G3 vs. G4. The meanings of these
• x3 = The number of branch nodes two-class pairs are:
• x4 = The average length of arbors
• x5 = The ratio of neurite to arbor • G1 vs. G2 shows that BDNF should enrich the
• x6 = The area of cell bodies neuronal cell cultures and increase neuronal
• x7 = The maximum length of the arbors network complexity—that is, more dendrites
• x8 = The culture time (during this time, and arbors, more length to dendrites, and so
the neuron grows normally and BDNF, forth.
glutamate, or gp120 have not been added • G2 vs. G3 indicates that glutamate should
to affect growth) damage neurons and lead to a decrease in
• x9 = The treatment time (during this time, dendrite and arbor number including den-
the neuron was growing under the effects drite length.
of BDNF, glutamate, or gp120) • G2 vs. G4 should show that gp120 causes
neuronal damage leading to a decrease in
The database used in this chapter contained dendrite and arbor number and dendrite
2,112 observations. Among them, 101 are on G1, length.
1,001 are on G2, 229 are on G3, and 781 are on • G3 vs. G4 provides information on the pos-
G4. sible difference between glutamate toxicity
Comparing with the traditional mathematical and gp120-induced neurotoxicity.
tools in classification, such as neural networks,
decision tree, and statistics, the two-class MCLP Given a threshold of training process that can
approach is simple and direct, free of the statisti- be any performance measure, we have carried out
cal assumptions, and flexible by allowing deci- the following steps:
sion makers to play an active part in the analysis
(Shi, 2001). Algorithm 4

Results of Empirical Study Using Step 1: For each class pair, we used the Linux
MClp code of the two-class model to compute the
compromise solution X* = (x1*,..., x9*) as the
By using the two-class model for the classifications best weights of all nine neuronal variables
on {G1, G2, G3, and G4}, there are six possible with given values of control parameters (b,
pairings: G1 vs. G2; G1 vs. G3; G1 vs. G4; G2 a*, β*).
vs. G3; G2 vs. G4; and G3 vs. G4. In the cases of Step 2: The classification score MCLPi =
G1 vs. G3 and G1 vs. G4, we see these combina- Ai X* against of each observation has been
tions would be treated as redundancies, therefore calculated against the boundary b to check
they are not considered in the pairing groups. G1 the performance measures of the classifica-
through G3 or G4 is a continuum. G1 represents tion.
an enrichment of neuronal cultures, G2 is basal or Step 3: If the classification result of Step 2
maintenance of neuronal culture, and G3/G4 are is acceptable (i.e., the given performance
both damage of neuronal cultures. There would measure is larger or equal to the given
never be a jump between G1 to G3/G4 without threshold), go to Step 4. Otherwise, choose
traveling through G2. So, we used the following different values of control parameters (b,
four two-class pairs: G1 vs. G2; G2 vs. G3; G2 a*, β*) and go to Step 1.

Introduction to Data Mining Techniques via Multiple Criteria Optimization Approaches and Applications

Step 4: For each class pair, use X* = (x1*,..., The “positive” represents the first-class label
x9*) to calculate the MCLP scores for all Ai while the “negative” represents the second-class
in the test set and conduct the performance label in the same class pair. For example, in the
analysis. class pair {G1 vs. G2}, the record of G1 is “posi-
tive” while that of G2 is “negative.” Among the
According to the nature of this research, we above four measures, more attention is paid to
define the following terms, which have been sensitivity or false-positive rates because both
widely used in the performance analysis as: measure the correctness of classification on class-
pair data analyses. Note that in a given a class
TP (True Positive) = the number of records pair, the sensitivity represents the corrected rate
in the first class that has been classified cor- of the first class, and one minus the false positive
rectly rate is the corrected rate of the second class by
FP (False Positive) = the number of records the above measure definitions.
in the second class that has been classified Considering the limited data availability in this
into the first class pilot study, we set the across-the-board threshold
TN (True Negative) = the number of records of 55% for sensitivity [or 55% of (1- false posi-
in the second class that has been classified tive rate)] to select the experimental results from
correctly training and test processes. All 20 of the training
FN (False Negative) = the number of records and test sets, over the four class pairs, have been
in the first class that has been classified into computed using the above procedure. The results
the second class against the threshold are summarized in Tables
7 to 10. As seen in these tables, the sensitivities
Then we have four different performance for the comparison of all four pairs are higher
measures: than 55%, indicating that good separation among
individual pairs is observed with this method.
TP
Sensitivity = The results are then analyzed in terms of both
TP + FN
positive predictivity and negative predictivity
TP
Positive Predictivity = for the prediction power of the MCLP method
TP + FP
FP on neuron injuries. In Table 7, G1 is the number
False-Positive Rate = of observations predefined as BDNF treatment,
TN + FP
G2 is the number of observations predefined as
TN
Negative Predictivity = non-treatment, N1 means the number of obser-
FN + TN

Table 7. Classification results with G1 vs. G2

Positive Negative
Training N1 N2 Sensitivity False Positive Rate
Predictivity Predictivity
G1 55 (TP) 34 (FN)
61.80% 61.80% 38.20% 61.80%
G2 34 (FP) 55 (TN)
Positive Negative
Test N1 N2 Sensitivity False Positive Rate
Predictivity Predictivity
G1 11 (TP) 9 (FN)
55.00% 3.78% 30.70% 98.60%
G2 280 (FP) 632 (TN)

Introduction to Data Mining Techniques via Multiple Criteria Optimization Approaches and Applications

Table 8. Classification results with G2 vs. G3

Positive False Positive Negative
Training N2 N3 Sensitivity
Predictivity Rate Predictivity
G2 126 (TP) 57 (FN)
68.85% 68.48% 31.69% 68.68%
G3 58 (FP) 125 (TN)
Positive False Positive Negative
Test N2 N3 Sensitivity
Predictivity Rate Predictivity
G2 594 (TP) 224 (FN)
72.62% 99.32% 8.70% 15.79%
G3 4 (FP) 42 (TN)

Table 9. Classification results with G2 vs. G4

Positive Negative
Training N2 N4 Sensitivity False Positive Rate
Predictivity Predictivity
G2 419(TP) 206 (FN)
67.04% 65.88% 34.72% 66.45%
G4 217 (FP) 408 (TN)
Positive Negative
Test N2 N4 Sensitivity False Positive Rate
Predictivity Predictivity
G2 216 (TP) 160 (FN)
57.45% 80.90% 32.90% 39.39%
G4 51 (FP) 104 (TN)

Table 10. Classification results with G3 vs. G4

Positive False Positive Negative

Training N3 N4 Sensitivity
Predictivity Rate Predictivity
G3 120(TP) 40 (FN)
57.45% 80.90% 24.38% 75.16%
G4 39 (FP) 121 (TN)
Positive False Positive Negative
Test N3 N4 Sensitivity
Predictivity Rate Predictivity
G3 50 (TP) 19 (FN)
72.46% 16.78% 40.00% 95.14%
G4 248 (FP) 372 (TN)

vations classified as BDNF treatment, and N2 tion of G1 in the training set is better than that
is the number of observations classified as non- of the test set, while the prediction of G2 in test
treatment. The meanings of other pairs in Tables outperforms that of training. This is due to the
8 to 10 can be similarly explained. In Table 7 small size of G1. In Table 3 for {G2 vs. G3}, the
for {G1 vs. G2}, both positive predictivity and positive predictivity (68.48%) is almost equal to
negative predictivity are the same (61.80%) in the the negative predictivity (68.68%) of the training
training set. However, the negative predictivity set. The positive predictivity (99.32%) is much
of the test set (98.60%) is much higher than that higher than the negative predictivity (15.79%) of
of the positive predictivity (3.78%). The predic- the test set. As a result, the prediction of G2 in

Introduction to Data Mining Techniques via Multiple Criteria Optimization Approaches and Applications

the test set is better than in the training set, but periment is to examine the applicability of MCLP
the prediction of G3 in the training set is better and MCQP models in intrusion detection.
than in the test set.
The case of Table 9 for {G2 vs. G4} is similar KDD Dataset
to that of Table 8 for {G2 vs. G3}. We see that the
separation of G2 in test (80.90%) is better than in The KDD-99 dataset provided by DARPA was
training (65.88%), while the separation of G4 in used in our intrusion detection test. The KDD-99
training (66.45%) is better than in test (39.39%). In dataset includes a wide variety of intrusions simu-
the case of Table 10 for {G3 vs. G4}, the positive lated in a military network environment. It was
predictivity (80.90%) is higher than the negative used in the 1999 KDD-CUP intrusion detection
predictivity (75.16%) of the training set. Then, contest. After the contest, KDD-99 has become
the positive predictivity (16.78%) is much lower a de facto standard dataset for intrusion detection
than the negative predictivity (95.14%) of the test experiments. Within the KDD-99 dataset, each
set. The prediction of G3 in training (80.90%) is connection has 38 numerical variables and is
better than that of test (16.78%), and the predic- labeled as normal or attack. There are four main
tion of G4 in test (95.14%) is better than that of categories of attacks: denial-of-service (DOS),
training (75.16%). unauthorized access from a remote machine
In summary, we observed that the predictions (R2L), unauthorized access to local root privileges
of G2 in test for {G1 vs. G2}, {G2 vs. G3}, and (U2R), surveillance and other probing. The train-
{G2 vs. G4} is always better than those in training dataset contains a total of 24 attack types,
ing. The prediction of G3 in training for {G2 vs. while the testing dataset contains an additional
G3} and {G3 vs. G4} is better than those of test. 14 types (Stolfo, Fan, Lee, Prodromidis, & Chan,
Finally, the prediction of G4 for {G2 vs. G4} in 2000). Because the number of attacks for R2L,
training reverses that of {G3 vs. G4} in test. If U2R, and probing is relatively small, this experi-
we emphasize the test results, these results are ment focused on DOS.
favorable to G2. This may be due to the size of
G2 (non-treatment), which is larger than all other Experimental Results of MCLP
classes. The classification results can change if the
sizes of G1, G3, and G4 increase significantly. Following the heuristic process described in
this chapter, training and testing datasets were
Network Intrusion Detection selected: first, the ‘normal’ dataset (812,813
records) was divided into 100 intervals (each
Network intrusions are malicious activities that interval has 8,128 records). Within each interval,
aim to misuse network resources. Although 20 records were randomly selected. Second, the
various approaches have been applied to network ‘DOS’ dataset (247,267 records) was divided into
intrusion detection, such as statistical analysis, 100 intervals (each interval has 2,472 records).
sequence analysis, neural networks, machine Within each interval, 20 records were randomly
learning, and artificial immune systems, this field selected. Third, the 2,000 normal and 2,000 DOS
is far from maturity, and new solutions are worthy records were combined to form a training dataset.
of investigation. Since intrusion detection can be Because KDD-99 has over 1 million records, and
treated as a classification problem, it is feasible to 4,000 training records represent less than 0.4%
apply a multiple-criterion classification model to of it, the whole KDD-99 dataset is used for test-
this type of application. The objective of this ex- ing. Various training and testing datasets can be

Introduction to Data Mining Techniques via Multiple Criteria Optimization Approaches and Applications

obtained by repeating this process. Considering Improvement of MCLP with Ensemble

the previous high detection rates of KDD-99 by Method
other methods, the across-the-board threshold
of 95% was set for both normal and DOS. Since The majority-vote ensemble method demonstrated
training dataset classification accuracies are all its superior performance in credit card accounts
100%, only testing dataset (10 out of the total 300 classification. Can it improve the classification ac-
results) results are summarized in Table 11 (Kou curacy of network intrusion detection? To answer
et al., 2004a). The average predictive accuracies this question, the majority-vote ensemble was
for normal and DOS groups in the testing dataset applied to the KDD-99 dataset. Ensemble results
are 98.94% and 99.56%. are summarized in Table 12 (Kou et al., 2004a).
The average predictive accuracies for normal and
DOS groups in the testing dataset are 99.61% and
99.78%. Both normal and DOS predictive accura-
cies have been slightly improved.

Table 11. MCLP KDD-99 classification results

Cross Validation Testing Set (812813 Normal + 247267 Dos)
Normal Accuracy DOS Accuracy
DataSet 1 804513 98.98% 246254 99.59%
DataSet 2 808016 99.41% 246339 99.62%
DataSet 3 802140 98.69% 245511 99.29%
DataSet 4 805151 99.06% 246058 99.51%
DataSet 5 805308 99.08% 246174 99.56%
DataSet 6 799135 98.32% 246769 99.80%
DataSet 7 805639 99.12% 246070 99.52%
DataSet 8 802938 98.79% 246566 99.72%
DataSet 9 805983 99.16% 245498 99.28%
DataSet 10 802765 98.76% 246641 99.75%

Table 12. MCLP KDD-99 classification results with ensemble

Number of Voters Normal Accuracy DOS Accuracy
3 809567 99.60% 246433 99.66%
5 809197 99.56% 246640 99.75%
7 809284 99.57% 246690 99.77%
9 809287 99.57% 246737 99.79%
11 809412 99.58% 246744 99.79%
13 809863 99.64% 246794 99.81%
15 809994 99.65% 246760 99.79%
17 810089 99.66% 246821 99.82%
19 810263 99.69% 246846 99.83%

Introduction to Data Mining Techniques via Multiple Criteria Optimization Approaches and Applications

Experimental Results of MCQP Improvement of MCQP with Ensemble

Method
A similar MCQP procedure used in credit card
accounts classification was used to classify the The majority-vote ensemble was used on MCQP
KDD-99 dataset. A part of the results is sum- results, and a part of the outputs is summarized in
marized in Table 13 (Kou et al., 2004b). These Table 14 (Kou et al., 2004b). The average predic-
results are slightly better than MCLP. tive accuracies for normal and DOS groups in the
testing dataset are 99.86% and 99.82%. Although
the increase in classification accuracy is small,

Table 13. MCQP KDD-99 classification results

Cross Validation Testing Set(812813 Normal + 247267 Dos)

Normal Accuracy DOS Accuracy

DataSet 1 808142 99.43% 245998 99.49%
DataSet 2 810689 99.74% 246902 99.85%
DataSet 3 807597 99.36% 246491 99.69%
DataSet 4 808410 99.46% 246256 99.59%
DataSet 5 810283 99.69% 246090 99.52%
DataSet 6 809272 99.56% 246580 99.72%
DataSet 7 806116 99.18% 246229 99.58%
DataSet 8 808143 99.43% 245998 99.49%
DataSet 9 811806 99.88% 246433 99.66%
DataSet 10 810307 99.69% 246702 99.77%

Table 14. MCQP KDD-99 classification results with ensemble

NO of Voters Normal Accuracy DOS Accuracy

3 810126 99.67% 246792 99.81%

5 811419 99.83% 246930 99.86%
7 811395 99.83% 246830 99.82%
9 811486 99.84% 246795 99.81%
11 812030 99.90% 246845 99.83%
13 812006 99.90% 246788 99.81%
15 812089 99.91% 246812 99.82%
17 812045 99.91% 246821 99.82%
19 812069 99.91% 246817 99.82%
21 812010 99.90% 246831 99.82%
23 812149 99.92% 246821 99.82%
25 812018 99.90% 246822 99.82%

Introduction to Data Mining Techniques via Multiple Criteria Optimization Approaches and Applications

both normal and DOS predictive accuracies have to measure the separation of each observation in
been improved compared with previous 99.54% the dataset, SVM selects the minority of observa-
and 99.64%. tions (support vectors) to represent the majority
of the rest of the observations. Therefore, in the
experimental studies and real applications, SVM
reSearCh ChallengeS and may have a high accuracy in the training set, but a
opportunitieS lower accuracy in the testing result. Nevertheless,
the use of kernel functions in SVM has shown its
Although the above multiple criteria optimization efficiency in handling nonlinear datasets. How to
data mining methods have been applied in the real- adopt kernel functions into the multiple criteria
life applications, there are number of challenging optimization approaches can be an interesting
problems in mathematical modeling. While some research problem. Kou, Peng, Shi, and Chen
of the problems are currently under investigation, (2006) explored some possibility of this research
some others remain to be explored. direction. The basic idea is outlined.
First, we can rewrite the generalized model
Variations and Algorithms of (Model 1) similar to the approach of SVM.
Generalized Models Suppose the two-classes G1 and G2 are under
consideration. Then, a n×n diagonal matrix Y,
Given Model 1, if p=2, q=1, it will become a convex which only contains +1 or -1, indicates the class
quadratic program which can be solved by using membership. A -1 in row i of matrix Y indicates
some known convex quadratic programming al- the corresponding record Ai ∈ G1 , and a +1 in row
gorithm. However, when p=1, q=2, Model 1 is a i of matrix Y indicates the corresponding record
concave quadratic program; and when p=2, q=2, Ai ∈ G2. The constraints in Model 1, AiX = b + ai
we have Model 3 (MCQP), which is an indefinite - βi, ∀ Ai ∈ G1 and Ai X = b - ai + βi, ∀Ai ∈ G2, are
quadratic problem. Since both concave quadratic converted as: Y (<A⋅X> - eb) = a - β, where e =
programming and MCQP are NP-hard problems, (1,1,…,1)T, a = (a1,...,an ) , and β = (β1,..., βn)T . In
2
it is very difficult to find a global optimal solution. order to maximize the distance between the
X
We are working on both cases for developing direct 2

two adjusted bounding hyper planes, the function

algorithms that can converge to local optima in 1
classification (Zhang, Shi, & Zhang, 2005). X 2 should also be minimized. Let s = 2, q =1,
2
and p =1, then a simple quadratic programming
Kernel Functions for Data (SQP) variation of Model 1 can be built as:
Observations
Model 5: SQP
The generalized model in the chapter has a natural 1 n n
connection with known support vector machines Minimize − X 2 +w ∑ i −w ∑ i
2 i =1 i =1
(SVM) (Mangasarian, 2000; Vapnik, 2000) since
they both belong to the category of optimiza- Subject to Y ( <A⋅X > - eb ) = a - β, where e =
tion-based data mining methods. However, they (1,1,…,1)T, a = (a1,..., an)T and β = (β1,..., βn)T ≥0.
differ from ways to identify the classifiers. As
we mentioned before, while the multiple criteria Using Lagrange function to represent Model
optimization approaches in this chapter use the 5, one can get an equivalent of the Wolfe dual
overlapping and interior distance as two standards problem of Model 5 expressed as:

0
Introduction to Data Mining Techniques via Multiple Criteria Optimization Approaches and Applications

Model 6: Dual of SQP Choquet Integrals and Non-Additive

Set Function
1 n n n
Maximize − ∑ ∑
2 i =1 j=1
i yi j y j (A i ⋅ A j ) + ∑ i
i =1
Considering the r-dimensional attribute vector
n a = (a1,...,ar ) in the classification problem, let P(a)
Subject to ∑ i yi = 0 , w ≤ i ≤w, denote the power set of a. We use f (a1),..., f (ar) to
i =1
denote the values of each attribute in an obser-
where wβ<wa are given, 1≤ i ≤ n. vation. The procedure of calculating a Choquet
The global optimal solution of the primal integral can be given as (Wang & Wang, 1997):
problem if Model 5 can be obtained from the r
solution of the Wolfe dual problem: ∫ fd = ∑ [ f (a 'j ) − f (a 'j −1 )] × ({a1' , a 2' ,..., a r' }),
n n j =1
X =∑ , b = y j -∑
* *
* *
i yi Ai yi ( A i ⋅ A j ) .
i =1 i =1 where {a1' , a 2' ,..., a r' } is a permutation of a =
(a1,...,ar ). Such that f (a0' ) = 0 and f (a1' ),..., f (a r' )
As a result, the classification decision func- is non-decreasingly ordered such that: f (a1) ≤...≤
tion becomes: f (ar). The non-additive set function is defined as:
µ:P(a)→(-∞,+∞), where µ(∅) = 0. We use µi to
> 0, B ∈ G1 denote set function µ, where i = 1,...,2r.
sgn ((X * ⋅ B) - b* ) { ,
≤ 0, B∈ G 2
Introducing the Choquet measure into the
We observe that because the form (Ai⋅Aj ) of generalized model of an section refers to the uti-
Model 6 is inner product in the vector space, it lization of Choquet integral as a representative
can be substituted by a positive semi-definite ker- of the left-hand side of the constraints in Model
nel K(Ai, Aj ) without affecting the mathematical 1. This variation for non-additive data mining
modeling process. In general, a kernel function problem is (Yan, Wang, Shi, & Chen, 2005):
refers to a real-valued function on χ×χ and for all
Ai, Aj∈χ. Thus, Model 6 can be easily transformed Model 7: Choquet Form
to a nonlinear model by replacing (Ai⋅Aj) with some
positive semi-definite kernel function K(Ai, Aj). Minimize f (a) and Maximize g (β)
Use of kernel functions in multiple criteria opti-
mization approaches can extend its applicability Subject to:
to linear inseparable datasets. However, there are
some theoretical difficulties to directly introduce ∫ fd - i + i - b = 0, ∀ A i ∈ G1 ,

kernel function to Model 5. How to overcome them ∫ fd + i - i - b = 0, ∀ A i ∈ G2 ,

deserves a careful study. Future studies may be

done on establishing a theoretical guideline for where ∫ f d denotes the Choquet integral with
selection of a kernel that is optimal in achieving respect to a signed fuzzy measure to aggregate
a satisfactory credit analysis result. Another open the attributes of a observation f, b is unrestricted,
problem is to study the subject of reducing compu- and a = (a1,...,an)T, β = (β1,...,βn)T; ai, βi≥ 0, i =
tational cost and improving algorithm efficiency 1,…, n.
for high dimensional or massive datasets. Model 7 results in the replacement of a linear
combination of all the attributes Ai X in the left-
hand side of constraints with the Choquet integral
representation∫ f d . The number of parameters,

Introduction to Data Mining Techniques via Multiple Criteria Optimization Approaches and Applications

denoted by µi, increases from r to 2r (r is the num- referenCeS

ber attributes). How to determine the parameters
through linear programming framework is not Bradley, P.S., Fayyad, U.M., & Mangasarian,
easy. We are still working on this problem and O.L. (1999). Mathematical programming for data
shall report the significant results. mining: Formulations and challenges. INFORMS
Journal on Computing, 11, 217-238.
Bauer, E., & Kohavi, R. (1999). an empirical
ConCluSion comparison of voting classification algorithms:
Bagging, boosting, and variants. Machine Learn-
As Usama Fayyad pointed out at the KDD-03 ing, 36, 105-139.
Panel, data mining must attract the participation
of the relevant communities to avoid re-inventing C 5.0. (2004). Retrieved from http://www.rule-
wheels and bring the field an auspicious future quest.com/see5-info.html
(Fayyad, Piatetsky-Shapiro, & Uthurusamy,
Charnes, A., & Cooper, W.W. (1961). Manage-
2003). One relevant field to which data mining
ment models and industrial applications of lin-
has not attracted enough participation is optimiza-
ear programming (vols. 1 & 2). New York: John
tion. This chapter summarizes a series of research
Wiley & Sons.
activities that utilize multiple criteria decision-
making methods to classification problems in Dietterich, T. (2000). Ensemble methods in ma-
data mining. Specifically, this chapter describes chine learning. In Kittler & Roli (Eds.), Multiple
a variation of multiple criteria optimization-based classifier systems (pp. 1-15). Berlin: Springer-
models and applies these models to credit card Verlag (Lecture Notes in Pattern Recognition
scoring management, HIV-1 associated dementia 1857).
(HAD) neuronal damage and dropout, and net-
Fayyad, U.M., Piatetsky-Shapiro, G., & Uthurusa-
work intrusion detection as well as the potential
my, R. (2003). Summary from the KDD-03 Panel:
in various real-life problems.
Data mining: The next 10 years. ACM SIGKDD
Explorations Newsletter, 5(2), 191-196.

aCknoWledgMent Fisher, R.A. (1936). The use of multiple measure-

ments in taxonomic problems. Annals of Eugenics,
Since 1998, this research has been partially sup- 7, 179-188.
ported by a number of grants, including First
Freed, N., & Glover, F. (1981). Simple but power-
Data Corporation, USA; DUE-9796243, the
ful goal programming models for discriminant
National Science Foundation of USA; U.S. Air
problems. European Journal of Operational
Force Research Laboratory (PR No. E-3-1162);
Research, 7, 44-60.
National Excellent Youth Fund #70028101,
Key Project #70531040, #70472074, National Freed, N., & Glover, F. (1986). Evaluating alter-
Natural Science Foundation of China; 973 Project native linear programming models to solve the
#2004CB720103, Ministry of Science and Tech- two-group discriminant problem. Decision Sci-
nology, China; K.C. Wong Education Foundation ence, 17, 151-162.
(2001, 2003), Chinese Academy of Sciences; and
Han, J.W., & Kamber, M. (2000). Data mining:
BHP Billiton Co., Australia.
Concepts and techniques. San Diego: Academic
Press.

Introduction to Data Mining Techniques via Multiple Criteria Optimization Approaches and Applications

He, J., Liu, X., Shi, Y., Xu, W., & Yan, N. (2004). Kuncheva, L.I. (2000). Clustering-and-selection
Classifications of credit cardholder behavior by model for classifier combination. In Proceedings
using fuzzy linear programming. International of the 4th International Conference on Knowledge-
Journal of Information Technology and Decision Based Intelligent Engineering Systems and Allied
Making, 3, 633-650. Technologies (KES’2000).
Hesselgesser, J., Taub, D., Baskar, P., Greenberg, Kwak, W., Shi, Y., Eldridge, S., & Kou, G. (2006).
M., Hoxie, J., Kolson, D.L., & Horuk, R. (1998). Bankruptcy prediction for Japanese firms: Us-
Neuronal apoptosis induced by HIV-1 gp120 ing multiple criteria linear programming data
and the Chemokine SDF-1alpha mediated by mining approach. In Proceedings of the Inter-
the Chemokine receptor CXCR4. Curr Biol, 8, national Journal of Data Mining and Business
595-598. Intelligence.
Kaul, M., Garden, G.A., & Lipton, S.A. (2001). Jiang, Z., Piggee, C., Heyes, M.P., Murphy, C.,
Pathways to neuronal injury and apoptosis in HIV- Quearry, B., Bauer, M., Zheng, J., Gendelman,
associated dementia. Nature, 410, 988-994. H.E., & Markey, S.P. (2001). Glutamate is a me-
diator of neurotoxicity in secretions of activated
Kou, G., & Shi, Y. (2002). Linux-based Multiple
HIV-1-infected macrophages. Journal of Neuro-
Linear Programming Classification Program:
immunology, 117, 97-107.
(Version 1.0.) College of Information Science
and Technology, University of Nebraska-Omaha, Lam, L. (2000). Classifier combinations: Imple-
USA. mentations and theoretical issues. In Kittler &
Roli (Eds.), Multiple classifier systems (pp. 78-86).
Kou, G., Liu, X., Peng, Y., Shi, Y., Wise, M., &
Berlin: Springer-Verlag (Lecture Notes in Pattern
Xu, W. (2003). Multiple criteria linear program-
Recognition 1857).
ming approach to data mining: Models, algorithm
designs and software development. Optimization Lee, S.M. (1972). Goal programming for decision
Methods and Software, 18, 453-473. analysis. Auerbach.
Kou, G., Peng, Y., Yan, N., Shi, Y., Chen, Z., Zhu, Lindsay, P.H., & Norman, D.A. (1972). Human
Q., Huff, J., & McCartney, S. (2004a, July 19-21). information processing: An introduction to psy-
Network intrusion detection by using multiple- chology. New York: Academic Press.
criteria linear programming. In Proceedings of
LINDO Systems Inc. (2003). An overview of
the International Conference on Service Systems
LINGO 8.0. Retrieved from http://www.lindo.
and Service Management, Beijing, China.
com/cgi/frameset.cgi?leftlingo.html;lingof.html
Kou, G., Peng, Y., Chen, Z., Shi, Y., & Chen, X.
Lopez, A., Bauer, M.A., Erichsen, D.A., Peng, H.,
(2004b, July 12-14). A multiple-criteria quadratic
Gendelman, L., Shibata, A., Gendelman, H.E., &
programming approach to network intrusion de-
Zheng, J. (2001). The regulation of neurotrophic
tection. In Proceedings of the Chinese Academy
factor activities following HIV-1 infection and
of Sciences Symposium on Data Mining and
immune activation of mononuclear phagocytes.
Knowledge Management, Beijing, China.
In Proceedings of Soc. Neurosci. Abs., San Di-
Kou, G., Peng, Y., Shi, Y., & Chen, Z. (2006). A ego, CA.
new multi-criteria convex quadratic program-
Mangasarian, O.L. (2000). Generalized support
ming model for credit data analysis. Working
vector machines. In A. Smola, P. Bartlett, B.
Paper, University of Nebraska at Omaha, USA.
Scholkopf, & D. Schuurmans (Eds.), Advances in

Introduction to Data Mining Techniques via Multiple Criteria Optimization Approaches and Applications

large margin classifiers (pp. 135-146). Cambridge, quadratic programming approach. International
MA: MIT Press. Journal of Information Technology and Decision
Making, 4, 581-600.
Olson, D., & Shi, Y. (2005). Introduction to
business data mining. New York: McGraw-Hill/ Shibata, A., Zelivyanskaya, M., Limoges, J., Carl-
Irwin. son, K.A., Gorantla, S., Branecki, C., Bishu, S.,
Xiong, H., & Gendelman, H.E. (2003). Peripheral
Opitz, D., & Maclin, R. (1999). Popular ensemble
nerve induces macrophage neurotrophic activities:
methods: An empirical study. Journal of Artificial
Regulation of neuronal process outgrowth, intra-
Intelligence Research, 11, 169-198.
cellular signaling and synaptic function. Journal
Parhami, B. (1994). Voting algorithms. IEEE of Neuroimmunology, 142, 112-129.
Transactions on Reliability, 43, 617-629.
Stolfo, S.J., Fan, W., Lee, W., Prodromidis, A.,
Peng, Y., Kou, G., Chen, Z., & Shi, Y. (2004). & Chan, P.K. (2000). Cost-based modeling and
Cross-validation and ensemble analyses on mul- evaluation for data mining with application to
tiple-criteria linear programming classification fraud and intrusion detection: Results from the
for credit cardholder behavior. In Proceedings of JAM project. In Proceedings of the DARPA In-
ICCS 2004 (pp. 931-939). Berlin: Springer-Verlage formation Survivability Conference.
(LNCS 2416).
Vapnik, V.N. (2000). The nature of statistical
Shi, Y., & Yu, P.L. (1989). Goal setting and learning theory (2nd ed.). New York: Springer.
compromise solutions. In B. Karpak & S. Zionts
Wang, J., & Wang, Z. (1997). Using neural net-
(Eds.), Multiple criteria decision making and risk
work to determine Sugeno measures by statistics.
analysis using microcomputers (pp. 165-204).
Neural Networks, 10, 183-195.
Berlin: Springer-Verlag.
Weingessel, A., Dimitriadou, E., & Hornik, K.
Shi, Y. (2001). Multiple criteria and multiple
(2003, March 20-22). An ensemble method for
constraint levels linear programming: Con-
clustering. In Proceedings of the 3rd International
cepts, techniques and applications. NJ: World
Workshop on Distributed Statistical Computing,
Scientific.
Vienna, Austria.
Shi, Y., Wise, W., Luo, M., & Lin, Y. (2001).
Yan, N., Wang, Z., Shi, Y., & Chen, Z. (2005).
Multiple criteria decision making in credit card
Classification by linear programming with signed
portfolio management. In M. Koksalan & S.
fuzzy measures. Working Paper, University of
Zionts (Eds.), Multiple criteria decision mak-
Nebraska at Omaha, USA.
ing in new millennium (pp. 427-436). Berlin:
Springer-Verlag. Yu, P.L. (1985). Multiple criteria decision mak-
ing: Concepts, techniques and extensions. New
Shi, Y, Peng, Y., Xu, W., & Tang, X. (2002). Data
York: Plenum Press.
mining via multiple criteria linear programming:
Applications in credit card portfolio management. Zenobi, G., & Cunningham, P. (2002). An ap-
International Journal of Information Technology proach to aggregating ensembles of lazy learn-
and Decision Making, 1, 131-151. ers that supports explanation. Lecture Notes in
Computer Science, 2416, 436-447.
Shi, Y, Peng, Y., Kou, G., & Chen, Z. (2005).
Classifying credit card accounts for business intel- Zhang, J., Shi, Y., & Zhang, P. (2005). Several
ligence and decision making: A multiple-criteria multi-criteria programming methods for clas-

Introduction to Data Mining Techniques via Multiple Criteria Optimization Approaches and Applications

sification. Working Paper, Chinese Academy of Zheng, J., Zhuang, W., Yan, N., Kou, G., Erichsen,
Sciences Research Center on Data Technology & D., McNally, C., Peng, H., Cheloha, A., Shi, C., &
Knowledge Economy and Graduate University of Shi, Y. (2004). Classification of HIV-1-mediated
Chinese Academy of Sciences, China. neuronal dendritic and synaptic damage using
multiple criteria linear programming. Neuroin-
Zheng, J., Thylin, M., Ghorpade, A., Xiong, H.,
formatics, 2, 303-326.
Persidsky, Y., Cotter, R., Niemann, D., Che, M.,
Zeng, Y., Gelbard, H. et al. (1999). Intracellular Zimmermann, H.-J. (1978). Fuzzy programming
CXCR4 signaling, neuronal apoptosis and neu- and linear programming with several objective
ropathogenic mechanisms of HIV-1-associated functions. Fuzzy Sets and Systems, 1, 45-55.
dementia. Journal of Neuroimmunology, 98,
185-200.

This work was previously published in Research and Trends in Data Mining Technologies and Applications, edited by D. Taniar,
pp. 242-275, copyright 2007 by IGI Publishing, formerly known as Idea Group Publishing (an imprint of IGI Global).

Chapter II
Making Decisions with Data:
Using Computational Intelligence Within a
Business Environment

Kevin Swingler
University of Stirling, Scotland

David Cairns
University of Stirling, Scotland

aBStraCt

This chapter identifies important barriers to the successful application of computational intelligence
(CI) techniques in a commercial environment and suggests a number of ways in which they may be
overcome. It identifies key conceptual, cultural and technical barriers and describes the different ways
in which they affect both the business user and the CI practitioner. The chapter does not provide techni-
cal detail on how to implement any given technique, rather it discusses the practical consequences for
the business user of issues such as non-linearity and extrapolation. For the CI practitioner, we discuss
several cultural issues that need to be addressed when seeking to find a commercial application for CI
techniques. The authors aim to highlight to technical and business readers how their different expecta-
tions can affect the successful outcome of a CI project. The authors hope that by enabling both parties
to understand each other’s perspective, the true potential of CI can be realized.

introduCtion buy?”, “Who is most likely to file a claim on

an insurance policy?”, and “What increase in
Computational intelligence (CI) appears to of- demand will follow an advertising campaign?”
fer new opportunities to a business that wishes It can filter good prospects from bad, the fraudu-
to improve the efficiency of their operations. It lent from the genuine and the profitable from the
appears to provide a view into the future, answer- loss-making.
ing questions such as, “What will my customers

Copyright © 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
Making Decisions with Data

These abilities should bring many benefits to information can be extracted from these data.
a business, yet the adoption of these techniques This meta-information is then used to predict or
has been slow. Despite the early promise of expert classify the outcome of new situations that were
systems and neural networks, the application of not present in the original data. Effectively, the
computational intelligence has not become main- power of the CI system derives from its ability
stream. This might seem all the more odd when to generalize from what it has seen in the past to
one considers the explosion in data warehousing, make sensible judgements about new situations.
loyalty card data collection and online data driven A typical example of this scenario would be
commerce that has accompanied the development the use of a computational intelligence technique
of CI techniques (Hoss, 2000). such as a neural network (Bishop, 1995; Hecht-
In this chapter, we discuss some of the reasons Neilsen, 1990; Hertz, Krogh, & Palmer, 1991) to
why CI has not had the impact on commerce that predict who might buy a product based on prior
one might expect, and we offer some recom- sales of the product. A neural network application
mendations for the reader who is planning to would process the historical data set containing
embark on a project that utilizes CI. For the CI past purchasing behaviour and build up a set of
practitioner, this chapter should highlight cultural weighted values which correlate observed input
and conceptual business obstacles that they may patterns with consequent output patterns. If there
not have considered. For the business user, this was a predictable consistency between a buyer’s
chapter should provide an overview of what a CI profile (e.g., age, gender, income) and the products
system can and cannot do, and in particular the they bought, the neural network would extract the
dependence of CI systems on the availability of salient aspects of this consistency and store it in
relevant data. the meta-information represented by its internal
Given the right environment the technology weights. A prospective customer could then be
has been shown to work effectively in a number presented to the neural network which would use
of fields. These include financial prediction (Kim these weights to calculate an expected outcome
& Lee, 2004; Trippi & DeSieno, 1992; Tsaih, Hsu, as to whether the prospect is likely to become a
& Lai, 1998), process control (Bhat & McAvoy, customer or not (Law, 1999).
1990; Jazayeri-Rad, 2004; Yu & Gomm, 2002) and Although neural networks are mentioned
bio-informatics (Blazewicz & Kasprzak, 2003). above, this process is similar when used with a
This path to successful application has a number number of different computational intelligence
of pitfalls and it is our aim to highlight some of approaches. Even within the neural network field,
the more common difficulties that occur during there are a large number of different approaches
the process of applying CI and suggest methods that could be used (Haykin, 1994). The common
for avoiding them. element in this process is the extraction and use of
information from a prior data set. This informa-
tion extraction process is completely dependent
BaCkground upon the quality and quantity of the available data.
Indeed it is not always clear that the available data
Computational intelligence is primarily concerned are actually relevant to the task at hand — a dif-
with using an analytical approach to making de- ficult issue within a business environment when
cisions based on prior data. It normally involves a contract has already been signed that promises
applying one or more computationally intensive to deliver a specific result.
techniques to a data set in such a way that meta-

Making Decisions with Data

Being CoMMerCial ConCeptual, Cultural, and

teChniCal BarrierS
This chapter makes two assumptions. The first
is that the reader is interested in applying CI We believe that computational intelligence has
techniques to commercial problems. The second a number of barriers that impede its general use
is that the reader has not yet succeeded in doing in business. We have broken these down into
so to any great extent. The reader may therefore three key areas: conceptual, cultural and techni-
be a CI practitioner who thoroughly understands cal barriers. On the surface, it may appear that
the computational aspects and is having difficul- technical barriers would present the greatest dif-
ties with the business aspects of selling CI, or ficulties, however, it is frequently the conceptual
a business manager who would like to use CI and cultural barriers that stop a project dead in
but would like to be more informed about the its tracks. The following sections discuss each of
requirements for applying it. In this chapter we these concepts in turn. We first discuss some of
offer some observations we have made when the main foundations of CI under the heading of
commercializing CI techniques, in the hope that “Conceptual Barriers,” this is followed by a dis-
the reader will find a smoother route to market cussion of the business issues relating to CI under
than they might otherwise have taken. the topic of “Cultural Barriers” and we finish off
If you are hoping to find commercial applica- by covering the “nuts and bolts” of a CI project
tion for your expertise in CI, then it is probably in a section on “Technical Barriers.”
for one or more of the following reasons:
Conceptual Barriers
• You want to see your work commercially
applied. CI offers a set of methods for making decisions
• Commercialization is stipulated in a grant based on calculations made from data. These
you have won. calculations are normally probabilities of possible
• You want to earn more money. outcomes. This is not a concept that many people
are familiar with. People are used to the idea of a
Many technologists with an entrepreneurial computer giving definitive answers—the value of
eye will have heard the phrase, “When you have sales for last year, for example. They are less com-
invented a hammer, everything looks like a nail.” fortable with the idea that a computer can make a
Perhaps the most common mistake made by any judgement that may turn out to be wrong.
technologist looking to commercialize their ideas The end user of a CI system must understand
for the first time is to concentrate too much on the what it means to make a prediction based on
technology and insufficiently on the needs of their data, the effect of errors and non-linearity and
customers (Moore, 1999). The more tied you are the requirements for the right kind of data if a
to a specific technique, the easier this mistake is project is to be successful. Analysts will under-
to make. It is easy to concentrate on the techno- stand these points intuitively, but if managers
logical aspects of an applied project, particularly and end users do not understand them, problems
if that is where your expertise lies. will often arise.

Core Concepts

In this section, we will define and explain some

of the mathematical concepts that everybody

Making Decisions with Data

involved in a CI project will need to understand. and even if it were, it is not always possible to
If you are reading this as a CI practitioner, it gather data about those factors.
may seem trivial and somewhat obvious. This The usual approach, forced on CI modelers
unfortunately is one of the first traps of applying through pragmatism, is to use all the variables
CI—there will be people who do not understand that are available and then exclude variables that
these concepts or perhaps have an incomplete are subsequently found to be irrelevant. Time
understanding, which may lead them to expect constraints frequently do not allow for data on
different outcomes. These differences in under- further variables to be collected. It is important to
standing must be resolved in order for a project acknowledge that this compromise is present since
to succeed. We highlight these mathematical a model with reduced functionality will almost
concepts because they are what makes CI differ- certainly be produced. From a business point of
ent from the type of computing many people find view, it is essential that a client is made aware
familiar. They are conceptual barriers because that the limitations of the model are attributable
their consequences have a material impact on the to the limitations of their data rather than the CI
operation of a CI-based system. technique that has been used. This can often be a
point of conflict and therefore needs to be clarified
Systems, Models, and The Real World at the very outset of any work.
First, let us define some terms in order to simplify Related to this issue of collecting data for all
the text and enhance clarity. A system is any part the variables that could affect a system is the col-
of the real world that we can measure or observe. lection of sufficient data that span the range of
Generally, we will want to predict its future be- all the values a variable might take with respect
haviour or categorize its current state. The system to all the other variables in the system. The goal
will have inputs: values we can observe and often here is to develop a model that accurately links the
control, that lead to outputs that we cannot directly patterns in the input data to corresponding output
control. Normally the only method available to patterns and ideally this model would be an exact
us if we want to change the values of the outputs match to the real-world system. Unfortunately, this
is to modify the inputs. Our goal is usually to do is rarely the case since it is usually not possible
this in a controlled and predictable manner. to gather sufficient data to cover all the possible
In the purchasing example used above, our intricacies of the real-world system.
inputs would be the profile of the buyer (their The client will frequently have collected the
age, gender, income, etc.) and the outputs would data before engaging the CI expert. They will
be products that people with a given profile have have done this without a proper knowledge of
bought before. We could then run a set of pos- what is likely to be required. A significant part of
sible customers through the model of the system the CI practitioner’s expertise is concerned with
and record those that are predicted to have the the correct collection of the right data. This is a
greatest likelihood of buying the product we are complex issue and is discussed in detail in Baum
trying to sell. and Haussler (1989).
Given that a CI system is generally derived A simple example of this might be the col-
from data collected from a real-world system, it is lection of temperature readings for a chemical
important to determine what factors or variables process. Within the normal operation of this
affect the system and what can safely be ignored. process, the temperature may remain inside a
It is often quite difficult to estimate in advance all very stable range, barely moving by a few degrees.
the factors or variables that may affect a system If regular recordings of the system state are be-

Making Decisions with Data

ing made every 5 seconds then the majority of the idea of creating an artificial distance metric
the data that are collected will record this tem- for symbolic variables, a computational intelli-
perature measurement as being within its stable gence system cannot know, for example, that blue
range. An analyst may however be interested in and purple are closer than blue and yellow. This
what happens to the system when it is perturbed information may be present in the knowledge of
outside its normal behaviour or perhaps what can a user, but it is not obvious from just looking at
be done to make the system optimal. This may the symbolic values “blue” and “yellow.”
involve temperature variations that are relatively
high or low compared to the norm. Unless the Coincidence and Causation
client is willing to perturb their system such that If two things reliably coincide, it does not neces-
a large number of measurements of high and low sarily follow that one caused the other. Causation
temperatures can be obtained then it will not be cannot be established from data alone. We can
possible to make queries about how the system observe that A always occurs when B occurs, but
will react to novel situations. we cannot say for sure that A causes B (or indeed,
This lack of relevant data over all the “space” that B causes A). If we observe that B always fol-
that a system might cover will lead to a model lows A, then we can rule out B causing A, but we
that is only an approximation to the real world. still can’t conclude that A causes B from the data
The model has regions where it maps very well alone. If A is “rain” and B is “wet streets” then we
to the real world and produces accurate predic- can infer that there is a causal effect, but if A is
tions, but it will also have regions where data “people sending Christmas cards” and B is “snow
were sparse or noisy and its approximations are falling” then we know that A does not cause B
consequently very poor. nor B cause A, yet the two factors are associated.
Generally, however, if A always occurs when B
Inputs and Outputs occurs, then we can use that fact to predict that
Input and output values are characterized by B will occur if we have seen A. Spotting such
variables — a variable describes a single input or co-occurrences and making proper use of them
output, for example “temperature” or “gender.” is at the heart of many CI techniques.
Variables take values — temperature might take
values from 0 to 100 and gender would take the Non-Linearity
values “male” or “female.” Values for a given Consider any system in which altering an input
variable can be numeric like those for a tempera- leads to a change in an output. Take the rela-
ture range or symbolic like those of “gender.” It tionship between the price of a product and the
is rare that a variable will have values that are in demand for that product. If an increase in price
part numeric and in part symbolic. The general of $1 always leads to a decrease in demand of
approach in this case is to force the variable to 50 units regardless of the current price then the
be regarded as symbolic if any of its values are relationship is said to be linear. If, however, the
symbolic. Fuzzy systems can impose an order on change in demand following a $1 increase varies
symbolic data, for example we can say that “cold” depending on the current price, then the system
is less than “warm” which is less than “hot.” This is non-linear. This is the standard demand curve
enables us to combine the two concepts. and is an example of non-linearity for a single
Numbers have an order and allow distances to input variable.
be calculated between them, symbolic variables Adding further input variables can introduce
do not, although they may have an implied scale non-linearity, even when each individual variable
such as “small,” “medium” or “large.” Ignoring produces a linear effect if it alone is changed.

0
Making Decisions with Data

This occurs when two or more input variables relationship with y, and “It depends on z” in cases
interact within the system such that the effect where the presence of one or more other variables
of one is dependent upon the value of the other introduce non-linearity.
(and vice versa). An example of such a situation Here is an example based on a CI system that
would be the connection between advertising calculates the risk of a person making a claim on
spend, price of the product and the effect these a motor insurance policy. Let us say we notice
two input variables might have on the demand for that as people grow older, their risk increases, but
the product. For example, adding $1 to the price that it grows more steeply once people are over 60
of the product during an expensive advertising years of age. That is a non-linearity as growing
campaign may cause less of a drop in demand older by one year will have a varying effect on
compared with the same increase when little has risk depending on the current age.
been spent on advertising. Now let us assume that the effect of age is
Non-linearity has a number of major conse- linear, but that for males risk gets lower as they
quences for trying to predict a future outcome grow older and for females the risk gets higher
from data. Indeed, it is these non-linear effects that with age. Now, we cannot know the effect of age
drove much of the research into the development without knowing the gender of the person in ques-
of the more sophisticated neural networks. It is tion. There is a non-linear effect produced by the
also this aspect of computational intelligence that interaction of the variables “age” and “gender.”
can cause significant problems in understanding It is possible for several inputs to combine to
how the system works. A client will frequently affect an output in a linear fashion. Therefore,
request a simplified explanation of how a CI system the presence of several inputs is not a sufficient
is deriving its answer. If the CI model requires condition for non-linearity.
a large number of parameters (e.g., the weights
of a neural network) to capture the non-linear Classification
effects, then it is usually not possible to provide A classification system takes the description
a simplified explanation of that model. The very of an object and assigns it to one class among
act of simplifying it removes the crucial elements several alternatives. For example, a classifier of
that encode the non-linear effects. fruit would see the description “yellow, long,
This directly relates to one of the more fre- hard peel” and classify the fruit as a banana.
quently requested requirements of a CI system The output variable is “class of fruit,” the value
— the decision-making process should be trace- is “banana.” It is tempting to see classification
able such that a client can look at a suggested as a type of prediction. Based on a description
course of action and then examine the rationale of an object, you predict that the object will be a
behind it. This can frequently lead to simple, linear banana. Under normal circumstances, that makes
CI techniques being selected over more complex sense but there are situations where that does not
and effective non-linear approaches because make sense, and they are common in business
linear processes can be queried and understood applications of CI.
more easily. A CI classification system is built by present-
A further consequence of non-linearity is ing many examples of the descriptions of the
that it makes it impossible to answer a question objects to be classified to the classifier-building
such as “How does x affect y?” with a general all algorithm. Some algorithms require the user to
encompassing answer. The answer would have to specify the classes and their members in this
become either, “It depends on the current value data. Other algorithms (referred to as clustering
of x” in the case of x having a simple non-linear algorithms) work out suitable classes based on

Making Decisions with Data

groups of objects that are similar enough to each 5,000 prospects and predicting that they will all
other but different enough from other things to become customers is a sure way of producing
qualify for a class of their own. scepticism in the client at best, and at worst of
A common application of CI techniques in mar- failing to deliver.
keting is the use of an existing customer database
to build a CI system capable of classifying new Dealing with Errors and Uncertainty
prospects as belonging to either the class “cus- Individual predictions from a CI system have
tomer” or “non-customer.” Classifying a prospect a level of error associated with them. The level
as somebody who resembles a customer is not the of error may depend on the values of the inputs
same as predicting that the person will become a for the current situation, with some situations
customer. Such systems are built by presenting being more predictable than others. This lack
examples of customers and non-customers. When of certainty can be caused by noise in the data,
they are being used, they will be presented with inconsistencies in the behaviour of the system
prospective customers (i.e., those who do not fall under consideration or by the effects of other
into the class of customer at the moment since they variables that are not available to the analysis.
have not bought anything). Those prospects that Dealing with this uncertainty is an important part
are classified by the CI system as “customer” are of any CI project. It is important both in technical
treated as good prospects as they share sufficient terms—measuring and acting on different levels
characteristics with the existing customers. of certainty—and conceptual levels—ensuring
It must be remembered, however, that they that the client understands that the uncertainty
currently fall into the non-customer category, so is present. (See Jepson, Collins, & Evans, 1993;
the use of the classification to predict that they Srivastava & Weigend, 1994 for different methods
would become customers if approached is er- for measuring errors.)
roneous. What the system will have highlighted We have stated that a classification can be
is that they have a greater similarity to existing seen as a label of a class that a new object most
customers than those classified as “non-customer.” closely resembles, as opposed to being a predic-
It does not indicate that they definitely will become tion of a class of behaviour. A consequence of
a “customer.” this is that a CI system can make a prediction or
For example, if such a system were used to a classification that turns out to be wrong. In the
generate a mailing list for a direct-mail campaign, broadest sense, this would be defined as an error
you would choose all the current non-customers but could also be seen as a consequence of the
who were classified as potential customers by the probabilistic nature of CI systems. For example,
CI system and target them with a mail shot. If a if a CI system predicts that an event will occur
random mailing produced a 1% response rate and with a probability of 0.8 and that event does not
you doubled that to 2% with your CI approach, occur for a given prediction, then the prediction
the client should be more than satisfied. However, and its associated probability could still be seen
if you treated your classification of customers as as being correct. It is just that in this instance the
a prediction that those people would respond to most probable outcome did not occur. In order
the mailing, you would still have been wrong on to validate the system, you must look at all the
98% of your predictions. results for the all the predictions. If a CI model
Prospect list management is increasingly seen assigns a probability of 0.8 to an event, it should
as an important part of Customer Relationship occur 8 times out of 10 for the system to be valid
Management (CRM) and it is in that aspect that but you should still expect it to misclassify 2 out
CI can offer real advantages. Producing a list of of 10 events.

Making Decisions with Data

For example, if a given insurance claim is for fraud when in fact it was actually fraudulent.
assigned a probability of being fraudulent of In the latter case you would not know that you
0.8 then one would expect 8 out of 10 identical had paid out on a fraudulent claim unless you
claims to be fraudulent. If this turned out not to explicitly investigated every claim while validat-
be the case, for example only 6 out 10 turned out ing the fraud detection system.
to be fraudulent, then the CI system would be False positives and false negatives have a cost
considered to be wrong. associated with them in any specific application.
Returning to the customer-prospecting ex- The key to dealing with these errors lies in the
ample, it is clear that the individual cost of a wrong cost-benefit ratio for each type of error. A false
classification in large campaigns is small. If we positive in the above case may cost two days work
have made it clear that the prospects were chosen for an investigator. A false negative (i.e., paying
for looking most like previous customers and that out on a missed fraudulent claim) may cost many
no predictions are made about a prospect actually thousands of dollars.
converting, the job of the CI system becomes to
increase the response rate to a campaign. Interpolation vs. Extrapolation
There are many cases where it is necessary to Many users want a model that they can use to
introduce the concept of the CI system being able make predictions about uncharted territory. This
to produce an “I don’t know” answer. Such cases involves either interpolation within the current
are defined as any prediction or classification with model or extrapolation into regions outside the
a confidence score below a certain threshold. By data set from which the model was built. This
refusing to make a judgement on such cases, it might happen in a case where the user asks the
is possible to reduce the number of errors made system to make a prediction for the outcome of a
in all other cases. chemical process when one of the input variables,
The authors have found that neural network such as temperature, is higher than any example
based systems are very useful for the detection provided in the recorded data set.
of fraudulent insurance claims. A system was Without a measure of the non-linearity in the
developed that could detect fraudulent claims system, it can be difficult to estimate how accurate
with reasonable accuracy. However, the client did such predictions are likely to be. For example,
not want to investigate customers whose claims interpolation within a data-rich area of the vari-
looked fraudulent but were not. By introducing able space is likely to produce accurate results
the ability of the system to indicate when it was unless the system is highly non-linear. Conversely,
uncertain about a given case, we were able to interpolation within a data-poor area is likely to
significantly reduce the number of valid claims produce almost random answers unless the system
that were investigated. is very linear in the region of the interpolation.
The two aspects that had to be considered when The problem with many computational systems
looking at the pattern of errors within the above is that it is often not obvious when the model has
example were the cost of a false positive and the strayed outside its “domain knowledge.”
cost of a false negative. An example of a false A good example comes from a current ap-
positive would be a situation where an insurance plication being developed by the authors. We are
fraud detection system classified a claim as “posi- using a neural network to predict sales levels of
tive” for fraud (i.e., fraudulent) but subsequent newly released products to allow distributors and
investigation indicated the claim to be valid. In retailers to choose the right stocking levels. The
the case of a false negative, the insurance fraud effect on sales of the factors that we can measure
system might indicate that a claim is “negative” is non-linear, which means that we do not know

Making Decisions with Data

how those factors would lead to sales levels that points along the curve to map out its correct shape.
were any higher than those we have seen already. Figure 1 (a) shows a simple case of identifying a
The system is constrained to predicting sales linear relationship in a system with two variables.
levels up to the maximum that it has already With only two points available, the most obvious
seen. If a new product is released in the future, conclusion to draw would be that the system is
and sells more than the best selling product that linear. Figure 1 (b) highlights what would happen
we have currently seen, we will fall short in our if we were to obtain more data points. Our initial
prediction. assumptions would be shown to be potentially
In the case of interpolation, the simplest meth- invalid. We would now have a case for suspect-
od for ensuring that non-linearity is accurately ing that the system is non-linear or perhaps very
modelled is to gather as much data as possible. noisy. A CI model would adapt to take account of
This is because the more data we have, the more the new data points and produce an estimate of
likely it is that areas of non-linearity within the the likely shape of the curve that would account
system will have sample data points indicating for the shape of the points (Figure 1 (c)).
the shape of the parameter space. If there were It can be seen from Figure 1 (c) that if we had
insufficient data in a non-linear part of the system, interpolated between the original two points
then a CI method would tend to model the area shown in Figure 1 (a) then we would have made
as though it were linear. an incorrect prediction. By ensuring that we had
In the extreme, you only need two data points adequate data, the non-linearity of the system
to model a linear relationship. As soon as a line would be revealed and the CI technique would
becomes a curve then we need a multitude of data adapt its model accordingly.

Figure 1 (a) A simple linear system derived from 2 points. (b) The addition of further data reveals non-
linearity. (c) A CI system fits a curve to the available data. (d) The shape of the estimated curve showing
how further data produces a new shape — extrapolation would fail in these regions.

(a) (b)

Making Decisions with Data

Related to this concept is the possibility of been used to build it, but performs very poorly
extrapolating from our current known position in when presented with novel data. With regard to
order to make predictions about areas outside the the previous section, generalization deals with
original data set used to build the model. Extrapo- the ability of a non-linear system to accurately
lation of the linear system in Figure 1 (a) would be interpolate between points from the data used to
perfectly acceptable if we knew the system was construct it.
actually linear. However, if we know the system An idealized goal for a CI system is that it
is non-linear, this approach becomes very error should aim to produce accurate predictions for
prone. An example of the possible shape of the data that it has not seen before. With a poorly
curve is shown in Figure 1 (c), however, we have constructed CI system that may have been built
no guarantee that this is actually where the curve with unrepresentative data, the system is likely
goes. Further data collection in the extremes of to perform well when making predictions in the
the system (shown by black squares) might reveal region of this unrepresentative data and very
that the boundaries of the curve are actually quite poorly when tested with novel data that is more
a different shape to the one we have extrapolated representative of the typical operating environ-
(Figure 1 (d)). While we remained in the data-rich ment. In simple terms, the system attempts to
central area of the curve, our prediction would build a predictive system that very closely follows
have remained accurate. However, as soon as we all the observed historical data to the detriment
went to the extremes, errors would have quickly of new data.
crept in. If all the data used to build a system completely
Given that we have the original data set at captured the behaviour of the system then gen-
our disposal, it is possible to determine how well eralization would not be an issue. This is almost
sampled a particular region is that we wish to make never the case, as it is very difficult to capture
a prediction in. This should enable us to provide all the data describing the state of a system and
a measure of uncertainty about the prediction furthermore data usually have some degree of
itself. With regards to extrapolation, we usually noise associated with them. The CI practitioner
know what the upper and lower bounds are for the will understand these limitations and will attempt
data used to build the model. We will therefore to minimize their effects on the performance of
know that we have set a given input variable to a the CI system. For the business manager interested
value outside the range on which the data used to in applying CI with the assistance of the practi-
build the model was limited to. For anything but tioner, this will generally present itself as a need
the simplest of systems, this should start ringing for a significant amount of data in an attempt to
alarm bells. It is important that a client using a overcome the noise within it and ensure that a
CI system understands the implications of what representative sample of the real-world system
they are asking for under each of these situations has been captured.
and where possible, steer away from trying to use
such information. Cultural Barriers

Generalization CI’s apparent power lies in its ability to address

This leads us to the concept of generalization—an issues at the heart of a business: choosing pros-
important issue in the development of an actual pects, pricing insurance, or warning that a machine
CI system. Generalization is concerned with needs servicing. These are high-level decisions
avoiding the construction of a CI system that that a business trusts experts to perform. Can you
is very accurate when tested with data that has go into a business and challenge the expertise

Making Decisions with Data

of their marketing team, their underwriters, or do not believe that a computer can understand
their engineers in the same way that production my customers better than me, you can show me
line robots have replaced car assembly workers? an improved response to a mail campaign for a
We look at these cultural barriers and the ways competing company, but I will still believe that
in which they have been successfully overcome. my business is different and it will require a lot
Whether you are an external consultant selling of evidence before I will change my mind.
to a client or an internal manager selling an idea Related to the issue of trust is that of under-
to the board, you will need to understand how standing. This is a problem on two levels—first
to win acceptance for this new and challenging people do not always understand how they
approach if your project is to succeed. themselves do something. For example, we inter-
People who are experts at their job do not like viewed experts in spotting insurance fraud, who
to think that a computer can do it better. In general, said things like, “You can just tell when a claim
computers are regarded as dumb tools—there to is dodgy—it doesn’t look right.” You can call it
help the human experts with the tedious aspects intuition or experience, but it is hard to persuade
of their work. Robots and simple machines have somebody that it is the result of a set of non-linear
successfully replaced a lot of manual labor. There equations served up by their subconscious. The
have been barriers to this replacement—protests brain is a mysterious thing and people find the
from unions and doubts about quality for ex- idea that in some areas it can be improved upon
ample—but automation of manual labor is now by a computer very hard to swallow.
an integral aspect of the industrialized world. The second problem is that people have dif-
Computational intelligence might be vaunted ficulty believing that a computer can learn. If a
as offering a modern computational revolution person does not understand the concepts of com-
where machines are able to replace human deci- puter learning and how it is possible to use data
sion-making processes. This replacement process to make a computer learn, then it is hard for them
should free up people to focus on special cases to make the conceptual leap required to believe
that require thought and knowledge of context that a computer could be good at something that
that the computer may be lacking. Given these they see as a very human ability.
positive aspects, there are still many reasons Here is an example to illustrate the point. A
why this shift might not come about. In the first printing company might upgrade from an old opti-
instance, there is the position of power held by the cal system to a complete state-of-the-art digital
people to be replaced. The people who make the system. In the process they would replace the
decisions are less than happy with the idea that very core of their business with a new technol-
they might be replaceable and that they might be ogy, perhaps with the result that their old skills
called upon to help build the systems that might become obsolete. A graphic designer, however,
replace them. Manual workers have little say in would not want to buy a system that could auto-
the running of an organization. However, market- matically produce logos from a written brief, no
ing executives and underwriters are higher up a matter how clever the technology.
company’s decision-making chain—replacing Our experience has shown that many of these
them with a computer is consequently a more problems can be overcome if the right kind of
difficult prospect. simplification is applied to the sales pitch. That
Next there is the issue of trust. I might not is not to say that technical details should be
believe that a machine can build a car, but show avoided or that buyers should be considered stu-
me one that does and I have to believe you. If I pid. It means choosing the right level of technical

Making Decisions with Data

description and, more importantly, setting the that the investigators would be better spending
strength of claims being made about the technol- their time on the more complex cases where their
ogy on offer. skills could truly be used, then you can make a
We shall use the task of building a CI system case for installing a CI system that does a lot of
for use in motor insurance as an example. We the routine work for them and only presents the
developed a system that could calculate the risk cases that it regards as suspicious.
associated with a new policy better than most Another barrier to the successful commer-
underwriters. It could spot fraud more effectively cialization of CI techniques is, to put it bluntly,
than most claims handlers and it could choose a lack of demand. It is easy to put this lack of
prospective customers for direct marketing better demand down to a lack of awareness, but it should
than the marketing department. Insurance could be stated with more strength than that: CI is not
be revolutionized by the use of CI (Viaene, Der- in the commercial consciousness. Perhaps if
rig, Baesens, & Dedene, 2003), but the industry prospective customers understood the power of
has so far resisted. CI techniques, then they would be easily sold on
An insurance company would never replace the idea. To an extent, of course, that is true. But
its underwriters, so if we are to help them with to find the true reason behind a lack of demand,
a CI system, it must be clearly positioned as a we must look at things from a customer’s point of
tool—something that helps them do their job better view. Will CI be on the customer’s shopping list?
without doing it for them. Even though you could Will there be a budget allocated? Are there press-
train a neural network to predict the probability ing reasons for a CI system to be implemented?
associated with a new policy leading to a claim If the answer to these questions is no, then there
better than the underwriter can, it would do too is no demand. There is only, at best, the chance
much of his or her job to be acceptable. to persuade a forward-thinking visionary in the
Our experience has shown us that approxi- company who has the time, resources and security
mately 90% of motor insurance policies carry to risk a CI approach.
a similar, low probability of leading to a claim. To use our e-commerce example again, a
There are 5% that have a high risk associated with company building an online shop will need to
them and 5% that have a very low risk. A system worry about secure servers, an e-commerce sys-
for spotting people who fall into the interesting tem, order processing, delivery and promotion of
10% in order to avoid the high risks and increase the site. Those things will naturally be on their
the low risk policies would leave the underwriters shopping list. An intelligent shop assistant to help
still doing their job on the majority of policies the customer choose what to buy might be the
and give them an extra tool to help avoid very only thing that would make a new e-commerce
high risks. The CI system becomes the basis site stand out. It might be a perfect technical ap-
of a portfolio management system and the sale plication of CI, and it might double sales, but it
is then about better portfolio management and will not be planned, nor budgeted for. That makes
not about intelligent computing—a much easier the difference between you having to sell and the
prospect to sell. customer wanting to buy.
Within the context of the insurance fraud ex-
ample, investigators spent a considerable amount Technical Barriers
of their time looking at routine cases. Each case
took a brief amount of time to review but, due to It has been our experience that the most common
the large number of them, this took up the major- and fundamental technical barrier to most CI
ity of their time. If you put forward the argument applications concerns access to source data of

Making Decisions with Data

the correct type and quality. Obviously, if there your own milestones to be moved. Be clear that
is no data available relating to a given applica- your work cannot start until the data are delivered.
tion, then no data-driven CI technique will be of You may also want to be clear that the data must
use. A number of more common problems arise, meet a certain set of criteria.
however, when a client initially claims to have You also need to make it very clear what the
adequate data. client is buying. Most clients will be used to
Is the client able to extract the data in elec- the idea that if they have a contract with (say) a
tronic form? Some database systems actually software company to develop a bespoke solution,
do not have a facility for dumping entire table then that solution will be delivered, working as
contents, compelling the user to make selections agreed upon in the specification. If it is not, then
one-at-a-time. Some companies still maintain the contract will usually allow for payment to
paper-only storage systems and some companies be reduced or withheld. It should be made clear
have a policy against data leaving their premises. to the client that their data, and whether or not it
It is also well worth remembering that the ap- contains the information required to allow the CI
propriate data will not only need to be available approach to work, will be the major contributing
at the time of CI system development, but at run factor to the success or otherwise of the project.
time, too. A typical use of CI in marketing is to The client must understand that success cannot
make predictions about the buying behaviour of be guaranteed. It has been our experience that the
customers. It is easy to append lifestyle data to client often does not see it this way — the failure
a customer database off-line ready for analysis, of the CI model to accurately predict who their
but will that same data be available online when customers are is seen as a failure of CI, not their
a prediction is required for any given member of carefully collected data.
the population as a whole? Another consequence of the lack of available
Does the data reflect the task you intend to data at the start of a project is the difficulty it
perform and does it contain the information represents if you plan to demonstrate your approach
quired to do so? Ultimately, finding the answer to a prospective client. You can generate mock
to this question is the job of the CI expert, but data that carries the information you hope to find
this is only true when the data appears to reflect in your client’s database, but this proves little to
the application well. It can be worth establishing the client as it is clearly invented by you. You can
early on that the data at least appears useful. talk about (and possibly even demonstrate) what
There are also technical aspects of a CI proj- you have done for similar, anonymous clients, but
ect that will have an impact on the contractual each company’s situation is usually different and
arrangements between you and the client. These CI models are very specific to each customer. The
are consequences of the fact that it is not always difficulty of needing data therefore remains.
possible to guarantee the success of a CI project There are many specific technical problems
since the outcome depends on the quality of the including choosing the right CI technique and
data. using it to produce the best results. Each CI
If the client does not have suitable data but is technique has its own particular requirements
willing to collect some, it is important to be clear and issues. It is beyond the scope of this chapter
about what is to be collected and when that data to cover such topics—we have focussed on the
will be delivered. If your contract with the client elements that occur generally across the diverse
sets out a time table, be sure that delays in the set of CI approaches. Further chapters in this
data collection (which are not uncommon) allow book address technique specific issues.

Making Decisions with Data

future trendS selling a product or service improved by the tech-

niques without reference to those techniques. The
We believe that CI technology is currently at a search engine Google is a good example. People do
stage of development where weaknesses in the not care about the clever methods behind it. They
techniques are not the major barriers to immediate just know it works as a very good search engine.
commercial exploitation. We have identified what Another good example of underplayed technol-
we consider to be the main cultural, conceptual ogy comes from the world of industrial control.
and technical barriers to commercialization of CI Most industrial control is done using a technique
and the reader may have noticed that the technical known as PID. Many university engineering
barriers did not include any shortcomings of the departments have produced improvements to the
CI methods themselves. PID controller and very few of them have found
There is a large gap between the power of their way into an industrial process. One reason
the techniques available and the problems that for this is that everybody in the industry under-
are currently being solved by those techniques. stands and trusts PID controllers. Nobody wants
Unusually, however, it is the technology that is to open the Pandora’s box of new and challenging
ahead. One can easily imagine impressive ap- techniques that might fragment the industry and
plications of CI techniques that are yet to be per- its expertise.
fected—Web agents that can write you an original One company developed an improvement to
essay on any topic you choose, robot cars capable the PID controller and did not even admit to its
of negotiating the worst rush hour traffic at high existence. They simply embedded it in a new
speeds, and intelligent CCTV cameras that can product and sold it as a standard PID controller.
recognize that a crime is taking place and alert It worked just that bit better than all the others.
the police. None of these applications are possible Nobody really knew why it was better, but it
today and they are likely to remain difficult for was. The controller sold very well, nobody was
a long time to come. The small improvements to threatened by the new technique, and there was no
the techniques that are possible in commercially technical concept to sell. It just worked better.
viable time scales will not bring about a step An alternative and related example is the use
change in the types of applications to which the of CI systems to spot fraudulent credit card be-
techniques may be applied. haviour. It is simply not practical for investigators
Our view of the near future of the commercial to analyze every single credit card transaction. A
exploitation of CI, therefore, is concentrated on CI system can be used to monitor activity for each
the methods of delivery of existing techniques user and determine when it has become unusual.
and not the development or improvement of those At this point an investigator is alerted who can
techniques. Of course, the development of CI contact the owner of the card to verify their spend-
techniques is important, but it is the commercial- ing behaviour. People are generally not aware that
ization that must catch up with the technology, CI systems are behind such applications, and for
and not the converse. The consequence of this all practical purposes, this does not matter. The
observation is, we believe, that the near future important issue is the benefits they bring rather
of the commercial exploitation of CI techniques than their technical sophistication.
requires little further technical research. The The authors have put this approach into
current techniques can do far more than they are practice. Having spent several years selling CI
being asked to do. technology to direct marketing agencies with
We expect to see a shift away from selling the little success, they have recently launched a Web-
idea of the techniques themselves and towards based direct marketing system that is driven by CI

Making Decisions with Data

techniques. The service allows clients to upload SuMMary

their current customer database to a Web server.
It then appends lifestyle data to the names in the We have seen that there are a number of barri-
database, which is then used to generate a new ers to the successful commercialization of CI
list of prospects for the client to download. techniques. There is a lack of awareness and
The primary selling point of the service is that understanding from potential customers. Their
it is easy to use and inexpensive (the techniques mathematical nature and the fact that the success
are automated). These are far easier concepts to of a project depends on the quality of the data it
sell because they are clearly demonstrable—the uses can make the concept hard to sell. The lack
client can see our prices and visit our Web site of awareness also means that companies are not
to see how easy the process is to use. Having got actively looking for CI solutions and are conse-
a foot in the door of the mailing list market, our quently unlikely to have budgets in place with
system quietly uses some very straightforward which to buy them.
CI techniques to produce prospect lists that yield CI techniques face cultural barriers to their
response rates up to four times better than the adoption as they could potentially replace existing
industry standard. human expertise. The existing human experts are
Our approach is proving successful and it is often in a powerful position to prevent even the
based on the following points: risk of this replacement and their unwillingness to
change should not be underestimated. We have also
• We selected a market where there was touched upon technical barriers, such as accessing
clearly money to be made from delivering the correct data both at design and run-time, and
an improvement to the existing, inefficient the problems of specifying, demonstrating and
norm. prototyping a system based on data.
• The main selling point of the product is not We have suggested a number of approaches
technical, thus all problems associated with designed to overcome the barriers discussed in
explaining and selling the CI concept are this chapter. These approaches can be summarized
avoided. by the notion of putting yourself in your prospec-
• We deliver a service that the customer needs, tive customer’s position. Ask yourself what the
already has budgeted for, and understands customer needs, not what you can offer. Think
perfectly. about how much change a customer is likely to
• The data we receive are always in the accept and whether or not they could cope with
same format (names and addresses) and that change. Ask yourself whether you are making
we provide all the additional data required. more work for the customer or making their life
Consequently, we never have a problem with easier. Think about whether the customer is likely
data quality. to have a budget for what you offer. If not, can you
present it as something they do have budget for?
This approach has a number of advantages. It Find out what level of technical risk the customer
removes the need for the client to worry that they is likely to be comfortable with. Are they early
are taking a risk by using a new technology. It adopters or conservative followers?
removes the need for us to try and sell the idea We believe that the future success of CI will
of the technology, and it allows us to sell to a rely on keeping your customer on board and giv-
mature market. ing them what they want, not impressing them
with all the clever tricks that you can perform.

0
Making Decisions with Data

The key element is for both you and the client to Hoss, D. (2000). The e-business explosion: Stra-
maintain the same point of view of the problem tegic data solutions for e-business success. DM
you are both trying to solve. This will primarily Review, 10(8), 24-28.
mean that if you are the provider of the CI solu-
Jazayeri-Rad, H. (2004). The nonlinear model-
tion, you will need to adapt your perspective to
predictive control of a chemical plant using
fit that of the client. It is, however, important that
multiple neural networks. Neural Computing and
the client understands the conceptual limits of CI
Applications, 13(1), 2-15.
as discussed in the early parts of this chapter. In
order to maintain a positive working relationship Jepson, B., Collins, A., & Evans, A. (1993).
with a client, it is important that they understand Post-neural network procedure to determine
both the benefits and limitations of computational expected prediction values and their confidence
intelligence and therefore know, at least in prin- limits. Neural Computing and Applications, 1(3),
ciple, what can and cannot be done. 224-228.
Kim, K., & Lee, W. B. (2004). Stock market
prediction using artificial neural networks with
referenCeS
optimal feature transformation. Neural Comput-
ing and Applications, 13(3), 255-260.
Baum, E.B., & Haussler, D. (1989). What net size
gives valid generalisation? Neural Computation, Law, R. (1999). Demand for hotel spending by
1(1), 151-160. visitors to Hong Kong: A study of various fore-
casting techniques. Journal of Hospitality and
Bhat, N., & McAvoy, T. J. (1990). Use of neural nets
Leisure Marketing, 6(4), 17-29.
for dynamic modelling and control of chemical
process systems. Computer Chemical Engineer- Moore, G. (1999). Crossing the chasm: Market-
ing, 14(4/5), 573-583. ing and selling high-tech products to mainstream
customers. Oxford, UK: Capstone.
Bishop, C. M. (1995). Neural networks for pattern
recognition. Oxford, UK: Oxford University. Srivastava, A. N., & Weigend, A. S. (1994). Com-
puting the probability density in connectionist
Blazewicz, J., & Kasprzak, M. (2003). Determin-
regression. In M. Marinara & G. Morasso (Eds.),
ing genome sequences from experimental data
Proceedings ICANN, 1 (pp. 685-688). Berlin:
using evolutionary computation. In G. G. Fogel
Springer-Verlag.
& D. W. Corne (Eds.), Evolutionary computa-
tion in bioinformatics (pp. 41-58). San Francisco: Trippi, R. R., & DeSieno, D. (1992). Trading equity
Morgan Kaufmann. index futures with a neural-network. Journal of
Portfolio Management, 19, 27-33.
Haykin, S. (1994). Neural networks, a compre-
hensive foundation. New York: Macmillan. Tsaih, R., Hsu, Y., & Lai, C. C. (1998). Forecast-
ing S & P 500 stock index futures with a hybrid
Hecht-Nielsen, R. (1990). Neurocomputing. Read-
AI system. Decision Support Systems, 23(2),
ing, MA: Addison Wesley.
161-174.
Hertz, J., Krogh, A., & Palmer, R. G. (1991). In-
Viaene, S., Derrig, R. A., Baesens, B., & Dedene,
troduction to the theory of neural computation.
G. (2003). A comparison of state-of-the-art clas-
Redwood City, CA: Addison Wesley
sification techniques for expert automobile insur-

Making Decisions with Data

ance claim fraud detection. Journal of Risk and

Insurance, 69(3), 373-421.
Yu, D. L., & Gomm, J. B. (2002). Enhanced neural
network modelling for a real multi-variable chemi-
cal process. Neural Computing and Applications,
10(4), 289-299.

This work was previously published in Business Applications and Computational Intelligence, edited by K. E. Voges, & N. K.
L. Pope, pp. 19-37, copyright 2006 by IGI Publishing, formerly known as Idea Group Publishing (an imprint of IGI Global).

Chapter III
Data Mining Association Rules
for Making Knowledgeable
Decisions
A.V. Senthil Kumar
CMS College of Science and Commerce, India

R. S. D.Wahidabanu
Govt. College of Engineering, India

aBStraCt

This chapter describes two techniques used to explore frequent large itemsets in the database. In the
first technique called “closed directed graph approach,” the algorithm scans the database once mak-
ing a count on possible 2-itemsets from which only the 2-itemsets with a minimum support are used to
form the closed directed graph which explores possible frequent large itemsets in the database. In the
second technique, dynamic hashing algorithm, large 3-itemsets are generated at an earlier stage which
reduces the size of the transaction database after trimming and the cost of later iterations will be less.
Furthermore the authors hope that these techniques help researchers not only to understand about
generating frequent large itemsets, but also assist with the understanding of finding association rules
among transactions within relational databases.

introduCtion occurred in all areas of human endeavors, from the

mundane (such as supermarket transaction data,
Recently, with the advent of the vast growth credit card usage records, telephone call details,
in applications of computers, large amounts of and government statistics) to the more exotic
transaction data are stored in databases. This has (such as images of astronomical bodies, molecular

Copyright © 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
Data Mining Association Rules for Making Knowledgeable Decisions

databases, and medical records). Little wonder, ready for application in the business community
then that interest has grown in the possibility of because it is supported by three technologies that
tapping these data, of extracting from them in- are now sufficiently mature:
formation that might be of value to the owner of
the database. The discipline concerned with this • Massive data collection
task has become known as data mining. • Powerful multiprocessor computers
Data mining, the extraction of hidden pre- • Data mining algorithms
dictive information from large databases, is a
powerful new technology with great potential Commercial databases are growing at unprec-
to help companies focus on the most important edented rates. A recent research survey conducted
information in their data warehouses. Data mining by GoldenGate Software in San Francisco shows
tools predict future trends and behaviors, allowing that almost half of the data warehouses are
businesses to make proactive, knowledge-driven growing between 10 and 50 percent annually. In
decisions. The automated, prospective analyses some industries, such as retail, these numbers
offered by data mining move beyond the analyses can be much larger. The accompanying need
of past events provided by retrospective tools typi- for improved computational engines can now
cal of decision support systems. Data mining tools be met in a cost-effective manner with parallel
can answer business questions that traditionally multiprocessor computer technology. Data mining
were too time consuming to resolve. They scour algorithms embody techniques that have existed
databases for hidden patterns, finding predictive for at least 10 years, but have only recently been
information that experts may miss because it lies implemented as mature, reliable, understandable
outside their expectations. tools that consistently outperform older statisti-
Most companies already collect and refine cal methods.
massive quantities of data. Data mining techniques Data mining is the analysis of (often large)
can be implemented rapidly on existing software observational data sets to find unsuspected rela-
and hardware platforms to enhance the value of tionships and to summarize the data in novel ways
existing information resources, and can be inte- that are both understandable and useful to the data
grated with new products and systems as they owner. The relationships and summaries derived
are brought online. When implemented on high through a data mining exercise are often referred
performance client/server or parallel processing to as models or patterns. Examples include linear
computers, data mining tools can analyze massive equations, rules, clusters, graphs, tree structures,
databases to deliver answers to questions such as, and recurrent patterns in time series. Association
“Which clients are most likely to respond to my rules are among the most popular representations
next promotional mailing, and why?” for local patterns in data mining.
Data mining techniques are the result of a
long process of research and product develop-
ment. This evolution began when business data BaCkground
was first stored on computers, continued with
improvements in data access, and more recently, There are quite a few rules that are available for
generated technologies that allow users to navigate analyzing data transformation for making intel-
through their data in real time. Data mining takes ligent decision. The association rule is by far
this evolutionary process beyond retrospective the most useful method in this respect, which is
data access and navigation to prospective and described next.
proactive information delivery. Data mining is

Data Mining Association Rules for Making Knowledgeable Decisions

Association Rule and p in the tens of thousands, and is generally

very sparse, since a typical basket contains only
An association rule is a simple probabilistic state- a few items. Association rules were invented as
ment about the co-occurrence of certain events in a way to find simple patterns in such data in a
a database, and is particularly applicable to sparse relatively efficient computational manner.
transaction data sets. For the sake of simplicity all Details about who calls whom, how long they
variables are assumed to be binary. An association are on the phone, and whether a line is used for
rule takes the following form: fax as well as voice can be invaluable in target-
If A=1 AND B=1 THEN C=1 with probability ing sales of services and equipment to specific
ρ, where A, B, and C are binary variables and ρ = customers. But these tidbits are buried in masses
ρ(C=1|A=1,B=1) that is, the conditional probability of numbers in the database. By delving into its
that C=1 given that A=1 and B=1. The conditional extensive customer-call database to manage its
probability ρ is sometimes referred to as the “ac- communications network, a regional telephone
curacy” or “confidence” of the rule, and ρ(A=1, company identified new types of unmet customer
B=1 C=1) is referred to as the “support.” This needs. Using its data mining system, it discovered
pattern structure or rule structure is quite simple how to pinpoint prospects for additional services
and interpretable, which helps explain the general by measuring daily household usage for selected
appeal of this approach. Typically the goal is to periods. For example, households that make
find all rules that satisfy the constraint that the many lengthy calls between 3 p.m. and 6 p.m. are
accuracy ρ is greater than some threshold ρ , and likely to include teenagers who are prime candi-
the support is greater than some threshold ρs (for dates for their own phones and lines. When the
example, to find all rules with support greater than company used target marketing that emphasized
0.05 and accuracy greater than 0.8). Such rules convenience and value for adults—“Is the phone
comprise a relatively weak form of knowledge; always tied up?”—hidden demand surfaced.
they are really just summarizes of co-occurrence Extensive telephone use between 9 a.m. and
patterns in the observed data, rather than strong 5 p.m. characterized by patterns related to voice,
statements that characterize the population as a fax, and modem usage suggests a customer has
whole. Indeed, in the sense that the term “rule” business activity. Target marketing offering those
usually implies a causal interpretation (from the customers “business communications capabilities
left to the right hand side), the term “association for small budgets” resulted in sales of additional
rule” is strictly speaking a misnomer since these lines, functions, and equipment.
patterns are inherently correlational, but need The ability to accurately gauge customer re-
not be causal. sponse to changes in business rules is a powerful
The general idea of finding association rules competitive advantage. A bank searching for new
originated in applications involving “market- ways to increase revenues from its credit card
basket data”. These data are usually recorded in operations tested a nonintuitive possibility: Would
a database in which each observation consists credit card usage and interest earned increase
of an actual basket of items (such as grocery significantly if the bank halved its minimum
items), and the variables indicate whether or not required payment? With hundreds of gigabytes
a particular item was purchased. One can think of data representing two years of average credit
of this type of data in terms of a data matrix of n card balances, payment amounts, payment timeli-
rows (corresponding to baskets) and p columns ness, credit limit usage, and other key parameters,
(corresponding to grocery items). Such a matrix the bank used a powerful data mining system to
can be very large, with n in the number of millions model the impact of the proposed policy change

Data Mining Association Rules for Making Knowledgeable Decisions

on specific customer categories, such as custom- known, then finding association rules is simple.
ers consistently near or at their credit limits who If a rule X ⇒ B has frequency at least s, then the
make timely minimum or small payments. The set X must by definition have frequency at least s.
bank discovered that cutting minimum payment Thus, if all frequent sets are known, researchers
requirements for small, targeted customer catego- can generate all rules of the form X ⇒ B, where
ries could increase average balances and extend X is frequent, and evaluate the accuracy of each
indebtedness periods, generating more than $25 of the rules in a single pass through the data.
million in additional interest earned. Studies on mining association rules have
Merck-Medco Managed Care is a mail-order evolved from techniques for discovery of function-
business which sells drugs to the country’s larg- al dependencies (Mannila & Raiha, 1987), strong
est health care providers: Blue Cross and Blue rules (Agrawal & Srikant, 1994), classification
Shield state organizations, large HMOs, U.S. rules (Han, Cai, & Cercone, 1993; Quinlan, 1992),
corporations, state governments, and so forth. casual rules (Michalski & Tecuci, 1994), clustering
Merck-Medco is mining its one terabyte data (Fisher, 1987), and so forth to disk-based, efficient
warehouse to uncover hidden links between methods for mining association rules in large
illnesses and known drug treatments, and spot set of transaction data (Agrawal, Imelinski, &
trends that help pinpoint which drugs are the most Swami, 1993; Agrawal & Srikant,1994; Agrawal
effective for what types of patients. The results & Srikant, 1995; Mannila, Toivonen, & Verkamo,
are more effective treatments that are also less 1994; Park, Chen, & Yu, 1995). Discovery of as-
costly. Merck-Medco’s data mining project has sociation rules is an important class of data mining
helped customers save an average of 10-15% on and aims at deciphering interesting relationships
prescription costs. among attributes in the data (Houtsma & Swami,
1995; Michalski & Tecuci, 1994; Quinlan, 1992).
Association Rule Mining to Find To achieve this, efficient algorithms are to be
Frequent Itemsets implemented to conduct mining on these data. As
a base, for given database of sales transactions,
The task in association rule discovery is to find all one could like to decipher all transactions among
rules fulfilling given prespecified frequency and items such that the presence of some items in a
accuracy criteria. This task might seem a little transaction will imply the presence of some items
daunting, as there is an exponential number of in the same transaction. The problem of mining
potential frequent sets in the number of variables association rules on the basis of database was first
of the data, and that number tends to be quite explored in Agrawal et al. (1993). Various algo-
large in market basket applications. Fortunately, rithms have been proposed to discover the large
in real data sets it is the typical case that there itemsets (Agrawal & Srikant, 1994; Houtsma &
will be relatively few frequent sets (for example, Swami, 1995; Jong Soo Park, Ming-Syan Chen,
most customers will buy only a small subset of & Philip S. Yu., 1997).
the overall universe of products). The main limitation of almost all proposed
If the data set is large enough, it will not fit algorithms (Agrawal et al., 1993; Agrawal &
into main memory. Thus researchers aimed at Srikant, 1994; Agrawal & Srikant, 1995; Mannila,
methods that read the data as few times as pos- Toivonen, & Verkamo, 1994; Park, Chen & Yu,
sible. Algorithms to find association rules from 1995) is that they make repeated passes over the
data typically divide the problem into two parts: disk-resident database partition, incurring high
(1) find the frequent itemsets and then (2) form the I/O overheads. Moreover, these algorithms use
rules from the frequent sets. If the frequent sets are complicated hash structures which entails ad-

Data Mining Association Rules for Making Knowledgeable Decisions

ditional overhead in maintaining and searching large 3-itemsets at an earlier stage which reduces
them, and they typically suffer from poor cache the size of the transaction database after trimming
locality (Zaki, Parthasarathy, & Li, 1997). The and consequently iterations will be less.
problem with Partition, even though it makes
only two scans, is that, as the number of partitions Closed Directed Graph Approach
increase, the number of locally possible frequent
itemsets increases. While this can be reduced by Various algorithms used for discovering large
randomizing the partition selection, but the re- itemsets make multiple passes over the data.
sults from sampling experiment (Toivonen et al., During the first pass, a count is made to find the
1996; Zaki, Parthasarathy, Li, & Ogihara, 1997) support of individual items. As a result, the support
indicate that the randomized partitions will have of individual items is used to determine which of
a large number of possible frequent itemsets in them are large, that is, have minimum support.
common. Construction of directed graphs with In each subsequent iterations, the set of itemsets
a single pass over the database reduces high I/O found to be large in the previous pass is used as the
overheads (Senthil Kumar & Wahidabanu, 2006). base for generating new potentially large itemsets,
Partition can thus spend a lot of time in perform- called candidate itemsets. A count is made to find
ing redundant computation (Zaki, Parthasarathy, the actual support for these candidate itemsets
Li, & Ogihara, 1997). Approaches using only during the pass over the data. The candidate
general-purpose DBMS systems and relational itemsets which are actually large, is identified at
algebra operations have been studied (Holsheimer, the end of the pass, which forms a base for the
Kersten, Mannila, & Toivonen, 1995; Houtsma & next pass. The same process is repeated until no
Swami, 1995), but these do not compare favorably other new large itemsets are found (Agrawal et
with the specialized approaches. A number of al., 1994; Jong Soo Park et al., 1997).
parallel algorithms have also been proposed (Han, The construction of closed directed graph
Karypis, & Kumar, 1997; Houtsma & Swami, requires a single scan of the transaction data-
1995). In this aspect, to find efficient algorithms base. Our method begins by generating all the
the author proposes two forms of approaches in the 2-subsets in the database and performing a single
next section that would overcome the drawbacks pass of the database to find the counts of support
of currently available approaches. of 2-itemsets. The frequent 2-itemsets with the
minimum support are used to generate a directed
graph from which only the closed portion of the
TECHNIQUES USED FOR FINDING directed graph is used to identify the frequent large
FREQUENT ITEMSETS itemsets. To illustrate this, a transaction database
has been considered as shown in Figure 1.
The goal of the techniques described in this section
is to detect relationships or associations between
specific values of categorical variables in large data Figure 1. Database
sets. Closed directed graph approach technique
scans a database only once making a count on Transaction ID Items
possible 2-itemsets from which only the 2-item- 100 be
sets with a minimum support are used to form
200 abce
the closed directed graph which explores possible
300 bce
frequent large itemsets in the database and the
dynamic hashing algorithm technique generates 400 acd

Data Mining Association Rules for Making Knowledgeable Decisions

At first each possible 2-itemsets from the data- the vertex v is adjacent to the vertex u, and the
base are being taken as shown in Figure 2. Initially, vertex u is adjacent to the vertex v. Moreover, it
a counter is set for each identified 2-itemsets. is said that the edge is incident from the vertex u
During the scanning of the entire database, the and incident to the vertex v. An association graph
support of each candidate 2-itemsets is counted can quickly turn into a tangled display with as
and stored in their respective counters. few as a dozen rules. Suppose that the minimum
After the scanning of the database is over, the support is 2, the directed graphs for the possible
value in each counter specifies the total number 2-itemsets with the minimum support will be as
of counts for that particular 2-itemset. The total shown in Figure 4. The value of the edges repre-
number of support for each possible 2-itemsets sent the number of counts of each pair from the
for the given database is shown in Figure 3. possible 2-itemsets.
To discover the possible frequent large After directed graphs for possible frequent
itemsets, directed graphs are constructed using 2-itemsets with minimum support have been
the items in the set of possible 2-itemsets with constructed, all the directed graphs in Figure 4
minimum support. A directed graph is one of are used to construct a single directed graph as
the prevailing techniques to depict associations. follows (see Figure 5(i) and subsequently Figures
A directed graph G = {V,E} consists of a finite 5(ii) to Figure 5(iv) as emerged):
set V, together with a subset E ⊆ V x V . The ele- In Figure 5(iv) the nodes of items b, c and e are
ments of V are the vertices of the graph, and the closed. Only the node of item {a} is not closed.
elements of E are the edges of the graph. An edge The edges of the closed nodes are used for the
of a directed graph is an ordered pair [u,v] where identification of frequent large itemsets. In the
u and v are the vertices of the graph. We say that Figure 5(iv) itemset {b c e} represents the possible

Figure 2. Generation of possible 2-itemsets

{b e} {b e}
{a b c e} {{a b},{a c},{a e},{b c},{b e},{c e}}
{b c e} {{b c},{b e},{c e}}
{a c d} {{a c},{a d},{c d}}

Figure 3. Support for possible 2-itemsets

counts
{a b} 1
{a d} 1
{b e} {a c} 2
{{a b},{a c},{a e},{b c},{b e},{c e}} {a e} 1
{{b c},{b e},{c e}} {b c} 2
{{a c},{a d},{c d}} {b e} 3
{c d} 1
{c e} 2

Data Mining Association Rules for Making Knowledgeable Decisions

frequent k-itemset and itemsets {b c}, {b e}, and The proposed algorithm DHA uses a technique
{c e} represents the possible frequent 2-itemsets of dynamic hashing to filter out unnecessary item-
for the given database. sets for next candidate itemsets generation. When
a transaction database is considered, frequent
Dynamic Hashing Algorithm 2-itemsets will not be much useful for improving
the sales. Considering this, DHA accumulates the
In Jong Soo Park et al. (1997) DHP algorithm, it information only about 3-itemsets in advance in
has been shown that, the amount of data that has such a way that all possible 3- itemsets of each
to be scanned during the large itemset discovery is transaction after some pruning are hashed to a
a performance-related issue. Reducing the number dynamic hash table.
of transactions to be scanned and trimming the In Jong Soo Park et al. (1997) DHP generates a
number of items in each transaction improves the much smaller C2, so that the step for determining
data mining efficiency in later stages. L2 will be less expensive than which is spent in

Figure 4. Directed graphs for the possible frequent 2-itemsets with minimum support

a b b c

2 2 3 2

c c e e

(i) (ii) (iii) (iv)

Figure 5. Formation of closed directed graph. (i): Directed graph 4(i) is used as the base. (ii): Since
there is no node for item b in Figure 5(i), a new node for item b is added newly and directed graph 4(ii)
is added to Figure 5(i) as shown. (iii). Since there is no node for item e in Figure 5(ii), a new node for
item e is added newly and directed graph 4(iii) is added to Figure 5(ii) as shown. (iv). Since already
nodes for items c and e exists in Figure 5(iii), directed graph 4(iv) is added to Figure 5(iii) as shown.

a a b b a b
a

2 2
2 2 3 2 3
2 2

c c c
c e e

5(i) 2
5(ii)
5(iii)
5(iv)

Data Mining Association Rules for Making Knowledgeable Decisions

Apriori algorithm. Since, frequent 2-itemsets will itemsets. Part 3 is basically same as Part 2 except
not be much useful for the improvement of sales, that it does not employ a dynamic hash table.
DHA algorithm generates frequent 3-itemsets in Since DHA is particularly powerful to determine
advance, which in turn reduces the cost spent for large itemsets in early stages, thus improving the
generating k-itemsets in subsequent passes. Also performance bottleneck. The size of Ck decreases
the time and cost spent for generating frequent significantly in later stages, thus rendering little
2-itemsets will also be reduced. justification its further filtering. This is the very
The proposed algorithm uses dynamic hash reason that Part 2 will be used for early itera-
technique to filter out unnecessary itemsets for tions, and Part 3 will be used for later iterations.
next candidate itemset generation and also reduces It is noted that in Part 3 procedure apriori_gen
the size of the database as a minimum for the to generate Ck+1 from Ck is essentially the same
next large itemsets generation. Initially, during as the method used by algorithm Apriori in [4] in
the scanning of the entire database, the support determining candidate itemsets, and authors have
of candidate k-itemsets is counted and DHA ac- omitted the details on it. Part 3 is used in DHA
cumulates information only about k+1-itemsets only for the completeness of this method.
(where hash value is equal to 0) in advance in However, this algorithm can be explained with
such a way that all efficient k+1-itemsets of each an example as shown in Figure 6. The transac-
transaction after some pruning are assigned to tion database D as shown in Figure 6 is used for
0 and for each k+1-itemsets this number will be the discussion of the large itemsets. In the first
incremented by 1, and are hashed to the dynamic pass of the algorithm, all the transactions of the
hash table. The buckets in the dynamic hash table database are scanned to count the support of the
consists of only k+1 itemsets whose hash value is 1-itemsets to form the candidate set of large 1-
equal to 0. Instead of hashing all the k+1 itemset itemsets, that is, C1= {{A}, {B}, {C}, {D}, {E}, {F},
into a hash entry whose value is larger than or {G}}. For this purpose, a hash tree for C1 is built
equal to support s, DHA hashes only the efficient on the fly for the purpose of efficient counting.
k+1-itemsets whose hash value is equal to 0 in the For each item in the database, it is checked with
dynamic hash table. Since, the number of keys, m, the items in the hash table. If the item is already
is not fixed but varies with time, DHA algorithm present in the hash table, then the corresponding
uses a dynamic hash table. count of the item is incremented by one. If the
DHA algorithm is divided into 3 parts. In Part item is not present in the hash table, then the new
1, the algorithm simply counts item occurrences item is inserted into the hash table and the count
to determine the large 1-itemsets and makes a is initialized as one. For each transaction, after
dynamic hash table (i.e., DH3) in which only the occurrences of all the 1-subsets are counted, all
efficient 3-itemsets (whose hash value is equal to the 3-subsets of this transaction are generated
0) are filtered and assigned a value starting from and only the efficient 3-subsets are hashed into
0 for the first efficient 3-itemset and the value the dynamic hash table DH3 in such a way that
increases for the next efficient 3-itemsets by 1 and each efficient 3-subsets generated using the hash
hashed into a dynamic hash table . In Part 2, based function are assigned a count starting from 0 for
on the dynamic hash table (i.e., DHk) generated the first 3-subset generated by using the hash func-
in the previous pass, a set of candidate itemsets tion and for each 3-subset generated thereafter,
Ck is generated, which determines the set of large the count will be incremented by one. Given the
k itemsets Lk, reducing the size of database as a dynamic hash table, to filter out 3-itemsets from
minimum for the next large itemsets and makes L1 * L1, the number of occurrences of 3-itemsets
a dynamic hash table for candidate large k+1 in dynamic hash table are compared with a mini-

0
Data Mining Association Rules for Making Knowledgeable Decisions

Figure 6. Example of a dynamic hash table and generation of C3 and C4

mum support equal to 2, thus forming C3= {{C in mundane, business related areas. Moreover, mi-
D E},{D E F}}. cromarketing campaigns will explore new niches
and foremost, advertising will target potential
customers with new precision tools.
future trendS In the medium term, data mining may be as
common and easy to use as e-mail. Users may use
It can be stated that, in the short-term, the results these tools to find the best airfare to New York,
of data mining will be in profitable taste, even if root out a phone number of a long-lost classmate,
or find the best prices on lawn mowers.

Data Mining Association Rules for Making Knowledgeable Decisions

However, the long-term prospects are truly which only closed portion of the directed graph
exciting. Imagine intelligent agents turned loose is used in identifying the possible frequent large
on medical research data or on subatomic particle itemsets. This method scans the database only
data. Computers may reveal new treatments for once so that the I/O cost for retrieving the pos-
diseases or new insights into the nature of the sible frequent large itemsets will be less. Since
universe. There are potential dangers, though, the number of keys, m, is not fixed and it varies
as discussed next. with time a dynamic hash table is used in DHA
What if every telephone call you make, every algorithm to store the efficient large candidate 3-
credit card purchase you make, every flight you itemsets. The proposed algorithm also prunes the
take, every visit to the doctor, every warranty card transactions, which do not contain any frequent
you send in, every employment application you itemsets, and trims the nonfrequent items from
fill out, every school record you have, your credit the transactions at the initial stage itself. The
record details, every Web page you visit— was expected storage utilization will be greater than
all collected together? Much would be known that of the DHP algorithm (Jong Soo Park et al.,
about you! This is an all-too-real possibility. 1997). Considering frequent 2-itemsets will not be
This kind of information is already stored in a much useful for the sales improvement, frequent
database. Remember that phone interview you 3-itemsets are generated in the initial stage which
gave to a marketing company last week? All of in turn also reduces the cost and time used for
your replies went into a database. Remember generating frequent 2-itemsets. However, both the
that loan application you filled out? It has been techniques explained above helps is discovering
dumped in a database. Is there too much informa- interesting association relationships among huge
tion about too many people for anybody to make amounts of data which in turn helps in marketing,
sense of? Not with data mining tools running on decision making and business management.
massively parallel processing computers! Would
any one feel comfortable about someone (or lots
of someones) having access to all this data about referenCeS
you? And remember, all this data do not have to
reside in one physical location; as the net grows, Agrawal, R., Imelinski, T. & Swami, A (1993).
information of this type becomes more available Mining association rules between sets of items
to more people. in large data bases. In Proceedings of the ACM
All these demand justified nature of data min- SIGMOD, (pp. 207-216).
ing, their implications (social, ethical, cultural,
Agrawal, R. & Srikant, R. (1994). Fast algorithms
economical) in the society, and technically speak-
for mining association rules in large databases. In
ing, a more robust set of data mining algorithm
Proceedings of the 20th International Conference
defining all these parameters and filtering all
Very Large Data Bases, (pp.478-499).
those preconditions.
Agrawal, R. & Srikant, R. (1995). Mining se-
quential patterns. In Proceedings of the 1995
ConCluSion International Conference Data Engineering,
Taipei, Taiwan.
Closed directed graph approach first constructs
Fisher, D. (1987). Improving inference through
directed graphs for the possible 2-itemsets with
conceptual clustering. In Proceedings of the
minimum support and these directed graphs are
1987 AAAI Conference (pp. 461-465). Seattle,
used to construct a single directed graph from
Washington.

Data Mining Association Rules for Making Knowledgeable Decisions

Han, J., Cai, Y., & Cercone, N., (1993). Data- Park, J. S., Chen, M. S. & Yu, P. S. (1995). An effec-
driven discovery of quantitative rules in relational tive hash based algorithm for mining association
databases. IEEE Trans. Knowledge and Data rules. In Proceedings of the 1995 ACM-SIGMOD
Engineering, 5, 29-40. International Conference Management of Data,
San Jose, California.
Han, E. H., Karypis, G., & Kumar, V. (1997). Scall-
able parallel data mining for association rules. In Parthasarathy, S., Zaki, M. J., & Li, W. (1997).
Proceedings of the ACM SIGMOD Conference Application driven memory placement for dy-
Management of Data. namic data structures (Tech. Rep. URCS TR
653). University of Rochester.
Holsheimer, M., Kersten, M., Mannila, H., &
Toivonen, H. (1995). A perspective on databases Quinlan, J. R. (1992). Programs for machine
and data mining. In Proceedings of the 1st Inter- learning. Morgan Kaufmann.
national Conference Knowledge Discovery and
Senthil Kumar, A. V. & Wahidabanu, R. S. D.
Data Mining.
(2006). Directed graph approach for association
Houtsma, M. & Swami, A. (1995). Set-oriented rule mining. In Proceedings of the 2nd Interna-
mining for association rules in relational data- tional Conference ICTS, Indonesia.
bases. In Proceedings of the 11th International
Toivonen, H. (1996). Sampling large databases
Conference Data Engineering, (pp.25-33).
for association rules. In Proceedings of the 22nd
Jong Soo Park, Ming-Syan Chen, & Philip S. Yu. VLDB Conference.
(1997). Using a hash-based method with transac-
Zaki, M. J., Parthasarathy, S., Ogihara, M. &
tion trimming for mining association rules. IEEE
Li, W. (1997). New algorithms for fast discovery
Transactions on Knowledge and Data Engineer-
of association rules. In Proceedings of the 3rd In-
ing, 9(5), 813-825.
ternational Conference on Knowledge Discovery
Mannila, H. & Raiha, K. J. (1987). Dependency and Data Mining.
inference. In Proceedings of the 1987 Interna-
Zaki, M. J., Parthasarathy, S., Li, W., & Ogihara,
tional Conference Very Large Data Bases, (pp.
W. (1997). Evaluation of sampling for data min-
155-158). Brighton, England.
ing of association rules. In Proceedings of the
Mannila, H., Toivonen, H., & Verkamo, I. (1994). 7th International. Workshop Research Issues in
Efficient algorithms for discovering association Data Engineering.
rules. In Proceedings of the AAAI Workshop,
Zaki, M. J.,Parthasarathy, S., & Li, W. (1997). A
Knowledge Discovery in Databases.
localized algorithm parallel association mining. In
Michalski, R. S. & Tecuci, G., (1994). Machine Proceedings of the 9th ACM Symposium Parallel
learning, A multistrategy approach (Vol. 4). Algorithms and Architectures.
Morgan Kaufmann.

Section II
Tools, Techniques, Methods

Chapter IV
Image Mining:
Detecting Deforestation Patterns
Through Satellites

Marcelino Pereira dos Santos Silva

Rio Grande do Norte State University, Brazil

Gilberto Câmara
National Institute for Space Research, Brazil

Maria Isabel Sobral Escada

National Institute for Space Research, Brazil

aBStraCt

Daily, different satellites capture data of distinct contexts, which images are processed and stored in
many institutions. This chapter presents relevant definitions on remote sensing and image mining domain,
beyond referring to related work on this field and to the importance of appropriate tools and techniques
to analyze satellite images and extract knowledge from this kind of data. The Amazonia deforestation
problem is discussed, as well INPE’s effort to develop and spread technology to deal with challenges
involving Earth observation resources. An image mining approach is presented and applied on a case
study, detecting patterns of change on deforested areas of Amazonia. The purpose of the authors is to
present relevant technologies, new approaches and research directions on remote sensing image mining,
demonstrating how to increase the analysis potential of such huge strategic data.

introduCtion tories, which grow fast. Among increasing and

relevant data acquired and processed, there is a
Motivation strategic segment: satellite images, also known
as remote sensing images.
Data acquisition and storage technology progress The search for less expensive and more efficient
has led to a huge amount of data stored in reposi- ways to observe Earth motivated man in develop-

Copyright © 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
Image Mining

ing remote sensing satellites. They are currently data is not analyzed, or it is done inefficiently,
the most significant source of new data about the relevant information to understand complex
planet, and remote sensing image databases are the processes and help solving challenging problems
fastest growing archives of spatial information. is wasted.
The variety of spatial and spectral resolutions for
remote sensing images ranges from IKONOS 1- General Perspective and Objectives
meter panchromatic images to the next generation of the Chapter
of polarimetric radar imagery satellites. Given the
widespread availability of remotely sensed data, In this chapter, which extends previous work
many government and private institutions have (Silva, Câmara, Souza, Valeriano, & Escada,
built large remote sensing image archives. 2005), the authors intend to present relevant
The US National Satellite Land Remote Sens- definitions on remote sensing and image mining
ing Data Archieve, managed by USGS EROS domain, beyond presenting related work on this
Data Center, hosts 1.400 terabytes of satellite field and the importance of appropriate tools and
data gathered during 40 years. Satellites, like techniques to explore satellite images and extract
Terra and Aqua (NASA), generate 3 terabytes of strategic knowledge from this kind of data.
images every day. The Brazil’s National Institute They also discuss the Amazonia deforestation
for Space Research (INPE) holds more than 130 problem to demonstrate, through an image mining
terabytes of image data, covering 30 years of process, the strength of this approach to identify
remote sensing activities which are available on patterns and fight against the increase of affected
a database with free online access. areas in this forest. Developed technologies to sup-
Actual society problems demand smart explo- port the process will be presented, providing an
ration of the vast and growing remote sensing data. overview of methodologies, tools and techniques
There is a need for understanding relevant data and involved in research efforts.
use it effectively and efficiently. Although valuable Future trends and conclusion will bring reflec-
information is contained in image repositories, tion elements to consider classical and new mining
the volume and complexity of this data makes resources to deal with challenging demands, citing
difficult (generally impossible) for human beings limitations and also revealing directions to new
extract strategic information (knowledge) without research initiatives and relevant problems.
appropriate tools (Piatetsky-Shapiro, Djeraba,
Getoor, Grossman, Feldman & Zaki, 2006).
Data mining research has enabled powerful reMote SenSing and iMage
tools, new technologies and challenging tech- Mining
niques for relevant data domains. However, large
image datasets need specific analysis resources Broad Definitions
and smart techniques and methodologies. The
availability of huge remote sensing image reposi- The first operational remote sensing satellite
tories demands appropriate resources to explore (LANDSAT-1) was launched in 1972, since
this data. then there has been a large worldwide experi-
A vast remote sensing database is a collection ence in data gathering, processing and analysis
of landscape snapshots, which supplies a single of remotely sensed data. According to Canada
opportunity to understand how, when and where Centre for Remote Sensing (2003), remote sens-
changes occurred in the world. When such rich ing is the science (and to some extent, art) of

Image Mining

acquiring information about the Earth surface spatial relationships, or other interesting but not
without actually being in contact with it. In explicit patterns stored in spatial databases. Such
other words, remote sensing is a field of applied mining approaches integrate spatial database and
sciences for information acquisition of the Earth data mining issues, bringing valuable resources
surface through devices that perform the sensing to understand facts and processes represented
and recording of the reflected or emitted energy, in spatial data, discovering spatial relationships,
followed by processing, analysis, and applica- building up spatial knowledge bases, and reveal-
tion of this information. Such devices are called ing spatial patterns and processes contained in
remote sensors, which are boarded on remote spatial repositories. Applications of the technol-
sensing aircrafts or satellites—also called Earth ogy include, beyond remote sensing, geographic
observation satellites. Images obtained through information systems, medical imaging, geomar-
remote sensing acquisition and processing are keting, navigation, traffic control, enviromental
used in many fields, once information from these studies, and many other areas where spatial data
remote sensing images is strongly demanded in are used (Han & Kamber, 2001).
many areas, including government, economy, Remote sensing image mining deals specifi-
infrastructure, and hydrology (e.g., security and cally with the challenge of capturing patterns,
social purposes, crop forecasting, urban planning, processes, and agents present in the geographic
water resources monitoring). space, in order to extract specific knowledge to
In the image acquisition process, four concepts understand or to make decisions related to a set
are fundamental: spatial, spectral, radiometric of relevant topics, including land change, climate
and temporal resolution. The spatial resolution variations and biodiversity studies. Events like de-
defines the detail level of an image, that is, if a forestation patterns, weather change correlations
sensor has a spatial resolution of 20m then each and species dynamics are examples of precious
pixel represents an area of 20m x 20m. The spec- knowledge contained in remote sensing image
tral resolution determines the sensor capability to repositories.
define short intervals of wavelength; the finer the The Amazonia forest, located in South Amer-
spectral resolution, the narrower the wavelength ica, has 6,500,000 km2, involving seven frontier
range for a particular channel or band. The radio- countries. Brazil holds 63.4% of South America
metric resolution of an imaging system describes Amazonia, which extends to the following Bra-
its ability to discriminate very slight differences zilian states: Mato Grosso, Tocantins, Maranhão,
in energy; the finer the radiometric resolution of Amazonas, Pará, Acre, Amapá, Rondônia and
a sensor, the more sensitive it is to detect small Roraima. Since it is the world’s largest tropical
differences in reflected or emitted energy. The forest, deforestation in the Amazonia rainforest
temporal resolution determines the necessary time is an important contributor to global land change.
for the sensor revisit a specific target and image According to INPE’s estimates, close to 200,000
the exact same area, that is, the time required to km2 of forest were cut in Brazilian Amazonia in
complete one entire orbit cycle; if a sensor is able the period from 1995 to 2005 (INPE, 2005). INPE
to obtain an image of an area each 16 days, then uses LANDSAT and CBERS images to provide
its temporal resolution is this period (Canada yearly assessments of the deforestation in Ama-
Centre for Remote Sensing, 2003). zonia. Given the extent of the deforestation on
Before getting into remote sensing image tropical forests, figuring out the processes and its
mining, it is necessary to state spatial data min- agents are important issues for setting up public
ing, which refers to the extraction of knowledge, policies that can help preserve the environment
(Figure 1).

Image Mining

Figure 1. Amazonia deforestation (source: Isabel Escada - INPE)

Related Work has a language for mining tasks of spatial data

(GMQL), beyond visualization tools for data and
Nagao and Matsuyama (1980) developed, at spatial mining results. GeoMiner is integrated
Kyoto University (Japan), the first high level vi- to data warehousing technology, and it is able to
sion system for aerial image interpretation. The access different spatial database servers.
system processing modules operated on a common SPIN! project (May & Savinov, 2002), devel-
dataset. The analysis process was divided in the oped by the Fraunhofer Institute for Autonomous
following steps: smoothing, when the images were Intelligent Systems (Germany), is focused on
processed to remove noise and spots on bound- producing a spatial data mining system that inte-
aries; segmentation, when elementary regions grates Geographic Information Systems and data
were extracted through a basic region growing mining in a open, extensible and tightly coupled
algorithm; global exam of the scene, to estimate framework. The project prioritizes issues like
object domains using image metadata; detailed scalability, security, multiuser access, robustness,
area analysis, when object detection subsystems and platform independence. Its functionality levels
analyzed a knowledge base to find specific ob- include data access and management, interactive
jects; communication among object detection thematic mapping for statistic data visualization,
subsystems, in order to control the analysis flow detection and explanation of spatial clusters and
managing the information on databases, resolve spatial events.
conflicts among detection subsystems and correct ADaM, a NASA’s project with the University
segmentation problems. of Alabama at Huntsville (USA), is a set of sci-
GeoMiner (Han, Koperski & Stefanovic, entific data and image mining tools (Rushing,
1997), developed at Simon Fraser University Ramachandran, Nair, Graves, Welch & Lin, 2005).
(Canada), is a prototype of spatial data mining Its resources include pattern recognition, image
system, with resources to characterize spatial processing, optimization, association rule mining,
data through rules, compare, associate, classify among others. The system is a set of components
and group datasets, analyze patterns and perform that may be put together to perform complex tasks.
mining tasks in differente levels. The prototype A focus of the project is the efficient implementa-

Image Mining

tion of critical performance components, keeping are focused on clustering methods that operate
each component of the system as independent as on the feature space, the multidimensional space
possible, in order to enable the use of appropriate which is created by the different spectral bands of a
module subsets to specific applications, including remote sensing image. These techniques are useful
linking to third party software. for distinguishing spectral signatures of different
land cover types, such as finding areas which are
Position about the Technology classified as “lakes,” “cities” or “forests.”
Nevertheless, in remote sensing image mining,
Such initiatives, among other important projects, one of the most important challenges is tracking
led by institutions and researchers of different patterns of land use change. A large remote sens-
countries since 1980, demonstrates the relevance, ing image database is a collection of snapshots of
strength and demand for efficient and robust landscapes, which provide a unique opportunity
approaches, once the mining process on image for understanding how, when, and where changes
repositories demand a strong commitment with take place in the world. Extensive fieldwork also
efficiency and robustness. The huge volume of the indicates that the different actors involved in land
datasets need an efficient hardware and software cover change (e.g., small-scale farmers, large
infrastructure. The relativity of values, the spatial plantations, cattle ranchers) can be distinguished
complexity, and the multitude of interpretations by their different spatial patterns of land use
require robust implementations, competent do- (Lambin, Geist & Lepers, 2003). Furthermore,
main specialists and experient data analysts for these patterns evolve in time; new small farms
the mining task performances. will be created and large farms will increase their
However, still a limited capacity is available agricultural area at the expense of the forest. In
for extracting information from large remote these and related situations, patterns of land use
sensing image databases. Currently, most image change will have similar spectral signatures and
processing techniques are designed to operate on image mining techniques based on clustering in
a single image, and there are few algorithms and the feature space will not be able to distinguish
techniques for handling multitemporal images. between them. Therefore, tracking the temporal
This situation has lead to a “knowledge gap” in the evolution of patterns in remote sensing imagery
process of deriving information from images and requires methods that are different from standard
digital maps (MacDonald, 2002). This “knowl- content-based image retrieval (CBIR) systems. A
edge gap” has arisen because there are currently typical CBIR system uses a query image as the
very few techniques for image data mining and source and images in the database as targets, and
information extraction in large image data sets, query results are a set of images sorted by feature
and thus researchers are failing to exploit the huge similarities with respect to the source (Chen, Wang
remote sensing data archives. & Krovetz, 2003). When searching for patterns in
Although there has been a large research effort remote sensing image databases, a different ap-
in content-based image retrieval (CBIR) tech- proach is necessary. Instead of similarity searches
niques (Rui, Huang & Chang, 1999; Smeulders, between image pairs, a system for mining remote
Worring, Santini, Gupta & Jain, 2000; Wang, sensing image databases must be able to do simi-
Khan & Breen, 2002), the specific problem of min- larity searches between patterns found in different
ing remote sensing image databases has received images. Therefore, mining remote sensing image
much less attention. Proposals such as VISIMINE databases is searching for patterns of change, not
(Aksoy, Koperski, Tusk, & Marchisio, 2004) and searching for internal content.
KIM (Schröder, Rehrauer, Seidel& Datcu, 2000)

Image Mining

ChallengeS and The Amazonia case is characterized by the

teChnologiCal StrategieS complexity, dimension, and interests involved in
on deforeStation iSSue the issues concerning land change (Becker, 1997).
Alves (2002) presents an investigation on spatio-
Brazil’s Challenge: Monitor and temporal deforestation dynamics of the Amazonia,
Decrease Amazonia Deforestation using remote sensing images to analyze deforesta-
tion spatial patterns on 1970’s, and between 1991
The land cover describes the physical state of the and 1997. This work brings valuable information:
land surface, which may be forest, water, build- the deforested area increased from 10,000,000 ha
ings, and so on. Changes on this cover may be (1970’s) to 59,000,000 ha in 2000; an intensifica-
caused by climate variations, changes on river tion on the deforestation rate on 1970’s and 1980’s
courses, and so on. However, most changes on was caused by the federal government politics,
land cover are attributed to human activities. Such which included huge highway infrastructures,
modifications implies on changes on the extension and a roadside colonization of 100 km along the
(area increase or decrease) of a specific type of extended highways; analyzing the images and the
coverage. The land use, influenced by human ac- patterns, it is clear that beyond of the roadside
tivities and enviromental processes and features, deforestation along main roads and development
is related to the purpose to which it is used, like areas, there is still the merging of little deforested
agriculture, habitation, mining, leisure, among areas, what originates large ones.
others. Land use changes occur in several spatial Once the fast deforestation process causes land
levels and in different periods, characterizing the degradation, social tension and irregular urbaniza-
environment and human dynamics on territorial tion, faster the precise identification of areas with
segments (Briassoulis, 2004). these tendencies, higher the chances of prevent-
Desertification, climate change, biodiversity ing, managing and reducing the consequences of
loss—among others—can imply in severe con- the processes. Daily, different satellites capture
sequences to the environment and consequently data belonging to this context, which images are
to humans. The modification of forest and crop available to many institutions. Image mining
areas for urban use is an important land change, tools can, in fact, increase the analysis potential
due to serious implications. The causes and of such huge strategic data.
consequences of land use and cover change, its
social, economics and enviromental impacts have Developed Technologies at INPE
motivated different research projects. One of them Concerning Image Analysis
is (Lambin, 1999), which emphasizes that land and Mining
cover change is an important global change factor,
interacting with climate, ecosystem processes, Researchers of the Brazil’s National Institute for
biochemical cycles, biodiversity and even with Space Research (INPE) has been studying the
human activities. The key issues of the project structural patterns on Amazonia, holding a wide
deal with land cover patterns, change processes, know-how on the forest issues. Moreover, the
human response to changes, integrated global historical development process is also a research
and local models, development of databases about topic at INPE, which maintains a rich dataset of
Earth surface, biophysics processes and funda- remote sensing images that provide an extensive
mental factors. This approach aims to increase spatiotemporal perspective of the Amazonia terri-
the understanding, and get new knowledge about tory. In addition, the Institute experience on image
interactive land change. processing and analysis, as well the development

0
Image Mining

of methodologies and software tools, supplies multiple GIS tools. Its main objective is to enable
important elements to keep building up image the development of a new generation of GIS ap-
analysis and mining technologies. In this context, plications, based on the technological advances
relevant ones developed at INPE are: SPRING, on spatial databases. TerraLib is free software
TerraLib, CBERS, PRODES and DETER, which developed by INPE and its partners. The main
are freely available on Internet. motivation for this project is the current lack of
SPRING (www.dpi.inpe.br/spring) is a state- either public or commercial GIS libraries that
of-the-art geographic information system (GIS) provide components for the diversity of GIS data
and remote sensing image processing system with and algorithms, especially when viewed upon
an object-oriented data model which provides the the latest advances in geographical information
integration of raster and vector data representa- science. On a practical side, TerraLib enables
tions in a single environment (Figure 2). SPRING quick development of custom-built geographical
main features include: an integrated GIS for en- applications using spatial databases. As a research
vironmental, socioeconomic and urban planning tool, TerraLib is aimed at providing a rich and
applications; a multiplatform system, including powerful environment for the development of
support for Windows and Linux; a widely acces- GIScience research, enabling the implementation
sible freeware for the GIS community with a quick of GIS prototypes that include new concepts such
learning curve. The software is a mechanism of as spatio-temporal data models, geographical
diffusion of the knowledge developed by INPE ontologies and advanced spatial analysis tech-
and its partners with the introduction of new niques (Câmara, Vinhas, Souza, Paiva, Monteiro,
algorithms and methodologies (Câmara, Souza, Carvalho, 2001).
Freitas & Garrido 1996). The CBERS program (http://www.cbers.inpe.
TerraLib (www.dpi.inpe.br/terralib) is a GIS br/en/index_en.htm), a joint effort of Brazil and
classes and functions library, available from the China, embodied the development and construc-
Internet as open source, allowing a collaborative tion of two remote sensing satellites that carry on-
environment and its use for the development of board imaging cameras and additionally a repeater

Figure 2. SPRING: image processing and geographic information system

Image Mining

for the Brazilian System of Environmental Data must be launched, respectively, in 2008 and 2012.
Collection. CBERS-1 and CBERS-2 are identical The program objectives are: build a family of
in their technical structure, space mission and pay- remote sensing satellites to support the needs of
load (on-board equipment like cameras, sensors, users in Earth resource applications, and improve
computers, among other equipment designed for the industrial capabilities of space technology in
scientific experiments). CBERS-1 was launched Brazil.
by the Chinese Long March 4B launcher from Since 1988 INPE has been monitoring Brazil-
the Taiyuan Launch Base on October 14, 1999. ian Amazonia using satellite images, producing
CBERS-2 was launched on October 21, 2003 from estimations on annual deforestation rates of the
the Taiyuan Satellite Launch Center in China forest through the PRODES project (Amazônia
(Figure 3). CBERS-2 was integrated and tested in deforestation calculation program). From 2002 on,
the integration and test laboratory of INPE. The these estimations are being generated by image
CBERS satellite has a set of sensors—WFI (wide digital classification with PRODES methodology
field imager), CCD (charge coupled device high (www.obt.inpe.br/prodes/). The main advantage
resolution imaging camera), IRMSS (infrared of this approach is the precision of georeferenced
multispectral scanner)—with a high potential to deforestation polygons, enabling a multitemporal
meet multiple application requirements including: geografic database. Using deforestation incre-
forestry alteration, signs of recent fires, monitor- ments identified on each image, the annual rates
ing of agricultural development, support for crop are estimated for August 1st of the reference year.
forecasting, identification of anthropic anomalies, For the 2003/2004 period, the deforestation rates
analysis of natural recurrent events, mapping of were obtained from 207 LANDSAT images; INPE
land use, urban sprawling, identification of water- estimates that the deforestation from August 2003
continent borders, coast studies and management, to August 2004 was 27.429 km2. For the 2004/2005
reservoir monitoring, acquisition of stereoscopic period, the deforestation rates were obtained from
images for proper cartographic analysis, support 211 LANDSAT classified images; INPE estimates
for soil survey and geology, generation of sup- that the deforestation from August 2004 to August
port material for educational activities. In 2002, 2005 was 18.793 km2. The Institute estimates a
both governments decided to expand the initial deforested area of 13,100 km2 for the 2005/2006
agreement by including CBERS-3 and 4, which period. PRODES digital results of 2000 until 2004

Figure 3. CBERS-2: Launch and Web interface of image catalog (source: www.cbers.inpe.br)

Image Mining

are available on SPRING databases containing goal of the system is not the estimation of total
LANDSAT satellite images, thematic map of the deforestated area in Amazonia, once estimations
deforestation of the year, thematic map of the ac- obtained through DETER are error-prone due to
cumulated deforestation, and the shapefiles of the the spatial resolution of MODIS and WFI. The
year with polygons of deforestation increment of system is concerned on supplying recent and
the year, forest, total accumulated deforestation updated information to support government ac-
until the previous year, clouds and non-forest. tions against the forest destruction using a higher
From 2005 on, it is also available on the shapefile temporal resolution of the sensors. The system is
of the deforestation thematic map of the year for an INPE project, part of a federal plan of reducing
each LANDSAT image, and the shapefile of the Amazonia deforestation.
mosaic of all images.
The DETER system (deforestation detection
on real time) uses sensors with high observation INPE’s Effort to Spread Earth
frequency to reduce cloud cover limitations dur- Observation Technologies
ing the process of detecting deforestation incre-
ments (www.obt.inpe.br/deter). The instruments There is a need for global land observation ad-
used are the MODIS sensor, aboard TERRA and vance, once the world is changing rapidly. Global
AQUA satellites (NASA), with a spatial resolu- land observation is a crucial need for the world,
tion of 250 m and temporal resolution (Brazil) and Earth observation (EO) systems are a public
of three to five days, and the WFI sensor, aboard good. INPE’s effort on advanced policies of de-
CBERS-2, with a spatial resolution of 260 m and velopment of state-of-the-art software, hardware,
temporal resolution of five days. These resolutions methodologies and products relies on the need
enable the detection of recent deforested areas of building capacity in EO to supply the wide
superior to 0,25 km2. The results of the method- demand of the area.
ology—which produces information in almost Build capacity in Earth observation implies
real time about regions where new deforestation on removing the barriers to make all sectors of
areas occur—allow DETER supplies environment society use publically funded EO data. Three
surveillance institutions with periodic informa- relevant obstacles are: lack of data (much EO data
tion about deforestation events (Figure 4). The is expensive or unavailable), lack of tools (once

Figure 4. Amazonia deforestation process detected by DETER

Image Mining

good software is required to explore EO data), and expertise challenge releasing free books online,
lack of expertise (it is necessary to build capacity a three-volume set: Introduction to GIS, Spatial
at a massive scale). INPE’s approach to overcome Analysis, and Spatial Databases (http://www.dpi.
such barriers are: make EO data free, produce inpe.br/gilberto/livros.html).
good open source software for EO data handling, INPE is focusing on the “white-box” model:
and provide open access to on-line training and results = people + data + software. This means
to scientific literature. support for people learning by doing and using,
The Internet has reduced the cost of data distri- timely and free geospatial datasets, and adequate
bution to very close to zero, and society responds data analysis and integration softwares. The re-
very quickly to open availability of free data sults: an enormous demand for remote sensing
and good on the Web. CBERS images received data in developing countries, a relevant increase
in Brazil are freely available on the Internet for on the number of users of Earth observation data
Brazilian and Latin American users, and CBERS due to free online data access, and the success of
images received in China are freely available on CBERS data policy that has been extremely well-
the Internet for Chinese users (www.cbers.inpe. received by government and society in Brazil.
br). Free EO data and free EO technology create
new users and new applications, increasing the
need for other types of EO data. Private companies, deteCting deforeStation
for example, state the free CBERS data benefit: patternS through SatelliteS
enables new business development, facilitates
trial uses for new clients, creates jobs by reduc- Patterns of Change in Remote
ing cost of data buys, increases work quality by Sensing Image Databases
adding data previously unavailable, and eases the
planning of new applications. Given a large remote sensing image database,
Commercial EO market in many countries researchers would like to explore the database
does not have enough income to research and with questions such as: What are the different
development investment, once it is still a small land use patterns present in the database? When
size market. To let it grow, it is necessary to supply did a certain land use pattern emerge? What are
improvements on information extraction through the dominant land use patterns for each region?
high-quality software. Concerning the tool chal- How do patterns emerge and change over time?
lenge, INPE developed GIS and image-processing The answer to these and similar questions requires
softwares (TerraLib and SPRING) available free the availability of data mining techniques which
on the Internet, providing good software for EO are able to perform searches for patterns found
data handling (www.dpi.inpe.br/spring; www.dpi. in different images. Silva (2006) approached
inpe.br/terralib). this problem by using spatial patterns as a mean
The research system on EO in the developed of describing relevant semantic features of an
world discourages the production of training mate- image.
rial, once academic institutions in US and Europe The primary consideration is that the instru-
graduate qualified personnel and there are good ments onboard remote sensing satellites capture
books on GIS and remote sensing (unfortunately, energy at different parts of the electromagnetic
these books are in English and are expensive). spectrum, which is then converted into digital
Developing countries need innovative responses, imagery. These instruments are not designed for
especially good training material and on-line a specific application, but are a compromise be-
books. Brazilian experience is overcoming the tween sensor technology and requirements from

Image Mining

different user communities. As a result, remote and the user domain concepts, and each associa-
sensing images have a structural description tion is valid within a given application context.
which is independent of the application domain In other words, there are many ways to bridge
that a scientist employs to extract information. the “sensory gap” and a “best fit” should not be
The image domain and the application domain searched. For each type of application, there will
are distinguished, as shown in Figure 5: be an appropriate structural classifier.
In what follows, the methodology for image
• Spatial patterns: The geometric structures mining is described and applied to the problem of
that can be extracted from the images using mining patterns in INPE’s remote sensing image
techniques for feature extraction, segmen- database. In this context, the application domain
tation, and image classification. They must is concerned with describing land use change in
be identified and labeled according to a tropical forests using remote sensing satellites.
typology which expresses their semantics.
Examples of such patterns include corridor- Methodology for Mining Land Use
like regions and regular-shaped polygons Patterns on Remote Sensing Images
representing patterns of the mined data.
• Application concepts: The different classes The methodology for image mining in large remote
of spatial objects, which are associated to a sensing databases uses the application-dependent
specific domain. For example, in deforesta- structural classifier, as outlined previously. The
tion assessments, concepts include large- methodology consists of three steps:
scale agriculture, small-scale agriculture,
cattle ranching, and wood logging. • Definition of a spatial pattern typology
according to the user’s application domain
To associate structures found in the image (Figure 6).
to concepts in the application, there is a struc- • Building a reference set of spatial patterns.
tural classifier, which is able to relate the same This reference set is built using a prototypical
structures to different application domains. This set of images. Landscape objects are identi-
strategy differs from most remote sensing image fied and labeled: the identification employs
database mining systems, such as KIM (Schröder image segmentation and the labeling is
et al., 2000) and VISIMINE (Aksoy et al., 2004), performed according to the spatial pattern
which implicitly assume that there is one “best typology (Figure 7).
fit” for associating semantic concepts in the user • Mining the database using a structural clas-
domains to image-derived structures. In this ap- sifier (guided by the application concepts
proach, different structural classifiers will produce of the domain), matching the reference set
different associations between spatial patterns of spatial patterns to the landscape objects

Figure 5. Overview of pattern mining process

Image Mining

identified in images, thus revealing the Three spatial patterns typology of Lambin
spatial configurations present in each image will be used (corridor, diffuse, geometric), relat-
(Figure 9). ing them to the structures of landscape objects
in order to obtain the spatial patterns, through a
Defining a Spatial Pattern Typology cognitive assessment process, in which a human
specialist associates landscape objects to spatial
The first phase of the methodology calls for the patterns typology elements.
definition of a spatial pattern typology which is
associated to a given application domain (Escada, Building a Reference Dataset of Spatial
Monteiro, Aguiar, Carneiro & Câmara, 2005). In Patterns
order to illustrate the proposal, a typology defined
for mapping different types of land use change To represent the structures detected in remote
in tropical forests will be used. sensing images, the concept of a landscape ob-
When using remote sensing images for under- ject will be introduced. A landscape object is a
standing the forces driving changes in tropical structure detected in a remote sensing image
forests, the assumption is that the expression of by means of an image segmentation algorithm.
change is captured by changes in land use. Ex- Landscape objects can be associated to different
tensive fieldwork also indicates that the different types of spatial patterns.
actors involved in land use change (small-scale To build a reference set of spatial patterns
farmers, large plantations, cattle ranchers) can (Figure 7), a set of prototypical landscape ob-
be distinguished by their different patterns of jects is obtained, which are extracted from a set
land use (Lambin, Geist & Lepers, 2003). They of sample images. Segmentation algorithms are
propose a typology of the land use patterns in used to partition the image into regions which are
terms of deforestation processes (see Figure 6): spatially continuous, disjoint and homogenous.
corridor (commonly associated with riverside Recent surveys (Meinel & Neubert, 2004) indicate
and roadside colonization), diffuse (generally that region-growing approaches are well suited
related to smallholder subsistence agriculture), for producing closed and homogenous regions.
fishbone (typical of planned settlement schemes), In this proposal, it is adopted the region-grow-
and geometric (frequently linked to large-scale ing segmentation algorithm developed by INPE
clearings for modern sector activities). (Bins, 1996), and implemented in the SPRING

Figure 6. Spatial patterns of tropical deforestation (from left to right): corridor, diffuse, fishbone, and
geometric (Source: Lambin, Geist & Lepers, 2003)

Image Mining

software system (Câmara, 1996). This algorithm neighbors until there is no remaining joinable
has been extensively validated for extracting land region, at which point the segment is labeled as a
use patterns in tropical forests (Shimabukuro et completed region; (d) the process moves to the next
al., 1998) and has been very favorably reviewed uncompleted cell, repeating the entire sequence
in a survey (Meinel & Neubert, 2004). until all cells are labeled. The algorithm requires
SPRING’s region growing algorithm works two parameters: a similarity threshold value, and
as follows (Figure 8) (Bins, 1996): (a) the image an area threshold value.
is first segmented into atomic cells of one or few
pixels; (b) each segment is compared with its Mining the Database Using a Structural
neighbors to determine if they are similar or not. If Classifier
similar, they are merged and the mean gray level
of the new segment is updated; (c) the segment Once the reference set of spatial patterns is built,
continues growing by comparing it with all the the next phase will use it to mine spatial configura-

Figure 7. Building a reference set of spatial patterns

Figure 8. Example of a segmentation process

Image Mining

Figure 9. Obtaining spatial configurations

tions from image databases. The structural clas- that operate at the patch level. Patches form the
sifier enables the association between landscape building blocks for categorical maps and within-
objects extracted from images and the reference patch heterogeneity is ignored. Patch metrics refer
set of spatial patterns (Figure 9). to the spatial character and arrangement, position,
The structural classifier must be able to distin- or orientation of patches within the landscape. The
guish between different spatial patterns. It uses pattern metrics proposed by the FRAGSTATS
the C4.5 classifier (Quinlan, 1993), a classification (Spatial Pattern Analysis Program for Categorical
method based on a decision tree. It predicts the Maps) software (McGarigal & Marks, 1995) are
value of a categorical attribute (Witten & Frank, used, which include:
1999) based on noncategorical attributes. The
categorical attribute is the pattern type and the • Perimeter (m) and area (ha).
noncategorical attributes are a set of numerical • Para (perimeter-area ratio): A measure of
attributes that characterize each pattern. shape complexity.
To select the attributes that distinguish the • Shape (shape index): Patch perimeter divided
different types of land use patterns, the concepts by the minimum perimeter possible for a
from landscape ecology (Turner, 1989) are used. maximally compact patch of the correspond-
Landscape ecology is based on the notion that ing patch area.
environmental patterns strongly influence eco- • Frac (fractal dimension index): Two times
logical processes. One of the key components the logarithm of patch perimeter (m) divided
of landscape ecology theory is the definition of by the logarithm of patch area (m2).
metrics that characterize geometric and spatial • Circle (related circumscribing circle):
properties of categorical map patterns (McGari- 1 minus patch area (m2) divided by the area
gal, 2002). The pattern metrics used in landscape (m2) of the smallest circumscribing circle.
ecology include metrics of spatial configuration

Image Mining

• Contig (contiguity index): Equals the average Case Study: Image Mining for
contiguity value for the cells in a patch. Deforestation Patterns

The landscape ecology metrics are fed into Controlling deforestation on Amazon rain forest
the C4.5 classification algorithm to distinguish is a difficult challenge for Brazil, once the causes
the different types of spatial patterns. After this of deforestation include economic, social and
classifier is properly trained, it can be used to political factors, and the current pace of land use
label the landscape objects found in other images. change is substantial, with a deforested area of
Therefore, for each image in the database, this about 200,000 km2 during the decade 1995-2005.
procedure identifies the number and location of The situation demands fast and effective actions
the different types of spatial patterns. A specific for reducing this pace of devastation. In order to
set of spatial patterns found in an image is referred monitor the extremely fast process of land use
as a spatial configuration. change in Amazonia, it is very important that
By identifying the spatial configurations of INPE be able to use its huge data archive to the
different images, the user will be able to evaluate maximum extent possible. In this context, the
the emergence and evolution of different types image mining methodology was used to achieve
of change. Each spatial pattern is associated to a a better understanding of the processes of land
different type of land use change. Therefore, the use change in Amazonia.
comparison between spatial configurations of A case study was developed using Landsat
images in different locations and between spatial TM images (225/64, 226/64, 226/65, 225/65) of
configurations of images at the same location in 1997, 2000, 2001, 2002 and 2003, which cover the
different times will allow new insights into the region of São Félix do Xingu in the state of Pará.
processes and actors that bring about change. This is a region with many violent land conflicts
and one of the largest annual rates of deforestation
in Amazônia (INPE, 2005). The main land use
activity developed in São Felix do Xingu is cattle
Table 1. Land use change in tropical forests
ranching, which holds around 10% of the cattle of
Landscape object Land use change Pará state (Américo, Vieira, Veiga & Araujo, in
Corridor pattern Roadside colonization press). Deforestation in the region has two main
Riverside deforestation agents: migrants, that have settled in small areas,
Diffuse pattern Smallholder agriculture and large cattle ranchers, many of whom have
Small deforestation increments occupied land illegally (Escada, Vieira, Amaral,
Geometric pattern Large farms Araújo, Veiga, Aguiar, & Veiga, 2005). The images

Figure 10. Spatial patterns representing corridor, diffuse, and geometric patterns

Image Mining

and deforestation data were provided by PRODES cal deforestation (Figure 6). Spatial patterns are
Project (INPE, 2005). The application concepts presented in Figure 10.
for this task are guided by the land use change
domain in tropical forests (Table 1). Obtaining Spatial Configurations

Building Spatial Patterns The structural classifier, using the spatial pat-
terns, extracted spatial configurations from
According to the image mining methodology, the set of images just mentioned. Results are
landscape objects were extracted from prototypi- presented below.
cal images. Then, a human specialist, through In a first case, it is necessary to answer the
cognitive assessment, obtained spatial patterns following question: “What’s the behavior of large
based on the spatial patterns typology of tropi- farmers in São Félix do Xingu during 1997-2003
period? Is the area of new large farms increasing?”

Figure 11. Large farms dynamic in São Félix do Xingu

Figure 12. Diffuse pattern in São Félix do Xingu 1997-2003

0
Image Mining

Observing the evolution of the corresponding spa- future trendS

tial configuration (geometric patterns - GEOM)
in Figure 11, it was possible to conclude that, “in A consortium of Earth observation satellites for
2000, this kind of deforestation reached a peak of global land monitoring, a network of cooperating
55,000 ha, but decreased in the following years. ground stations, EO data free on the Internet with
In 2003, the deforestation area associated to large global weekly coverage, satellite sensor resolution
farms decreased to 29,000 ha. This indicates that improvements and the availability of web services
large farms are reducing their contribution to to perform image mining tasks will provide
deforestation.” necessary resources for new applications and a
There is a second question: “What’s the dis- wide range of demands, specially in developing
tribution of smallholder agriculture and small countries. Moreover, hardware and software per-
deforestation increments in São Félix do Xingu formance increase will support mining processes
area during the years 1997-2003?” Observing on huge and improved image datasets, allowing
Figure 12, it is possible to conclude: “the distri- a more intensive and extensive use of satellite
bution of this land use pattern (diffuse) in this image mining in strategic fields like forestry and
period was mainly concentrated in the northeast reservoir monitoring, agricultural expansion,
and southeast of this area.” soil survey, analysis of natural phenomena, and
The next question is: “In São Félix do Xingu urban studies.
region, is there any dominant land use change Future research directions in remote sensing
pattern?” Observing Figure 13, the conclusion is: image mining include tracking individual trajec-
“Diffuse pattern represented 61% of total occur- tories of change. Patterns found in one map are
rences of land use changes in 2001, indicating an linked to those in earlier and later maps, thus
increase in smallholder agriculture / small incre- enabling a description of the trajectory of change
ments in deforested areas in that year.” of each landscape object. The current method ag-
gregates landscape objects of the same type. A

Figure 13. Diffuse patterns in São Félix do Xingu

Image Mining

more sophisticated approach would be to describe selection, model building and rating, context
the evolution of each landscape object, including evaluation, return to specific points of the process,
operations such as merging of adjacent regions. among others. During experiments, the result
This description would allow the image-mining evaluation in different phases demonstrated the
tool to describe when two irregular areas of land need of new prototype objects, better model
use (associated to small settlers) were merged. It calibration, or even adjustments on the spatial
would also show when the merged region was pattern typology. Once provided such topics, rel-
extended with a regular pattern (suggesting that evant results were obtained and validated through
a large cattle ranch had been established). This extensive fieldwork.
description could increase even more the ability Taking into account the heterogeneity of the
to understand the land use changes that are detect- Amazonia context, a relevant (and expected)
able in remote sensing image databases. question is the fact that the model training and
application must be performed in spatially similar
regions, that is, train the structural classifier in
ConCluSion a specific region and apply it in another region,
with different spatial features, causes the genera-
This chapter presents relevant issues on satellite tion of inconsistent results. Another methodology
image mining, describing a method for mining limitation concerns the quantity and quality of
patterns of change that enables extracting spatial prototype objects used to generate the model for
arrangements from remote sensing image data- structural classification. If the number of ele-
bases. It addresses the problem of describing land ments or their description ability to distinguish
use change. It combines techniques from data patterns is not appropriate, the generated model
mining, digital image processing and landscape (decision tree) will classify inconsistently many
ecology to identify patterns in images of distinct objects. The methodology also demands a proper
dates. The method points out that patch metrics spatial pattern typology, which must characterize
can be used to identify agents of land use change. the spatial patterns and the semantic aspects that
Images of distinct dates enabled the detection of must be detected during the process.
pattern changes, which are extremely valuable The mining process requires a domain spe-
when assessing, managing or preventing defor- cialist, due to the intense Amazonia dynamics,
estation processes. especially on the prototype object selection and
This methodology enables associating land during the spatial configuration interpretation.
change objects to causative agents, and it can Further experiments are necessary to improve
assist the environmental community to respond the method, to test alternatives for image seg-
to the challenge of understanding and modeling mentation algorithms and for pattern classifiers.
relevant issues in a rapidly changing world. The The limitations of the current method are also
results from the case study show that image-min- associated to the two-dimensional nature of land
ing techniques are a step forward in understanding use maps. An extension of the method would
and modeling land use and cover change. The combine spatial information (patch metrics) with
proposed method also enables a more effective spectral information (pixel and region trajectories
use of the large land remote sensing image data- in multitemporal images).
bases available in agencies such as USGS, ESA Uncle Scrooge principle states that, “a penny
and INPE. saved is a penny earned.” However, the anti-Uncle
The remote sensing image-mining process is Scrooge principle reveals that, “a pixel saved is a
an interactive one; once it demanded the sample penny wasted.” Why is that so? Because “value

Image Mining

comes from use.” Coherent EO programs can Briassoulis, H. (2004). Analysis of land use
supply strategic components for the enormous change: Theoretical and modeling approaches.
demand of remote sensing data, expertise, and Retrieved April 8, 2008, from http://www.rri.wvu.
analysis tools in developing countries. This edu/WebBook/Briassoulis
work resources may help to leverage the power
Câmara, G., Souza, R., Freitas, U. & Garrido, J.
of detecting, evaluating and reducing the pace of
(1996). SPRING: Integrating Remote Sensing
Amazonia deforestation, once INPE holds know-
and GIS with object-oriented data modelling.
how and a wide spatiotemporal coverage of the
Computers and Graphics, 15(6), 13-22.
forest. Moreover, the present technology can be
ported to provide solutions to a broad range of Câmara, G., Vinhas, L., Souza, L., Paiva, L.,
image mining applications. Monteiro, A., Carvalho, M. & Raoult, B. (2001).
Design patterns in GIS development: The Terralib
experience. In Proceedings of the III Brazilian
referenCeS Symposium in Geoinformatics, GeoInfo 2001,
Rio de Janeiro.
Américo, M. C. S., Vieira, I. C. G., Veiga, J. B.
Canada Centre for Remote Sensing (2003). Fun-
& Araujo, R. (in press) Pecuária e Amazônia:
damentals of remote sensing. Remote Sensing
Estratégias sociais e reestruturação do território
Tutorial (pp. 5-44). Retrieved April 8, 2008, from
nas frentes pioneiras: Rodovia PA-279 e região
www.ccrs.nrcan.gc.ca/ccrs/learn/tutorials/fun-
da Terra do Meio no Pará [Cattle ranching and
dam/fundam_e.html
Amazonia: Social strategies and territory reor-
ganization in new frontiers – PA-279 and Terra Chen, Y., Wang, J. Z. & Krovetz, R. (2003). CLUE:
do Meio region in Pará state]. In R. Araujo & P. Cluster-based retrieval of images by unsupervised
Lena (Eds.), Alternativas de desenvolvimento learning. In K. A. Meraim, I. Bloch (Eds.), In
sustentável na Amazônia: Experiências recen- Proceedings of the IEEE Seventh International
tes [Alternatives of sustainable development in Symposium on Signal Processing and its Applica-
Amazônia: Recent experiences]. tions (pp. 202-231).
Aksoy, S., Koperski, K., Tusk, C. & Marchisio, Escada, M. I. S., Monteiro, A. M., Aguiar, A. P.,
G. (2004). Interactive training of advanced classi- Carneiro, T. & Câmara, G. (2005). Análise de pa-
fiers for mining remote sensing image archives. In drões e processos de ocupação para a construção
Proceedings of the ACM International Conference de modelos na Amazônia [Analysis of land use
on Knowledge Discovery and Data Mining (pp. patterns and processes for the construction of
773-782). Seattle, Washington. models in Amazonia]. In Proceedings of the XII
Brazilian Symposium on Remote Sensing (pp.
Alves, D. S. (2002). Space-time dynamics of de-
2973-2983), Goiania, Brazil.
forestation in Brazilian Amazonia. International
Journal of Remote Sensing, 23(14). Escada, M. I. S., Vieira, I. C. G., Amaral, S.,
Araújo, R., Veiga, J. B. D., Aguiar, A. P. D.,
Bins, L., Fonseca, L. & Erthal, G. (1996). Satel-
Veiga, I., Oliveira, M., Pereira, J. L. G., Filho,
lite imagery segmentation: A region growing
A. C., Fearnside, P. M., Venturieri, A., Carriello,
approach. In Proceedings of the 8th Brazilian
F., Thales, M., Carneiro, T. S., Monteiro, A. M.
Symposium on Remote Sensing (pp.1-4).
V., & Câmara, G. (2005). Padrões e processos
Becker, B. (1997). Amazonia. São Paulo: Atica. de ocupação nas novas fronteiras da Amazônia:
O interflúvio do Xingu/Iriri [Land use patterns

Image Mining

and processes in Amazonian new frontiers: The McGarigal, K. (2002). Landscape pattern metrics.
Xingu/Iriri region]. Estudos Avançados [Ad- In A.H. El-Shaarawi & W.W. Piegorsch (Eds.),
vanced Studies], 19, 9-23. Encyclopedia of environmentrics (pp. 1135-1142).
Sussex, England: John Wiley & Sons.
Han, J., Koperski, K. & Stefanovic, N. (1997).
GeoMiner: A system prototype for spatial data Meinel, G. & Neubert, M. (2004). A comparison
mining. In Proceedings of the ACM SIGMOD of segmentation programs for high resolution
International Conference on Management of remote sensing data. International Archives of
Data (pp. 553-556). Photogrammetry and Remote Sensing, 35(1),
1097-1105.
Han, J. & Kamber, M. (2001). Data mining - Con-
cepts and tecniques. In D. Cerra & H. Severson Nagao, M. & Matsuyama, T. (1980). A structural
(Eds.) (pp.405-412). San Diego, CA: Morgan analysis of complex aerial photographs. New
Kaufmann Publishers. York: Plenum Press.
INPE, National Institute for Space Research Piatetsky-Shapiro, G., Djeraba, C., Getoor, L.,
(2005). PRODES project - Monitoring the Bra- Grossman, R., Feldman, R. & Zaki, M. (2006).
zilian Amazon forest using satellites. National What are the grand challenges for data mining?
Institute for Space Research. Retrieved April 8, - KDD-2006 Panel Report. SIGKDD Explora-
2008, from http://www.obt.inpe.br/prodes tions, 8(2), 70-77.
Lambin, E. (1999). Land-use and land-cover Quinlan, R. (1993). Programs for machine learn-
change implementation strategy. Retrieved April ing. San Francisco: Morgan Kaufmann.
8, 2008, from http://www.geo.ucl.ac.be/LUCC/
Rui, Y., Huang, T. S. & Chang, S. F. (1999).
lucc.html
Image retrieval: Current techniques, promising
Lambin, E. F., Geist, H. J. & Lepers, E. (2003). directions and open issues. Journal of Visual
Dynamics of land use and land cover change in Communication and Image Representation, 10(1),
Tropical Regions. Annual Review of Environment 39-62.
and Resources, 28(1) 205-241.
Rushing, J., Ramachandran, R., Nair, U. J., Graves,
MacDonald, J. (2002). The Earth observation S. J., Welch, R. & Lin, A. (2005). ADaM: A data
business and the forces that impact it. In D. Couts mining toolkit for scientists and engineers. Com-
(Ed.), Earth observation business network 2002. puters and Geosciences, 31(5), 607-618.
Vancouver, CA: MacDonald Dettwiler.
Schröder, M., Rehrauer, H., Seidel, K. & Datcu,
May, M. & Savinov, A. (2002). An integrated M. (2000). Interactive learning and probabilistic
platform for spatial data mining and interactive retrieval in remote sensing image archives. IEEE
visual analysis. In Proceedings of the Interna- Transactions on Geoscience and Remote Sensing,
tional Conference on Data Mining Methods and 23(1), 2288-2298.
Databases for Engineering (pp. 90-101).
Shimabukuro, Y. et al. (1998). Using shade fraction
McGarigal, K. & Marks, B. (1995). FRAGSTATS: image segmentation to evaluate deforestation in
Spatial pattern analysis program for quantifying Landsat thematic mapper images of the Amazon
landscape structure. USDA Forestry Service region. International Journal of Remote Sensing
Technical Report PNW-351, Washington, DC. 19(3), 535-541.

Image Mining

Silva, M. P. S., Câmara, G., Souza, R. C. M., Transactions on Pattern Analysis and Machine
Valeriano, D. M. & Escada, M. I. S. (2005). Min- Intelligence, 22(1), 1349-1380.
ing patterns of change in remote sensing image
Turner, M. G. (1989). Landscape ecology: The
databases. J. Han & B. Wah (Eds.), In Proceed-
effect of pattern on process. Annual Review of
ings of the Fifth IEEE International Conference
Ecology and Systematics, 20, 171-197.
on Data Mining (pp. 362-369).
Wang, L., Khan, L. & Breen, C. (2002). Object
Silva, M. P. S. (2006). Mineração de Padrões de
boundary detection for ontology-based image
Mudança em Imagens de Sensoriamento Remoto
classification. In Proceedings of the Third ACM
[Mining patterns of change in remote sensing im-
International Workshop on Multimedia Data
ages] (Unpublished doctoral thesis). São José dos
Mining (pp. 51-61).
Campos: National Institute for Space Research
(INPE). Witten, I. H. & Frank, H. (1999). Data mining:
Practical machine learning tools and techniques
Smeulders, A.W.M., Worring, M., Santini, S.,
with Java implementations. San Francisco: Mor-
Gupta, A. & Jain, R. (2000). Content-based im-
gan Kaufmann.
age retrieval at the end of the early years. IEEE

Chapter V
Machine Learning
and Web Mining:
Methods and Applications
in Societal Benefit Areas

Georgios Lappas
Technological Educational Institution of Western Macedonia,
Kastoria Campus, Greece

aBStraCt

This chapter reviews research on machine learning and Web mining methods that are related to areas
of social benefit. It shows that machine learning and Web mining methods may provide intelligent Web
services of social interest. The chapter reveals a growing interest for using advanced computational
methods, such as machine learning and Web mining, for better services to the public, as most research
identified in the literature has been conducted during the last years. The chapter objective is to help re-
searchers and academics from different disciplines to understand how Web mining and machine learning
methods are applied to Web data. Furthermore it aims to provide the latest developments on research
that is related to societal benefit areas.

introduCtion data mining that is related to the discovery of

knowledge from the Web. The Web can be con-
The Web is constantly becoming a central part of sidered as a tremendously large and rich in content
social, cultural, political, educational, academic, knowledge base of heterogeneous entries without
and commercial life and contains a wide range any well specified structure, which proportion-
of information and applications in areas that are ally makes the Web at least as complex as any
of societal interest. Web mining is the field of known complex database and perhaps the largest

Copyright © 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
Machine Learning and Web Mining

knowledge repository. The vast information that of Internet and Web modeling at the information
surrounds the Web does not come only from the level based on mathematical, probabilistic, and
content of Websites, but is also related to usage graphical treatment. Chakrabarti focuses on stud-
of Web pages, navigation paths and networking ies that connect users to the information they seek
between the links of Web-pages. All these proper- from the Web providing lots of programs with
ties establish the Web as a very challenging area pseudocode. Chen and Chau provide an extended
for the machine learning community to apply their review of how machine-learning techniques for
methods usually for extracting new knowledge, traditional information retrieval systems have
discovering interesting patterns and enhancing the been improved and adapted for Web mining ap-
efficiency of Websites by providing user-demand plications. Pal et al. (2002) present an overview
content and design. of machine learning techniques with focusing on
Web mining is a relatively new area, broadly a specific Web mining category, the Web content
interdisciplinary, attracting researchers from: mining that will be described in next section.
computer science fields like artificial intelligence, This work is differentiated from the aforemen-
machine learning, databases, and information tioned related work as the chapter particularly
retrieval specialists; from business studies fields focuses on Web mining and machine learning
like marketing, administrative and e-commerce that may help and benefit societal areas in ways
specialists; and from social and communication of extracting new knowledge, providing support
studies fields such as social network analyzers, for decision making and empowering valuable
pedagogical scientists, and political science spe- management of societal issues. This survey aims
cialists. Herrera-Viedma and Pasi (2006) denote to help researchers and academics from differ-
that due to the complexity of Web research there ent disciplines to understand Web mining and
is a requirement for the use of interdisciplinary machine learning methods. Thus, it is aimed at
approaches like statistics, databases, information a relatively broad audience and tries to provide
retrieval, decision theory, artificial intelligence, them with a different and more open view on Web
cognitive social theory and behavioral science. As research. Therefore this work addresses research-
a relatively new area there is a lot of confusion ers from both computer science and other than
when comparing research efforts from different computer science disciplines with the intention:
point of views (Kosala & Blockeel, 2000) and (a) for computer science researchers, to provide
therefore there is a need for surveys that record and them with the latest developments on the theory
aggregate efforts done by independent research- and applications of Web mining, focusing also to
ers, provide definitions and explain structures the need for Web mining applications in societal
and taxonomies of the field from various points beneficial areas, and (b) for researchers from other
of view. than computer science disciplines, to draw their
The overall objective of this chapter is to attention to existing machine learning methods
provide a review of different machine learning that may help them to seek for more effective
approaches to Web mining and draw conclusions results in their Web research.
on their applicability in societal benefit areas. Later in the chapter, some background to the
The novelty of this review is that it focuses on different perspectives of Web mining has been
Web mining in societal benefit areas. There exist provided with a short review on machine learning
similar work related to Web mining in (Baldi, methods. Afterwards, a study on related machine
Frasconi, & Smyth, 2003; Chakrabarti, 2003; Chen learning methods applied to Web mining have
& Chau, 2004; Pal, Talwar & Mitra, 2002). Baldi been put forward, which is followed by applica-
et al. (2003) cover research and theory on aspects tions related to societal benefit areas. Finally it

Machine Learning and Web Mining

discusses current trends and future challenges on for extracting patterns from resources distributed
machine learning and Web mining. in the Web. Over the years, Web mining has been
extended to denote the use of data mining and
other similar techniques to discover resources,
WEB MINING OVERVIEW patterns and knowledge from the Web and Web-
related data (Chen & Chau, 2004).
The word mining means extracting something According to the different sources of data
useful or valuable, such as mining gold from analysis, Web mining is divided into three min-
the earth. The expectation of useful or valuable ing categories. Figure 1 shows the Web mining
information discovery from the Web is enclosed taxonomy and the sources of data that are used in
in the term “Web mining.” Definitional, Web mining. Moreover, it displays Web mining catego-
mining refers to the application of data mining ries and the related areas of research interest:
techniques to the World Wide Web (Cooley, Mo-
basher & Srivastava, 1997), or else is the area of a. Web content mining focus on the discovery
data mining that refers to the use of algorithms of knowledge from the content of Web

Figure 1. Taxonomy of Web mining according to source of target data

WEB
MINING

WEB CONTENT WEB USAGE WEB

MINING MINING STRUCTURE
MINING

Data Type: Data Type:

Data Type:
Server Logs Hyperlinks
Web page Content

Related areas of interest: Related areas of interest: Related areas of interest:

Search Engines Navigational Behavior Search Engines
Intelligent Agents Usage Patterns Recommendation Systems
Personalized Search Web-site Personalization Personalization
Structuring Web Content Collaborative Filtering Authority Detection
Semantic Web Content-based Filtering Community Networks
Web data Classification Rule Based Filtering Social Network Analysis
Meta-data Creation Clustering Users/Sessions Community Mining
Group Targeting Group Detection
Recommendation Systems
Web Site Evaluation
User Satisfaction
Design Efficiency
Site Reconstruction

Machine Learning and Web Mining

pages and therefore the target data consist learning user preferences and discovering Web
of multivariate type of data contained in a sources meeting these preferences.
Web page as text, images, multimedia, and Web content mining is more than selecting
so forth. relevant documents on the Web. Web content
b. Web usage mining focus on the discovery mining is related to information extraction and
of knowledge from user navigation data knowledge discovery from analyzing a collection
when visiting a Website. The target data of Web documents. Related to Web content min-
are requests from users recorded in special ing is the effort for organizing the semistructured
files stored in the Website’s servers called Web data into structured collection of resources
log files. leading to more efficient querying mechanisms
c. Web structure mining deals with the con- and more efficient information collection or ex-
nectivity of Websites and the extraction of traction. This effort is the main characteristic of
knowledge from hyperlinks of the Web. the “Semantic Web” (Berners-Lee, Hendler &
Lassila, 2001), which is the next Web generation.
The above taxonomy is now broadly used in Semantic Web is based on “ontologies,” which
Web mining (Scime, 2005) and has the origins are metadata related to the Web page content
from Coley et al. (1997) who introduced Web that make the site meaningful to search engines.
content mining and Web usage mining and Kosala Sebastiani’s study (2002) may be used as a source
and Blockeel (2000), who added Web structure for Web content mining.
mining. Web usage mining tries to find patterns of
A well-known problem related to Web content navigational behavior from users visiting a Web-
mining, is experienced by any Web user trying site. These patterns of navigational behavior can
to find relevant Web pages that interests the be valuable when searching answers to questions
user from the huge amount of available pages. like: How efficient is our Website in delivering
Current search tools suffer from low precision information? How the users perceive the structure
due to irrelevant results (Chakrabarti, 2000). of the Website? Can we predict users next visit?
Lawrence and Giles (1999) raise issues related to Can we make our site meeting user needs? Can
search engine problems. Search engines are not we increase user satisfaction? Can we targeting
able to index all pages resulting in imprecise and specific groups of users and make Web content
incomplete searches due to information overload. personalized to them?
The overload problem is very difficult to cope as Answer to these questions may come from
information on the Web is immensely and grows the analysis of the data from log files stored in
dynamically raising scalability issues. Web servers. A log file is usually a large file that
Moreover, myriad of text and multimedia data contains all requests of all users to the Website
are available on the Web prompting the need for as they arrive in time. Log files may have vari-
intelligence techniques for developing automatic ous formats according to the information stored.
mining using artificial intelligence tools. Such au- The most common format uses information about
tomatic mining is performed by intelligent systems user IP, date and time of request, type of request
called “intelligent agents” or “agents” that search (for example get a Web page), a code denoting
the Web for relevant information using domain whether the request has be successfully served
characteristics and user profiles to organize and or failed, and number of bytes transferred to
interpret the discovery information. Agents may user. However, Web usage mining should not
be used for intelligent search, for classification be confused with tools that analyze log files in
of Web pages, and for personalized search by order to provide statistics about the site such us:

Machine Learning and Web Mining

page hits, times of visits, hits per hour or per day classification methods or clustering algorithms
or per month, and so forth. While this informa- on Web usage data.
tion might be interesting or valuable for Website Along this perspective, a common method-
owners, they have low data analysis. Web usage ology for discovering usage and user behavior
mining is more sophisticated as it refers to find patterns consists of the following steps: recon-
users access behavior (Levene & Loizou, 1999) structing user sessions, that is, the navigational
and usage patterns (Buchner, Mulvenna, Anand sequence of Web-pages of a user in the site;
& Hughes, 1999). It has become a necessity task comparing them with other user’s sessions; and
to provide Web administrators with meaningful clustering or classifying the sessions to extract
information about users and usage patterns for knowledge of navigational behavior. Extracted
improving quality of Web information and service usage and user behavior patterns may be used
performance (Eirinaki & Vazirgiannis, 2003; in targeting specific groups of users; in various
Spiliopoulou & Pohle, 2001; Wang, Abraham & recommendation systems; and in evaluation and
Smith, 2005). Successful Websites may be those reconstruction of the Website to meet design
that are customized to meet user preferences both efficiency issues and user satisfaction require-
in the presentation of information and in relevance ments. Detailed surveys on Web usage mining
of the content that best fits the user. are presented by Faca and Lanzi (2005), and by
In this context, Website personalization is the Srivastava et al. (2000).
process of customizing the content and structure Subsequently, Web structure mining is closely
of a Website to the specific needs of each user related to analyzing hyperlinks and link struc-
taking advantage of user’s navigational behavior ture on the Web for information retrieval and
(Eirinaki & Vazirgiannis, 2003). Recommenda- knowledge discovery. Web structure mining can
tion systems support Website personalization be used by search engines to rank the relevancy
by tracking user’s behavior and recommending between Websites and to classify them accord-
similar items to those liked in the past (content- ing to their similarity and relationship (Kosala
based learning), or by inviting users to rate objects & Blockeel, 2000). Google search engine, for
and state their preferences and interests so that instance, is based on PageRank algorithm (Brin
recommendations can be offered to them based & Page, 1998), which states that the relevance of
on other users rates with similar preferences a page increases with the number of hyperlinks
(collaborative filtering), or by asking questions to to it from other pages, and in particular of other
the user and providing services tailored to user relevant pages. Personalization and recommenda-
needs according to the user’s answers (rule-based tion systems based on hyperlinks are also studied
filtering). in Web structure mining.
On the other hand, content-based filtering is Web structure mining is used for identifying
the most common method for Web personalization “authorities,” which are Web pages that are pointed
from server log files and has attracted consider- to by a large set of other Web pages (Desikan,
able attention from researchers (Mobasher, Jain, Srivastava, Kumar & Tan, 2002) that make them
Han, & Srivastava, 1996; Mobasher, Cooley, & candidates of good sources of information. Web
Srivastava, 1999; Ngu & Wu, 1997; Spiliopoulou, structure mining is also used for discovering
Pohle, & Faulstich, 1999; Srivastava, Cooley, “social networks on the Web” by extracting
Deshpande, & Tan, 2000; Wolfang & Lars, 2000) knowledge from similarity links. The term is
for constructing user models that represent the closely related to “link analysis” research, which
behavior of users. Such systems usually apply has been developed in various fields over the last

0
Machine Learning and Web Mining

decade such as computer science and mathematics age mining and Web content mining for creating
for graph-theory, and social and communication user content profiles. Web usage data combined
sciences for social network analysis (Foot, Sch- with ontologies and semantics for improving Web
neider, Dougherty, Xenos & Larsen, 2003; Park, personalization are currently proposed in vari-
2003; Wasserman & Faust, 1994). ous systems (Berendt, 2002; Dai & Mobasher,
This method is based on building a graph out 2003; Oberle, Berendt, Hotho & Gonzalez, 2003;
of a set of related data (Badia & Kantardzic, 2005) Spiliopoulou & Pohle, 2001).
and to apply social network theory (Wasserman
& Faust, 1994) to discover similarities. Thus, a
social network is modeled by a graph, where the MACHINE LEARNING OVERVIEW
nodes represent individuals whereas an edge be-
tween two nodes represents a direct relationship Machine learning is the basic method for most data
between the individuals. Recently Getoor and mining approaches and therefore will be also an
Diehl (2005) introduce the term “link mining” important method in Web mining research. It is a
to put special emphasis on the links as the main broad field of artificial intelligence investigating
data for analysis and provide an extended survey the use of algorithms acting as intelligent learning
on the work that is related to link mining. methods in computer systems to gain experience,
A new term, namely, community mining is so that this experience can be used when making
a major research area on social networks that decisions based on previous learned tasks. The
emphasizes on discovering groups of individuals, machine learning methods cover a wide range of
who by sharing the same properties form a specific learning methods, where some of them have been
community on the Web. Domain applications inspired from nature. Neural networks are inspired
related to Web structure mining of social interest from human brain and its neurons for the learn-
are: criminal investigations and security on the ing, information storing and information retrieval
Web, digital libraries where authoring, citations capability. Genetic algorithms and evolutionary
and cross-references form the community of aca- algorithms are inspired from Darwin’s theory
demics and their publications etc. Detailed survey for the surviving characteristics of the fittest in
on Web structure mining can be found in Desikan a population that evolves in time. Other machine
et al. (2002) and Getoor & Diehl (2005). learning methods are designed to reach to a deci-
The taxonomy previously described is based sion by asking simple yes/no questions following
on the characteristics of the source data. Usually a path from a tree based graph (decision trees) or
when working with one of the three data sources to derive rules that find interesting associations
(Web content, log files, hyperlinks), researchers and/or correlation relationships among large set
might think the corresponding category. However, of data items (association rules).
this is not strict and might combine source data Representatives of machine learning methods
and target application as for example they can use are: artificial neural networks (ANN), self-or-
hyperlinks to predict Web content (Mladenic & ganizing maps, Hopfield network, genetic algo-
Grobelnik, 1999). Another example is “Web com- rithms, evolutionary algorithms, fuzzy systems,
munity” (Zhang,Yu & Hou, 2005), a term closer rough sets, rule-based systems, support vector
to Web structure mining, however, is used for the machines, decision trees, Bayesian and probabi-
analysis and construction of “Web communities” listic models. Describing in details each of these
not only from hyperlinks, but also from Web methods will overpass the chapter. The reader
document content and user access logs. Mobasher, can find many textbooks that describe in details
Dai, Luo, Sung, and Zhu (2000) combine Web us- all the previously mentioned methods (Bishop,

Machine Learning and Web Mining

2003; Duda, Hart & Stork, 2001; Michaklski & class one try to find the next value in the range of
Tecuci, 1994; Mitchell, 1997). hypothesis after training the model with historical
At the same time machine learning systems data, as for example, predicting the closing price
are capable of solving a number of problems of a security in stock-market based on historical
related to pattern classification, data clustering, financial data.
predicting purposes, and information retrieval. On the contrary, clustering is a method that uses
In traditional data mining, one can identify that a machine learning approach called “unsupervised
machine learning is used for tackling four types learning,” where no predefined classes exist and
of data mining problems: classification, clustering, the task is to find groups of similar objects creating
association rules and prediction problems. a cluster for each group. Therefore, in a cluster
The task in classification problems is to assign belong data that have similar features between
classes to objects according to their characteristics them and at they same time they have dissimilar
(features). The central aim in designing a classifier features with the rest of data. Association rules
is to train the classifier with patterns of known aims to find relationships and interesting patterns
labels drawn out of the total number of available from large data sets.
data, which usually are labeled as “positive ex- Although the overwhelming majority of ma-
amples” for samples that belong to a known class chine learning research is based on supervised
and “negative examples” for all those samples that and unsupervised learning models, there exist two
do not belong to the known class. The classifier more types of learning: reinforcement learning
success is evaluated by the ability to generalize, and multi-instance learning. Reinforcement learn-
that is, the ability to predict correctly the label of ing tries to learn behavior through trial-and-error
novel (unseen to the classifier) patterns that have interactions with a dynamic environment. The
been left out from the training process. difference from supervised learning is that correct
The training process uses an adjustment classes are never presented, nor suboptimal actions
mechanism that iteratively adjusts the parameters explicitly corrected. In multi-instance learning
of the classifier in order to get closer to learning (Dietterich, Lathrop & Lozano-Perez, 1997) the
the class of the training data. Evaluating classifier training data consists of “bags” each containing
generalization one may have an estimation of the many instances, while in supervised learning
performance of the classifier in classifying and the data set for training consists of positive and
predicting labels of newly collected data. Since negative examples. A bag is labeled positively
the class of each data example during the train- if it contains at least one positive instance. The
ing phase is provided to the classifier, the type of task is to learn some concept from the training
learning is called “supervised learning,” where set of bags for predicting the label of unseen bags.
the supervision takes place in adjustments of Training bags have known labels, however, the
the classifier parameters so that a misclassified instances have unknown labels so the training
data example in an iteratively learning process process comprises labeled data that are composed
is classified correctly. of unlabeled instances and the task is to predict
Classification problems also deal with predic- the labels of unseen data.
tion, as the task in classification is to minimize Normally decision trees, and rule-based
the error of misclassified test data and therefore models are used to solve supervised learning
classifiers according to the quality of collected problems; self-organizing maps (SOM), and clus-
data and to the accuracy rate of performance of tering models are typically used in unsupervised
the classifier may predict classes. In this aspect, learning problems; and genetic and evolutionary
prediction is harder when instead of a discrete algorithms are typically used in reinforcement

Machine Learning and Web Mining

learning problems. The rest of machine learning powered from the intelligence and automations
methods are used in both supervised, unsuper- of machine learning and from the interpretation
vised, reinforcement, and multi-instance learning ability of social sciences.
problems. In comparison to data mining, Web mining
may have a few common characteristics similar
to machine learning methods and approaches.
MaChine learning applied to However, working with Web data is more difficult
WeB Mining due to fact that Web data are formed dynamically,
change frequently, and their structure cannot be
Machine learning techniques can be very helpful stored in a fixed length database with known
when applied to the process of Web mining. Al- features and characteristics. On the contrary,
though there is a close relation between machine most data mining systems are well structured
learning and Web mining one should denote that and remain static over time. Moreover, Web data
Web mining is not actually the application of have many different data type such as text, tables,
machine learning techniques on the Web (Ko- links, sidebars, layouts, images, audio, video,
sala & Blockeel, 2000). Other methods studying pdf files, word files, postscript files, executable
interesting patterns on the Web may be methods files, animation files, and so forth, to name a
of statistical analysis (Gibson & Ward, 2000; few. Detection of such data types can be a hard
Sharma & Woodward, 2001; Yannas & Lappas, problem that needs considerable effort to solve it
2005; Yannas & Lappas, 2006), or heuristics as with table detection in Websites, where support
(Sutcliffe, 2001). vector machines and decision trees can be used
Primitive Web mining attempts to find pat- for attacking this problem (Wang & Hu, 2002).
terns to explain various Web “phenomena” can be Lastly, the Web is considerably larger than tra-
also found in qualitative Web research methods ditional databases in terms of magnitude due to
(Demertzis, Diamantaki, Gazi, & Sartzetakis, the billions of existing Websites. Before going to
2005; Gillani, 1998; Margolis, Resnick, & Tu, the next section, the author presents an indicative
1997; Maule, 1998; Li, 1998; Reeves, and De- research that relates machine learning with the
honey, 1998) that usually rely on observations, three Web mining categories.
annotation of online and archived Web objects,
interviews or surveys of Web administrators Machine Learning and Web Content
and users, textual analysis, focus groups, social Mining
experiments (Schneider & Foot, 2004) to analyze
and explain various Web phenomena. Intelligent indexing text on the Web is the primary
This approach is usually originated from social goal of search engines building their databases.
science and communication researchers, where Machine learning techniques and Web content
the ability to apply more advanced computer- mining are widely used in this task. Neural net-
ized methods like machine learning is limited, works are commonly used for Web document
however, the interesting of such methods is the classification. They are trained by existing Web
expressive power to interpret and explain such date for learning to correctly classify patterns
phenomena. To the best of our knowledge the of Web documents. They produce high clas-
author has not identified any combined machine sification accuracy and are very popular among
learning and qualitative studies. It could be very researchers for learning and classifying Web
interesting to see how such studies will be em- documents (Cirasa, Pilato, Sorbello, & Vassallo,

Machine Learning and Web Mining

2000; Fukuda, Passos, Pacheco, Neto, Valerio, & ing applied to Web usage mining for creating
Roberto, 2000; Pilato, Vitabile, Vassallo, Conti, community specific directories to offer users a
& Sorbello, 2003). more personalized view of the Web according to
Apart from neural networks classifiers, sys- their preferences. Hu and Meng (2005) present
tems based on support vector machines for Web a system that combines the intelligent agent ap-
document classification are presented in Sun, Lim, proach with collaborative filtering using neural
and Ng (2002), and Yu, Han, and Chang (2002), networks and Bayes network in order to retrieve
whereas Esposito, Malerba, Di Pace, and Leo relevant information. Zhou, Jiang, and Li (2005)
(1999) use three different classification models apply multi-instance learning on Web mining by
(decision trees, centroids and k-nearest-neighbor) using the browsing history of the user in the Web
for automated classification of Web pages; whereas index recommendation problem for recommend-
hybrid systems like in Kuo, Liao, and Tu (2005) ing unseen Web pages.
combine neural networks with genetic algorithms Yao, Hamilton, and Wang (2002) combine
to analyse Web browsing paths for a recommenda- three different machine learning techniques: as-
tion system based on intelligent agents. sociation rules, clustering and decision trees to
Also, Bayesian classifiers for text categoriza- help users navigate a Website by analysing and
tion in Syskill and Webert (Pazzani & Billsus, learning from Web usage mining and user behav-
1997) are used in a recommendation system to ior. A hybrid approach that uses self-organizing
recommend Web pages, and in Mooney and Roy maps (SOM), (Kohonen, 1990) and a neuro-fuzzy
(2000) to produce content-based book recom- model is applied on log files by Wang et al. (2005)
mendations. Semeraro, Basile, Degemmis, and for Web traffic mining in order to predict Web
Lops (2006) train a Bayes classifier that infers server traffic. Genetic algorithms are used in (Tug,
user profiles as binary text classifiers (likes and Sakiroglu & Arslan, 2006) for the discovery of
dislikes) in an application that acts like a confer- user sequential accesses from log files.
ence participant advisor that suggests conference
papers to be read and talks to be attended by a Machine Learning and Web
conference participant. Structure Mining
Similarly, reinforcement learning and Bayes
networks are used as intelligent agents in Rennie The most famous application of Web structure
and McCallum(1999) for learning and classifying mining is the Google search engine based on
efficiently Web documents. Stamatakis, Karkalet- Brin and Page’s (1998) PageRang algorithm for
sis, Paliouras, Horlock, Grover, and Curran (2003) ranking pages relevance. Mladenic and Grobelnik
compare various machine learning approaches (1999) use the k-nearest-neighbor algorithm to
(decision trees, support vector machines, nearest train a system for predicting Web content from
neighbour classifier, naïve baies) for identifying hyperlinks. Wu, Gordon, DeMaagd, and Fan
domain-specific Websites. (2006) use principal cluster analysis to identify
a small number of major topics from millions of
Machine Learning and Web Usage navigational data. Lu and Getoor (2003) apply
Mining classifiers for link-based object classification.
Probabilistic models are used in Matsuo, Ohsawa,
Classifiers and clustering algorithms are usu- and Ishizuka (2001) for Web search and identi-
ally used for analyzing hyperlinks in Web usage fying Web communities; in Lempel and Moran,
mining. Pierrakos, Paliouras, Papatheodorou, (2001) for Web search; Cohn and Chang, (2000),
Karkaletsis, and Dikaiakos (2003) use cluster- Getoor, Segal, Tasker, and Koller (2001), and

Machine Learning and Web Mining

Richardson and Domingos, (2002) for Web page Helpdesks and Recommendation
classification. Systems

Recommendation systems are based on user

appliCationS of WeB Mining modeling that are mainly derived from content-
to SoCietal Benefit areaS based learning or from collaborative filtering
(Zukerman & Albrecht, 2001). Content–based
Web mining may benefit those organizations that learning uses a user’s past usage behavior and
want to utilize the Web as a knowledge base for acts as an indicator of his/her future behavior.
supporting decision-making. Pattern discovery, Collaborative filtering is based on ratings of user
analysis, and interpretation of mined patterns may favors, like rating music or movies, so that rating
lead to take better decisions for the organization history of a user can be associated with similar
and for the provided services. E-commerce and preferences of other users. So a user is classified
e-business are two fields that may be empowered in a user model, where recommendations can be
by Web mining with lots of applications to increase addressed to the user according to favors of other
sales, doing intelligence business or even used in people from the specific classified user model.
crisis management as in Tango-Lowy and Lewis Hybrid recommendation systems that take benefits
(2005), where Web mining and self organizing from both collaborative filtering and content-based
maps are used in crisis scenarios. learning have been also investigated in literature
Lots of Web mining applications found in (Melville, Mooney, & Nagarajan, 2002; Sarwar,
the literature describe the effectiveness of the Karypis, Konstan, & Riedl, 2000).
application from the Web administration point Martin-Guerrero, Palomares, Balaguer-
of view. The target in these applications is tak- Ballester, Soria-Olivas, Gomez-Sanchis, and
ing advantage of the mined knowledge from the Soriano-Asensi (2006) propose a recommender
users to increase the benefits of the organiza- model for predicting user preferences based on
tion. In this chapter, the author focuses on social common clustering algorithms in a citizen Web
beneficial areas from Web mining, and hence portal. Clustering and collaboration filtering is
the point of view is on Web mining applications used in Hayes, Avesani, and Veeramachaneni
that can help users or group of users. An obvious (2006) for a blog recommendation system. A
societal benefit is that Web mining research ef- blog is a journal-style Website usually written
forts lead to user (or group of users) satisfaction by a single user, where entries are presented in a
by providing accurate and relevant information reverse chronological order.
retrieval; by providing customized information; ReferralWeb (Kautz, Selman & Shah, 1997)
by learning about user’s demands so that services is a project that mines social networks from the
can target specific groups or even individual Web by using collaborative filtering for identify-
users; and by providing personalized services. ing experts that could answer questions asked by
The author identified research on the following individuals. Nasraoui and Pavuluri (2004) using
areas, where Web mining offers societal benefits: neural networks provide accurate Web recom-
Helpdesks and recommendation systems; digital mendations based on a committee of predictors.
libraries; security and crime investigation; e- Yao et al. (2002) created PagePrompter, an agent-
learning; e-government services; and e-politics based recommender that helps users navigate a
and e-democracy. Website by analysing and learning from Web
usage mining and user behavior. The interesting
part of Pageprompter is that it combines three dif-

Machine Learning and Web Mining

ferent machine learning techniques: association in order to discover hidden communities in het-
rules, clustering, and decision trees for achieving erogeneous social networks. Graph analysis for
its task. discovering Web communities can be modeled
Pierrakos et al. (2003) use clustering applied to by using Bayesian networks as demonstrated by
Web usage mining for creating community spe- Goldenberg and Moore (2005) for identifying
cific directories to offer users a more personalized coauthorship networks.
view of the Web according to their preferences A content-based learning book recommenda-
and may be assisted by using these directories as tion system is proposed by Mooney and Roy (2000)
starting points on their navigation. Garofalakis, based on Web pages of the Amazon online digital
Kappos, and Mourloukos (1999) studied Website store. Large portals with news updated frequently
optimization using Webpage popularity. Scheffer per day consist of rich information and may be
(2004) created an e-mail answering assistant by considered as part of digital libraries in the way
semisupervised text classification. that newspaper articles are indexed and available
Fast and accurate Web services are practical to readers in traditional libraries.
implications from improved helpdesks and rec- At the same time, news sites are large portal
ommendation systems. sites that increase their content on a daily basis.
For such sites, the interpretation of Web content to
Digital Libraries meaningful content can be classified into seman-
tic categories in order to make both information
Digital libraries provide precious information retrieval and presentation easier for individuals
distributed all around the world without necessar- and group of users is very important (Eirinaki &
ily having the need to be physically present in a Vazirgiannis, 2003). Liu, Yu, and Le (2005) use
traditional library building. In this context, Web fuzzy clustering to identify meaningful news
mining research aiming to offer better services on patterns from Web news stream data. However,
digital libraries have been identified in literature. the wide distribution of knowledge on the one
Adafre and Rijke (2005) use clustering for discov- side and the easiness of access to this knowledge
ering missing hypertext links in Wikipedia, the on the other side from various groups of people
largest encyclopedia on the Web that is created and like researchers, academics, students, pupils,
modified by many volunteer authors. Web min- professionals or independents are the most valu-
ing on Wikipedia is also investigated by Gleim, able practical implications of Web mining to this
Mehler, and Dehmer (2006). Bhattacharya and societal interest area.
Getoor (2004) use clustering for detecting group
of entities, like authors, from links and resolving E-Learning
the coreference problem of multiple references
to the same paper in autonomous citation index- Web mining may be used for enhancing the
ing engines, like CiteSeer (Giles, Bollacker, & learning process in e-learning applications. Bel-
Lawrence, 1998). laachia, Vommina, and Berrada (2006), introduce
CiteSeer is an important resource for computer a framework, where they use log files to analyze
scientists for searching electronic versions of pa- the navigational behavior and the performance
pers. Cai, Shao, He, Yan, and Han (2005) work also of e-learners so that to personalize the learning
is based on another well-known digital library in content of an adaptive learning environment in
computer science community, the DBLP library order to make the learner reach his learning ob-
at http://dblp.uni-trier.de. Their work is related to jective. Zaiane (2001) studies the use of machine
machine learning feature extraction algorithms learning techniques and Web usage mining to

Machine Learning and Web Mining

enhance Web-based learning environments mining and machine learning to this societal
for the educator to better evaluate the learning interest area.
process and for the learners to help them in their
learning task. Students’ Web logs are investigated E-Government Services
and analysed in Cohen and Nachmias (2006) in
a Web-supported instruction model. Improved The processes through which government organi-
e-learning services that accommodate user needs zations interact with citizens for satisfying user (or
are practical implications from Web mining and group of users) preferences leads to better social
machine learning to the e-learning area. services. The major characteristics of e-govern-
ment systems are related to the use of technology
Security and Crime Investigation to deliver services electronically focusing on
citizens needs by providing adequate informa-
Web mining techniques may be used for identify- tion and enhanced services to citizens to support
ing cyber-crime actions like Internet fraud and political conduct of government. Empowered by
fraudulent Websites, illegal online gambling, Web mining methods e-government systems may
hacking, virus spreading, child pornography provide customized services to citizens resulting
distribution, and cyberterrorism. Chen, Qin, Reid, to user satisfaction, quality of services, support
Chung, Zhou, and Xi (2004) note that clustering in citizens decision making, and finally leads
and classification techniques can reveal identities to social benefits. However, such social benefits
of cybercriminals, whereas neural networks, mainly rely on the organization’s willingness,
decision tress, genetic algorithms, and support knowledge, and ability to move on the level of
vector machines can be used to crime patterns using Web mining.
and network visualization. Chen et al. (2004) The e-government dimension of an institution
provide a detailed study on methods against ter- is usually implemented gradually. E-government
rorist groups on the Web for predictive modeling, maturity models (Irani, Al-Sebie & Elliman, 2006;
terrorist network analysis, visualization of terror- Lappas & Yannas, 2006) describe the online stages
ists’ activities, linkages and relationships. an organization goes through time, becoming
Similarly, Wu et al. (2006) based on user’s more mature in using the Web for providing bet-
online activities use principal cluster analysis ter services to citizens. The maturity stages start
to identify a small number of major topics from from the organization’s first attempt to be online
millions of navigational data in an approach that aiming at publishing useful citizens’ information
can be useful in security against terrorism. Do, and move to higher maturity stages of being in-
Chang, and Hui (2004) implemented a system teractive, making transactions and finally trans-
that can benefit safe Internet browsing in school, forming the functionality of the organization to
home and workplace. The system monitors and operate their business and services electronically
filters Web access by applying Web mining for through the Web. But, maturity stages described
performing Web data classification in order to in literature do not have a Web mining dimen-
classify Web data in a “white list” of allowed sion, which the author considers that should be
pages or blacklist of blocked Web pages. Social the climax in maturity stages.
Networks extracted from instant messaging by Riedl (2003) states that by using interviews
using clustering is investigated by Resig, Da- and Web mining the actual access to information
wara, Homan, and Teredesai (2004). Enjoying a by citizens should be tracked, analyzed and used
more secure environment having better online for the redesign of e-government information ser-
and offline protection are implications of Web vices. E-Government literature reveals that only

Machine Learning and Web Mining

recently Web mining has attracted researchers in future trendS

e-government applications. Fang and Sheng (2005)
present a Web mining approach for designing Nowadays the Web is a rich and huge information
better Web portal for e-government. Hong and repository, where a number of methods and auto-
Lee (2005) propose an intelligent Web informa- matic systems have been created for identifying,
tion system of government based on Web usage locating, accessing, and retrieving information.
mining to help disadvantaged users make good The main open question in Web mining is how
decisions-making for their profit improvements. to provide information relevant to specific users’
In the health sector, Mayer, Karkaletsis, Stama- needs. Semantic Web (Berners-Lee et al., 2001)
takis, Leis, Villarroel, and Thomeczek (2006) works toward this direction and is considered as
investigate improvements of health services by the next Web generation. The current Web is based
quality labelling of medical Web content in the on the hypertext mark-up language (HTML),
recently announced MedIEQ project. Conclu- which specifies how to layout Web pages so that
sively, e-government aim to improve government’s they can be readable to humans, thus it is hu-
services to citizens and any improvement to this man-centralized. The problem is the retrieval of
direction lead to valuable implications of Web relevant information by search engines because
mining and machine learning to national and machines cannot understand Web content to
local societies. retrieve relevant information. This is expected
to change by semantic Web technologies as in
E-Politics and E-Democracy semantic Web “information is given well-defined
meaning better enabling computer and people to
E-politics provides political information and poli- work in cooperation” (Berners-Lee et al., 2001).
tics “on demand” to the citizens by improving the Consequently, machine-learning techniques
political transparency and democracy, benefiting will continue to play the most important role in
parties, candidates, citizens and the society. Elec- the semantic Web (Hess & Kushmerick, 2004) for
tion campaigners, parties, members of parliament, information retrieval and knowledge discovery.
and members of local governments on the Web Berendt et al. (2002) introduce “semantic Web
are part of e-politics. Despite the importance mining” as the field where semantic Web meets
of e-politics in democracy there is limited Web Web mining. It is expected that machine learning
mining methods to meet citizen needs. The author techniques and semantic Web mining will be in
has identified in the literature research that only the focus of research for the next years.
refers in mining political social networks on the In this chapter the author has introduced areas
Web. Link analysis has been used to estimate the of societal interest that may be benefited by Web
size of political Web graphs (Ackland, 2005), to mining and machine learning. The literature re-
map political parties network on the Web (Ack- view revealed that most research in these areas
land & Gibson, 2004) and to investigate the U.S. has just recently been started. The future trend
political Blogosphere (Ackland, 2005b). Political seems to be the convergence of Web mining and
Web linking is also studied by Foot et al. (2003) machine learning to practical solutions in the six
during the U.S. congressional election campaign areas of societal benefit: Helpdesks and recom-
season on the Web. In this aspect, expanding e- mendation systems, digital libraries, e-learning,
democracy boarders will lead to more transparent security and crime investigation, e-government
and participating democracy, which are vital to services, e-politics and e-democracy.
the society.

Machine Learning and Web Mining

ConCluSion Ackland, R. (2005b). Mapping the U.S. political

blogosphere: Are conservative bloggers more
This chapter has provided a survey on Web min- prominent? Paper presented to BlogTalk Dow-
ing and machine-learning methods focusing on nunder, Sydney. Retrieved April 10, 2008, from
current Web mining research in societal benefit http://acsr.anu.edu.au/staff/ackland/papers/pol-
areas identifying that most of this research has blogs.pdf
been recently developed. Therefore, one of the
Adafre, S. F. & Rijke, M. D. (2005). Discovering
current trends of Web mining is toward the con-
missing links in Wikipedia. In Proceedings of the
nection between intelligent Web services and
3rd International Workshop on Link Discovery
applications of social benefits, which brings to
(pp. 90-97). ACM Press.
work closer scientists from various disciplines.
Furthermore, this integrating tendency benefits Badia, A. & Kantardzik, M. (2005). Graph build-
researchers from various fields. ing as a mining activity: Finding links in the small.
Social studies on the Web may benefit from In Proceedings of the 3rd International Workshop
machine learning and Web mining methods for on Link Discovery (pp. 17-24). ACM Press.
providing them with tools and methods to better
Baldi, P., Frasconi, P., & Smyth, P. (2003). Mod-
collect, manage and analyze Web based-phe-
eling the Internet and the Web: Probabilistic
nomena. Moreover, a social interpretation of the
methods and algorithms. West Sussex, UK: John
meaning of outcomes from computer science Web
Wiley.
mining methods is the key question from social
and communications studies (Thelwall, 2006). Bellaachia, A., Vommina, E., & Berrada, B.
Finally, Web mining and machine learning com- (2006). Minel: A framework for mining e-learning
munity may benefit from social and communica- logs. In Proceedings of the 5th IASTED Interna-
tion expertise on the Web to better interpret their tional Conference on Web-based Education (pp.
outcomes in the direction of why this happens; 259-263). Puerto Vallarta, Mexico.
or whether mining patterns have meaningful or
Berendt, B. (2002). Using site semantic to analyze,
useful knowledge; or whether hidden knowledge
visualize and support navigation. Data Mining
found from Web mining creates a new view that
and Knowledge Discovery, 6, 37-59.
needs further investigation and explanation.
Berendt, B., Hotho, A., & Stumme, G. (2002).
Towards semantic web mining. Lecture Notes in
referenCeS Computer Science (vol. 2342, pp. 264-278).
Berners-Lee, T., Hendler, J., & Lassila, O. (2001).
Ackland, R. & Gibson, R. (2004). Mapping politi-
The semantic Web. Scientific American, 284(5),
cal party networks on the WWW. In Proceedings
34-43.
of the Australian Electronic Governance Confer-
ence, Melbourne, Australia. Bhattacharya, I. & Getoor, L. (2004). Deduplica-
tion and group detection using links. In Proceed-
Ackland, R. (2005). Estimating the size of politi-
ings of the SIGKDD Workshop on LinkAnalysis
cal Web graphs. Revised paper presented to ISA
and Group Detection, Seattle, WA.
Research Committee on Logic and Methodology
Conference. Retrieved April 10, 2008, from http:// Bishop, C. M. (2003). Neural networks for pattern
acsr.anu.edu.au/staff/ackland/papers/political_ recognition. Oxford University Press.
web_graphs. pdf

Machine Learning and Web Mining

Brin, S. & Page, L. (1998). The anatomy of a Multiconference on Systemics, Cybernetics, and
large-scale hypertextual Web search engine. In Informatics SCI2000, Orlando, Florida.
Proceedings of the 7th International World Wide
Cohn D. & Chang, H. (2000). Learning to proba-
Web Conference, Elsevier Science (pp. 107-117),
bilistically identify authoritative documents. In
New York.
Proceedings of the17th International Conference
Buchner, A. G., Mulvenna, M. D., Anand, S. S. & on Machine Learning (ICML2000) (pp. 167-174),
Hughes, J. G. (1999). Navigation pattern discovery Stanford, California.
from Internet data. In Proceedings of the Web
Cooley, R., Mobasher, B., & Srivastava, J. (1997).
Usage Analysis and User Profiling Workshop
Web mining: Information and pattern discovery
(pp. 25-30), San Diego, CA.
on the World Wide Web. In Procceding of the 9th
Cai, D., Shao, Z., He, X., Yan, X., & Han, J. International Conference on Tools with Artificial
(2005). Minning hidden community in heteroge- Intelligence(ICTAI ’97) (pp. 558-567), New Port
neous social networks. In Proceedings of the 3rd Beach, CA: IEEE Computer Society.
International Workshop on Link Discovery (pp.
Cohen, A. & Nachmias, R. (2006). A quantita-
58-65). ACM Press.
tive cost effectiveness model for Web-supported
Chakrabarti, S., (2000). Data mining for hypertext: academic instruction. The Internet and Higher
A tutorial survey. ACM SIGDDD Explorations, Education, 9(2), 81-90.
1(2), 1-11.
Dai, H. & Mobasher, B. (2003). A road map to
Chakrabarti, S. (2003). Mining the Web: Discover- more effective Web personalization; Integrating
ing knowledge from hypertext data. San Francisco: domain knowledge with Web usage mining. In
Morgan Kaufmann Publishers. Proceedings of the International Conference
on Internet Computing (IC 2003), Las Vegas,
Chen, H. & Chau, M. (2004). Web mining:
Nevada.
Machine learning for Web applications. Annual
Review of Information Science and Technology Demertzis, N., Diamantaki, K., Gazi, A., & Sartze-
(ARIST), 38, 289-329. takis, N. (2005). Greek political marketing on-line:
An analysis of parliament members’ Web sites.
Chen, H., Chung, W., Xu, J. J., Wang, G., Qin, Y.,
Journal of Political Marketing, 4(1), 51-74.
& Chau, M. (2004). Crime data mining: A gen-
eral framework and some examples. Computer, Desikan, P., Srivastava, J., Kumar, V., & Tan, P.
37(4), 50-56. N. (2002). Hyperlink analysis: Techniques and
applications (Tech. Rep. TR 2002-0152). Army
Chen, H., Qin, J., Reid, E., Chung, W., Zhou, Y., Xi,
High Performance Computing Center.
W., Lai, G., Bonillas, A., & Sageman, M., (2004).
The dark Web portal: Collecting and analyzing Dietterich, T. G., Lathrop, R. H., & Lozano-
the presence of domestic and international terror- Perez, T. (1997). Solving the multiple-instance
ist groups on the Web. In Proceedings of the 7th problem with axis-parallel rectangles. Artificial
International Conference on Intelligent Transpor- Intelligence, 89(1-2), 31-71.
tation Systems (ITSC), Washington D.C.
Do, T. D., Chang, K., & Hui, S. C. (2004). Web
Cirasa, A., Pilato, G., Sorbello, F., & Vassallo, G. mining for cyber monitoring and filtering. In
(2000). EaNet: A neural solution for Web pages Proceedings of the 2004 IEEE Conference on
classification. In Proceedings of the 4th World Cybernetics and Intelligent Systems Vol. 1 (pp.
399-404). Singapore.

0
Machine Learning and Web Mining

Duda, R. O., Hart, P. E., & Stork, D. G. (2001). Gibson, R. K. & Ward, S. J. (2000). A proposed
Pattern classification. New York: John Wiley. methodology for studying the functions and effec-
tiveness of party and candidate Web-sites. Social
Eirinaki, M. & Vazirgiannis, M. (2003). Web min-
Science Computer Review, 18(3), 301-319.
ing for Web personalization. ACM Transactions
on Internet Technology, 3(1), 1-27. Giles, C. L., Bollacker, K., & Lawrence, S. (1998).
CiteSeer: An automatic citation indexing system.
Esposito, F., Malerba, D., Di Pace, L., & Leo, P.
In Proceedings of the 3rd ACM Conference on
(1999). A learning intermediary for automated
Digital Libraries, 89-98.
classification of Web pages. In Proceedings of the
16th International Workshop on Machine Learning Gillani, B. (1998). The Web as a delivery mecha-
in Text Data Analysis (ICML1999) (pp. 37-46). nism to enhance instruction. Educational Media
International, 35(3), 197-202.
Faca, F. M. & Lanzi, P. L. (2005). Mining inter-
esting knowledge from Weblogs: A survey. Data Gleim, R., Mehler, A., & Dehmer, M. (2006).
Knowledge Engineering, 53(3), 225-241. Web corpus mining by instance of Wikipedia.
In Proceedings of the EACL 2006 Workshop on
Fang, X., Sheng, O. R. L. (2005). Designing a
Web as Corpus, Trento, Italy.
better Web portal for digital government: A Web-
mining based approach. In Proceedings of the Goldenberg, A. & Moore, A. W. (2005). Bayes net
2005 National Conference on Digital Government graphs to understand co-authorship networks? In
Research (pp. 277-278), Atlanta, Greorgia. Proceedings of the 3rd International Workshop on
Link Discovery (pp. 1-8). ACM Press.
Foot, K., Schneider, S., Dougherty, M., Xenos, M.,
& Larsen, E. (2003). Analyzing linking practices: Hayes, C., Avesani, P., & Veeramachaneni, S.
Candidate sites in the 2002 U.S. electoral Web (2006). An analysis of bloggers and topics for
sphere. Journal of Mediated Communication, a blog recommender system. In Proceedings
8(4). of the Workshop on Web Mining, 7th European
Conference on Machine Learning and the 10th
Fukuda, H., Passos, E., Pacheco, A. M., Neto, L.
European Conference on Principles and Practice
B., Valerio, J., Roberto, V. J. D., Antonio, E. R.,
of Knowledge Discovery in Databases (ECML/
& Chigener, L. (2000). Web text mining using a
PKDD), Berlin, Germany.
hybrid system. In Proceedings of the 6th Brazilian
Symposium on Neural Networks (pp.131–136). Herrera-Viedma, E. & Pasi, G. (2006). Soft ap-
proaches to information retrieval and information
Garofalakis, J., Kappos, P., & Mourloukos, D.
access on the Web: An introduction to the special
(1999). Website optimization using page popular-
topic section. Journal of the American Society
ity. IEEE Internet Computing, 3(4), 22-29.
for Information Science and Technology, 57(4),
Getoor, L., Segal, E., Tasker, B., & Koller, D. 511-514.
(2001). Probabilistic models of text and link struc-
Hess, A. & Kushmerick, N. (2004). Machine
ture for hypertext classification. In Proceedings
learning for annotating semantic Web services.
of the IJCAI Workshop on Text Learning: Beyond
In Proceedings of the AAAI Spring Symposium on
Supervision, Seattle, Washington.
Semantic Web Services, Palo Alto, California.
Getoor, L. & Diehl, C. P. (2005). Link mining: A
Hong, G. H. & Lee, J. H. (2005). Designing an
survey. ACM SIGKDD Explorations Newsletter,
intelligent Web information system of government
7(2), 3-12.

Machine Learning and Web Mining

based on Web mining. Lecture notes in computer Li, X. (1998). Web page design and graphic use
science (Vol. 3614, pp. 1071-1078). of three U.S. newspapers. Journalism and Mass
Communication Quarterly, 75(2), 353-365.
Hu, W. & Meng, B. (2005). Design and implemen-
tation of Web mining system based on multi-agent. Liu, J. W., Yu, S. J., & Le, J. J. (2005). Online
Lecture notes on artificial intelligence (Vol. 3584, mining dynamic Web news patterns using ma-
pp.491-498). chine learn methods. Lecture notes on artificial
intelligence (Vo. 3614, pp. 462-465).
Irani, Z., Al-Sebie, M., & Elliman, T. (2006).
Transaction stage of e-government systems: Lu, Q. & Getoor, L. (2003). Link-based text clas-
Identification of its location & importance. In sification. In Proceedings of the 3rd International
Proceedings of the 39th Hawaii International Workshop on Link Discovery (pp. 1-8). ACM
Conference on System Sciences, Hawai. Press.
Kautz, H., Selman, B., & Shah, M. (1997). Margolis, M., Resnick, D., & Tu, C.-C. (1997).
Referral Web: Combining social networks and Campaigning on the Internet: Parties and can-
collaborating filtering. Communications of the didates on the World Wide Web in the 1996
ACM, 40(3), 63-65. primary season. Harvard International Journal
of Press/Politics, 2(1), 59-78.
Kohonen, T. (1990). The self-organizing maps.
Proceedings of the IEEE, 78, 1464-1480. Martin-Guerrero, J. D., Palomares, A., Balaguer-
Ballester, E., Soria-Olivas, E., Gomez-Sanchis,
Kosala, R. & Blockeel, H. (2000). Web mining
J., & Soriano-Asensi, A. (2006). Studying the
research: A survey. ACM, 2(1), 1-15.
feasibility of a recommender in a citizen Web
Kuo, R. J., Liao, J. L., & Tu, C. (2005). Integration portal based on user modeling and clustering
of ART2 neural network and genetic k-means algorithms. Expert Systems with Applications,
algorithm for analyzing Web browsing paths in 30, 299-312.
electronic commerce. Decision Support Systems,
Matsuo, Y., Ohsawa, Y., & Ishizuka, M. (2001).
40, 355-374.
Average-clicks: A new measure of distance on
Lappas, G. & Yannas, P. (2006). A framework to the WWW. In Proceedings of First Asia-Pacific
evaluate political party Websites. In Proceedings Conference, Web Intelligence, Japan.
of the 4th International Conference on Politics and
Maule, R. W. (1998). Content design frameworks
Information Systems: Technologies and Applica-
for Internet studies curricula and research. Internet
tions Vol. II (pp. 226-231), Orlando, Florida.
Research: Electronic Networking Applications
Lawrence, S. & Giles, C. L. (1999). Accessibility of and Policy, 8(2), 174-184.
information on the Web. Nature, 400, 107-09.
Mayer, M. A., Karkaletsis, V., Stamatakis, K.,
Lempel, R. & Moran, S. (2001). SALSA: The Leis, A., Villarroel, D., Thomeczek, C., Labsky,
stochastic approach for link-structure analysis. M., Lopez-Ostenero, F., & Honkela, T. (2006).
ACM Transactions on Information Systems, MedIQ-Quality labelling of medical Web con-
19(2), 131-160. tent using multilingual information extraction.
Studies in Health Technology and Informatics,
Levene, M. & Loizou, G. (1999). Computing the
121, 183-190.
entropy of user navigation in the Web (Tech. Rep.
No. RN/99/42), University College London. Melville, P., Mooney, R. J., & Nagarajan, R.
(2002). Content-boosted collaborative filtering for

Machine Learning and Web Mining

improved recommendations. In Proceedings of Ngu, D. S. W. & Wu, X. (1997). Sitehelper: A

the 18th National Conference on Artificial Intel- localized agent that helps incremental exploration
ligence (pp. 187-192). of the World Wide Web. Computer Networks,
29(8-13), 1249-1255.
Michalski, R. S. & Tecuci, G. (1994). Machine
learning: A multistrategy approach (Vol. IV). Oberle, D., Berendt, B., Hotho, A., & Gonzalez, J.
Morgan Kaufmann (2003). Conceptual user tracking. Lecture notes on
artificial intelligence (Vol. 2663, pp. 155-164).
Mitchell, T. (1997). Machine learning. McGraw
Hill. Pal, S., Talwar, V., & Mitra, P. (2002). Web mining
in soft computing framework: Relevance, state of
Mladenic, D. & Grobelnik, M. (1999). Predicting
the art and future directions. IEEE Transactions
content from hyperlinks. In Proceedings of the
on Neural Networks, 13(5), 1163-1177.
16th International ICML99 Workshop on Machine
Learning in Text Data Analysis (pp. 109-113). Park, H. W. (2003). Hyperlink network analysis:
A new method for the study of social structure
Mobasher, B., Jain, N., Han, E., & Srivastava,
on the Web. Connections, 25(1), 49-61.
J. (1996). Web Minning: Pattern discovery from
WWW transaction (Tech. Rep. TR-96050). De- Pazzani, M. & Billsus, D. (1997). Learning
partment of Computer Science, University of and revising user profiles: The identification of
Minnesota, Minneapolis. Retrieved April 12, interesting Web sites. Machine Learning, 27(3),
2008, from http://citeseer.ist.psu.edu/mobasher- 313-331.
96web.html
Pierrakos, D., Paliouras, G., Papatheodorou, C.,
Mobasher, B., Cooley, R., & Srivastava, J. (1999). Karkaletsis, V., & Dikaiakos, M. (2003). Web
Creating adaptive Web sites through usage based community directories: A new approach to Web
clustering of URLs. In Proceedings of the IEEE personalization. Lecture notes on artificial intel-
Knowledge and Data Engineering Exchange ligence (Vol. 3209, pp. 113-129).
Workshop (KDEX99), Chicago, Illinois.
Pilato, G., Vitabile, S., Vassallo, G., Conti, V., &
Mobasher, B., Dai, H., Luo, T., Sung, Y., & Zhu, Sorbello, F. (2003). A concurrent neural classifier
J. (2000). Integrating Web usage and content for HTML documents retrieval. Lecture notes in
mining for more effective Web personalization. computer science (Vol. 2859, pp. 210-217).
In Proceedings of the International Conference
Reeves, T. C. & Dehoney, J. (1998). Cognitive
on E-Commerce and Web Technologies (ECWeb
and social functions of courseWeb sites. In H.
2000) (pp. 165-176). Greenwich, UK.
Maurer & R.G. Olson (Eds.), Proceedings of
Mooney, R. J. & Roy, L. (2000). Content-based WebNet World Conference 98—World Conference
book recommending using learning for text of the WWW, Internet & Intranet. Orlando, FL:
categorization. In Proceedings of the 5th ACM Association for the Advancement of Computing
Conference on Digital Libraries (pp. 195-204). in Education.
ACM Press.
Rennie, J. & McCallum, A. K. (1999). Using rein-
Nasraoui, O. & Pavuluri, M. (2004). Complete forcement learning to spider the Web efficiently.
this puzzle : A connectionist approach to accurate In Proceedings of the 16th International ICML99
Web recommendations based on a committee of Workshop on Machine Learning in Text Data
predictors. In Proceedings of the 6th WEBKDD Analysis (pp. 335-343).
Workshop, Seattle, Washington.

Machine Learning and Web Mining

Resig, J., Dawara, S., Homan, C. M., & Terede- Spiliopoulou, M. & Pohle, C. (2001). Data min-
sai, A. (2004). Extracting social networks from ing for measuring and improving the success of
instant messaging populations. In Proceedings Web sites. Data Mining and Knowledge Discover,
of LinkKDD’04, Seattle, Washington. 5(1-2), 85-114.
Richardson, M. & Domingos, P. (2002). The in- Srivastava, J., Cooley, R., Deshpande, M., & Tan,
telligent surfer: Probabilistic combination of link P. (2000). Web usage mining: Discovery and
and content information in PageRank. Advances applications of usage patterns from Web data.
in Neural Information Processing Systems, 14. SIGKDD Explorations, 1, 12-23.
Riedl, R. (2003). Design principles for E-govern- Stamatakis, K, Karkaletsis, V., Paliouras, G.,
ment services. In Proceedings of eGov Day 2003, Horlock, J., Grover, C., Curran, J. R. & Dingare,
Vienna, Austria. S. (2003). Domain-specific Web site identification:
The CROSSMARC focused Web crawler. In Pro-
Sarwar, B., Karypis, G., Konstan, J., & Riedl, J.
ceedings of the Second International Workshop
(2000). Analysis of recommendation algorithms
on Web Document Analysis (WDA 2003) (pp.
for e-commerce. In Proceedings of the ACM Con-
75-78), Edinburgh, UK.
ference on Electronic Commerce (pp. 158-162).
Sun, A., Lim, E. P. & Ng, W.K. (2002). Web
Schneider, S. & Foot, K. (2004). The Web as
classification using support vector machine. In
an object of study. New Media & Society, 6(1),
Proceedings of the Fourth ACM CIKM Interna-
114-122.
tional Workshop on Web Information and Data
Scime, A. (2005). Web mining: Application and Management (WIDM’02), McLean, Virginia.
techniques. Hershey, PA: Idea Group Inc.
Sutcliffe, A. (2001). Heuristic evaluation of Web-
Scheffer, T. (2004). Email answering assistance site attractiveness and Web usability. Lecture notes
by semi-supervised text classification. Intelligent in computer science (Vol. 2220, pp. 183-198).
Data Analysis, 8(5), 2004.
Tango-Lowy, R. & Lewis, L. (2005). Situation
Sharma, A. & Woodward, R. (2001). Political management in crisis scenarios based on self-orga-
economy Websites: A researcher’s guide. New nizing neural mapping technology. In Proceedings
Political Economy, 6(1), 119-130. of the IEEE Military Communications Conference
(pp. 1-7), Atlantic City, New Jersey.
Sebastiani, F. (2002). Machine learning in au-
tomated text categorization. ACM Computing Thelwall, M. (2006). Interpreting social science
Surveys, 34(1), 1–47. link analysis research: A theoretical framework.
Journal of the American Society for Information
Semeraro, G., Basile, P., Degemmis, M., & Lops,
Science and Technology, 57(1), 60-68.
P. (2006). Discovering user profiles from papers by
using word sense disambiguation. In Proceedings Tug, E., Sakiroglu, M., & Arslan, A. (2006).
of the ECML/PKDD Workshop on Web Mining Automatic discovery of the sequential accesses
(pp. 69-79), Berlin, Germany. from Web log data files via a genetic algorithm.
Knowledge-Based Systems, 19(3), 180-186.
Spiliopoulou, M., Pohle, C., & Faulstich, L. (1999).
Improving the effectiveness of a Web site with Wang, X., Abraham, A., & Smith, K. (2005).
Web usage mining. In Proceedings of WEBKDD99 Intelligent Web traffic mining and analysis.
(pp. 142-162), San Diego, CA. Journal of Network and Computer Applications,
28, 147-165.

Machine Learning and Web Mining

Wang, Y. & Hu, J. (2002). A machine learning navigation created using data mining techniques.
approach for table detection on the Web. In Pro- Lecture notes in computer science (Vol. 2475,
ceedings of the 11th International World Web pp. 506-513).
Conference, Honolulu, Hawaii.
Yu, H., Han, J., & Chang, K. C. (2002). PEBL:
Wasserman, S. & Faust, K. (1994). Social network Positive example based learning for Web page
analysis: Methods and applications. Cambridge classification using SVM. In Proceedings Of
University Press. The International Conference On Knowledge
Discovery In Databases (KDD02) (pp. 239-248),
Wolfgang, G. & Lars, S. (2000). Mining Web
New York.
navigation path fragments. In Proceedings of
the Workshop on Web Mining for E-Commerce Zaiane, O. R. (2001). Web usage mining for a better
(KDD2000) (pp. 105-110). Boston, MA. Web-based learning environment. In Proceedings
of Conference on Advanced Technology for Edu-
Wu, H., Gordon, M., DeMaagd, K., & Fan, W.
cation (pp. 60-64). Banff, Alberta, Canada.
(2006). Mining Web navigations for intelligence.
Decision Support Systems, 41, 574-591. Zhang, Y., Yu, J. X., & Hou, J. (2005). Web com-
munities: Analysis and construction. Berlin:
Yannas, P. & Lappas, G. (2005). Web campaign
Springer.
in the 2002 Greek municipal elections. Journal
of Political Marketing, 4(1), 33-50. Zhou, Z., Jiang, K., & Li, M. (2005). Multi-instance
learning based Web mining. Applied Intelligence,
Yannas, P. & Lappas, G. (2006). Web candidates
22(2), 135-147.
in the 2002 Greek prefecture elections. Journal
of E-Government, 3(1), 53-67. Zukerman, I. & Albrecht, D. (2001). Predictive
statistical models for user modeling. User Model-
Yao, Y. Y., Hamilton, H. J. & Wang, X. (2002).
ling and User Adapted Interaction, 11, 5-18.
PagePrompter: An intelligent agent for Web

Chapter VI
The Importance of Data Within
Contemporary CRM
Diana Luck
London Metropolitan University, UK

aBStraCt

In recent times, customer relationship management (CRM) has been defined as relating to sales, market-
ing, and even services automation. Additionally, the concept is increasingly associated with cost sav-
ings and streamline processes as well as with the engendering, nurturing and tracking of relationships
with customers. Much less associations appear to be attributed to the creation, storage and mining of
data. Although successful CRM is in evidence based on a triad combination of technology, people and
processes, the importance of data is unquestionable. Accordingly, this chapter seeks to illustrate how,
although the product and service elements as well as organizational structure and strategies are central
to CRM, data is the pivotal dimension around which the concept revolves in contemporary terms. Con-
sequently, this chapter seeks to illustrate how the processes associated with data management, namely:
data collection, data collation, data storage and data mining, are essential components of CRM in both
theoretical and practical terms.

introduCtion fundamental underpinnings of data mining, the

concept has been defined as essentially relating to
Throughout the past decade, customer relationship sales, marketing, and even services automation.
management (CRM) has become such a buzzword Additionally, CRM is increasingly associated with
that in contemporary terms the concept is used cost savings and streamline processes as well as
to reflect a number of differing perspectives. with the engendering, nurturing, and tracking of
In fact, although in essence CRM pivots on the relationships with customers. Much less associa-

Copyright © 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
The Importance of Data Within Contemporary CRM

tions appear to be attributed to the creation, storage bracing enterprise-resource planning applica-
and mining of data; all essential components of tions in order to deliver cost savings and more
CRM in both theoretical and practical terms. streamlined services within organizations, as well
In support of the close connection of CRM as tracking the relationships organizations have
with data mining, it should be emphasized that in with their customers, and indeed, their suppliers.
contemporary terms, the acronym CRM is used (Key Note, 2002a, p. 1)
to refer to both customer relationship marketing
and customer relationship management. Although Notwithstanding such complexity, it simply
customer relationship marketing and customer re- cannot be denied that CRM is intricately con-
lationship management are indeed often regarded nected with data mining.
as specialised fields of study, within the discourse In line with the wide latitude afforded by its
of this chapter it is argued that they are in fact complexity, various themes have been discussed
inter-related. Subsequently, throughout this chap- under the title CRM in both trade and academic
ter, the scope of CRM is intended to span from literature. However, in spite of being extensive,
the development and marketing of relationships as a whole this coverage still seems to lack coher-
between organizations and their customers to the ence. Although in recent times, CRM has been
day-to-day management of these relationships. described as a triad combination of technology,
The collation, storage and mining of data are by people, and processes (Chen & Popovich, 2003;
all means implicitly encompassed within the as- Galbreath & Rogers, 1999); the importance of
sociated processes conducted as part of CRM. data is unquestionable. Accordingly, this chapter
Throughout the past decade, CRM has been seeks to illustrate how, although the product and
associated with various objectives and differing service elements as well as organizational struc-
perspectives. Accordingly, while it is at times ture and strategies are central to CRM, data is
referred to as being synonymous to a form of the pivotal dimension around which the concept
marketing such as database marketing (Khalil revolves in contemporary terms in practice as
& Harcar, 1999), services marketing (Grönroos, well as in theory.
1994), and customer partnering (Kamdampully The technologies associated with data man-
& Duddy, 1999) for instance, at other times it is agement, namely: data collection, data collation,
specified in terms of more specific marketing data storage and data mining, have undoubtedly
objectives such as customer retention (Walters & influenced the evolution and implementation of
Lancaster, 1999a), customer share (Rich, 2000), information systems within companies. In fact,
and customer loyalty (Reichheld & Schefter, the central role of databases and data mining
2000). In fact, as Lindgreen and Crawford (1999, within the context of current CRM practices is
p. 231) succinctly summarise, more often than not so evident that it could even be argued that the
the concept seems to be “described with respect concept quintessentially revolves around the col-
to its purposes as opposed to its instruments or lection and usage of data. Accordingly, current
defining characteristics”. Meanwhile, the exact and emerging technologies have been associated
nature of the CRM approach remains persis- and are expected to continue to be associated with
tently elusive while the realm of CRM remains databases. Database technologies indeed appear
unquestionably complex. This blurred outlook is to have significantly contributed to the evolution
poignantly emphasised in the definition that: of CRM. However, regardless how theoretically
valid the relevance of data may be to concept of
Essentially CRM relates to sales, marketing, CRM, unless it is adequately implemented within
services automation, but it is increasingly em- operations, that is to say unless its significance

The Importance of Data Within Contemporary CRM

can be translated to operations, its benefits are customer needs (Chen & Popovich, 2003). How-
unlikely to be fully reaped. Thus it is crucial that ever, with increased competition, mass marketing
all the processes, which complement the process appears to have lost its glitter. Instead, relation-
of data mining, are also focused upon. ship marketing and CRM have been hailed by
In an attempt to illustrate the significance of organizations and academics as the solution to this
data mining applications within the context of change of consumer expectations. Notwithstand-
CRM, this chapter has opted to focus on one in- ing, several academics including Palmer (1996)
dustry: the hotel industry. Hence, throughout this and Murphy (2001) have argued that if companies
chapter, the actions of hotel companies have been intend to optimally embrace CRM, they will need
used to consolidate the arguments and explicate to realign their business offerings.
how businesses within the hotel industry are try- Furthermore, developments in information
ing to optimize the range of opportunities, which technology have dramatically enhanced the scope
the adequate use of data affords. Accordingly, for the collection, analysis, and exploitation of
the types of managerial practices, strategies and information on customers (Long et al., 1999).
tactics, being deployed by hotel chains in their However, these technological developments have
attempts to facilitate CRM at the organizational also highly likely led to an important trend, which
and individuals levels have been reviewed. has in evidence centered itself on database market-
As a synthesis, the aim of this chapter is ing. As a concept, database marketing revolves
two-fold. While on one level, it is intended to set around the implications that organizations can
a platform for briefly exploring how databases acquire and maintain extensive files of informa-
are subject to a range of important influences tion on past and current customers as well as on
with respect to the underlying connections with prospects. Although database marketing may be
customers; on another level it expects to con- regarded as being traditionally inherent to the
solidate how ultimately data or the search or specialised field of direct marketing, numerous
even the exploitation of data needs to be aligned authors including Moncrief and Cravens (1999)
with the individual capabilities and strategies of and Long, Hogg, Hartley, and Angold (1999) have
companies, and indeed with the reality as well acknowledged how its functions are being increas-
as aspirations of organizations. Although the ingly applied to enhance and refine relationships
examples used within this chapter focused on with customers in other areas of marketing too.
the hotel industry, it is suggested that similar op- Consequently, the database, the fundamental
portunities can be extended to other companies. tool of traditional direct marketing, has become
However, the opportunities and benefits will be a pivotal instrument within such areas as the
aligned to the dynamics of the said industries and CRM arena, not only as far as interaction and the
the forces operating in the specific market within exchange of information between an organization
which these companies operate. and its customers are concerned, but also in the
facilitation of processes such as the segmentation
and targeting of these customers. Consequently,
PROCESSES: THE KEy TO when companies engage in CRM, they also clearly
unloCking the SeCretS of have to engage data collection, data storage and
data data mining processes. Therefore, strategic and
even tactical CRM centers on data mining.
Until fairly recently, efficient facilities, standard- Supporters of the important role of technol-
ized products, and lower costs have arguably ogy within the CRM and the general business
been sufficient for companies to be able to satisfy arena are numerous. Consequently, databases

The Importance of Data Within Contemporary CRM

and information systems have been increasingly portunities afforded CRM and by databases. This
favored. Fraser, Fraser, and McDonald (2000) reorganization and restructuring can indeed only
even advocate that it is only when companies be achieved through a review of functional and
ensure that their organizational and systems business processes. Accordingly, CRM should
changes remain one step ahead of their competi- implicitly linked to the capture, collation, storage,
tors’ that they can be said to be making the most and mining of data. Additionally, it implies that
appropriate use of technology. By contending the means and processes through which compa-
that technology can set companies ahead of their nies acquire, mine and use data also need to be
competitors, Fraser et al. quintessentially appear continuously and consistently monitored.
to equate technology to a competitive advantage. In this context, the changes of the market
However failure is expected if companies believe environment have a direct impact on relationship
that CRM is only a technology solution. CRM marketing. According to Prabhaker (2001, p. 113),
will only be successful if companies learn how within the business environment, “two specific
to disseminate and exploit the information, which evolving forces have led to organizations having
they have collected on their customers, on their to rethink their business models.” These are: the
databases. In other words, unless data mining is power of customers and the changes in technol-
effectively conducted, CRM is highly unlikely be ogy. The effect of the dyadic synergy created by
utilized optimally and beneficially. In an attempt these two evolving forces is said to have been
to explicate how these distinct yet complementary two-fold. On one hand, companies appear to have
processes can be integrated to operate from a level in general attempted to keep up with and adapt
platform, Overell (2004, p.1) testifies that although to these changes, and on the other hand more
within the business environment, companies have proactive companies appear to have learnt how
instead tended to follow a flawed contingency to additionally leverage the advances in technol-
and expected information technology “to solve ogy and computer-integrated control systems to
management problems,” they should learn to significantly improve their own initial strategic
“rethink functionally fragmented processes from capabilities. This latter contention is arguably
the customers’ viewpoint.” aligned to Zahra et al. (1999) explanation about
A consortium of academics, including Joplin how technology can impact a company’s internal
(2001) and Nitsche (2002), consider that CRM is and external capabilities. Indeed, advances in
not a technological solution but a strategy. In fact, database technologies have influenced the way in
according to Joplin, CRM is the most important which companies have used databases and data
strategy, which any company must adopt and within their operations as well as the processes
develop if it wishes to remain competitive. The they have followed to capture and to mine the
evolving properties of CRM as a strategic solution said data.
are emphasised by Nitsche when he argues that Within the hotel industry, a tiered adaptation
“technology is not a panacea” (2002, p. 208) and to the changing market forces associated with
that the people and markets, around which CRM databases seems to prevail. Advances in technol-
revolves, are “changing just as the competition ogy have enabled operational benefits in terms
is” (2002:207). To be able to embrace the fast oc- of automation in back-office functions such as
curring changes, Nitsche advises that companies reservations processes and check-in processes
must reorganize themselves. Thus companies being generally reaped by hotel chains. Some
arguably need to restructure their strategies and hotel companies have in evidence additionally
tactics in line with emerging new market forces attempted to benefit from other opportunities. For
in order to capture the inherent and changing op- example in June 2002, when Travelodge started

The Importance of Data Within Contemporary CRM

to develop a new database of online customers as THE DATABASE: THE PIVOTAL

part of its strategy to double the 5000 reservations tool of CrM
which the company took ever week, the company’s
website was also redesigned in line with up-to-date The Pareto principle states that 80% of company’s
technologies in order to streamline its reservation income comes from 20% of its customers. Accord-
process (Key Note, 2002b). During 2003, Corus ing to Bentley (2005), the ongoing challenge for
& Regal Hotels Plc recategorized the profiles of hotel companies is to determine which specific
the customers on its existing database in an at- customers represent that 20%. In an attempt to
tempt to engage in precise targeting. As part of identify the profitable customers, hotel companies
their strategy to increase return on investment, are increasingly investing in database infrastruc-
all bookings made for any of the hotels within the ture. Meanwhile, technological developments
group was redirected via the central reservations have highly likely led to an important trend,
office or to their new marketing database. This which is evidently centred itself on database
new process was implemented in order to update marketing.
existing records consistently and continuously and As a concept, database marketing revolves
automatically create the profiles of new customers around organizations acquiring and maintaining
(Key Note, 2003). Through the centralization of extensive files of information on past and current
its customer contacts, this hotel chain arguably customers as well as on prospects. Although the
also put itself in a better position to offer a more objective of databases is to enable a better portrait
controlled view of the company to its existing of customers and their buying habits, ultimately
and prospective customers. Perhaps even more they are intended to not only enable hotel compa-
importantly, such an integrated system arguably nies to market their products, services and even
enables the hotel company to enhance its data special offers more effectively, but to also provide
mining opportunities. an improved personalised service to customers
CRM implies a detailed examination of the (Bentley, 2005). Although database marketing is
guest (Davies, 2000). As databases are essentially traditionally associated with the specialised field
associated with the ability provide exactly that, it of direct marketing, numerous authors including
is evident how CRM and databases are intricately Moncrief and Cravens (1999) and Long et al.
linked. CRM systems may include functions relat- (1999) have acknowledged how its functions are
ing to customer retention, customer profitability, being increasingly applied to enhance and refine
customer response to marketing campaigns and relationships with customers. Consequently, the
even more mundane details such as whether cus- database has become a pivotal instrument within
tomers prefer still to sparkling water. However, the CRM arena, not only as far as interaction and
in order to achieve such objectives, companies the exchange of information between an organiza-
have to adhere to some specific processes. These tion and its customers are concerned, but also in
processes pivot around the processes associated the facilitation of processes such as the segmen-
with the capturing, storing, and mining of data tation and targeting of customers. Furthermore,
on customers, as well as around the company’s as a result of the mining of the data captured on
use of the mined data. It is indeed argued that customers, precise targeting can be achieved.
not only are data acquisition and data mining According to Bradbury (2005), a database is a
quintessential in the success of CRM, but also structured collection of information, which is not
are the ways companies use the mined data with only set as indexes but also searchable. In general,
regards to their strategies and even tactics. databases are used for business applications such
as the storage of customers’ data. Thus, previous

00
The Importance of Data Within Contemporary CRM

hotel reservations and even restaurant reserva- data warehouses is by all means to help create
tions may be held in a hotel chain’s database. In customer retention. Large hotel chains have in
layman’s terms, databases may be compared to evidence been acquiring and storing customer
an electronic library, which receives fresh data, data in a combined attempt to achieve competitive
stores information and make the latter accessible edge and improve the experience of customers. It
to an organization; thereby helping maintain a even appears that hotel chains have realised the
continuous learning loop (McDonald, 1998). In associated benefits of databases. For example, as
more implicit marketing terms, databases can a consequence of investing in customer relation-
be extended to form an extensive and multilevel ship management software, Marriott International
process (Tapp, 2001). Within the CRM arena, it registered improvements in other areas, such as
could be argued that databases are used not only cross-selling and yield management (Caterer and
to promote and facilitate interaction between an Hotelkeeper, 2004).
organization and its customers from the time The capability of databases to help track actual
of an initial response, but also to help with the purchases of customers and enable inferences to
measurement and analysis of such interactions. predict future behaviour patterns may undoubt-
Simply put, the ongoing relationship between an edly encourage the assumption that database mar-
organization and a customer can be systematically keting is routine within the embracing of CRM.
recorded in databases. In this aspect, a sophisti- Moncrief and Cravens’ (1999, p. 330) contention
cated database cannot only store data on active, that “customer service levels increase when
dormant or lapsed customers but it may even have customer information becomes so easy to obtain
the potential to identify prospects (McDonald, and disperse,” and could by all means imply that
1998; Tapp, 2001). Subsequently, the increasingly databases are being efficiently and effectively used
integral role, which databases have come to play to acquire and maintain information on existing
in CRM campaigns appears well founded. and prospective customers. Abbott (2001, p. 182)
As stated earlier, developments in informa- even advocates that refinements in technology has
tion technology have dramatically enhanced the provided companies with increasing opportunities
scope for the collection, analysis and exploitation and well-structured channels to not only collect
of information on customers (Long et al., 1999) abundant amount of data but also to manipulate
and for these purposes, data warehouses have this data in various ways so as to unravel any
been increasingly created by businesses. A data unforeseen areas of knowledge. However, several
warehouse is essentially a giant database, which academics have reservations on how databases
takes the raw information from the various sys- are not being optimally used.
tems within a hotel, such as central reservations
and room service, and converts the data collated
from all the sources into one easily accessible and CONVERTING DATA INTO
ideally user-friendly set of data (Davies, 2000). COMPETITIVE EDGE
When used effectively, data warehouses cannot
only gather data on a continuous basis but they CRM is arguably a progression from data ware-
can also allow the precise segmentation of infor- housing. At present, one of the principle func-
mation about customers. Subsequently, profitable tions of CRM systems is to collect as much data
interaction with customers can be increased and about each customer as is possible. As discussed
operations such as targeting and even customer earlier, this information is then stored, and to be
service can be improved. As succinctly sum- used at a later stage to give guests as much of a
marised by Davies (2000), the ultimate aim of personalised service as possible when they return

0
The Importance of Data Within Contemporary CRM

(Davies, 2001). According to Cindy Green, the and satisfaction, are in fact created through the
Senior Vice-President of Pegasus Business Intel- business process management.
ligence, this will not only lead to a change in the According to Cindy Estis Green, from Driving
sales and marketing arena but even more impor- Revenue, a consultancy that aims to help hotel
tantly this will imply that hotel companies will companies add value to the data they collect
need to become as advanced in the management from and about the customer, the management
of their customer relationships as technology will of a database involves three crucial stages (Goy-
enable them to be (Davies, 2001). This change mour, 2001). Firstly, when all the data collected
of perspective is arguably expected to engender about a guest is consolidated into a usable set
a transition from the management of data about of information, the automated cleaning of data
the customers to the management of interactive must be conducted. Secondly, the analysis of the
relationships. Accordingly the data which hotel information about the guests must undergone
chains have compiled over the years about their segmentation in order for the hotel company to
customers, would need to be used intelligently be able to precisely target the most attractive
in order to enable predictions about consumer prospects and discard those suspects who do not
behavior as well as the anticipation of needs or meet the profiling criteria. Thirdly, the results of
even problems. Such data can be used precisely the targeting of specific guests must be tracked
for target marketing campaigns. Indeed, as suc- in order to determine which guests responded to
cinctly summarized by Green, CRM is in actual the campaigns. This step will not only identify
fact simply about a hotel company being willing the profitable customers, but also will ultimately
and flexible enough to change its behavior in line also indicate which promotions are successful.
with what customers are saying and what the data Subsequently, the adequacy of campaigns can
collated reveal about them. be evaluated.
CRM concept has grown out of companies’ The general consensus is that an integrated
attempts to offer a better service to their custom- and centralized database will enable a complete
ers than their competitors are offering (Gledhill, view of the customers within a hotel chain. Such
2002). Within the hotel industry, as identified a database is expected to collect ongoing informa-
earlier in this chapter, one of the major elements tion from all relevant sources and outlets, such as
appears to be the pursuit to streamline back-office reservations and other point of sale systems lo-
processes in order to achieve greater operational cated within the various hotels. Information from
efficiencies. Technology has revolutionized op- customer satisfaction questionnaires, surveys or
erations within the hotel industry as applications even e-mails can also be fed into the database.
have already managed to smoothly link front-of- The database would ideally be compiled so as to
fice processes such as check-in, with back-office produce an integrated set of information in order
functionality such as reservation details. Ad- to create a unified profile about each customer
ditionally, in order to enhance their engagement (Bentley, 2005).
in CRM, many hotel chains have invested in According to Jane Waterworth, the market-
customized systems. Notwithstanding, as is suc- ing director at Shire Hotels, the standardizing of
cinctly reminded by Chen and Popovich (2003, data is a process, which hotel companies should
p. 682), despite the crucial role that technology take seriously, as it is vital to ascertain that they
and people play within the CRM arena the philo- in fact are inputting the right data in their CRM
sophical bases of CRM: relationship marketing, system. According to Steve Clarke, the account
customer profitability, lifetime value, retention director at marketing database company CDMS,

0
The Importance of Data Within Contemporary CRM

companies which are serious about CRM must (2004) provide evidence to confirm that companies
consolidate their data. Otherwise customers may are not adequately using the information at their
end up receiving the same information from disposal to build and strengthen relationships
various sources, thereby diluting marketing ini- with customers.
tiatives, and more specifically for the company, Moreover, according to Dyer (1998), many
no full view of a customer’s behavior would be practitioners are failing to make optimum use of
achievable. Furthermore, as emphasised by Bent- their client databases because not only their in-
ley (2005), without all the relevant information formation is being updated, but also the available
about a customer, any attempt to use data in a data is not even being analysed adequately so as
meaningful and precise way to enhance loyalty to produce pertinent qualitative and quantitative
schemes or even marketing campaigns will be information, from which future strategies and
essentially flawed. tactics could be taken. Yet, Murphy (2001) advo-
A central data warehouse can by all means cates that not only does personalized data have
combine information from many sources and help to exist and be correct, but also this data should
consolidate a comprehensive and reliable picture be correctly updated and be made available to
of a hotel’s clients. Although data warehouses the rest of the organization. Here, the general
can be clear and immediately accessible, Velibor consensus is that this process should be rigidly
Korolija, the operations director with software adhered to whichever channel of communication
specialist of the Bromley Group, argue that for the customer uses to interact with an organization
business and marketing analysts, data warehouses (Key Note, 2002b). Although this step may not
are by no means enough. In fact, it is data mining, already be adhered to within the hotel industry,
a process which involves the analysis of the data there is an indication that some hotel chains have
in an attempt to seek meaningful relationships integrated this process in their systems. For in-
not previously known, which Korolija advocates stance, from 2003 all bookings made for any of
to be of utmost importance (Davies, 2000). the hotels within the Corus & Regal hotel chain
Data mining refers to the process of retrieving have been redirected via the central reservations
data from a data warehouse for analysis purposes. office or to their new marketing database so that
Data mining tools and technologies have been ac- the information on the database can be continu-
credited by such academics as Nemati and Barko ously updated. Accordingly, the records about
(2003, p. 282) with having the potential to enhance existing customers are consistently updated while
the decision-making process by transforming data the profiles of new customers are automatically
into valuable and actionable knowledge to gain a created (Key Note, 2003).
competitive advantage. Highlighting a different shortcoming, Rich
Although many databases may by all means (2000) argues that companies are failing to use
be deemed to be appropriate data warehouses, the information stored in their databases to build
it has been argued that the data mining process relationships with their customers even though
associated with many of these has been consis- the latter could prove vital for marketers in their
tently flawed. In fact, in spite of several academics attempts to outperform their competitors in terms
acknowledging the technological trend to rely of providing a better service to customers. Accord-
on database marketing to acquire and maintain ing to Overell (2004), marketers and companies
extensive information on existing and potential are not even attempting to adequately analyze the
customers (Krol, 1999; Long et al., 1999; Mon- data to an accepted level of depth. In spite of such
crief & Cravens, 1999), such academics as Dyer contentions, Michael Gadbury, the vice-president
(1998), Rich (2000), Joplin (2001) and Overell of Aremissoft, a CRM software company, advo-

0
The Importance of Data Within Contemporary CRM

cates that while two years ago, only ten percent experts in line with specifications requested by a
of hotel companies showed interest in making use hotel company, once unfolded within an organiza-
of the data, which they had collected about their tion, such systems tend to be monitored in-house.
customers, this percentage has risen to almost Luck suggests that internal employees may not
ninety percent in contemporary terms (Davies, have the adequate level of expertise that some of
2001). It is anticipated that in recent years, even the filtering processes may call for. Furthermore,
more companies have shown interest in adequately she also suggests that the high financial, human
mining their customer database. and technological resources needed to keep a
Although the integrated process of capturing, data mining system up to date may also place
sifting, and interrogating data about customers too high demands on some companies. Arguably
may have been somehow flawed within some in attempts to curtail limitations and perhaps to
companies; companies have been so eager to enhance their CRM opportunities, hotel compa-
capturing data about their customers that accord- nies have increasingly entered in partnerships
ing to Overell (2004:1), “many organisations are with specialist agencies. While De Vere Group
sitting on mushrooming stockpiles of data.” This Plc enlisted the GB group to help create more
over zealous attitude towards the collection of data targeted and cost effective database campaign;
seems to have gripped the hotel companies too. Thistle Hotels Ltd worked closely with Arnold
Indeed, as is advocated by Geoffrey Breeze, the Interactive to design, develop and handle its online
Vice-President of marketing and alliance devel- strategy to increase its database from 50,000 to
opment at Hilton International, “hotels have far 500,000 profiles by the end of 2003 and its series
more information about their guests than they can of e-marketing campaigns (Key Note, 2003).
actually use” (Caterer and Hotelkeeper, 2000, p. As identified by Bradbury (2005), CRM is
14). However, Overell (2004) advocates that the meant to not only to help companies collect in-
general consensus among database experts is that formation about guests, but also, and even more
companies do not have much more understanding importantly it is meant to help companies use the
of customers than they did prior to their embrac- information collected about its customers more
ing of CRM. effectively. One of the ultimate steps within the
Nemati and Barko (2003, p. 282) offer a plau- data mining process is undeniably to cluster
sible explanation for the limited benefits reaped customers into segments, which are not only
from data mining when they explain that although meaningful but also reachable by CRM cam-
“management factors affecting the implementa- paigns. According to Korolija, it is by all means
tion of IT projects have been widely studied,” possible to cluster a hotel’s guests into very specific
“there is little empirical research investigating demographic groups (Davies, 2000). In serving
the implementation of organizational data-mining a number of closely-related purposes, customer
projects.” Furthermore, in pointing to a plausible segmentation has been portrayed as a means of
differential level of expertise between the collec- predicting behavior (Clemons & Row, 2000), a
tion of data and the actual mining and usage of method of detecting, evaluating, and selecting
this data, they also shed light on the inadequacy homogeneous groups (Reichheld & Schefter,
of training for the people at the various stages of 2000) and a way of identifying a target market for
the data mining process. For instance, it is notable which a competitive strategy can be formulated
that within the hotel industry, technical systems (Gulati & Garino, 2000). In more general terms,
tend not to be developed in-house (Luck & Lan- customer segmentation is accredited by enabling
caster, 2003) but commissioned through expert the identification of key consumer groups, thereby
agencies. While CRM systems are developed by favoring the effective targeting of such strategic

0
The Importance of Data Within Contemporary CRM

tools as CRM programmes. It could also be plau- seems pertinent to CRM strategy, transferring
sible to posit that customer segmentation enables the theoretical advantages into practice appears
precision targeting. Some hotel chains in evidence to be an altogether different scenario. Meanwhile,
appreciate the opportunities afforded by customer according to Felix Laboy, the chief executive of-
segmentation. For instance, in an attempt to pre- ficer of E-Site Marketing, when hotels are able
cisely and cost effectively target its guests, De to access more information about a guest and
Vere Group Plc restructured its customer database then be able to offer the latter the individual ser-
in 2003 into a range of customer categories such vice the guest needs, loyalty will be encouraged
as debutantes and devoted stayers. This strategy (Edlington, 2003). Moreover while such authors
was also intended to enhance cross-selling across as Davies (2001a) and Bentley (2005) advocate
the various brands to existing customers. In the that when the data is correctly structured and
same year, Corus & Regal Hotels Plc divided its hotel companies can target their marketing more
database, which consisted of 68,000 profiles, into effectively, it is expected that loyalty schemes
categories. These spanned from cold prospects to will become more effective. To strengthen these
loyal customers (Key Note, 2003). arguments, Cindy Estis Green from Driving
The varied outcomes of customer segmentation Revenue advocates that when a company shows
have been well documented. Benefits such as add- that it cares about its guests through its offering
ed protection against substitution, differentiation, of benefits, it can strongly influence the creation
and pricing stability have been quoted by several of customer loyalty (Goymour, 2001).
authors including Walters and Lancaster (1999b)
and Sinha (2000). Moreover, Ivor Tyndall, the head
of customer intelligence at Le Meridien advocates retaining old CuStoMerS and
that as the company segments their consumer reaChing neW oneS
base, they can precisely target different sectors
or segments with different offers (Bentley, 2005). It is well documented that retaining customers is
Although Botschen, Thelen, and Pieters (1999) more profitable than building new relationships.
support the importance of segmenting customers While Reichheld and Schefter (2000) discuss
on the benefit-level, Long and Schiffman (2000) how the dynamics of customer retention are less
offer evidence to suggest that different segments costly that initiatives focusing on customer ac-
of consumers may perceive benefits differently and quisition, Kandampully and Duddy (1999) even
consequently have differing degrees of affinity state that attracting new customers is five times
and commitment to CRM programmes and other more costly than retaining an existing customer.
benefits on offer. Consequently, the retention of existing customers
The popularity of databases is increasing and has become a priority for businesses to survive
as is highlighted by Abbott (2001, p.182), “vast and prosper. In view of its inherent long-term per-
databases holding terabytes of data are becoming spective, databases and CRM explicably appears
commonplace.” However, if companies do not fol- to be ideal platforms for the achievement of this
low the correct processes to tap into this valuable ongoing objective. As pertinently summarized by
data they have in their databases, new knowledge Chen and Popovich (2003), CRM strategy (and
about customers will be largely uncovered (Rich, databases) can help to attract new customers,
2000). Indeed, it is likely that the assiduous col- but even more importantly, helps develop and
lection of information about customers will be maintain existing ones.
largely wasted. Consequently, although in theory Efforts to retain customers have led to the refin-
borrowing from the arena of direct marketing ing of processes such as target marketing and seg-

0
The Importance of Data Within Contemporary CRM

mentation within hotel operations. Furthermore, that must be built around the customer,” Chen
with direct marketing and database marketing and Popovich (2003, p. 682) arguably imply that
having been repeatedly identified as two of the when hotel chains embrace CRM, rather than keep
immediate forebears of relationship marketing in customers at the end of the value chain (Jobber,
consumer markets (Long et al., 1999), companies 1998), hotel chains should instead put customers
have adopted processes initially associated with at the start of operations. This reversal of the
the specialised field of direct marketing to facilitate direction of the traditional value chain means
their CRM objectives. As such, precise targeting, that companies will have to shift their CRM
described by Lester (2004, p.4) as: “the ability to endeavours from a mass marketing perspective
deliver accurate and exact marketing messages where customers are sought for products or ser-
to people at a narrow customer segment level”, is vices, to develop products and services, which are
almost expected to be routinely applied as part of actually tailored to fit the needs of the company’s
CRM programs. In fact, such is the commonality targeted customers. In this perspective, CRM
between direct marketing and CRM that despite appears to call for a reversal of some traditional
the criticisms highlighted earlier in this chapter processes and the integration of data into all
about the current poor level of data-mining, it levels of an organization. Thus, it is crucial that
would seem almost impetuous for hotel chains hotel companies strive to maintain and enhance
embracing CRM not to focus on this process the data they collect about their targeted guests.
within in their operations. Only when they ensure that processes are being
Through a combination of technology and correctly implemented within their operations,
business processes, which attempts to find out would hotel companies be able to truly be in a
and understand who customers are, what they do position to assess what their customers actually
and to identify their likes and dislikes (Couldwell, seek. Only then, would a unified and optimum
1998), CRM and its databases can and do facili- CRM offering be possible. Consequently, the
tate the understanding of the customer. In turn, database is indeed the pivotal tool around which
an understanding of customers in line with the CRM revolves in contemporary terms.
dynamics of organizations is expected not only
to help design systems, which meet customers’
needs more effectively, but also this balance is future trendS
highly likely to lead to stronger customer loyalty
and lasting relationships. As the achievement Databases are subject to a range of important
of loyalty appears to be sought within the hotel influences in addition to technological advances.
industry (Palmer, McMahon-Beattie & Beggs, Although internal capabilities is by all means a de-
2000; Tepeci, 1999), several processes have termining factor in how a company uses databases
been integrated into hotel operations to work and data, the way in which these are embraced
alongside databases in order to find out more is going to differ from company to company and
about customers’ needs and wants. Accordingly, perhaps even more importantly from industry
customer satisfaction surveys and customer ser- to industry. It is argued that although there is a
vice questionnaires are routinely distributed to common platform from which data can be used,
guests in an attempt to improve operations and the means and usage is going to be dictated by the
understand customers better, and indeed to gather dynamics of the industry and product or service.
more information for the databases. As such future research is recommended into the
Meanwhile, in describing CRM as “an en- dynamics of specific industries in order to deter-
terprise-wide customer centric business model mine the relevancy of databases and data to that

0
The Importance of Data Within Contemporary CRM

specific industry. Indeed, this chapter set out to is by no means sufficient to reap the benefits of
illustrate the applicability of data and databases CRM. Indeed, the effectiveness of data mining
to the hotel industry, and not only are individual procedures is crucial if successful CRM is to be
capabilities and strategies of companies are rel- achieved. Subsequently, companies are expected
evant, but also the reality of organizations within not only to continuously view their organiza-
a micro and macro business environment concern. tions from the customers’ perspective but also as
These unquestionably demand intensive research importantly, gear operations to actively involve
and experimentation through proper feedback and customer feedback and market as well as tech-
monitoring tools. nological changes.
When these processes are consistently and
continuously integrated, applied, and monitored, it
ConCluSion is expected that companies would be able to gather
and disseminate the right type of data to optimally
CRM has been hailed as a powerful tool in the achieve their CRM objectives. Indeed, successful
quest for strengthening relationships with custom- CRM does not just emerge or simply exist. Thus,
ers. A triad combination of technology, people it is advocated that the creation and establishing
and processes can arguably enable hotel chains of successful customer relationships confront
to not only implement CRM within their opera- companies with a complex range of relationship
tions, but also to reap the opportunities, which and network management tasks above the ones,
the concept can provide. However, in recent times which is inherent to their traditional operations
it appears that most attention has been focused and structures. The principal processes of data
on technology rather than on the capturing and management, namely data acquisition, collation
mining of data. and mining, are indeed integral to these business
Technology has greatly enhanced the processes functions.
associated with the implementation, evaluation
and monitoring of CRM. Database technologies
have by all means driven CRM into a new era not referenCeS
only in terms of storing and mining information
to help make sales, but also to access customers, Abbott, J. (2001). Data data everywhere – and not
gather data and even target campaigns. The im- a byte of use? Qualitative Market Research: An
portance of the database within CRM is in fact International Journal, 4(3), 182-192.
so unquestionable that it can be said that database
Bentley, R. (2005, August 25). Data with destiny.
is now the central tool of CRM.
Caterer & Hotelkeeper, 38.
Although it is not denied that technology is
crucial in the facilitation of CRM and as such Bradbury, D. (2005, August 31). Technology
attracted much investment, it is emphasised Jargon Buster. Caterer & Hotelkeeper,
that the optimization of CRM also requires the
Botschen, G., Thelen, E. M. & Pieters, R. (1999).
organization of business processes. Although
Using means-end structures for benefit segmenta-
data mining processes associated with the hotel
tion: An application to services. European Journal
industry have been somehow flawed, and depend-
of Marketing, 33 (1/2).
ing on databases, hotel chains of all sizes appear
to increasingly be developing and implementing Caterer & Hotelkeeper (2000, September 7).
database technologies. However, it is argued Hotel groups deny they’re missing Web oppor-
that the acquisition of a sophisticated database tunities, 14.

0
The Importance of Data Within Contemporary CRM

Caterer & Hotelkeeper (2004, 24 June). Do the Gulati, R. & Garino, J. (2000, May-June). Get the
knowledge, 34. right mix of bricks and mortar. Harvard Business
Review, 107-114.
Chen, I. J. & Popovich, K. (2003). Understand-
ing customer relationship management (CRM); Jobber, D. (1998). Principles of marketing (2nd
People, process and technology. Business Process ed.). McGraw-Hill
Management Journal, 9(5), 672-688.
Joplin, B. (2001, March/April). Are we in danger
Clemons, E. & Row, M. (2000, November 13). of becoming CRM lemmings? Customer Man-
Behaviour is key to web retailing strategy. Fi- agement, 81- 85
nancial Times.
Kandampully, J. & Duddy, R. (1999). Relationship
Couldwell, C. (1998, May 21). A data day battle. marketing: a concept beyond primary relation-
Computing, 64-66. ship. Marketing Intelligence &Planning, 17(7),
315-323.
Davies, A. (2000, 29 June). Data’s the way to do
it, Caterer & Hotelkeeper, 31-32. Key Note (2002a), Customer Relationship Man-
agement
Davies, A. (2001, 26 July). On-line, on course.
Caterer & Hotelkeeper, 37-39. Key Note (2002b), Hotels
Dyer, N. A. (1998). What’s in a relationship (other Key Note (2003), Hotels
than relations)? Insurance Brokers Monthly &
Khalil, O. E. M. & Harcar, T. D. (1999). Relation-
Insurance Adviser, 48(7), 16-17.
ship marketing and data quality management.
Edlington, S. (2003, January 20). Future perfect? SAM Advanced Management Journal, 64 (2).
Caterer & Hotelkeeper, 26.
Krol, C. (1999, May). A new age: It’s all about
Fraser, J., Fraser, N., & McDonald, F. (2000). relationships. Advertising Age, 70(21), S1-S4.
The strategic challenge of electronic commerce.
Lester, T. (2004, March 31). Pitfalls of precision
Supply Chain Management: An International
bombing. FT Management, 4.
Journal, 5(1), 7-14
Lindgreen, A. & Crawford, I. (1999). Implement-
Galbreath, J. & Rogers, T. (1999). Customer rela-
ing, monitoring and measuring a programme of
tionship leadership: A leadership and motivation
relationship marketing. Marketing Intelligence
model for the twenty-first century business. The
& Planning, 17(5), 231-239.
TQM Magazine, 11(3), 161-171.
Long, G., Hogg, M. K., Hartley, M. & Angold,
Gledhill, B. (2002, February 28). Learning from
S. J. (1999). Relationship marketing and privacy:
history. Caterer & Hotelkeeper, 33.
Exploring the thresholds. Journal of Market-
Goymour, A. (2001, 26 July). Host in the machine. ing Practice: Applied Marketing Science, 5(1),
Caterer & Hotelkeeper, 43-45. 4-20.
Grönroos, C. (1994). From scientific manage- Long, M. M. & Schiffman, L. G. (2000). Con-
ment to service management: A management sumption values and relationships: Segmenting
perspective for the age of service competition. the market for frequency programs. Journal of
International Journal of Service Management, Consumer Marketing, 17(3).
5(1), 5-20.

0
The Importance of Data Within Contemporary CRM

Luck, D. & Lancaster, G. (2003). E-CRM: Cus- programmes. International Journal of Contempo-
tomer relationship marketing in the hotel industry. rary Hospitality Management, 12(1), 54-60.
Managerial Auditing Journal – Accountability
Prabhaker, P. (2001). Integrated marketing-
and the Internet, 18(3), 213-232.
manufacturing strategies. Journal of Business
McDonald, W. J. (1998). Direct marketing: An &Industrial Marketing, 16(2), 113-128.
integrated approach. McGraw-Hill International
Reichheld, F. & Schefter, P. (2000, July/ August).
Editions.
E-loyalty. Harvard Business Review, 105-113.
Moncrief, W. C. & Cravens, D. (1999). Technology
Rich, M. K. (2000). The direction of marketing
and the changing marketing world. Marketing
relationships. The Journal of Business & Indus-
Intelligence and Planning, 17(7), 329-332.
trial Marketing, 15(2/3), 170-179.
Murphy, J. M. (2001, March-April). Customer
Sinha, I. (2000, March/ April). Cost transpar-
excellence: From the top down. Customer Man-
ency: The Net’s real threat to prices and brands.
agement, 36-41.
Harvard Business Review, 43-55.
Nemati, H. R. & Barko, C. D. (2003). Key factors
Tapp, A. (2001). Principles of direct marketing
for achieving organizational data-mining success.
(2nd ed). Prentice Hall.
Industrial Management & Data Systems, 103(4),
282-292. Tepeci, M. (1999). Increasing brand loyalty in
the hospitality industry. International Journal of
Nitsche, M. (2002, January-March). Developing
Contemporary Hospitality Management, 11(5).
a truly customer-centric CRM system: Part One
– Strategic and artchitectural implementation. Walters, D. & Lancaster, G. (1999a). Value and
Interactive Marketing, 3(3), 207-217. information – Concepts and issues for manage-
ment. Management Decision, 37(8), 643-656.
Overell, S. (2004, March 31). Customers are not
there to be hunted. FT Management, 2. Walters, D. & Lancaster, G. (1999b). Using the
Internet as a channel for commerce. Management
Palmer, A. (1996). Relationship marketing: A
Decision, 37(10), 800-816.
universal paradigm or management fad? The
Learning Organisation, 3(3), 18-25. Zahra, S., Sisodia, R., & Matherne, B. (1999,
April). Exploiting the dynamic links between
Palmer, A., McMahon-Beattie, U. & Beggs, R.
competitive and technology strategies. European
(2000). A structural analysis of hotel sector loyalty
Management Journal, 17(2), 188-201.

0
0

Chapter VII
Mining Allocating Patterns in
Investment Portfolios
Yanbo J. Wang
University of Liverpool, UK

Xinwei Zheng
University of Durham, UK

Frans Coenen
University of Liverpool, UK

aBStraCt

An association rule (AR) is a common type of mined knowledge in data mining that describes an im-
plicative co-occurring relationship between two sets of binary-valued transaction-database attributes,
expressed in the form of an 〈antecedent〉 ⇒ 〈consequent〉 rule. A variation of ARs is the (WARs), which
addresses the weighting issue in ARs. In this chapter, the authors introduce the concept of “one-sum”
WAR and name such WARs as allocating patterns (ALPs). An algorithm is proposed to extract hidden
and interesting ALPs from data. The authors further indicate that ALPs can be applied in portfolio
management. Firstly by modelling a collection of investment portfolios as a one-sum weighted transac-
tion-database that contains hidden ALPs. Secondly the authors show that ALPs, mined from the given
portfolio-data, can be applied to guide future investment activities. The experimental results show good
performance that demonstrates the effectiveness of using ALPs in the proposed application.

introduCtion corporate finance (Damodaran, 2001), personal

financial planning (Ho & Robinson, 2001), fi-
Investments (Bodie, Kane, Marcus, & Ryan, 2003; nancial engineering (Neftci, 2004), and so forth.
Cuthbertson & Nitzsche, 2001) are one of the Portfolio management, aiming to minimize the
major schools in financial research that parallels overall risk while maximizing the total expected

Copyright © 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
Mining Allocating Patterns in Investment Portfolios

return for an investment activity, is perhaps one of the scope of computational finance, the necessi-
the most indispensable tools available in invest- ties and/or possibilities of employing data mining
ments. It diversely “allocates” a given amount of technologies and/or methodologies in financial
assets/funds in a variety of investment-items (i.e., research.
bonds, funds, options, stocks, etc.). In Ho and Portfolio management, in a general prospect,
Robinson (2001) diversification (Farrell, 2006) refers to the overall process of creating appropriate
was introduced as a principle of investments. There portfolio strategies that will ensure/almost-ensure
are three dimensions in diversification (Ho & profits in future investment activities. The port-
Robinson, 2001): (1) diversity across items/assets folio management process has been analysed in
within the same investment-security, (2) diversity many literatures, but a unique scheme has not
across different securities of investments, and yet been agreed upon. The stages of the portfolio
(3) diversity internationally. When addressing management process usually include:
diversification in portfolio management, choos-
ing to invest a portfolio that consists of a set 1. Investment-item selection: Where a num-
of uncorrelated investment-items or negatively ber of investment-items/assets that will be
correlated investment-item pairs, noted as the comprised in a “potential” portfolio are
correlation coefficient based portfolio theory (Ho selected.
& Robinson, 2001), is recommended. 2. Investment-item return prediction: Where
Data mining (Bramer, 2007; Han & Kamber, the expected return of each asset, selected
2001; Han & Kamber, 2006; Hand, Mannila & in stage 1, is predicted.
Smyth, 2001; Thuraisingham, 1999) is a promis- 3. Investment-item weight determination:
ing area of current research and development in Where a candidate portfolio is generated
computer science, which is attracting more and by assigning a suitable weight to each asset,
more attention from a wide range of different based on the result of stage 2.
groups of people. It aims to extract various types 4. Portfolio selection: Where the “best”
of hidden, interesting, previously unknown and portfolio strategy is selected from a number
potentially useful knowledge (i.e., rules, patterns, of alternative candidate portfolios that are
regularities, customs, trends, etc.) from databases, generated by iteratively processing stages 1,
where the volume of a collected database can be 2 and 3. Best in this case is defined accord-
measured in GBytes. In data mining common ing to the return and risk of the candidate
types of mined knowledge include: association portfolio.
rules (Agrawal & Srikant, 1994), classification
rules (Quinlan, 1993), prediction rules (Han & In the past decade, research in portfolio man-
Kamber, 2001), classification association rules agement has demonstrated an interest in some
(Ali, Manganaris & Srikant, 1997), clustering data mining and/or machine learning concepts
rules (Mirkin & Mirkin, 2005), emerging patterns (Hung, Liang & Liu, 1996; Lazo, Maria, Vellasco,
(Dong & Li, 1999), sequential patterns (Wang & Aurelio & Pacheco, 2000; Tseng, 2004; Wang &
Yang, 2005), and so forth. In the past decade, data Weigend, 2004; Zhang & Zhou, 2004). A number
mining techniques have been widely applied in, for of approaches in such research are summarized
example, bioinformatics (Wang, Zaki, Toivonen as follows:
& Shasha, 2005), e-commerce (Raghavan, 2005),
geography (Miller & Han, 2001), marketing and • John, Miller, and Kerber (1996) developed
sales studies (Berry & Linoff, 1997). Kovalerchuk a rule induction based stock selection sys-
and Vityaev (2000) systematically discussed, in tem, namely Recon. This system marks

Mining Allocating Patterns in Investment Portfolios

“stocks with returns in the top 20% in a • Enke and Thawornwong (2005) proposed
given quarter as exceptional and the rest as an approach that utilizes data mining and/or
unexceptional” (pp. 52-53); and analyses “a machine learning techniques to forecast
historical database and produce rules that stock market returns. In this approach,
would classify present stocks as exceptional “an information gain technique used in
or unexceptional future performers” (p. machine learning for data mining” (p. 927)
53). is introduced to evaluate “the predictive
• A hybrid approach that generates candidate relationships of numerous financial and
portfolios by integrating the well-known economic variables” (p. 927); and “neural
APT (arbitrage pricing theory) model (Ross, network models for level estimation and
1976) with neural networks was introduced classification are then examined for their
by Hung, Liang, and Liu (1996). In this ability to provide an effective forecast of
approach, “an APT model can be used to future values” (p. 927).
determine prices, and then a neural network
predicts the trend of each risk factor in the Contribution
future” (Zhang & Zhou, 2004, p. 517). A
portfolio selection mechanism was further In this chapter, the authors introduce a novel type
developed in Hung, Liang, and Liu (1996) of mined knowledge in data mining, namely al-
to select the optimal/best portfolio(s) by locating patterns (ALPs), which can be recognized
computing a performance score for each as a variation of the traditional association rules
generated candidate portfolio. (ARs) in a special weighted setting. An ALP is a
• Kohara, Ishikawa, Fukuhara, and Nakamura “one-sum” weighted AR (WAR), where each item
(1997) incorporated prior knowledge with involved in an AR is associated with a weighting
artificial neural networks “to improve the score between 0 and 1, and the sum of all AR item
performance of stock market prediction” weights is 1. An ALP can not only indicate the
(Yu, Wang & Lai, 2005, p. 336). implicative co-occurring relationship between
• Quah and Srinivasan (1999) proposed an two sets of binary-valued transaction-database
artificial neural network stock selection attributes (items) in a weighting setting, but also
system “to select stocks that are top perform- inform the allocating relationship among AR
ers from the market and to avoid selecting items (e.g. 〈allocating weight/quota a to item X〉
under performers” (Yu, Wang & Lai, 2005, ⇒ 〈the allocation of both quotas b and c to items
p. 336). Y and Z〉, where 0 < a, b, c < 1, and a + b + c =
• Lazo, Maria, Vellasco, Aurelio, and Pa- 1). An algorithm is proposed to extract all hidden
checo (2000) describes “a hybrid model and interesting ALPs from a one-sum weighted
for portfolio selection and management, transaction-database (the well-established trans-
which comprised three modules: a genetic action-database in a one-sum weighting fashion).
algorithm for the selection of the assets that With regard to portfolio management, the authors
are going to form the investment portfolio, model a collection of investment portfolios as
a neural net for the prediction of the returns a one-sum weighted transaction-database that
on the assets in the portfolio, and a genetic contains hidden ALPs; and suggest that a set of
algorithm for the determination of the opti- ALPs, mined from the given portfolio-data, can
mal weights for the assets” (Zhang & Zhou, be treated as the candidate portfolios that can
2004, p. 517). be further applied to guide future investment

Mining Allocating Patterns in Investment Portfolios

activities. It is believed that ALPs will prove set of items (database attributes), and Ŧ = {T1, T2,
be useful in several different areas as well. The …, Tm-1, Tm} be a set of transactions (database
experiments are conducted using two sets of records), DT is described by Ŧ, where each Tj ∈ Ŧ
possible investment portfolios generated from comprises a set of items I’ ⊆ I. In ARM, two
the CSMAR (China Stock Market & Accounting threshold values are usually used to determine
Research) China Stock Trade and Quote Research the significance of an AR:
Database (CSTQR⋅Database). The experimental
results show good performance regarding both the • Support: The frequency that the items oc-
rate of obtaining “qualified” candidate portfolios cur or co-occur in Ŧ. A support threshold σ,
and the monthly average return of the obtained defined by the user, is used to distinguish
candidate portfolios (as used in Hung, Liang & frequent items from the infrequent ones. A
Liu, 1996). The results evidence the effectiveness set of items S is called an itemset, where S
of addressing ALPs in the proposed portfolio ⊆ I and ∀ai ∈ S co-occur at least once in
management application. Ŧ. If the frequency of S in Ŧ exceeds σ, S is
defined as a Frequent Itemset (FI).
Chapter Organization • Confidence: Represents how “strongly” an
itemset X implies another itemset Y, where
The following section describes the related data X, Y ⊆ I and X ∩ Y = {∅}. A confidence
mining aspects in association rule mining (ARM) threshold a, supplied by the user, is used
and weighted association rule mining (WARM). to distinguish high confidence ARs from
In the third section the concept of ALP is intro- low confidence ARs.
duced, based on describing the one-sum weighted
transaction-database, one-sum weighted itemsets An AR X 〈antecedent〉 ⇒ Y 〈consequent〉 is said
and such WARs. An algorithm is proposed in to be valid when the support for the co-occurrence
the fourth section that identifies all hidden and of X and Y exceeds σ, and the confidence of this
interesting ALPs in a given one-sum weighted AR exceeds a. The computation of support is: (X
transaction-database. In the fifth section, the au- ∪ Y) / |Ŧ|, where |Ŧ| is the size function of the set
thors further suggest an application of mining a set Ŧ. The computation of confidence is: support(X ∪
of ALPs in a collection of investment portfolios. Y) / support(X). Informally, X ⇒ Y can be inter-
Experiments are presented in the sixth section that preted as: if X exists, it is likely that Y also exists.
demonstrates the effectiveness of using ALPs in With regards to the history of ARM investigation,
the proposed application. Finally the conclusions three major categories of ARM algorithms can be
and a number of open issues for future research identified: (1) mining ARs from all possible FIs,
are discussed at the end of this chapter. (2) mining ARs from maximal frequent itemsets
(MFIs), and (3) mining ARs from frequent closed
itemsets (FCIs).
related Work
Mining ARs from FIs
Association Rule Mining
In the past decade, many algorithms have been
Association rule mining (ARM) aims to extract introduced to mine ARs from identified FIs. These
a set of ARs from a given transaction-database algorithms can be further grouped into different
DT , first introduced in Agrawal, Imielinski, and “families”, such as pure-apriori like, semi-apriori
Swami (1993). Let I = {a1, a2, …, an-1, an} be a like, set enumeration tree like, and so forth.

Mining Allocating Patterns in Investment Portfolios

• Pure-apriori like where FIs are generated approach founded on (1) the join procedure,
based on the generate-prune level by level and (2) the prune procedure that employs
iteration that was first promulgated in the the “closure property” of itemsets  if an
apriori algorithm (Agrawal & Srikant, itemset is frequent then all its subsets will
1994). In this family archetypal algorithms also be frequent; if an itemset is infrequent
include: Apriori, AprioriTid and AprioriHy- then all its supersets will also be infrequent.
brid (Agrawal & Srikant, 1994), partition In this family typical algorithms include:
(Savasere, Omiecinski & Navathe, 1995), AIS (Agrawal⋅Imielinski⋅Swami) (Agrawal,
DHP (direct hashing and pruning) (Park, Imielinski & Swami, 1993), OCD (off-
Chen & Yu, 1995), sampling (Toivonen, line candidate determination) (Mannila,
1996), DIC (dynamic itemset counting) Toivonen & Verkamo, 1994), SETM (SET
(Brin, Motwani, Ullman & Tsur, 1997), oriented mining) (Houtsma & Swami, 1995),
CARMA (continuous association rule min- and so forth.
ing algorithm) (Hidber, 1999), and so forth. • Set enumeration tree like where FIs are
It can be remarked that the well-established generated through constructing a set enu-
apriori algorithm has been the basis of many meration tree structure (Rymon, 1992) from
subsequent ARM and/or ARM-related al- DT , which avoids the need to enumerate a
gorithms. The apriori algorithm is sketched large number of candidate itemsets. In this
as follows (see Algorithm 1). family a number of approaches can be further
• Semi-apriori like where FIs are generated divided into two main streams: (1) Apriori-
by enumerating candidate itemsets but do TFP (apriori-total⋅from⋅partial) based (i.e.
not apply the apriori generate-prune iterative Coenen & Leng, 2001; Coenen & Leng,

Algorithm 1. The apriori algorithm

Input: (a) A transaction-database DT;
(b) A support threshold σ;
Output: A set of frequent itemsets SFI;
Begin Algorithm:
(1) k := 1;
(2) SFI := prepare an empty set for holding the identified frequent itemsets;
(3) generate all candidate 1-itemsets from DT;
(4) while (candidate k-itemsets exist) do
(5) determine support for candidate k-itemsets from DT;
(6) add frequent k-itemsets into SFI;
(7) remove all candidate k-itemsets that are not sufficiently supported to give
frequent k-itemsets;
(8) generate candidate (k + 1)-itemsets from frequent k-itemsets using “closure
property” (see semi-apriori like);
(9) k  k + 1;
(10) end while
(11) return (SFI);
End Algorithm

Mining Allocating Patterns in Investment Portfolios

2002; Coenen, Goulbourne & Leng, 2001; with f. The relationship between FI, MFI and
Coenen, Leng & Ahmed, 2004; Coenen, FCI is that MFI ⊆ FCI ⊆ FI (Burdick, Calimlim
Leng & Goulbourne, 2004; etc.), and (2) & Gehrke, 2001). In this category algorithms
FP-tree (Frequent⋅Pattern-tree) based (i.e., include: CLOSET (mining CLOsed itemSETs)
El-Hajj & Zaiane, 2003; Han, Pei & Yin, (Pei, Han & Mao, 2000), CLOSET+ (Wang, Han
2000; Liu, Pan, Wang & Han, 2002; etc.). & Pei, 2003), CHARM (closed association rule
mining; the ‘H’ is gratuitous) (Zaki & Hsiao,
Mining ARs from MFIs 2002), MAFIA (Burdick, Calimlim & Gehrke,
2001), and so forth.
It is apparent that the size of a complete set of
FIs can be very large. The concept of MFI (Ro- Weighted Association Rule Mining
berto & Bayardo, 1998) was proposed to find
several “long” (super) FIs in DT, which avoids Weighted association rule mining (WARM), first
the redundant work required to identify “short” introduced in (Cai, Fu, Cheng & Kwong, 1998),
FI. The concept of vertical mining has also been aims to address the weighting issue in ARM in-
effectively promoted in this category (Zaki, Par- vestigation and extract WARs from a weighted
thasarathy Ogihara, & Li, 1997). Vertical mining, transaction-database. In the past decade, a number
first mentioned in Holsheimer, Kersten, Man- of alternative approaches have been subsequently
nila, and Toivonen (1995), deals with a vertical described in WARM (i.e., Lu, Hu & Li, 2001;
transaction database DTV, where each database Tao, Murtagh & Farid, 2003; Wang, Yang, & Yu
record represents an item that is associated with 2000; etc.). Broadly WARM approaches can be
a list of its relative transactions (the transactions categorized into three groups: (1) mining hori-
in which it is present). MFI algorithms include: zontal WARs, (2) mining vertical WARs, and (3)
MaxEclat/Eclat (Zaki, Parthasarathy, Ogihara & mining mixed WARs.
Li, 1997), MaxClique/Clique (Zaki, Parthasarathy,
Ogihara & Li, 1997), Max-Miner (Roberto & Ba- Mining Horizontal WARs
yardo, 1998), Pincer-Search (Lin & Kedem, 1998),
MAFIA (MAximal Frequent Itemset Algorithm) The Traditional Approach
(Burdick, Calimlim & Gehrke, 2001), Genmax Cai, Fu, Cheng, and Kwong (1998) introduced
(Gouda & Zaki, 2001), and so forth. the concept of weighted item based on a “real-
life” marketing experience  not all goods share
Mining ARs from FCIs the same importance in a market. With regard
to a retailing business, mining from weighted
Algorithms belonging to this category extract items/goods enables the generation of such ARs
ARs through generating a set of FCIs from DT . with more emphasis on some particular goods
In fact the support of some subitemsets of an MFI (e.g., goods that are under promotion, goods that
might be hard to identified resulting in a further always make significant profits) and less empha-
difficulty in the computation of confidence. The sis on other goods. The idea of mining ARs in a
concept of FCI (Pei, Han & Mao, 2000) is pro- special transaction-database, where each item is
posed to improve this property of MFI, which assigned with a weighting score, directly depicts
avoids the difficulty of identifying the support of the problem of mining horizontal WARs. Let IW =
any sub-itemsets of a relatively long FI. A FCI f {aW1, aW2, …, aWn-1, aWn} be a set of weighted items,
is an itemset S ∈ DT , where f is frequent, and ¬∃ where each aWi ∈ IW is an item ai ∈ I labelling with
itemset f’ ⊃ f and f’ shares a common support a user-defined weighting score wi (0 ≤ wi ≤ 1). Let

Mining Allocating Patterns in Investment Portfolios

Ŧ = {T1, T2, …, Tm-1, Tm} be a set of transactions. The Variation Approach

A horizontal weighted transaction-database DWT Wang, Yang, and Yu (2000) proposed an alter-
is described by Ŧ, where each Tj ∈ Ŧ comprises a native approach of mining horizontal WARs by
set of weighted items IW’ ⊆ IW. introducing a variational horizontal weighted
To measure the significance of a horizontal transaction-database DWT*. With regards to the
WAR, the “weighted-supportweighted-confi- real-life marketing, the newly mined horizontal
dence” approach, as an extension of the well-es- WARs “can not only improve the confidence in
tablished “supportconfidence” framework, was the rules, but also provide a mechanism to do
introduced in Cai, Fu, Cheng, and Kwong (1998). more effective target marketing by identifying
A horizontal weighted support threshold σW is or segmenting customers based on their potential
supplied by the user that distinguishes frequent degree of loyalty or volume of purchases” (p.
horizontal weighted itemsets from the infrequent 270). In Table 1 several points in terms of item
ones. A horizontal weighted itemset X W ∪ YW is weighting score properties that differentiate DWT*
considered to be frequent if (∑aWi ∈ (X W ∪ YW) from DWT are listed.
wi) * support(X W ∪ YW) ≥ σW, where X W, YW ⊆ In a marketing context, a typical horizontal
IW and X W ∩ YW = {∅}. Having a set of frequent WAR mined from DWT* can be exemplified as
horizontal weighted itemsets generated from DWT, 〈bread[9, 13]〉 ⇒ 〈milk[1, 3]〉, which can be inter-
a set of horizontal WARs can be further obtained. preted as: when bread is purchased in the quantity
A horizontal WAR X W ⇒ YW is said to be valid between 9 and 13, it is likely that the milk in the
when X W ∪ YW is frequent, and ((∑aWi ∈ (X W ∪ quantity between 1 and 3 is also purchased. In
YW) wi) * support(X W ∪ YW)) / ((∑aWi ∈ X W wi) Wang, Yang, and Yu (2000) the proposed WAR
* support(X W)) ≥ aW, where aW is a user-defined generation approach comprises two phases: (1)
horizontal weighted confidence threshold. generating a set of frequent itemsets from DWT*
regardless the weighting issue; and (2) extract-

Table 1. The difference between DWT and DWT*

Properties of Item
DWT DWT*
Weighting Scores
Single-value like The weighting score of an item
The weighting score of an item
in DWT* is given as an interval of
in DWT is given as a single value
vs. two values [v1, v2], where v1 < v2.
v. The weighting score is defined
The weighting score is defined as
as single-value like.
Interval-value like interval-value like.
Both lower and upper values of
Percentage like the weighting score interval for
The value of the weighting score
an item in DWT* are given as v1, v2
for an item in DWT is given as 0
vs. ≥ 1 and v1, v2 ∈ Z (both v1, v2 are
≤ v ≤ 1. The weighting score is
positive integers). The weighting
defined as percentage like.
Positive-integer like score is defined as positive-
integer like.
Static like The weighting score of an The weighting score of an item
item in DWT is given as a fixed in DWT* can be valued differently
vs. value in all transactions. The in different transactions. The
weighting score is defined as weighting score is defined as
Dynamic like static like. dynamic like.

Mining Allocating Patterns in Investment Portfolios

ing hidden and interesting horizontal WARs horizontal weighted itemsets, the downward
based on (1). In (2) a set of candidate rules can closure property can be proved works properly.
be enumerated from the result of (1), where the With respect to the idea presented in Agrawal
consequent of each candidate rule “only contains and Srikant (1994), all horizontal WARs can be
one weighted item for the sake of simplicity” (p. further mined from SFIW. In this improved ap-
271). A number of “qualified” horizontal WARs proach of mining horizontal WARs, automatically
can be further identified in the set of candidate assigning a weighting score to each transaction
rules regarding the user-specified threshold (in a vertical fashion) signifies the approach of
values of support, confidence and density. Since mining vertical WARs.
the proposed approach shows an interest in pro-
ducing maximum rules only, a set of maximum Mining Vertical WARs
horizontal WARs  “a qualified WAR X ⇒ Y is
a maximum WAR if for any generalization X’ of Lu, Hu, and Li (2001) extended the traditional
X and Y’ of Y where X’ ≠ X and Y’ ≠ Y, neither of approach of mining horizontal WARs in a verti-
X’⇒ Y, X ⇒ Y’, nor X’⇒ Y’ is a qualified WAR” (p. cal manner by introducing the vertical weighted
271)is finally obtained. Tao, Murtagh, and Farid transaction-database DWTV . With regards to the
(2003) classified the process of mining horizontal real-life marketing, it can be indicated that not
WARs from DWT*, proposed in (Wang, Yang, & all transactions share the same importance in a
Yu, 2000), as a technique of post-processing or market. For example, transactions that have been
maintaining ARs. dealt ages ago may be less important than current
transactions; transactions that are processed in a
The Improved Approach particular region may be more interesting than
Tao, Murtagh, and Farid (2003) identified the other transactions; and so forth. Thus assigning
main challenge of mining horizontal WARs: the non-identical weighting scores to different trans-
downward closure property of itemsets is invalid actions is suggested.
in the generation of significant/frequent horizon- In Lu, Hu, and Li (2001) the concept of transac-
tal weighted itemsets. To solve this problem, an tion interval was introduced that allows a number
improved approach of mining horizontal WARs of adjacent transactions share a common weight-
was proposed in Tao, Murtagh, and Farid (2003), ing score. In this vertical WARM approach, items
which takes an alternative horizontal weighted are treated as uniformity. Let I = {a1, a2, …, an-1,
transaction-database DWT+ as the input. The only an} be a set of items, Ŧ = {T1, T2, …, Tm-1, Tm} be
difference between DWT+ and DWT is that the item a set of m-many transactions, and ŦI = {TT1, TT2,
weighting scores in DWT+ can be valued as any real …, TTM-1, TTM} be a set of M-many transaction
number. This improved approach automatically intervals that covers all transactions in Ŧ in a
assigns a weighting score w_tj to each transaction non-overlapping manner, where M ≤ m. A vertical
Tj in DWT+, where the computation of w_tj is: (∑aWi weighted transaction-database DWTV is described
∈ Tj wi) / |Tj|. Based on the assigned transaction by ŦI, where each TTl ∈ ŦI contains a number of
scores, a set of frequent horizontal weighted item- Tj, and each Tj ∈ Ŧ comprises a set of items I’ ⊆ I.
sets SFIW can be generated. A horizontal weighted In DWTV a vertical weighting score w_vl is assigned
itemset X W ∪ YW is considered to be frequent if to each TTl ∈ ŦI, where 0 ≤ w_vl ≤ 1.
(∑j = 1 … |Ŧ| & (X W ∪ YW) ⊆ Tj w_tj) / (∑j = 1 The process of mining vertical WARs, de-
… |Ŧ| w_tj) ≥ σW, where X W, YW ⊆ IW, X W ∩ YW = scribed in Lu, Hu, and Li (2001), consists of two
{∅}, and σW is a user-supplied horizontal weighted stages: (1) generating a set of large vertical weight-
support threshold. In the generation of frequent ed itemsets from DWTV; and (2) extracting vertical

Mining Allocating Patterns in Investment Portfolios

WARs based on (1). In (1) a vertical weighted vertical WARs. In the stage of generating large
support threshold σWV is supplied by the user that mixed weighted itemsets, a weighted support
distinguishes large vertical weighted itemsets threshold σWM is specified by the user that
from the small ones. The weighted support of an distinguishes large mixed weighted itemsets
vertical weighted itemset X ∪ Y is calculated as: from the small ones. The weighted support of a
(∑l = 1 … M (w_vl * count((X ∪ Y)l ))) / (Nv), mixed weighted itemset X W ∪ YW is calculated
where X, Y ⊆ I, X ∩ Y = {∅}, count((X ∪ Y)l ) is as: (1/k) * (∑aWi ∈ (X W ∪ YW) wi) * ((∑l = 1 …
the number of transactions that contain X ∪ Y in M (w_vl * count((X W ∪ YW)l ))) / (Nv )), where X W,
the transaction interval TTl, and Nv is the weighted YW ⊆ IW, X W ∩ YW = {∅}, count((X W ∪ YW)l ) is
transaction number. The calculation of Nv is: the number of transactions that contain X W ∪ YW
∑l = 1 … M (w_vl * Nl ), where Nl is the number in the transaction interval TTl, Nv is the vertical
of transactions that are found in the transaction weighted transaction number (see the previous
interval TTl. It can be proved that the closure prop- subsection), (1/k) * (∑aWi ∈ (X W ∪ YW) wi) is the
erty works properly in this stage. In (2) a vertical horizontal weight of X W ∪ YW, and k is the size
weighted WAR generation approach is applied, of X W ∪ YW. In this stage, the closure property
which is similar to the rule-generation approach works by checking the lower bound of the verti-
provided in Agrawal and Srikant (1994). cal weighted support for each candidate itemset,
where the calculation of such lower bound for a
Mining Mixed WARs mixed weighted k-itemset X W ∪ YW is: (k *σWM) /
(∑i = n – k … n (wi )). In the stage of generating
A further extension in mining WARs was pre- mixed WARs, an approach that is similar to the
sented in Lu, Hu, and Li (2001), which combines rule-generation approach provided in Agrawal
both approaches of mining horizontal and vertical and Srikant (1994) is employed.
WARs. This hybrid WARM approach takes a
mixed weighted transaction-database DWTM as
the input. Let IW = {aW1, aW2, …, aWn-1, aWn} be a alloCating patternS
set of weighted items, where each aWi ∈ IW is an
item ai ∈ I labelling with a user-defined weight- A new type of horizontal WAR, namely allocat-
ing score wi (0 ≤ wi ≤ 1). Let Ŧ = {T1, T2, …, Tm-1, ing pattern (ALP), is described in this section. It
Tm} be a set of m-many transactions, and ŦI = can not only indicate the implicative co-occur-
{TT1, TT2, …, TTM-1, TTM} be a set of M-many ring relationship between two sets of items in a
transaction intervals that covers all transactions weighing setting, but also inform the allocating
in Ŧ in a non-overlapping manner, where M ≤ m. relationship among AR items. In a marketing
A mixed weighted transaction-database DWTM is context, an archetypal ALP can be exemplified
described by ŦI, where each TTl ∈ ŦI contains as 〈bread[0.25] ham[0.35]〉 ⇒ 〈milk[0.40]〉, which
a number of Tj, each Tj ∈ Ŧ comprises a set of can be interpreted as: when people spend 25% and
weighted items IW’ ⊆ IW, and the weighted items 35% of their money to purchase bread and ham
in each transaction are ordered in an ascending together, it is likely that people also spend 40% of
manner based on their item weights. In DWTM a the money to purchase milk. The approach of min-
vertical weighting score w_vl is assigned to each ing ALPs requires a special horizontal weighted
TTl ∈ ŦI, where 0 ≤ w_vl ≤ 1. transaction-database DWT-OS as the input.
The process of mining mixed WARs (Lu, Hu
& Li, 2001) is similar to the process of mining

Mining Allocating Patterns in Investment Portfolios

One-Sum Weighted Transaction comparison, in terms of item weighting score

Database properties, of four different horizontal weighted
transaction-databases is provided in Table 2.
In Table 1 three sets of item score properties are
defined to analyse different horizontal weighted One-Sum Weighted Itemsets
transaction-databases. These properties are
“single-value like vs. interval-value like,” “per- An itemset can be recognized in a transaction-
centage like vs. positive-integer like,” and “static database DT if this particular set of items appears
like vs. dynamic like.” In DWT-OS item weighing as a subset of at least one transaction Tj in DT. A
scores show an additional property (“one-sum” one-sum weighted itemset can be treated as an
like) that distinguishes DWT-OS from other hori- itemset that is presented in a particular weighting
zontal weighted transaction-databases  the frame, where the item scores are assigned in a
sum of all item scores in each transaction is 1. one-sum percentage manner. For example, {I1[0.1],
Hence DWT-OS can be named as one-sum weighted I2[0.3], I3[0.3], I5[0.3]} and {I1[0.1], I2[0.3], I3[0.5],
transaction-database. I5[0.1]} are two different weighting frames for
Let IOSW = {aOSW1, aOSW2, …, aOSWn-1, aOSWn} be the itemset {I1, I2, I3, I5}. An itemset can produce
a set of one-sum weighted items, and Ŧ = {T1, T2, as many as infinity possible weighting frames.
…, Tm-1, Tm} be a set of transactions. Each aOSWi If an itemset weighting frame IWF appears as a
∈ IOSW represents an item ai ∈ I assigning with subset of at least one transaction Tj in a one-sum
a set of weighting scores θi = {wi1, wi2, …, wim-1, weighted transaction-database DWT-OS , this IWF
wim}, where 0 ≤ wij ≤ 1 and |θi| = |Ŧ| that means: can be identified as a one-sum weighted itemset
for different transactions Tj ∈ Ŧ, different scores in DWT-OS .
wij ∈ θi can be assigned to a particular item aOSWi
∈ IOSW. A one-sum weighted transaction-database The Score Transformation Procedure
DWT-OS is described by Ŧ, where each Tj ∈ Ŧ com-
prises a set of one-sum weighted items IOSW’ ⊆ To determine whether an IWF is a subset of a
IOSW, and ∑i = 1…|IOSW’| or |Tj| wij = 1. An overall particular Tj in DWT-OS or not, the actual weighting

Table 2. The comparison of DWT, DWT*, DWT+, and DWT-OS

Properties of Item
DWT DWT* DWT+ DWT-OS
Weighting Scores
Single-value like
Interval-value Single-value
vs. Single-value like Single-value like
like like
Interval-value like

Percentage like
vs. Positive-integer Positive-real
Percentage like Percentage like
Positive-integer / like like
Positive-real like

Static like
vs. Static like Dynamic like Static like Dynamic like
Dynamic like

One-sum like No No No Yes

Mining Allocating Patterns in Investment Portfolios

score wji that is assigned to each item aOSWi ∈ Tj Frequent One-Sum Weighted Itemsets
where aOSWi ∈ IWF needs to be transformed as
(wji) / (∑q = 1…|Tj| & (aOSWq ∈ IWF) wjq ∈ Tj). A one-sum weighted itemset is considered to be
The transformed scores clarify the actual allocat- frequent if it can be found as a subset of more
ing relationship among these IWF-related items than (σWOS * |Ŧ|)-many transactions in DWT-OS ,
in Tj. An IWF is defined as a subset of Tj if the where σWOS is a user-supplied one-sum weighted
score of each item involved in IWF matches the support threshold. It should be noted that the well-
relative item score transformed in Tj. For example, known closure property of itemsets can also be
an IWF can be given as {I1[0.2], I2[0.4], I3[0.4]} found in one-sum weighted itemsets, so that: (1)
while a transaction Tj may be {I1[0.1], I2[0.2], if a one-sum weighted itemset is frequent then
I3[0.2], I4[0.25], I5[0.25]}; the weighing scores for all its subsets will also be frequent; and (2) if a
items I1, I2 and I3 are concentrated since the item one-sum weighted itemset is infrequent then all
intersection IWF ∩ Tj = {I1, I2, I3}; although the its supersets will also be infrequent.
actual scores of I1, I2 and I3 are presented differently
in IWF (as “0.2”, “0.4” and “0.4”) and Tj (as “0.1,” One-Sum Weighted Association
“0.2” and “0.2”), IWF is still noticed as a subset Rules
of Tj because the transformed scores of I1, I2 and
I3 ∈ Tj are computed as “0.1 / (0.1 + 0.2 + 0.2) = A frequent one-sum weighted itemset is presented
0.2”, “0.2 / (0.1 + 0.2 + 0.2) = 0.4” and “0.2 / (0.1 as XOSW ∪ YOSW, where XOSW, YOSW ⊆ IOSW and XOSW
+ 0.2 + 0.2) = 0.4,” that match the scores given ∩ YOSW = {∅}. A one-sum WAR in the form of
in IWF. The transformation of transaction item XOSW ⇒ YOSW can be further produced by a rule
scores enables the one-sum weighted property formalisation procedure, namely rule-formali-
to be lasted from transactions (a special case of sation (see Algorithm 2). In rule-formalisation,
itemsets) to the extracted weighted itemsets. w(aOSWi) ∈ (XOSW ∪ YOSW) represents the corre-

Algorithm 2. The rule-formalisation procedure

Input: A frequent one-sum weighted itemset in terms of (XOSW, YOSW);
Output: A formalized one-sum weighted association rule p (as XOSW ⇒ YOSW);
Begin Algorithm:
(1) prepare p to be a formalized one-sum weighted association rule;
(2) formalize “〈” as the first part of p;
(3) for each aOSWi ∈ XOSW do
(4) update p iteratively by formalising “aOSWi ‘[’ w(aOSWi ) ∈ (XOSW ∪ YOSW) ‘]’” as its second part;
(5) end for
(6) update p by formalising “〉 ⇒ 〈” as its third part;
(7) for each aOSWi ∈ YOSW do
(8) update p iteratively by formalising “aOSWi ‘[’ w(aOSWi ) ∈ (XOSW ∪ YOSW) ‘]’” as its fourth part;
(9) end for
(10) update p by formalising “〉” as its last part;
(11) return (p);
End Algorithm

0
Mining Allocating Patterns in Investment Portfolios

sponding (actual) weighting score for the item and interesting ALPs from a one-sum weighted
aOSWi in XOSW ∪ YOSW. transaction-database DWT-OS. With respect to the
A one-sum WAR XOSW ⇒ YOSW is said to be valid traditional ARM approach presented in (Agrawal
when count((XOSW ∪ YOSW) ⊆ (Tj ∈ Ŧ)) / count(XOSW & Srikant, 1994), the proposed ALPM approach
⊆ (Tj ∈ Ŧ)) ≥ aWOS , where aWOS is a user-supplied consists of two phases: (1) generating a set of
one-sum weighted confidence threshold, count(J) frequent one-sum weighted itemsets from DWT-OS;
is the count function that returns the number of and (2) mining one-sum WARs (noted as ALPs)
occurences of an object J, and the previously based on (1).
described score transformation procedure is
employed to verify the “⊆” relationship. Generating Frequent One-Sum
Weighted Itemsets

alloCation pattern Mining An algorithm, namely apriori-ALP, is proposed to

generate a set of frequent one-sum weighted item-
In this section, an allocating pattern mining sets from DWT-OS, which takes the apriori algorithm
(ALPM) approach is proposed to extract all hidden (see Algorithm 1) as its basis. A one-sum weighted

Algorithm 3. The apriori-ALP algorithm

Input: (a) A one-sum weighted transaction-database DWT-OS;
(b) A one-sum weighted support threshold σWOS;
Output: A set of frequent one-sum weighted itemsets SFIWOS;
Begin Algorithm:
(1) k := 1;
(2) SFIWOS := prepare an empty set for holding the identified frequent one-sum weighted itemsets;
(3) Ck: = generate the set of candidate k-itemsets from DWT-OS;
(4) while (Ck ≠ {∅}) do
(5) for each element ei ∈ Ck do
(6) generate all itemset weighting frames for ei through scanning all transactions in DWT-OS;
(7) initialize a Boolean variable frequentFlag as false;
(8) for each itemset weighting frame f j ∈ ei do
(9) support := count( f i ⊆ transactions in DWT-OS); // the previously described score transformation
procedure is employed to verify the “⊆” relationship
(10) if ((support / |DWT-OS |) ≥ σWOS) then
(11) add f j into SFIWOS; // f j is stored with its actual support value
(12) set frequentFlag to be true;
(13) end for
(14) if (¬frequentFlag) then
(15) remove ei from Ck;
(16) end for
(17) k  k + 1;
(18) Ck  apriori-gen(Ck – 1); // the apriori-gen function is introduced in (Agrawal & Srikant, 1994)
(19) end while
(20) return (SFIWOS);
End Algorithm

Mining Allocating Patterns in Investment Portfolios

support threshold, as a parameter of apriori-ALP, portfolio strategy, in terms of total expected re-
is taken from the user. The apriori-ALP algorithm turn and overall risk, that guides how individuals
is described as follows (see Algorithm 3). concurrently invest a number of investment-items.
With respect to the real-life financial market, an
Mining One-Sum WARs (ALPs) investment-item can be any type of securities,
that is, bonds, cash equivalents, funds, futures,
Given a set of frequent one-sum weighted item- options, stocks, and so forth. The primary goal
sets SFIWOS that is generated from apriori-ALP, of a portfolio strategy is “to choose a set of risk
an algorithm, namely ALP-generation, is further assets to create a portfolio in order to maximize
proposed to extract ALPs from SFIWOS. A one-sum the return under certain risk or to minimize the
weighted confidence threshold, as a parameter of risk for obtaining a specific return” (Zhang &
ALP-generation, is taken from the user. According Zhou, 2004, p. 517).
to the closure property of one-sum weighted item-
sets, all subsets of a frequent one-sum weighted Modeling a Collection of Portfolios
itemset f i are included in SFIWOS , where |f i| ≥ 2.
Hence the process of ALP-Generation can be Traditional Transaction-Database
designed as follows (see Algorithm 4). Model

A number of “popular” investment-items (e.g., the

applying alpS in portfolio stocks issued by Microsoft, Royal Bank of Scot-
ManageMent land, Wal-Mart, etc.) can always be easily listed.
This list of items/assets already depicts an invest-
Portfolio management as the core study in in- ment portfolio in a particular weighted setting,
vestments research aims to determine the best where items are weighted identicallyspending/

Algorithm 4. The ALP-generation algorithm

Input: (a) A set of frequent one-sum weighted itemsets SFIWOS;
(b) A one-sum weighted confidence threshold aWOS;
Output: A set of allocating patterns SALP;
Begin Algorithm:
(1) SALP := prepare an empty set for holding the identified allocating patterns;
(2) for each frequent one-sum weighted itemset f i ∈ SFIWOS do
(3) for each frequent one-sum weighted itemset f j ∈ SFIWOS do
(4) if ( f j ⊂ f i) then // the previously described score transformation procedure is employed to
verify the “⊂” relationship
(5) confidence := f i.support / f j.support;
(6) if (confidence ≥ aWOS) then
(7) allocating pattern p := Rule-Formalisation( f j, f i – f j);
(8) add p into SALP;
(9) end for
(10) end for
(11) return (SALP);
End Algorithm

Mining Allocating Patterns in Investment Portfolios

allocating the same amount of assets/funds to each Collecting “Meaningful” Portfolios

portfolio-item. A collection of such portfolios can
be modelled as a traditional transaction-database A collection of portfolios is assumed “meaningful”
P. Mining a set of ARs from P, each AR illustrates if each portfolio is collected under some subscrip-
an implicative co-occurring relationship between tive conditions. An aspect of the conditions can
two sets of investment-items. For example an be described as: each collected portfolio must be
AR may be mined as 〈stock_no.1 stock_no.2 “successful”  the realized return of a portfolio
bond_no.1〉 ⇒ 〈stock_no.3 fund_no.1〉, which can exceeds a user-defined return threshold δ, where
be interpreted as: when stock_no.1, stock_no.2 the lower bound of δ is defined as the return of
and bond_no.1 are invested together, it is likely investing the same amount of assets/funds to a risk
that both stock_no.3 and fund_no.1 are also free investment-item in the same period of time.
invested. In real-life investment activities, ARs Other conditions that may be considered/specified
are not generally applicable because an amount by the user include: (1) each collected portfolio
of assets/funds is usually allocated to each port- must be invested in a particular time interval; (2)
folio-asset in a non-identical manner. each collected portfolio must contain a particular
number of investment-items; (3) each collected
One-Sum Weighted Transaction- portfolio must be produced by a particular port-
Database Model folio selection technique; (4) each collected port-
folio must be produced by a particular financial
In this subsection, the authors model a collection institution; (5) the risk of each collected portfolio
of investment portfolios as a one-sum weighted must be less than a user-supplied risk threshold;
transaction-database P*, where each attribute in (6) the investment-items contained in each col-
a database record represents an investment-item lected portfolio must be traded in a particular
assigning with a weighting score between 0 and 1 stock exchange; and so forth.
(i.e., the ratio – the amount of assets/funds spent
on this portfolio-item to the total amount of as- Guiding Future Investment Activities
sets/funds spent on this portfolio), and the sum of
all investment-item scores in a portfolio (database It can be prospected that ALPs will prove be
record) is 1. A number of ALPs can be identified useful in a range of applications. With respect to
in P* that illustrate such implicative allocating re- portfolio management, mining ALPs from a set
lationships between two sets of investment-items. of meaningful portfolios can be applied to guide
An ALP mined from P* can be exemplified as future investment activities. Given a collection
〈stock_no.1[0.3] stock_no.2[0.15] bond_no.1[0.2]〉 of successful portfolios P*S that is invested in a
⇒ 〈stock_no.3[0.05] fund_no.1[0.3]〉, which can particular time interval t1 (e.g., all portfolios are
be interpreted as: when people invest 30%, 15% purchased on day1 and sold on day90), where the
and 20% of their total assets/funds to stock_no.1, return threshold δ is suggested to be a percentage
stock_no.2 and bond_no.1 together, it is likely ψ times the average realized return of all invest-
that people also invest 5% and 30% of the total ment-items involved in P*S in t1 (noted that in this
assets/funds to stock_no.3 and fund_no.1. Based chapter, ψ is simply determined as 50%), a set of
on the additional information of one-sum item/ mined ALPs can be treated as a number of candi-
asset weights, ALPs can be generally applied in date portfolios that will be further applied in future
real-life investment activities. investment activities. The quality of each obtained
candidate portfolio can be evaluated using a qual-
ity threshold µ, where µ is always chosen to be
the yearly return of a risk free investment-item.

Mining Allocating Patterns in Investment Portfolios

In Ye, Liu, Yao, Wang, Zhou, and Lu (2002) the the CSMAR (China Stock Market & Accounting
yearly returns of two risk free investment-items Research) China Stock Trade and Quote Research
are determined as: (1) 1.5%deposit to bank or Database (CSTQR⋅Database).1 In CSMAR eight
money market, and (2) 3%bonds. In this chapter databases and one system are included, they are
µ is chosen to be 5% in a conservative fashion. China Stock Market Trading Database, China
Hence a candidate portfolio is “qualified” if its Stock Market Financial Database, China Securi-
realized return in a “test” time interval t2 (noted ties Investment Fund Research Database, China
that the beginning of t2 is later than the end of Stock Market Information Disclosure System,
t1) is greater than µ in t2 (i.e., if the range of t2 China IPO Research Database, China Listed Firm
is three months, µ should be calculated as 5% / Corporate Governance Research Database, China
12 * 3 = 1.25%). The overall performance of the Listed Firm’s Financial Ratios Research Database,
proposed application can be measured by the rate China Stock Market Quarterly Report Database
of obtaining qualified candidate portfolios, which and the CSTQR⋅Database. The CSTQR⋅Database
is: count (qualified candidate portfolios) / count covers all details of every transaction and related
(all generated candidate portfolios). On the other information within every working day, providing
hand, the overall performance of the obtained data by bid and ask record.
candidate portfolios can be further evaluated by
their monthly average return, as suggested in Data Related Stock Exchanges
Hung, Liang, and Liu (1996).
There are two stock markets in China: Shanghai
Stock Exchange (SHSE) and Shenzhen Stock
experiMental reSultS Exchange (SZSE). SHSE was opened in Decem-
ber 1990, whereas SZSE was established in July
In this section, the authors aim to evaluate the 1991. The CSTQR⋅Database involves two types
effectiveness of the proposed ALP application in of shares: A-shares and B-shares. The A-shares
portfolio managementa set of ALPs mined from are domestic investment shares that are issued
a collection of successful investment portfolios by Chinese companies, and listed on SHSE and
can be treated as a number of candidate portfolios SZSE. The A-shares are denominated in the
used to guide future investment activities. The Chinese money, that is, the RMB. Foreign indi-
evaluation is performed regarding both the rate viduals or institutions are not allowed to directly
of obtaining qualified candidate portfolios and the buy and sell these shares. On the other hand, the
monthly average return of the generated candidate B-shares were issued to, and traded by overseas
portfolios. All evaluations are obtained using the investors only; domestic investors were not able
proposed apriori-ALP and ALP-generation algo- to purchase B-shares before 2001. Since 2001 the
rithms. Experiments are run on a 1.20 GHz Intel B-shares have been opened to domestic investors.
Celeron CPU with 256 MByte of RAM running The denomination currencies for the B-shares
under Windows Command Processor. are the US dollar used on SHSE, and the Hong
Kong dollar used on SZSE (Chan, Menkveld &
The CSMAR-CSTQR⋅Database Yang, 2003).

The CSMAR Data Two Simulated Portfolio Collections

The experiments are conducted using two sets Due to the sake of simplicity, for each of SHSE
of possible investment portfolios generated from and SZSE only the first 50 listed stocks for A-

Mining Allocating Patterns in Investment Portfolios

shares are taken in the period between January etc.), where the stock IDs are issued to each stock
2003 and June 2003 from the CSTQR⋅Database. without a specific order. For each of SHSE and
It should be noted that the stocks are listed in SZSE, 5,000 successful portfolios are randomly
CSTQR⋅Database according to their stock IDs created based on the 50 taken stocks, where each
(i.e., “600000” represents Shanghai Pudong portfolio is limited to contain at least 7 and at
Development Bank Co., Ltd., “600001” repre- most 15 stocks. To decide which stocks should
sents Handan Iron & Steel Co., Ltd., “600002” be included in a simulated portfolio, a random
represents Qilu Petroche Mical Company Ltd., procedure is applied. In Algorithm 5, the random

Algorithm 5. The portfolio-simulation procedure

Input: (a) The number of stocks num; // ‘num’ is decided to be 50
(b) The min size of a portfolio s; // ‘s’ is decided to be 7
(c) The max size of a portfolio t; // ‘t’ is decided to be 15
Output: A simulated portfolio Φ;
Begin Algorithm:
(1) Φ := prepare an empty set for holding the selected SHSE/SZSE stocks;
(2) k := 1;
(3) c := 1;
(4) while ((k ≤ num) and (c ≤ t)) do
(5) r1 := generate a random integer under 5; // including 0 and 5
(6) r2 := 0;
(7) if (r1 = 0) then
(8) r2  generate a random integer under 1;
(9) else if (r1 = 1) then
(10) r2  generate a random integer under 2;
(11) else if (r1 = 2) then
(12) r2  generate a random integer under 3;
(13) else if (r1 = 3) then
(14) r2  generate a random integer under 5;
(15) else if (r1 = 4) then
(16) r2  generate a random integer under 8;
(17) else
(18) r2  generate a random integer under 13;
(19) if (r2 = 0) then
(20) Φ  Φ ∪ k;
(21) c  c + 1;
(22) k  k + 1;
(23) end while
(24) if (c < s) then
(25) return Portfolio-Simulation(num, s, t); // recursive procedure
(26) return (Φ);
End Algorithm

Mining Allocating Patterns in Investment Portfolios

procedure that forms a portfolio structure, namely lated as: (1) for SHSE, 50% * 13.423% = 6.7115%;
portfolio-simulation, is described. and (2) for SZSE, 50% * 11.926% = 5.963% (as
In portfolio-simulation, the ranges of the commenced in the previous section  ψ = 50%).
random integer variable r2 are designed to be Table 3 and Table 4 illustrate the first 5 simulated
increased in a Fibonacci pattern (i.e., 0, 1, 1, 2, 3, portfolios/transactions in “shse.D50.N5000.Ifibo-
5, 8, 13…). It should be noted that the Fibonacci nacci.W5” and “szse.D50.N5000.Ifibonacci.W5”.
pattern can be substituted by any other patterns. Noted that the integers listed before the square
Having an overall structure of the simulated port- brackets are the stock IDs (i.e., “1” represents
folio collection generated by iteratively processing “600000” in SHSE and “000001” in SZSE), and
portfolio-simulation, the one-sum weighting score the real numbers shown in the square brackets
is then assigned to each portfolio (transaction) are the stock weights.
item. Firstly, an integer ϖi is assigned to each
item ai in a transaction Tj, where ϖi is randomly Mining ALPs from Simulated
chosen from {1, 2, 3, 4, 5}. Secondly, the one-sum Portfolios
weighing score wi for ai is then calculated as:
ϖi / (∑k = 1…|Tj| ϖk). Two simulated portfolio The evaluation undertaken used a one-sum
collections (one for SHSE and another one for weighted support threshold value of 1% and a
SZSE) are named as “shse.D50.N5000.Ifibonacci. one-sum weighted confidence threshold value of
W5” and “szse.D50.N5000. Ifibonacci.W5,” where 75%, as used in (Coenen & Leng, 2004) to gener-
“shse”/“szse” specifies the stock exchange, “D” ate an extension of ARs, that is, the classification
represents the number of stocks taken from a stock association rules (CARs), which parallels ALPs.
exchange, “N” denotes the number of simulated The proposed apriori-ALP and ALP-generation
portfolios, “I” indicates the pattern applied in algorithms are implemented and run on both simu-
random integer generation in portfolio-simulation, lated portfolio sets. There are 18 ALPs generated
and “W” signifies the size of the random integer from “shse.D50.N5000.Ifibonacci.W5”, while 16
set in the process of item weighting. ALPs are mined from “szse.D50.N5000.Ifibo-
In both “shse.D50.N5000.Ifibonacci.W5” and nacci.W5”. In Table 5 and Table 6 the generated
“szse.D50.N5000.Ifibonacci.W5,” only the suc- ALPs are listed for “shse.D50.N5000.Ifibonacci.
cessful portfolios are comprised. It is assumed W5” and “szse.D50.N5000.Ifibonacci.W5”.
that all portfolios are invested in the time interval
between the first trade in January 2003 and the Evaluation of the Mined ALPs
first trade in April 2003. Hence the return of each
stock/item taken from the CSTQR⋅Database can ALPs from “shse.D50.N5000.Ifibonacci.
be calculated as (p1 – p0) / p0, where p0 represents W”
the purchasing price (the price of the first trade in
January 2003), and p1 indicates the selling price The 18 ALPs mined from “shse.D50.N5000.
(the price of the first trade in April 2003). The Ifibonacci.W5” are treated as the candidate port-
overall return of a simulated portfolio (transac- folios, and tested by investing them in the test
tion) Tj is then calculated as: ∑i = 1…|Tj| (wi * time interval: purchasing at the first trade in May
((p1 – p0) / p0)i ). In the described time interval, 2003 and selling at the first trade in June 2003.
the average return of the 50 taken stocks is that In Table 7, the return of each candidate portfolio
13.423% on SHSE and 11.926% on SZSE. Thus for the test time interval is shown.
the return threshold δ (used to measure whether a
simulated portfolio is successful or not) is calcu-

Mining Allocating Patterns in Investment Portfolios

Table 3. The first five transactions in “shse.D50.N5000.Ifibonacci.W5”

1[0.1111111111111111] 3[0.027777777777777776] 13[0.08333333333333333]
14[0.05555555555555555] 15[0.027777777777777776] 17[0.1111111111111111]
T1 22[0.05555555555555555] 23[0.08333333333333333] 28[0.05555555555555555]
31[0.08333333333333333] 32[0.027777777777777776] 33[0.08333333333333333]
38[0.027777777777777776] 39[0.1111111111111111] 44[0.05555555555555555]
1[0.06666666666666667] 2[0.044444444444444446] 7[0.08888888888888889]
11[0.06666666666666667] 12[0.022222222222222223] 13[0.08888888888888889]
T2 15[0.06666666666666667] 16[0.06666666666666667] 8[0.022222222222222223]
19[0.022222222222222223] 22[0.08888888888888889] 23[0.08888888888888889]
24[0.08888888888888889] 25[0.08888888888888889] 27[0.08888888888888889]
10[0.08823529411764706] 11[0.029411764705882353] 2[0.029411764705882353]
13[0.08823529411764706] 19[0.058823529411764705] 5[0.029411764705882353]
T3 26[0.11764705882352941] 27[0.029411764705882353] 28[0.11764705882352941]
30[0.029411764705882353] 32[0.11764705882352941] 35[0.11764705882352941]
38[0.058823529411764705] 40[0.029411764705882353] 41[0.058823529411764705]
6[0.05263157894736842] 9[0.10526315789473684] 11[0.05263157894736842]
12[0.10526315789473684] 20[0.07894736842105263] 21[0.07894736842105263]
T4 23[0.02631578947368421] 25[0.05263157894736842] 26[0.07894736842105263]
27[0.10526315789473684] 28[0.05263157894736842] 29[0.05263157894736842]
32[0.02631578947368421] 33[0.10526315789473684] 34[0.02631578947368421]
1[0.14285714285714285] 2[0.10714285714285714] 3[0.03571428571428571]
5[0.03571428571428571] 6[0.03571428571428571] 7[0.03571428571428571]
T5 10[0.03571428571428571] 11[0.03571428571428571] 12[0.07142857142857142]
20[0.10714285714285714] 24[0.07142857142857142] 27[0.03571428571428571]
28[0.03571428571428571] 32[0.07142857142857142] 34[0.14285714285714285]

Table 4. The first five transactions in “szse.D50.N5000.Ifibonacci.W5”

1[0.0975609756097561] 2[0.07317073170731707] 3[0.04878048780487805]

4[0.0975609756097561] 7[0.04878048780487805] 8[0.0975609756097561]
T1 11[0.0975609756097561] 12[0.024390243902439025] 14[0.0975609756097561]
15[0.04878048780487805] 16[0.04878048780487805] 17[0.07317073170731707]
19[0.07317073170731707] 26[0.024390243902439025] 27[0.04878048780487805]

1[0.05] 3[0.075] 4[0.1] 6[0.025] 7[0.025]

10[0.025] 16[0.075] 17[0.1] 18[0.075]
T2
19[0.075] 20[0.1] 23[0.075] 24[0.1] 25[0.05]
34[0.05]

1[0.08695652173913043] 5[0.06521739130434782] 6[0.06521739130434782]

7[0.043478260869565216] 15[0.043478260869565216] 18[0.08695652173913043]
T3 20[0.06521739130434782] 21[0.06521739130434782] 24[0.043478260869565216]
26[0.06521739130434782] 27[0.08695652173913043] 30[0.08695652173913043]
31[0.08695652173913043] 32[0.021739130434782608] 33[0.08695652173913043]
2[0.10256410256410256] 5[0.05128205128205128] 9[0.02564102564102564]
14[0.10256410256410256] 15[0.07692307692307693] 16[0.05128205128205128]
T4 17[0.07692307692307693] 19[0.05128205128205128] 22[0.10256410256410256]
23[0.10256410256410256] 27[0.07692307692307693] 28[0.02564102564102564]
29[0.05128205128205128] 31[0.05128205128205128] 32[0.05128205128205128]
13[0.11428571428571428] 14[0.08571428571428572] 16[0.08571428571428572]
17[0.05714285714285714] 23[0.08571428571428572] 24[0.05714285714285714]
T5
30[0.08571428571428572] 34[0.11428571428571428] 38[0.02857142857142857]
40[0.11428571428571428] 42[0.08571428571428572] 48[0.08571428571428572]

Mining Allocating Patterns in Investment Portfolios

Table 5. The 18 ALPs mined from the “shse.D50.N5000.Ifibonacci.W5”

ALP No. 1 〈7[0.333333] 27[0.500003]〉 ⇒ 〈4[0.166663]〉 conf = 0.865954
ALP No. 2 〈8[0.333331] 20[0.222222]〉 ⇒ 〈4[0.444445]〉 conf = 0.846154
ALP No. 3 〈8[0.333331] 20[0.222222]〉 ⇒ 〈6[0.444445]〉 conf = 0.813187
ALP No. 4 〈17[0.333334] 18[0.22222]〉 ⇒ 〈23[0.444444]〉 conf = 0.782609
ALP No. 5 〈7[0.222222] 21[0.333333]〉 ⇒ 〈20[0.444444]〉 conf = 0.77907
ALP No. 6 〈11[0.444444] 14[0.333333]〉 ⇒ 〈7[0.222222]〉 conf = 0.772277
ALP No. 7 〈17[0.333331] 18[0.500002]〉 ⇒ 〈3[0.166665]〉 conf = 0.771739
ALP No. 8 〈26[0.333333] 31[0.444444]〉 ⇒ 〈23[0.222222]〉 conf = 0.769231
ALP No. 9 〈9[0.444445] 21[0.333333]〉 ⇒ 〈14[0.22222]〉 conf = 0.767677
ALP No. 10 〈7[0.222222] 21[0.333333]〉 ⇒ 〈8[0.444444]〉 conf = 0.767442
ALP No. 11 〈10[0.500002] 12[0.166663]〉 ⇒ 〈19[0.333333]〉 conf = 0.761062
ALP No. 12 〈17[0.500002] 18[0.333331]〉 ⇒ 〈14[0.166665]〉 conf = 0.76087
ALP No. 13 〈9[0.444445] 26[0.333333]〉 ⇒ 〈11[0.22222]〉 conf = 0.758621
ALP No. 14 〈8[0.333331] 20[0.222222]〉 ⇒ 〈23[0.444445]〉 conf = 0.758242
ALP No. 15 〈2[0.333333] 18[0.5]〉 ⇒ 〈10[0.166666]〉 conf = 0.757282
ALP No. 16 〈10[0.22222] 25[0.333333]〉 ⇒ 〈8[0.444445]〉 conf = 0.755319
ALP No. 17 〈18[0.500002] 25[0.333331]〉 ⇒ 〈3[0.166665]〉 conf = 0.75
ALP No. 18 〈17[0.333334] 18[0.22222]〉 ⇒ 〈6[0.444444]〉 conf = 0.75

Table 6. The 16 ALPs mined from the “szse.D50.N5000.Ifibonacci.W5”

ALP No. 1 〈3[0.2] 13[0.4] 16[0.3]〉 ⇒ 〈19[0.1]〉 conf = 0.981132
ALP No. 2 〈3[0.2] 13[0.099998] 17[0.299999]〉 ⇒ 〈15[0.400001]〉 conf = 0.877193
ALP No. 3 〈7[0.199999] 10[0.400002] 12[0.299998]〉 ⇒ 〈11[0.099999]〉 conf = 0.847458
ALP No. 4 〈1[0.444445] 28[0.333333]〉 ⇒ 〈4[0.22222]〉 conf = 0.831169
ALP No. 5 〈6[0.222222] 17[0.333331]〉 ⇒ 〈15[0.444445]〉 conf = 0.821053
ALP No. 6 〈17[0.333333] 18[0.5]〉 ⇒ 〈6[0.166666]〉 conf = 0.792079
ALP No. 7 〈5[0.333333] 8[0.444444]〉 ⇒ 〈6[0.222222]〉 conf = 0.785047
ALP No. 8 〈3[0.333333] 28[0.444444]〉 ⇒ 〈16[0.222222]〉 conf = 0.78481
ALP No. 9 〈19[0.444444] 23[0.333333]〉 ⇒ 〈10[0.222222]〉 conf = 0.78125
ALP No. 10 〈13[0.500002] 21[0.333331]〉 ⇒ 〈23[0.166665]〉 conf = 0.77551
ALP No. 11 〈1[0.333333] 10[0.444444]〉 ⇒ 〈6[0.222222]〉 conf = 0.770833
ALP No. 12 〈8[0.444444] 15[0.333334]〉 ⇒ 〈5[0.22222]〉 conf = 0.770833
ALP No. 13 〈4[0.222222] 8[0.333333]〉 ⇒ 〈10[0.444444]〉 conf = 0.769231
ALP No. 14 〈5[0.22222] 9[0.333334]〉 ⇒ 〈8[0.444444]〉 conf = 0.765306
ALP No. 15 〈9[0.333333] 10[0.444444]〉 ⇒ 〈2[0.222222]〉 conf = 0.764706
ALP No. 16 〈1[0.333333] 8[0.444444]〉 ⇒ 〈12[0.222222]〉 conf = 0.761468

Mining Allocating Patterns in Investment Portfolios

Table 7. The returns of the 18 candidate portfolios

No. Candidate Portfolios Return (%)
1 〈7[0.333333] 27[0.500003]〉 ⇒ 〈4[0.166663]〉 4.9111
2 〈8[0.333331] 20[0.222222]〉 ⇒ 〈4[0.444445]〉 4.5220
3 〈8[0.333331] 20[0.222222]〉 ⇒ 〈6[0.444445]〉 1.5153
4 〈17[0.333334] 18[0.22222]〉 ⇒ 〈23[0.444444]〉 1.9325
5 〈7[0.222222] 21[0.333333]〉 ⇒ 〈20[0.444444]〉 0.8196
6 〈11[0.444444] 14[0.333333]〉 ⇒ 〈7[0.222222]〉 2.2271
7 〈17[0.333331] 18[0.500002]〉 ⇒ 〈3[0.166665]〉 1.5136
8 〈26[0.333333] 31[0.444444]〉 ⇒ 〈23[0.222222]〉 6.0828
9 〈9[0.444445] 21[0.333333]〉 ⇒ 〈14[0.22222]〉 1.8606
10 〈7[0.222222] 21[0.333333]〉 ⇒ 〈8[0.444444]〉 2.6530
11 〈10[0.500002] 12[0.166663]〉 ⇒ 〈19[0.333333]〉 3.2090
12 〈17[0.500002] 18[0.333331]〉 ⇒ 〈14[0.166665]〉 0.7036
13 〈9[0.444445] 26[0.333333]〉 ⇒ 〈11[0.22222]〉 3.1105
14 〈8[0.333331] 20[0.222222]〉 ⇒ 〈23[0.444445]〉 3.0224
15 〈2[0.333333] 18[0.5]〉 ⇒ 〈10[0.166666]〉 -0.7885
16 〈10[0.22222] 25[0.333333]〉 ⇒ 〈8[0.444445]〉 2.5576
17 〈18[0.500002] 25[0.333331]〉 ⇒ 〈3[0.166665]〉 1.3371
18 〈17[0.333334] 18[0.22222]〉 ⇒ 〈6[0.444444]〉 0.4271
Average 2.6451

The quality threshold µ (used to determine 21 SHSE stocks only) is 2.6451%, while the aver-
whether a candidate portfolio is qualified or not) age return of the 50 SHSE stocks is 2.785%; the
has been previously commenced as 5% (yearly). average return of all candidate portfolios can be
Since the range of the test time interval is one realized as high as 94.98% of the average return
month, the quality threshold µ should be converted of the 50 stocks taken from SHSE. Hence the
as 5% / 12 = 0.4167%. From Table 7 it can be seen experimental result can be further interpreted
that only the candidate portfolio no. 15 (high- as: almost all ALP-based portfolio strategies
lighted) shows a return < µ. Thus the overall rate (candidate portfolios) can produce a return that is
of obtaining qualified candidate portfolios herein greater than the return of a risk free investment-
is calculated as 17/18 = 94.44%. This very high item, and the average return of these strategies
rate of obtaining qualified candidate portfolios is considered high.
evidences that a set of mined ALPs can be used
to guide future investment activities. It is worth ALPs from “szse.D50.N5000.Ifibonacci.
giving further consideration to the following: for W”
this test time interval of investment (one month),
the average return of the 18 candidate portfolios The 16 ALPs mined from “szse.D50.N5000.Ifibo-
(each comprises 3 stocks only; together involves nacci.W5” are treated as the candidate portfolios

Mining Allocating Patterns in Investment Portfolios

Table 8. The returns of the 16 candidate portfolios

No. Candidate Portfolios Return (%)

1 〈3[0.2] 13[0.4] 16[0.3]〉 ⇒ 〈19[0.1]〉 4.9060
2 〈3[0.2] 13[0.099998] 17[0.299999]〉 ⇒ 〈15[0.400001]〉 1.8930
3 〈7[0.199999] 10[0.400002] 12[0.299998]〉 ⇒ 〈11[0.099999]〉 7.1890
4 〈1[0.444445] 28[0.333333]〉 ⇒ 〈4[0.22222]〉 6.1988
5 〈6[0.222222] 17[0.333331]〉 ⇒ 〈15[0.444445]〉 1.4378
6 〈17[0.333333] 18[0.5]〉 ⇒ 〈6[0.166666]〉 2.1480
7 〈5[0.333333] 8[0.444444]〉 ⇒ 〈6[0.222222]〉 2.8499
8 〈3[0.333333] 28[0.444444]〉 ⇒ 〈16[0.222222]〉 5.8392
9 〈19[0.444444] 23[0.333333]〉 ⇒ 〈10[0.222222]〉 27.9634
10 〈13[0.500002] 21[0.333331]〉 ⇒ 〈23[0.166665]〉 8.1319
11 〈1[0.333333] 10[0.444444]〉 ⇒ 〈6[0.222222]〉 2.9985
12 〈8[0.444444] 15[0.333334]〉 ⇒ 〈5[0.22222]〉 2.4376
13 〈4[0.222222] 8[0.333333]〉 ⇒ 〈10[0.444444]〉 4.8600
14 〈5[0.22222] 9[0.333334]〉 ⇒ 〈8[0.444444]〉 1.7432
15 〈9[0.333333] 10[0.444444]〉 ⇒ 〈2[0.222222]〉 -8.7930
16 〈1[0.333333] 8[0.444444]〉 ⇒ 〈12[0.222222]〉 5.5649
Average 4.8355

as well, and tested by investing them in the same as 76.72% of the average return of the 50 stocks
test time interval as described in the previous taken from SZSE. Hence the experimental result
subsection. In Table 8, the return of each candidate can be further interpreted as: almost all ALP-
portfolio for the test time interval is shown. based portfolio strategies (candidate portfolios)
The value of µ is still taken as 0.4167% that is can produce a return that is greater than the
lasted from the previous subsection. From Table return of a risk free investment-item, and the
8 it can be identified that the candidate portfolio average return of these strategies is considered
no. 15 (highlighted) is the only non-qualified relatively high.
candidate portfolio (return < µ). The overall rate
of obtaining qualified candidate portfolios herein
is then calculated as 15/16 = 93.75%. This very ConCluSion and future
high rate of obtaining qualified candidate port- reSearCh
folios evidences that a set of mined ALPs can be
used to guide future investment activities. It can This chapter is concerned with an investigation
be further concerned: for this test time interval of applying a new type of data mining knowl-
of investment (one month), the average return of edge in portfolio management. This new type of
the 16 candidate portfolios (each comprises 3 or knowledge can be recognized as an extension of
4 stocks only; together involves 21 SZSE stocks the well-established ARs in a one-sum weight-
only) is 4.8355%, while the average return of the ing setting. An overview of existing ARM and
50 SZSE stocks is 6.3024%; the average return WARM approaches and/or algorithms was pro-
of all candidate portfolios can be found as high vided in the second section, where three catego-

0
Mining Allocating Patterns in Investment Portfolios

rizes of ARM algorithms and three categories of improved algorithms of mining ALPs from data;
WARM approaches were reviewed. A new type applying ALPs in other areas; and so forth.
of horizontal WARs was proposed in the third
section, namely allocating pattern (ALP), which
shows a one-sum percentage like item weighting aCknoWledgMent
property. A novel algorithm, separated in apriori-
ALP and ALP-generation, was presented in the The authors would like to thank Professor Paul
fourth section that effectively extracts ALPs from Leng, Dr. Robert Sanderson, and Dr. Mark Rob-
data. In the fifth section, the introduced ALPs erts of the Department of Computer Science at
were addressed in portfolio management, where the University of Liverpool for their support with
the authors described the possibility of utilis- respect to the work described here.
ing ALPs to guide future investment activities.
The experiments were performed in the sixth
section, where two sets of simulated portfolios referenCeS
generated from CSMAR-CSTQR⋅Database were
taken to mine ALPs. The mined ALPs were then Agrawal, R., Imielinski, T., & Swami, A. (1993).
treated as the candidate portfolios, and tested by Mining association rules between sets of items
investing them in a test (later) time interval. The in large databases. In P. Buneman & S. Jajdia
experimental result shows: a very high percentage (Eds.), Proceedings of the 1993 ACM SIGMOD
of the mined ALPs (94.44% for the first set, and International Conference on Management of Data
93.75% for the second set) can produce a return (pp. 207-216). New York: ACM Press.
that is greater than the return of a risk-free invest-
Agrawal, R. & Srikant, R. (1994). Fast algorithms
ment-item. With respect to the good evaluation
for mining association rules. In J. B. Bocca, M.
performance, the effectiveness of the proposed
Jarke & C. Zaniolo (Eds.), Proceedings of the
ALP application in portfolio management can
20th International Conference on Very Large
be demonstrated. In a further consideration, the
Data Bases (pp. 487-499). San Francisco: Morgan
average return of all candidate portfolios (each
Kaufmann Publishers.
comprises 3 or 4 stocks only), for the test time
interval of investment (one month), can be real- Ali, K., Manganaris, S., & Srikant, R. (1997).
ized as a relatively high percentage (94.98% for Partial classification using association rules. In
the first set, and 76.72% for the second set) of the D. Heckerman, H. Mannila, D. Pregibon & R.
average return of all stock-items (taken from the Uthurusamy (Eds.), Proceedings of the Third In-
CSTQR⋅Database) comprised in each simulated ternational Conference on Knowledge Discovery
portfolio set. Therefore the overall experimental and Data Mining (pp. 115-118). Menlo Park, CA:
result can be further interpreted as: it seems that AAAI Press.
almost all ALP-based portfolio strategies make
Berry, M. J. A. & Linoff, G. (1997). Data mining
a relatively high profit.
techniques for marketing, sales, and customer
Future research is suggested to run more
support. New York: John Wiley & Sons, Inc.
experiments on a wide range of stock market
data, and conclude whether ALPs can be widely Bodie, Z., Kane, A., Marcus, A. J., & Ryan, P. J.
applied to guide future investment activities or (2003). Investments (Fourth Canadian Edition).
not. Other obvious directions for future research Toronto, ON (Canada): McGraw-Hill Ryerson
include: finding other types of data mining knowl- Limited.
edge based on ARM/WARM; investigating the

Mining Allocating Patterns in Investment Portfolios

Bramer, M. (2007). Principles of data Principles and Practice of Knowledge Discovery

miningUndergraduate topics in computer in Databases (pp. 99-111). Berlin Heidelberg,
science. London, UK: Springer-Verlag. Germany: Springer-Verlag.
Brin, S., Motwani, R., Ullman, J. D., & Tsur, S. Coenen, F. & Leng, P. (2004). An evaluation of
(1997). Dynamic itemset counting and implica- approaches to classification rule selection. In Pro-
tion rules for market basket data. In J. Peckham ceedings of the 4th IEEE International Conference
(Ed.), Proceedings of the 1997 ACM SIGMOD on Data Mining (pp. 359-362). Los Alamitos, CA:
International Conference on Management of Data IEEE Computer Society Publications.
(pp. 255-264). New York: ACM Press.
Coenen, F., Goulbourne, G., & Leng, P. (2001).
Burdick, D., Calimlim, M., & Gehrke, J. (2001). Computing association rules using partial totals. In
MAFIA: A maximal frequent itemset algorithm L. D. Raedt & A. Siebes (Eds.), Principles of Data
for transactional databases. In Proceedings of Mining and Knowledge Discovery – Proceedings
the 17th International Conference on Data Engi- of the 5th European Conference on Principles
neering (pp. 443-452). Los Alamitos, CA: IEEE and Practice of Knowledge Discovery in Data-
Computer Society Publications. bases (pp. 54-66). Berlin Heidelberg, Germany:
Springer-Verlag.
Cai, C. H., Fu, A. W. C., Cheng, C. H., & Kwong,
W. W. (1998). Mining association rules with Coenen, F., Leng, P., & Ahmed, S. (2004). Data
weighted items. In B. Eaglestone, B. C. Desai & structure for association rule mining: T-tree and
J. Shao (Eds.), Proceedings of the 1998 Interna- p-tree. IEEE Transactions on Knowledge and
tional Database Engineering and Application Data Engineering, 16(6), 774-778.
Symposium (pp. 68-77). Los Alamitos, CA: IEEE
Coenen, F., Leng, P., & Goulbourne, G. (2004).
Computer Society Publications.
Tree structures for mining association rules. Jour-
Chan, K. A., Menkveld, A. J., & Yang, Z. (2003). nal of Data Mining and Knowledge Discovery,
Evidence on the foreign share discount puzzle 8(1), 25-51.
in China: Liquidity or information asymmetry?
Cuthbertson, K. & Nitzsche, D. (2001). Invest-
(Working Paper). Hong Kong, China: University
ments: Spot and derivatives markets. Chichester,
of Science and Technology (HKUST).
West Sussex, UK: John Wiley & Sons, Ltd.
Coenen, F. & Leng, P. (2001). Optimising asso-
Damodaran, A. (2001). Corporate finance theory
ciation rule algorithms using itemset ordering.
and practice (2nd ed.). New York: John Wiley &
In M. Bramer, F. Coenen & A. Preece (Eds.),
Sons, Inc.
Research and Development in Intelligent Systems
XVIII–Proceedings of the Twenty-first SGES Dong, G. & Li, J. (1999). Efficient mining of
International Conference on Knowledge Based emerging patterns: Discovering trends and dif-
Systems and Applied Artificial Intelligence (pp. ferences. In Proceedings of the Fifth ACM SIG-
53-66). London, UK: Springer-Verlag. KDD International Conference on Knowledge
Discovery and Data Mining (pp. 43-52). New
Coenen, F. & Leng, P. (2002). Finding association
York: ACM Press.
rules with some very frequent attributes. In T.
Elmaa, H. Mannila & H. Toivonen (Eds.), Prin- El-Hajj, M. & Zaiane, O. R. (2003). Inverted
ciples of Data Mining and Knowledge Discovery matrix: Efficient discovery of frequent items in
– Proceedings of the 6th European Conference on large datasets in the context of interactive mining.

Mining Allocating Patterns in Investment Portfolios

In L. Getoor, T. E. Senator, P. Domingos & C. Ho, K. & Robinson, C. (2001). Personal financial
Faloutsos (Eds.), Proceedings of the Ninth ACM planning (3rd ed.). North York, ON (Canada):
SIGKDD International Conference on Knowledge Captus Press Inc.
Discovery and Data Mining (pp. 109-118). New
Holsheimer, M., Kersten, M. L., Mannila, H., &
York: ACM Press.
Toivonen, H. (1995). A perspective on databases
Enke, D. & Thawornwong, S. (2005). The use of and data mining. In U. M. Fayyad & R. Uthuru-
data mining and neural networks for forecasting samy (Eds.), Proceedings of the First International
stock market returns. Expert Systems with Ap- Conference on Knowledge Discovery and Data
plications, 29(2005), 927-940. Mining (pp. 150-155). Menlo Park, CA: AAAI
Press.
Farrell, M. (2006). Create a diversified portfolio.
©2006 Path to InvestingLeading the way to Houtsma, M. & Swami, A. (1995). Set-oriented
financial knowledge®. New York: Lightbulb mining of association rules in relational databases.
Press, Inc. In P. S. Yu & A. L. Chen (Eds.), Proceedings of
the Eleventh International Conference on Data
Gouda, K. & Zaki, M. J. (2001). Efficiently min-
Engineering (pp. 25-33). Los Alamitos, CA: IEEE
ing maximal frequent itemsets. In N. Cercone, T.
Computer Society Publications.
Y. Lin & X. Wu (Eds.), Proceedings of the 2001
IEEE International Conference on Data Mining Hung, S.-Y., Liang, T.-P., & Liu, V. W.-C. (1996).
(pp. 163-170). Los Alamitos, CA: IEEE Computer Integrating arbitrage pricing theory and artificial
Society Publications. neural networks to support portfolio management.
Decision Support Systems, 18(1996), 301-316.
Han, J. & Kamber, M. (2001). Data mining con-
cepts and techniques. San Francisco: Morgan John, G. H., Miller, P., & Kerber, R. (1996). Stock
Kaufmann Publishers. selection using rule induction. IEEE Expert,
11(5), 52-58.
Han, J. & Kamber, M. (2006). Data mining con-
cepts and techniques (2nd ed.). San Francisco: Kohara, K., Ishikawa, T., Fukuhara, Y., & Naka-
Morgan Kaufmann Publishers. mura, Y. (1997). Stock price prediction using prior
knowledge and neural networks. International
Han, J., Pei, J., & Yin, Y. (2000). Mining frequent
Journal of Intelligent Systems in Accounting,
patterns without candidate generation. In W. Chen,
Finance and Management, 6(1), 11-22.
J. F. Naughton & P. A. Bernstein (Eds.), Proceed-
ings of the 2000 ACM SIGMOD International Kovalerchuk, B. & Vityaev, E. (2000). Data
Conference on Management of Data (pp. 1-12). mining in finance: Advances in relational and
New York: ACM Press. hybrid Methods. New York: Kluwer Academic
Publisher.
Hand, D., Mannila, H., & Smyth, P. (2001). Prin-
ciples of data mining. Cambridge: MIT Press. Lazo, J. G., Maria, M., Vellasco, R., Aurelio, M.,
& Pacheco, C. (2000). A hybrid genetic-neural
Hidber, C. (1999). Online association rule mining.
system for portfolio selection and management.
In A. Delis, C. Faloutsos & S. Ghandeharizadeh
In Proceedings of the 7th International Confer-
(Eds.), Proceedings of the 1999 ACM SIGMOD
ence on Engineering Applications of Neural
International Conference on Management of Data
Networks. Kingston Upon Thames, UK: Kingston
(pp. 145-156). New York: ACM Press.
University.

Mining Allocating Patterns in Investment Portfolios

Lin, D.-I., & Kedem, Z. M. (1998). Pincer search: Proceedings of the 2000 ACM SIGMOD Workshop
A new algorithm for discovering the maximum on Research Issues in Data Mining and Knowledge
frequent set. In H.-J. Schek, F. Saltor, I. Ramos & G. Discovery (pp. 21-30), Dallas, TX.
Alonso (Eds.), Advances in Database Technology
Quah, T. S. & Srinivasan, B. (1999). Improving
– Proceedings of the 6th International Conference
returns on stock investment through neural net-
on Extending Database Technology (pp. 105-119).
work selection. Expert Systems with Applications,
Berlin Heidelberg, Germany: Springer-Verlag.
17(4), 295-301.
Liu, J., Pan, Y., Wang, K., & Han, J. (2002).
Quinlan, J. R. (1993). C4.5: Programs for machine
Mining frequent item sets by opportunistic
learning. San Francisco: Morgan Kaufmann
projection. In Proceedings of the Eighth ACM
Publishers.
SIGKDD International Conference on Knowledge
Discovery and Data Mining (pp. 229-238). New Raghavan, S. N. R. (2005). Data mining in e-com-
York: ACM Press. merce: A survey. Sadhana, 30(2&3), 275-289.
Lu, S., Hu, H., & Li, F. (2001). Mining weighted Roberto, J. & Bayardo, Jr. (1998). Efficiently
association rules. Intelligent Data Analysis, mining long patterns from databases. In L. M.
5(2001), 211-255. Hass, & A. Tiwary (Eds.), Proceedings of the
1998 ACM SIGMOD International Conference
Mannila, H., Toivonen, H., & Verkamo, A. I.
on Management of Data (pp. 85-93). New York:
(1994). Efficient algorithms for discovering asso-
ACM Press.
ciation rules. In U. M. Fayyad & R. Uthurusamy
(Eds.), Knowledge Discovery in Databases: Pa- Ross, S. (1976). The arbitrage theory of capital
pers from the 1994 AAAI Workshop (pp. 181-192). asset pricing. Journal of Economic Theory, 13,
Menlo Park, CA: AAAI Press. 341-360.
Miller, H. J. & Han, J. (2001). Geographic data Rymon, R. (1992). Search through systematic
mining and knowledge discovery. Bristol, PA: set enumeration. In B. Nebel, C. Rich & W. R.
Taylor & Francis, Inc. Swartout (Eds.), Proceedings of the 3rd Interna-
tional Conference on Principles of Knowledge
Mirkin, B. & Mirkin, B. G. (2005). Clustering for
Representation and Reasoning (pp. 539-550). San
data mining: A data recovery approach. Virginia
Francisco: Morgan Kaufmann Publishers.
Beach, VA: Chapman & Hall / CRC.
Savasere, A., Omiecinski, E., & Navathe S. (1995).
Neftci, S. N. (2004). Principles of financial en-
An efficient algorithm for mining association
gineering. Burlington, MA: Elsevier Academic
rules in large databases. In U. Dayal, P. M. D.
Press.
Gray & S. Nishio (Eds.), Proceedings of the 21st
Park, J. S., Chen, M.-S., & Yu, P. S. (1995). An International Conference on Very Large Data
effective hash based algorithm for mining asso- Bases (pp. 432-444). San Francisco: Morgan
ciation rules. In M. J. Carey & D. A. Schneider Kaufmann Publishers.
(Eds.), Proceedings of the 1995 ACM SIGMOD
Tao, F., Murtagh, F., & Farid, M. (2003). Weighted
International Conference on Management of Data
association rule mining using weighted support
(pp. 175-186). New York: ACM Press.
and significance framework. In L. Getoor, T. E.
Pei, J., Han, J., & Mao, R. (2000, May). CLOSET: Senator, P. Domingos & C. Faloutsos (Eds.), Pro-
An efficient algorithm for mining frequent closed ceedings of the Ninth ACM SIGKDD International
itemsets. In D. Gunopulos & R. Rastogi (Eds.), Conference on Knowledge Discovery and Data

Mining Allocating Patterns in Investment Portfolios

Mining (pp. 661-666). New York: ACM Press. Ye, Z., Liu, X., Yao, Y., Wang, J., Zhou, X., Lu,
P., & Yao, J. (2002). An intelligent system for
Thuraisingham, B. (1999). Data mining: Technolo-
personal and family financial service. In L. Wang,
gies, techniques, tools, and trends. Boca Raton,
J. C. Rajapakse, K. Fukushima, S.-Y. Lee & X.
FL: CRC Press LLC.
Yao (Eds.), Proceedings of the 9th International
Toivonen, H. (1996). Sampling large databases Conference on Neural Information Processing
for association rules. In T. M. Vijayaraman, A. (Vol. 5, pp. 2325-2327). Los Alamitos, CA: IEEE
P. Buchmann, C. Mohan & N. L. Sarda (Eds.), Computer Society Publications.
Proceedings of the 22nd International Conference
Yu, L., Wang, S., & Lai, K. K. (2005). Mining stock
on Very Large Data Bases (pp. 134-145). San
market tendency using GA-based support vector
Francisco: Morgan Kaufmann Publishers.
machines. In X. Deng & Y. Ye (Eds.), Proceedings
Tseng, C.-C. (2004). Portfolio management using of the First International Workshop on Internet
hybrid recommendation system. In Proceedings and Network Economics (pp. 336-345). Berlin
of the 2004 IEEE International Conference on Heidelberg, Germany: Springer-Verlag.
e-Technology, e-Commerce, and e-Services (pp.
Zaki, M. J. & Hsiao, C.-J. (2002) CHARM: An
202-206). Los Alamitos, CA: IEEE Computer
efficient algorithm for closed itemset mining. In
Society Publications.
R. L. Grossman, J. Han, V. Kumar, H. Mannila
Wang, H. & Weigend, A. S. (2004). Data mining & R. Motwani (Eds.), Proceedings of the Second
for financial decision making. Decision Support SIAM International Conference on Data Mining
Systems, 37(2004), 457-460. (Part IX No. 1). Philadelphia, PA: SIAM.
Wang, W. & Yang, J. (2005). Mining sequential Zaki, M. J., Parthasarathy, S., Ogihara, M., & Li,
patterns from large data sets. Secaucus, NJ: W. (1997). New algorithms for fast discovery of
Springer-Verlag New York, Inc. association rules. In D. Heckerman, H. Mannila,
& D. Pregibon (Eds.), Proceedings of the Third
Wang, W., Yang, J., & Yu, P. (2000). Efficient
International Conference on Knowledge Discov-
mining of weighted association rules (WAR). In
ery and Data Mining (pp. 283-286). Menlo Park,
Proceedings of the Sixth ACM SIGKDD Inter-
CA: AAAI Press.
national Conference on Knowledge Discovery
and Data Mining (pp. 270-274). New York: ACM Zhang, D. & Zhou, L. (2004). Discovery golden
Press. nuggets: Data mining in financial application.
IEEE Transactions on Systems, Man, and Cy-
Wang, J., Han, J., & Pei, J. (2003). CLOSET+:
bernetics – Part C: Applications and Reviews,
Searching for the best strategies for mining fre-
34(4), 513 –522.
quent closed itemsets. In: L. Getoor, T. E. Senator,
P. Domingos & C. Faloutsos (Eds.), Proceedings
of the Ninth ACM SIGKDD International Confer-
endnote
ence on Knowledge Discovery and Data Mining
(pp. 236-245). New York: ACM Press. 1
CSMAR-CSTQR⋅Database is provided
Wang, J. T. L., Zaki, M. J., Toivonen, H. T. T., & by GuoTaiAn Information Technology
Shasha, D. E. (2005). Data mining in bioinformat- Company, Shenzhen, China. (http://www.
ics. London, UK: Springer-Verlag. chinagtait.com)

Chapter VIII
Application of Data Mining
Algorithms for Measuring
Performance Impact of Social
Development Activities
Hakikur Rahman
Sustainable Development Networking Foundation, Bangladesh

aBStraCt

Social development activities are flourishing in diversified branches of society endeavor, despite nu-
merous hurdles inflicting on their ways that are truly cross-sectoral. They vary from providing basic
human services, as such education, health, and entrepreneurship to advance maneuvers depending on
the demand at the outset. However, while talking about discovering true success cases around the globe,
recapitulating their thoroughfares to accumulate knowledge; and foremost, utilizing newly emerged
information technology methods to archive and disseminate model cases, not many stand on their own.
This has happened due for many reasons, and a few of them are; improper program design, inaccurate
site selection, incorrect breakeven analysis, insufficient supply of funding, unbalanced manpower selec-
tion, inappropriate budget allocation, inadequate feedback and monitoring. Apart from them, there are
many hidden parameters that are not even visible. Furthermore, these visible parameters (including
the invisible) are intricately intermingled to one another in such a way that lagging of one derailed the
whole project and eventually the program fail. Not surprisingly, all of these parameters depend on data
and information on implemented programs or projects of which they mostly lack. Thus, lack of data and

Copyright © 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
Application of Data Mining Algorithms

information related to their appropriateness (or inappropriateness), made them failure projects, despite
devoted efforts by the implementers, in most cases. This chapter has tried to focus on data mining ap-
plications and their utilizations in formulating performance-measuring tools for social development
activities. In this context, this chapter has provided justifications to include data mining algorithm to
establish monitoring and evaluation tools for various social development applications. Specifically, this
chapter gave in-depth analytical observations to establish knowledge centers with various approaches
and finally it put forward a few research issues and challenges to transform the contemporary human
society into a knowledge society.

introduCtion operational systems. People believe that there are

untapped values hidden inside these data, and
All information pertaining to a successful orga- data mining techniques can help these patterns
nization is truly its asset. Information, such as out of this data.1
client lists, vendor lists, product details, employee Currently data are being collected and ac-
information, and corporate strategy, is invaluable. cumulated across a wide variety of fields at an
Without appropriate feeding of information, a exaggerated pace. Data are no more a rigid matter
business cannot operate properly (Utimaco, 2005). for an entrepreneurship, or an organization, but
This is potentially true for any sort of ventures have became an intrinsic part of any management
that may vary from providing services to the sci- process and most dynamic in nature. For these
entific community or academics or civil society reasons, data mining algorithms are imperative
or individuals. However, to take an intelligent to researches in the aspect of making intelligent
decision, the information needs to be processed decisions through data mining. To cope up with
and compiled. this new arena of research, there is an urgent need
Data mining is a method of collecting and for a new generation of computational theories
processing of data and eventually assisting to and tools to assist humans in extracting useful
take knowledgeable decision. In today’s modern information (knowledge) from the rapidly grow-
information based environment, data mining is ing volumes of digital data.
day by day coming at the front and beginning to At the same time, data mining and knowledge
acquire more and more attention. Because data discovery in databases have been attracting a
mining is all about acquisition, assessment and significant amount of research, industry, and
analysis, and by automatic or semiautomatic media attention (Boulicaut, Esposito, Giannotti
means huge or small, all quantities of data can & Pedreschi, 2004; Bramer, 1999; Fayyad,
help to uncover meaningful patterns and rules. Piatetsky-Shapiro & Smyth, 1996; Freitas, 2002;
These patterns and schemes help enterprises Kargupta & Chen, 2001; Kloesgen & Zythkow,
improve their marketing, sales and customer 2002; Larose, 2004; Miller & Han, 2001). This
support operations to better understand their chapter provides a brief overview of this emerging
end users. Over the years, corporate houses have field, clarifying how data mining and knowledge
accumulated very large databases from applica- discovery in databases are related to each other,
tions such as enterprise resource planning (ERP), and especially focused on application of data
client relationship management (CRM), or other mining algorithms in establishing social devel-

Application of Data Mining Algorithms

opment management systems. In this aspect, data mining. Finally it discusses a few challenges
this chapter intends to illustrate a few real-world with some hints on future research directives
applications, but specifically focused to data before concluding.
mining algorithms; challenges involved in those
applications of knowledge discovery, including
contemporary and future research directions in the BaCkground
arena of establishing knowledge centers to assist
the society for taking intelligent decision. In contrast to heuristics (which contain general
Along the way, this chapter tries to provide recommendations based on statistical evidence or
a few hints on data mining algorithms and put theoretical reasoning), algorithms are comprised
forward a few illustrations with which data min- of completely defined, finite sets of steps, opera-
ing algorithms may be applied for making deci- tions, or procedures to produce a particular out-
sion support systems. Furthermore, this chapter come. For example, with a few certain exceptions,
has endeavored to justify on several models on all computer programs, mathematical formulas,
establishment of knowledge centers. The author and (ideally) health prescriptions and food recipes
finds that knowledge centers (information center, are algorithms.2 Algorithms are based on finite
kiosks, community information centers) are be- patterns and occurrences in any incidents, and the
ing established in many countries during the last outcome could be quantified using mathematical
decade with aspirations for assisting the grass formulations (Abbass, Sarker & Newton, 2002;
roots communities. However, until now, not many Adamo, 2001; Kantardzic, 2002;Yoon & Kersch-
researches are being conducted to measure their berg, 1993).
impacts in the society, or any cost benefit analyses Historically, the concept of finding useful pat-
have carried out. terns in data has been given a variety of names,
In recent years, many countries have seen including data mining, knowledge extraction,
evolution of telecenters in various forms, ranging information discovery, information harvesting,
from kiosks, information centers, community data archeology, data warehousing, data reposi-
information centers, village information centers tory, or data pattern processing. Furthermore,
to multipurpose village information centers, the term data mining has been mainly used by
knowledge centers, and the like. However, due to statisticians, data analysts, and management in-
proper implementation, management, and moni- formation system (MIS) communities. Though it
toring, most of them failed in many countries and has also gained popularity in the database field
donors have withdrawn supporting them despite (Chakrabarti, 2002; Fayyad, Piatetsky-Shapiro
enormous demands exist in various parts of the & Smyth, 1996; Hand, Mannila & Smyth, 2001;
world. Varying from sub-Saharan countries with Liu & Motoda, 1998a, 1998b; Pal & Mitra, 2004;
minimum information access to South Asian Perner & Petrou, 1999; Pyle, 1999), but devel-
countries with lack of information management opment partners and researchers in the field of
framework, many effort of establishing these implementing numerous development projects
information centers remain exorbitant. Further- remain aloof of utilizing data mining techniques
more, most of the telecenters did not maintain to preserve their data or content, and as well as
any records on their clients, or their habits, nor utilizing data mining algorithms to derive their
the reasons for their failures, or any analytical project outcomes. Data remain as critical means
studies. Given these perspectives, this chapter of project evaluation essence and data process-
tries to devise a few algorithms to formulate the ing possesses as a simple means of conversion of
measuring criteria of knowledge centers, utilizing raw data into tables or charts. The hidden pattern

Application of Data Mining Algorithms

within the data remains hidden and transformation detection, data access, data cleaning, manufactur-
of those data into knowledge element could not ing, telecommunications, and Internet agents.
gain concrete momentum until now. In this chapter, a few data mining algorithms
Furthermore, there has not been any math- based on rough set theory (RS) (Cox, 2004; Curotto
ematical formulation derived that can take care & Ebecken, 2005; Kantardzic, 2002; Myatt, 2006;
the transformation of data into knowledge and Nanopoulos, Katsaros & Manolopoulos, 2003;
at the same time, measure their impact in the Thuraisingham, 1999; Zhou, Li, Meng & Meng,
society, or quantify the impact of data transfor- 2004) are included which are used to extract deci-
mation. The traditional method of turning data sion-making rules from dataset. Rough set theory
into knowledge relies on manual analysis and provides a neat methodology to formalize and cal-
interpretation. For example, in the health-care culate the results for data mining problems. In the
industry, it is common for physicians or specialists early 1980’s Z. Pawlak, in cooperation with other
to periodically analyze current trends and changes researchers developed the rough set data analysis
in health-care data. The specialists then provide (RSDA) (Pawlak, 1982). As recommended by its
a report detailing the analysis to the authority; main adage “let the data speak for themselves”,
and ultimately this report becomes the basis for RSDA tried to distinguish internal characteristics
future decision making and planning for health- of a data set, such as categorization, dependency,
care management. In a totally different category and association rules, without invoking external
of application, planetary geologists sift through metrics and judgment (Drewry et al., 2002).
remotely sensed images of planets and asteroids
by carefully locating and cataloging such geologic
objects of interest as impact craters. Main thruSt
Perhaps it can be a village information center,
established at a very remote corner of a geo- The output of a data mining algorithm is typically
graphically dispersed region. There has not been a pattern or a set of patterns that are valid in the
evolved many readymade formulas, algorithms, given data. A pattern is defined as a statement
hypothesis, or any measuring criteria to recognize (expression) in a given language, that describes
their pattern of growth and implementation, nature (relationships among) the facts in a subset of the
of operation, sustainability of their existence, or given data, and is in some sense simpler than the
replication of success cases in applicable states enumeration of all the facts in the subset. (Drewry
or stages. et al, 2002, p. 2)
Be it science, research, marketing, finance,
health care, retail shop, community center, or A given data mining algorithm usually depends
any other field, the classical approach to data on a built-in class of patterns, and the particular
analysis relies fundamentally on one or more language of patterns considered depends on the
analysts becoming intimately familiar with the characteristics of given data (the attributes and
data ad serving as an interface between the data their values).
and the users and end products (Berthold & This section constitutes the main thrust of the
Hand, 1999; Fayyad, Piatetsky-Shapiro & Smyth, chapter and includes a few models/patterns of data
1996; Maimon & Last, 2000; Mattison, 1997). mining algorithms that would be used to deduce
Nevertheless, in recent years many entrepreneurs possible measuring criteria of social development
are formulating measuring criteria that include processes. However, to remain within the context
marketing, finance (especially investment), fraud of the chapter, specifically, the algorithms related

Application of Data Mining Algorithms

Table 1. Microcredit loan seekers’ information

Loan-seekers ID Debt level Income level Employment status Credit risk Remarks
1 High High Self-Employed Bad
2 High High Salaried Bad
3 High Low Self-Employed Bad
4 High Low Salaried Bad
5 Low High Self-Employed Bad Accepted
6 Low High Salaried Bad Accepted
7 Low Low Self-Employed Bad
8 Low Low Salaried Bad
9 High High Self-Employed Good Accepted
10 High High Salaried Good Accepted
11 High Low Self-Employed Good
12 High Low Salaried Good
13 Low High Self-Employed Good Accepted
14 Low High Salaried Good Accepted
15 Low Low Self-Employed Good
16 Low Low Salaried Good Accepted

to establishment of knowledge centers have been popular and efficient algorithm for deriving such
elaborated, with apparent hints to a few other association rules from large data sets.3
types of development activities. In this context, the decision tree algorithm
Data mining for association rules is an use- would probably be the most popular technique
ful method for analyzing data that describe for predictive modeling. The following example
transactions, lists of items, unique phrases (in explains some of the basics of the decision tree
text mining), and so forth. Generally, associa- algorithms. Table 1 shows a set of NGO data that
tion rules that take the form If Body then Head, could be used to predict credit risk. In this example,
where body and head stand for simple codes, fictionalized information was generated on loan
text values, items, consumer choices, phrases, seekers that included debit level, income level,
and so forth, or the conjunction of codes and what type of employment they had and whether
text values and the like. (e.g., if (debt=high and they were a good or bad credit risk.
age<35 and repayment=high) then (risk=high In the example illustrated in Figure 1, the
and insurance=high); here the logical conjunc- decision tree algorithm might determine that the
tion before the then would be the body, and the most significant attribute for predicting credit risk
logical conjunction following the then would is debt level. The first split in the decision tree
be the head of the association rule). Based on is, therefore, made on debt level. One of the two
some user-defined “threshold” values for rule, new nodes (debt = low) is a leaf node, containing
the apriori algorithm (Agrawal, Imielinski & two cases with bad credits and three cases with
Swami, 1993; Agrawal & Srikant, 1994; Pei, Han good credit. In this example, a high debt level is
& Lakshmanan, 2001; Witten & Frank, 2005) is a a perfect predictor of a bad credit risk. The other

0
Application of Data Mining Algorithms

Figure 1. A partial decision tree derived that might be created from the Table 1.
All
Credit risk = good 5
Credit risk = bad 2

Debt = Low Debt = High

Credit risk = good 3 Credit risk = good 2
Credit risk = bad 2 Credit risk = bad 0

Employment = SE Employment = S
Credit risk = good 1 Credit risk = good 2
Credit risk = bad 1 Credit risk = bad 1

Table 2. A combination of case table and nested table

Customer ID Age group Martial Status Wealth group Product Purchases
Product Quantity
1 c M B Washing machine 1
TV 1
Shampoo 2
2 e S C Diet Coke 12
TV 1
Jelly 3
Cake 2
3 b M A Coke 3
Cake 1
Jelly 1
Age group: a- below 15, b- 15-20, c- 21-26, d- 27-32, e- 33-38, f- 39-44, g- 45-50, h- 51-56, i- 57-62, j- above 62; Martial
status: M- married, S- separated, D- divorced, U- unmarried; Wealth group: A- have less than $50,000, B- between $50,000-
250,000, C- between $251,000-450,000, D- above $451,000; the divisions are fictitious and their actual divisions depend on
decision of the analyst and implementer.

node (debt = high) is still mixed, having two good case table contains the case information related to
credits and zero bad credit case. the non-nested part of the data, and a nested table
Departmental stores may use data mining to contains information related to the nested part of
understand customer’s behavior, sale trend, market the data. In the following table, there are two input
behavior, and predict market strategy. This can be tables to the mining model. One table contains
done using the following table. Table 2 includes information about customer demographics. It is
two forms of tables—case table and nested table. A a case table. The other table contains information

Application of Data Mining Algorithms

Table 3. A table showing information for predictive marketing

Question number Question (Data mining algorithm)
1 Identifying those customers that are most likely depart based on customer demographical information
(Decision tree without nested table)
2 Grouping heterogeneous customers into subgroups based on customer profile to generate a mailing list
for marketing purposes (Clustering without nested table)
3 Finding the list of other products that the customer may be interested in, based on the products the
customer has purchased (Cross-selling using decision tree with nested table)
4 Grouping customers into more or less homogeneous groups based on the customer profile and the list of
banking products they have subscribed to (Clustering with nested table)

Figure 2. Data mining algorithm for a supermarket

Supermarket “My_Choice_DT_Nonnested” Execute:
CREATE MINING MODEL [My_Choice_DT_Nonnested]
([Cust_ID] LONG KEY,
[Income] DOUBLE CONTINUOUS,
[Other_Income] DOUBLE CONTINUOUS,
[Loan] DOUBLE CONTINUOUS,
[Age_Group] DOUBLE CONTINUOUS,
[Area_Residence] TEXT DISCRETE,
[Home_Years] DOUBLE CONTINUOUS,
[Value_House] DOUBLE CONTINUOUS,
[Home_Type] TEXT DISCRETE,
[Insured] TEXT DISCRETE,
[Type_of_Insurance] TEXT DISCRETE,
[Education_Level] TEXT DISCRETE,
[Others] depends
[Leave_Yes_No] TEXT DISCRETE PREDICT)
USING Any_Decision_Trees

about customer purchases. It is a nested table. In inside datasets, and these patterns can be used to
database technology, a nested table is similar to solve many business problems. The following table
a transaction table. presents a few business questions that are difficult
In the example, age group division may be to answer without data mining, and at the same
made more broad sacrificing accuracy of the time answers to these questions are essential for
result, though smaller age groups segregation making decisions on predictive marketing (Ville,
results in complicated algorithms. This applies 2001; Ville, 2006; Weiss & Indurkhya, 1997).
to other parameters too. Fields for Table 3 could be Cust_ID, Income,
To illustrate another example of data mining, Other_Income, Loan, Age_Group, Area_Resi-
hidden patterns inside data have been considered. dence, Home_Years, Value_House, Home_Type,
It is a fact that, data mining finds hidden patterns Insured, Type_of_Insurance, Education_Level,
Leave_Yes_No, and others.

Application of Data Mining Algorithms

The algorithm could use the CREATE state- community information centers, cyber centers,
ment for data mining application shown in Figure village information centers, kiosks, and other
2. familiar names as accepted by the communities.
Association rule mining is another fundamen- Depending on their connectivity to the Internet,
tal technique in data mining. In some real-life many remain connected, or many remain off-line,
applications, for example, market basket analysis providing various ICT supports to the community.
in super market chain stores, data sets can be too Furthermore, depending on the availability of the
large for manual analysis, and potentially valuable Internet they use VSAT (SCPC, MCPC, TDMA,
relations among attributes may not be evident at a FTDMA), radio (microwave; mostly 2.4GHz,
glance. An association rule-mining algorithm can 5.8GHz free frequency), broadband (DSL, ADSL,
find frequent patterns (sets of database attributes) SDSL, fiber, ISDN), dial-up (mainly PSTN) and
in a given data set and generate association rules other varieties of LAN/WAN formation. Very
among database attributes. For example, some recently, using emerging cellular phones, GPRS
items can be frequently sold together, for example, and Edge have been extensively used to connect
milk and cereal, or bread and butter. Such items to the Internet.
can be displayed together to improve the conve- However, as it has been observed, most of the
nience of shopping. Association rule mining is knowledge centers have been established through
generally be applicable to those applications in donor funding or subsidies from national gov-
which the data set is large and it is useful to find ernments. While many of them have succeeded
frequent patterns and their associations, for ex- to come up with expected outcome, but at the
ample, market basket analysis, medical research, same time, many failed to produce any visible
and intrusion detection. output or outcome. Nor, any quantified evaluation
Similarly, algorithms may be devised for been conducted by any donor agency to justify
various other social activities like, readymade their existence, or any measurable indicators
garments databank (bridging the gap between been developed to measure their performances.
developed and developing countries), NGO net- Throughout the years, a few research studies have
works engaged in social development works, skill been conducted (IDRC, World Bank, EU), but
and capacity development databank (migration of availability of those reports at the end-user level
skilled workers), jobs databank (for youths and has remained meager. Furthermore, migration
jobless), online blood bank (during emergencies of a telecenter into the daily life of a common
and disasters), and microcredit databank for the citizen remains detached due to many visible
overall benefit of the society. Now, a case study and nonvisible parameters. They need extensive
on knowledge centers will be discussed in the research in an organized fashion. Most recent
next subsection. approach by the Telecenter.org, hopefully could
accommodate a separate unit in this aspect, re-
Case Study: Knowledge Center gardless of building new telecenters without their
feasibility study.
In recent years, information centers (designating Coming to the point of establishing knowl-
them as knowledge centers) have been established edge centers, many of them emerged as stand
in many countries with great enthusiasms. From alone units in a remote rural set up. These were
developed to developing countries, they have desired, but without a loopback or feedback on
been highly appreciated by not only the devel- their performance the operation and maintenance
opment partners, but also by all members of of those centers become stringent, day-by-day. A
the communities. They evolved as telecenters, preferred approach in this direction is to establish

Application of Data Mining Algorithms

them under generic clusters unified as blocks of to expression levels in each sample cluster. The
centers in a region. Though it is extremely difficult vectors can then be normalized, so that the sum
to patent this sort of intricate networks, but in the over their components equaled zero, that is, ∑m
longer run there is no alternate to built knowledge xm = 0, and the magnitude equaled one, |vN | =
centers piggybacking on each other and operat- 1. The clusters may be split into two groups by
ing within a cluster. This is needed to manage first defining two cluster centroids, Kj , where j=
them properly, and at the same time, it is easier 1, 2. However, a probability of belonging to each
to manipulate them through proper data mining cluster can be determined for each node:
and applying data mining algorithms.
e(-aνN-Kj2)
Clustering Bj (νN) =

Clustering is a mechanism for data analysis, Σe (aν -K 2)

N j
which solves classification problems. Its object j
is to distribute cases (people, objects, events,
dealings, etc.) into groups, so that the degree of where the cluster centroids producing knowledge
association to be strong among members of the outcome can be determined by the equation:
same cluster and weak among members of dif-
ferent clusters. This way each cluster describes, ΣνNBj(νN)
in terms of data collected, the class to which its N
members belong. Clustering is also a discovery Kj =
tool. It may reveal associations and structure in
data, which apparently remain nonevidence, but Σ Bj (νN)
become sensible and useful once found. The results N
of cluster analysis may append to the definition of
a formal classification scheme, such as a nomen- which was solved by iterations. For a= 0 there is
clature for related animals, insects or plants; or only one cluster K1 = K 2. a can be increased in
suggest statistical models with which to describe small steps until two distinct, converged centroids
populations; or indicate rules for assigning new be formed. Then each node may be assigned to a
cases to classes for identification and diagnostic cluster with the larger Bj (vN). The process may be
purposes; or provide measures of definition, size, repeated to split each one of the new clusters. The
and change in what previously were only broad algorithm may run against the cluster samples,
concepts; or find exemplars to represent classes where each cluster sample, N, may be represented
(Mirkin, 2005). Whatever an establishment is by the vector, vN .
built in now, the chances are that sooner or later
it will run into a classification problem. Cluster Formation of Wi-Fi Clusters
analysis might provide the methodology to help
solve it properly.4 Data clustering methods have Wi-Fi clusters are becoming more popular among
been proven to be a successful data mining tech- the development agencies while establishing
nique in the analysis of discrete data. knowledge centers, as majority of places where
The data-clustering algorithm can be used to there are demands of knowledge center are lack of
build a datagram (Alon et al., 1999). Each node, N, robust Internet infrastructure and these demand
may be represented by a vector, vN = (x1, x2, x3,…, expansion of Wi-Fi bases along the geographically
xn), whose components, x1 to xn, corresponded

Application of Data Mining Algorithms

dispersed peripheries. In this context, it is ideal • Node addition: Add nodes with high con-
to build Wi-Fi bases to form a symmetric matrix nectivity to the nodes in the open cluster.
with homogeneous dispersion, as such a mid-range • Node removal: Remove any nodes in the
Canopy/ Smartbridge/ Bridgeaccess (using IEEE open cluster with low connectivity to the
811.a, 811.b) can cover an air distance of around other nodes in the cluster.
15-25Km, or an average Airpro Gold (using • Cluster cleaning: Make sure all nodes are
IEEE 811.g, 802.16) may cover an air distance of in clusters with highest affinity
around 20-35Km. However, the cost of the radio
varies with the nature of the terrain and data However, CAST algorithm relies on the af-
throughput of the knowledge center (e.g., short finity threshold, AT , which is an input variable
range, mid-range, long range, very long range; defined by the user before initiating the cluster-
data throughput of 64Kbps to 10Mbps). ing process. It could create a problem because of
The cluster affinity search technique (CAST) the size and quantity of the clusters produced by
and enhanced cluster affinity search technique the algorithm may directly affect this parameter
(E_CAST) algorithms take as input an n-by-n (Ben-Or, 1983; Ben-Dor, Shamir & Yakhini,
similarity matrix M where (M(i, j) ∈[1, 0]) and 1999; Diplaris, Tsoumakas, Mitkas & Vlahavas,
an affinity threshold AT is defined. AT is used 2005). For this reason, to carry out a mathemati-
to determine node membership to a cluster. For cal analysis on this, thorough knowledge of the
sake of calculation a few definitions are being data set is required before the clustering can be
introduced next: performed. Some may enhance the algorithm to
calculate this threshold. Also, the threshold can
• Definition 1: The affinity of a node be calculated dynamically based only on the ob-
x to a cluster K is defined as: a(x) = jects in that have yet to be assigned a cluster, U’
∑M(x,N) = U\(C0∪C1∪…∪Cn), before each cluster is cre-
N∈K ated, which provide a means of fine-tuning during
clusters formation. The threshold parameter, DT,
• Definition 2: The connectivity threshold, is then calculated based on the similarity values
Χ, of a cluster K is: Χ = AT |K| where |K| is of the nodes left to be clustered.
the cardinality of K. This dynamic threshold is computed as fol-
lows:
• Definition 3: A high connectivity node is
a node that will be included in a cluster. Its
affinity satisfies the following: a(i) ≥ Χwhere ΣM(i,j) – 0.5
a(i) is the affinity of i.
i,j ∈ U and M (i,j)≥0.5
• Definition 4: A low connectivity node is a DT =
node that will be removed from a cluster. {u:u∈U′and a(u) ≥ 0.5
Its affinity satisfies the following: a(i) < Χ
where a(i) is the affinity of i.

Each cluster is formed by alternating between After deducing the mathematical modeling of
adding and removing nodes from the current clus- dynamic threshold, a pseudo-code for calculating
ter until such time that changes no longer occur or threshold, cluster formation and remodeling step
a maximum of iterations has been executed. has been illustrated in Figure 3. The dynamic

Application of Data Mining Algorithms

Figure 3. Pseudo-codes for dynamic threshold calculation

Dynamic Threshold:
/// DT is an input parameter

CAST:
DT = fixed value (ideally, 1)

// executed before each new Kopen is created

E_CAST:
a = 0;
count = 0;
for all u ∈ U such that a(u) ≥ 0.5 {
a+ = a(u)-0.5
count++
}
DT = (a/count) + 0.5

Cluster Formation Algorithm Pseudo-Code:

while ( U <> ∅){
E_CAST: Calculate Threshold, DT
for all u ∈ U set a(u) = 0
create empty cluster Kopen
Pick an element u ∈ U such that M(u,x)=max{M(w,x)|w and x ∈ U}
Kopen = Kopen ∪ u
U=U\u
For all x ∈ U set a(x) = a(x) + M(x,u)
while (changes in Kopen occur) or (iterations < max iterations){
//Addition Step
while max{a(w)|w ∈U} ≥  χ {
Pick an element u U such that a(u)=max{a(w)|w ∈ U}
Kopen ← Kopen ∪ {u}
U ← U \ {u}
// Update affinity of all nodes
For all x ∈U ∪ Kopen set a(x) = a(x) + M(x,u)
}
//Removal Step
while min{a(w)|w ∈ Kopen} < Χ{
Pick an element u ∈ Kopen such that a(u)=min{a(w)|w ∈ Kopen}
Kopen ← Kopen\ {u}
U ←U ∪{u}

continued on following page

Application of Data Mining Algorithms

Figure 3. continued
// Update affinity of all nodes before returning a final value
For all x ∈ U ∪ Kopen set a(x) = a(x) - M(x,u)
}
}
}

Remodeling Step:
while (changes in any Ki occur) or (iterations < maxiterations){
// cleaning step may not converge
for each c ∈ Ki and Ki s K and Kj ∈ K {
Compute a normalized affinity of c to each cluster Kj such
that aj(c)= (∑k∈Kj M(c,k))/(|Kj|)
}
if max{ aj (c) } > ai , for all Kj ∈K and i ≠ j {
K i= K i \ c
K j= K j ∪ c
}
}

threshold assignment has been shown here to formulation. They are randomized model, homo-
obviate the need for the “cleaning” step as pro- geneous model, and additive model.
posed in the original algorithm (Alon et al., 1999).
The cleaning step is used to move any vector Randomized Models
from its current cluster to one that it may have a
higher affinity for and has a time complexity on Most available and popular forms of implemen-
the order of O(n2) (Bellaachia, Portnoy, Chen & tation model so far, but at the same time most
Elkahloun, 2002). vulnerable to perish at their earlier stage, as major-
ity of them have been implemented without any
Implementing Models for Knowledge study about the sustainability parameters before
Centers implementation. Over 50% of them are sure to
die. Rigorous observations reveal that, without
Similar to designing of knowledge centers, imple- knowing the baseline, there are at most 50% of
mentation of them in a wholesome form demands any telecenter have a chance to survive. However,
extensive study on their patterns, funding condi- a mathematical model may be derived from the
tions, localizations, implementing agencies, and probability theory.
foremost ultimate objectives of the implementers. The modern definition of discrete probability
Without running into complicated materials in distribution starts with a set called the sample
this aspect, this subsection will look into three space, which relates to the set of all possible out-
forms of implementing models/patterns in terms comes in classical sense, denoted by δ = {x1,x2,…}.
of framing a viable algorithm or mathematical It is then assumed that for each element x ∈ δ, an

Application of Data Mining Algorithms

intrinsic “probability” value ƒ(x) is designated, E of the sample space δ. The probability of the
which satisfies the following properties: event E is:

1. ƒ(x) ∈ |0,1| for all x ∈ δ P(E) = Σƒ(x)

2. Σƒ(x) = 1 x∈E
x∈δ
so that, the probability of the entire sample space is
unity, and the probability of the null event is zero.
That is, the probability function f(x) lies
Furthermore, the function f(x) mapping a point
between zero and one for every value of x in
in the sample space to the “probability” value is
the sample space δ, and the sum of f(x) over all
called a probability mass function (pmf).
values x in the sample space δ is exactly equal
However, the modern definition does not try
to 1. An event may then be defined as any subset
to answer how probability mass functions are

Figure 4. Algorithm of Ben-Or’s consensus protocol (Adapted from Aspncs, 2002).

Input: Boolean value from input register
Output: Boolean value stored in output register
Data: Boolean preference, integer round
begin
preference ← input
round ← 1
while true do
send (1, round, preference) to all nodes
wait to receive n − t (1, round, *) messages
if received more than n/2 (1, round, v) messages then
send (2, round, v, ratify) to all nodes
else
send (2, round, ?) to all nodes
end
wait to receive n − t (2, round, *) messages
if received a (2, round, v, ratify) message then
preference ← v
if received more than t (2, round, v, ratify) messages then
output ← v
end
else
preference ← CoinFlip(); CoinFlip returns 0 if messageis not received ;
else returns if message is received
end
round ← round + 1
end
end

Application of Data Mining Algorithms

obtained; instead it builds a theory that assumes are important in shaping real communities (Law-
their existence. Observed communities are gen- lor, 1980).
erally more stable than randomly constructed Moreover, randomized models provide lower
communities with the same number of species. probabilities for some transitions, which means
This greater stability of observed communities is instead of looking at a single worst-case execution,
partially due to the low values of both the mean one must consider a probability distribution over
and variance of their alpha distributions. It has also bad executions. If the termination requirement is
been observed that, randomization of consumer weakened to require termination only with prob-
resource utilization rates almost always increased ability 1, the nonterminating executions continue
the mean but not the variance of the calculated to exist, but they may collectively occur only with
consumer similarities. Therefore, in comparison probability 0. In this case, there are two ways that
to randomly constructed communities, the lower randomness can be brought into an acceptable
similarities and greater stability of the observed model. One is to assume that the model itself is
communities suggest that competitive processes randomized; instead of allowing arbitrary valid

Figure 5. Algorithm of Bracha and Rachman’s voting protocol (Adapted from Aspncs, 2002; Bracha &
Rachman, 1992)
Input: none
Output: Boolean output
Local data: Boolean preference n; integer round r; utility variables
c, total, and stable
Shared data: single-writer register r[n] for each node n, each of
which holds a pair of integers (flips, stable), initially
(0,0)
begin
repeat
for i ← 1 to n/log n do
1 c ← CoinFlip()
2 r[n] ← (r[n].flips + 1, r[n].stable + c)
end
Read all registers r[n]
total ← ∑n r[n].flips
until total > n
Read all registers r[n]
total ← ∑n r[n].flips
stable ← ∑n r[n].stable
if total/stable ≥ 1/2 then
return
else
return 0
end
end

Application of Data Mining Algorithms

operations to occur in each state, particular opera- into k disjoint parts T1… Tk of equal size. For each
tions only occur with some probability. Though, fold i=1..i of the process, the base-level classifiers
randomized scheduling allows for very simple are trained on the set T \ Ti and then applied to
algorithms, but it depends on assumptions about the test set Ti. The output of those classifiers for
the behavior of the environment that may not a test instances along with the true class of that
be justified in practice. Thus it has not been as instance form a meta-instance. A metaclassifier
popular as the second approach (homogeneous), is then trained on the metainstances and the base-
in which randomness is located in the processes level classifiers are trained on all training data T.
themselves. When a new instance appears for classification,
Figures 4 and 5 incorporate algorithm for a the output of all base-level classifiers is first
consensus protocol (determine whether a mes- calculated and then propagated to the metalevel
sage can reach all nodes) and a voting protocol classifier, which outputs the final result. Thus,
(determine whether a node can be sustainable or always there are opportunities of providing a
not) that use randomized pattern. higher value as the output.
These illustrations support that, homogeneous
Homogeneous Models patterns are expensive to establish and at the same
time to maintain, but in the longer run always
As, voting protocol has been depicted in Figure 3, competitive in terms of providing better services
it has been observed that unweighted and weighted and stronger existence than the other two models,
voting are two of the simplest methods for combin- as discussed in this chapter.
ing not only randomized but also homogeneous
models. In voting, each model outputs a class Additive Models
value (or ranking, or probability distribution)
and the class with the most votes (or the highest In terms of mathematical formulation, additive
average ranking, or average probability) is the one models represent a generalization of multiple re-
proposed by the community. Note that this type of gressions (a special case of general linear models).
voting is in fact called plurality voting, in contrast In linear regression, a linear least-squares fit is
to the frequently used term majority voting, as compound for a set of predictor or X variables, to
the latter implies that at least 50% (the majority) predict a dependent Y variable. Thus, to predict
of the votes should belong to the winning class a dependent variable Y the well known linear
(Diplaris, Tsoumakas, Mitkas & Vlahavas, 2005), regression equation with m predictors, can be
and in comparison to the randomized model, stated as:
there is an additional probability of survival op-
portunity. Therefore, homogeneous models have Y = b0 + b1*X1 + … + bm*Xm,
better probability to be sustainable.
In homogeneous pattern, stacking can be where Y stands for the (predicted values of the)
introduced that combines multiple classifiers by dependent variable, X1 through Xm represent the
learning a meta-level (or level-1) model, which m values for the predictor variables, and b0, and
predicts the correct class based on the decisions of b1 through bm are the regression coefficients esti-
the base-level (or level-0) classifiers. This model is mated by multiple regression. A generalization of
induced on a set of meta-level data that are typi- the multiple regression model would be to maintain
cally produced by applying a procedure similar to the additive nature of the model, but to replace
k-fold cross-validation on the available data. the simple terms of the linear equation bi*Xi with
Let T be the level-0 data set. T is randomly split f i(Xi) where f i is nonparametric function of the

0
Application of Data Mining Algorithms

predictor Xi. In additive models, to achieve the This theory reveals that, additive models are
best prediction of the dependent variable values5 easier to establish, simpler to calculate and provide
(Hastie & Tibshirani, 1990; Schimek, 2000) an multiplier effect if chosen with better probability
unspecific (non-parametric) function is estimated values. Additive models can always piggyback
for each predictor, instead of a single coefficient on existing successful ones without enough
for each variable (additive term). implementation costs, costlier maintenance and
In terms of continuous probability distribu- unknown experimentations. There is a proverb of
tions, if the sample space is comprised of real “learning from experiences” applies in this type
numbers (Α), then a function called the cumulative of establishment patterns.
distribution function (cdf) where F is assumed to
exist, which gives P(X ≤ x) = F(x) for a random Technical Issues and
variable X. That is, F(x) returns the probability Recommendations
that X will be less than or equal to x.
However, the cdf is supposed to satisfy the In terms of evaluating the performance indica-
following properties: tor through data mining the process can be very
complicated and time consuming. The execution
1. F is a monotonically non-decreasing, right- of knowledge discovery using SQL (KDS) on a
continuous function real world large database of 1.6 million records,
2. lim F(x) = 0 6 independent variables for a total of 4,334 dif-
x → -∞ ferent values required a total of about 5 hours
3. lim F(x) = 1 on a dual Pentium Pro computer with 128Mb
x → -∞ of RAM and a 40GB HD (Giuffrida, Cooper &
Chu, 1998). The data, someone may handle or
If F is differentiable, then the random variable intend to handle in data mining is usually of such
X is said to have a pdf or simply density ƒ(x) = orders of magnitude that a human being can not
dF(x)/dx. in fact comprehend. In such circumstances, even
For a set E ⊆ A, the probability of the random an algorithm with the simplest complexity can be
variable X being in E is defined as: too expensive in terms of computation. In these
contexts, algorithms with linear or log-linear
∫
P(X ∈ E) dF(x) complexity are needed to adopt for performing
x∈E the data mining tasks.
Furthermore, according to the information
In case the probability density function exists, theory, there is a certain limit to which a particu-
then it can be written as: lar large body of data can be condensed without
incurring loss of information. This limit is the
∫
P(X ∈ E) ƒ(x) dx entropy or information content of the data. Even
x∈E in practice, if this theoretical limit of compression
could be reached, the resulting size of the data
Whereas the pdf exists only for continuous would still be far too large for a human being to
random variables, the cdf exists for all random examine. Hence, the effective mining of large data
variables (including discrete random variables) sets must permit and live with loss of information
that take values on A. These concepts can be and the impact may remain dependent largely on
generalized for multidimensional cases on A n the data mining performance.
and other continuous sample spaces.

Application of Data Mining Algorithms

The most commonly used approach to this issue Prior to intervention: Impactt = 0
is to set a frequency threshold (benchmarking) At time of intervention: Impactt = λ
and mine only those, which have a frequency of After intervention: Impactt = θ * Impact t-1
occurrence above this threshold (additive mod-
eling). Such an approach arises out of the belief This impact pattern is again defined by the two
that if one must sacrifice some understanding of parameters λ (lambda) and θ (theta). As long as
the domain, it would be best to sacrifice under- the θ parameter is greater than 0 and less than 1
standing of the least frequently occurring aspects (the bounds of system stability), the initial abrupt
(loosing the unsuccessful ones, and piggybacking impact will gradually decay. If θ is near 0 (zero)
on success cases). than the decay will be very quick, and the impact
This is the very practical approach that data will have entirely disappeared after only a few
mining was initially tagged with and is increas- observations. If θ is close to 1 then the decay
ingly referred to as statistical data mining. How- will be slow, and the intervention will affect the
ever, recently there has been interest in mining series over many observations. Furthermore, when
some of the less frequent aspects of a data set, evaluating a fitted model, it is again important
but certain things, such as the threat of a terrorist that both parameters are statistically significant;
attack or the existence of a rare breed of poison- otherwise one could reach paradoxical conclu-
ous mushroom, seem worthy of attention even if sions. For example, suppose the λ parameter is
they occur only once and are buried inside of a not statistically significant from 0 (zero) but the θ
large body of data. But, to mine such infrequent parameter is; this would mean that an intervention
patterns places a large burden on the performance did not cause an initial abrupt change, which then
of traditional statistical data mining techniques. showed significant decay.
To address this issue, a number of data mining Abrupt permanent impact: In a time series, a
algorithms that are not statistical in nature are permanent abrupt impact pattern simply implies
required. Furthermore, this then brings the ques- that the overall mean of the times series shifted
tion of where and how far to permit the necessary after the intervention, and the overall shift is
data loss to perform effective data mining if one denoted by λ (lambda).6
wants to mine the infrequent rules from a data
set (Drewry et al., 2002). Measuring Ripple Effect Impact of
Knowledge Centers
Monitoring the Impact of Knowledge
Centers In practice, when analyzing actual data, it is usu-
ally not that crucially important to identify exactly
In terms of developing a mathematical formula for the frequencies for particular underlying sine or
monitoring the impact of knowledge centers, two cosine functions. Rather, because the periodogram
forms of impact have been discussed here. values are subject to substantial random fluctua-
Abrupt temporary impact. In a time series, tion, one is faced with the problem of very many
the abrupt temporary impact pattern implies as “chaotic” periodogram spikes. In that case, one
initial abrupt increase or decrease due to the would like to find the frequencies with the greatest
intervention which then slowly decays, without spectral densities, that is, the frequency regions,
permanently changing the mean of the series. consisting of many adjacent frequencies, which
This type of intervention can be summarized by contribute most to the overall periodic behavior of
the expressions: the series. This can be accomplished by smoothing

Application of Data Mining Algorithms

the periodogram values via a weighted moving However, in case of augmented product moment
average transformation.7 matrix, for a set of p variables, a (p + 1) X (p + 1)
In time series, the Hamming window is a square matrix evolves. The first p rows and col-
weighted moving average transformation used to umns contain the matrix of moments about zero,
smooth the periodogram values. In the Hamming while the last row and column contain the sample
(named after R. W. Hamming) window or Tukey- means for the p variables. Ideally, the development
Hamming window (Blackman & Tukey, 1958), matrix can be shown in the following form:
for each frequency, the weights for the weighted
moving average of the periodogram values are
M χ
computed as:
DM = χ′ 1
wj = 0.54 + 0.46*cosine(π*j/p) (for j=0 to p)
w-j = wj (for j ≠0)
where M is a matrix with element. Thus, value
where p = (m-1)/2 and it is supposed that the mov- of the matrix is:
ing average window is of width m (which must
be an odd number). N
This weight function will assign the greatest
weight to the observation being smoothed in the
Mjk =
1
N
ΣX X ij jk

center of the window, and increasingly smaller i=1

weights to values that are further away from
the center.8 In this way, ripple effect impact of andχ is a vector with the means of the vari-
knowledge centers can be calculated. ables.
Another indicator about the relationship among
Implementing an Ideal Homogeneous the member of a network can be derived, if the
Pattern edges of the network (Figure 4) can be set in a
symmetrical matrix, like:
Considering all the above justifications, a homog-
enous pattern is suggested for implementation.

Figure 6. Person a is virtually linked to persons b, c, d and e (Adapted from Rahman, 2004)

c b
1

2
a
d
4 3 e

Application of Data Mining Algorithms

Additive Model: Xt = TCt + St + It

a| - 1 1 1 1 |
Multiplicative Model: Xt = Tt*Ct*St*It
b| 1 - 0 0 0 |
where, Xt represents the observed value of the
c| 1 0 - 0 0 | time series at time t.
d| 1 0 0 - 0 | Given some a priori knowledge about the
cyclical factors affecting the series (e.g., busi-
e| 1 0 0 0 - | ness cycles), the estimates for the different com-
ponents can be used to compute forecasts for
while in Figure 6, a knows b, c, d and e. But, future observations. However, the exponential
the relationship between b, c, d and e may not smoothing method, which can also incorporate
be known (Rahman, 2004). These relationships seasonality and trend components can be used as
will establish the ICT matrix and with point-to- the preferred technique for forecasting purposes
point relationships among the network entities, (Hale, Threet & Shenoi, 1994; Han, Kamber &
the ideal relationship value will be 1 (unity). The Chiang, 1997).
ICT development matrix evolves from this unity
relationship. Even if the network entity may fol-
low point-to-multipoint, or multipoint-to-point future iSSueS and
paths, for a development matrix it must be up- ChallengeS
graded to provide unity relationship value (either
a zero communication, or unity communication; Data mining algorithms in future should con-
in other words either yes communication or no sider incorporation of larger databases, high
communication) (Rahman, 2006). dimensionality, overfitting, assessing of statisti-
cal significance, dynamic database, adaptation
Implementing a Generic Model of knowledge theory, treatment of missing and
noisy data, complex relationships between fields,
In general, a time series consists of four different understandability of tattered patterns, user inter-
components; (a) a seasonal component (denoted action and prior knowledge, and integration, and
as St, where t stands for the particular point in versatility with other systems (Wang, 2003).
time), (b) a trend component (Tt), (c) a cyclical While measuring performance impact of social
component (Ct), and d) a random, error, or ir- development activities, future research should
regular component (It). The difference between formulate a homogeneous pattern of implemen-
a cyclical and a seasonal component is that the tation, provided varying nature of environment,
latter occurs at regular intervals, while cyclical economy, culture and other parameters exist at
factors usually have a longer duration that var- the peripheries. Specifically, in terms of knowl-
ies from cycle to cycle. The trend and cyclical edge centers, there should be a symmetric matrix
components are customarily combined into a to follow as a guideline, over which each node,
trend-cycle component (TCt). The specific func- subnode, or any discrete existence of knowledge
tional relationship between these components can center could be established. This will reduce the
assume different forms. design cost, operating expenditure, monitoring
However, two straightforward possibilities are complexity and assist in measuring the perfor-
that they combine in an additive or a multiplica- mance quantitatively.
tive fashion: Given the three patterns of implementa-
tion model, yet numerous debates are running

Application of Data Mining Algorithms

Figure 7. Vertical pattern of data transformation (Adopted from Fayyad, Piatetsky-Shapiro & Smyth,
1996)

Knowledge
Interpretation/
Evaluation

Recognized patterns

Data Mining
Transformed data

Transformation

Processed data

Processing

Target data

Selection

Data

Procedures/Algorithms

across the globe about their advantages and have not been archived during the last decade of
disadvantages. A systematic approach, in terms implementation phases (collection and archival of
of establishing a mathematical formula and its existing data) and by far most of them need to be
consequential algorithm will ease debacles of transformed into recognized data sets, so that they
enormous nature and lead to deduce a verified can be used by verified data readers (transforma-
threshold as output. Furthermore, quantification tion to any recognized database structure).
of knowledge development from the immensely Now, before concluding, two patterns of data
discrete activities of qualitative nature will remain transformation are portrayed here (Figures 7 and
as challenge to the future researchers. 8), those seem relevant to the main theme of the
Finally, utilizing data mining algorithms for book. If a community would like to synthesize
measuring performance impact demand huge data and transform them into knowledge, two
storage of data of varying nature; many of them types of transformation pattern are visible. The

Application of Data Mining Algorithms

Figure 8. Pyramid pattern of data transformation (A proposed pattern)

Knowledge
Transformed Refined
data content

Management of
Transformation of content
data

Feedback
Processed data Knowledge
accumulation
Monitoring & Monitoring &
evaluation evaluation

Innovation &
Processing research
Data, content, information

Data Content
management management

vertical one is more or less thorough and involves ConCluSion

several stages of action during the transformation
process though deserves rigorous study and closed It is well recognized, that the real-world knowl-
observation. There is another form of transfor- edge-measurement applications obviously vary in
mation that is pyramid pattern, with processing, terms of underlying data, complexity, the amount
innovation, and knowledge at its three edge, and of human involvement required, and their degree
data remains at the core. The author, proposes a of possible automation of parts of the discov-
modified pyramid pattern of data transformation, ery process. In most applications, however, an
although seems to be difficult in implementing, indispensable part of the measurement process
but the transformation process would take shorter is that the analyst explores the data and sifts
period to settle down and be sustainable in the through the raw data to become familiar with
longer run. Researchers may derive separate al- it and to get a feel for what the data may cover.
gorithms for this transformation process, so that Furthermore, very often an explicit specification
an acceptable measuring indicator may evolve of what one actually is looking for only arises
in future. during an interactive process of data explora-

Application of Data Mining Algorithms

tion, analysis, and segmentation (Stumme, Wille Agrawal, R., Imielinski, T. & Swami, A. (1993).
& Wille, 1998). Therefore, proper data mining Mining association rules between sets of items in
techniques with timely feedback analysis on the large databases. In Proceedings of the 1993 ACM
executed results deserves immediate attention SIGMOD Special Interest Group on Management
for accurate result. of Data (pp. 207-216), Washington, DC.
It is a difficult task to eliminate theories of
Alon et al. (1999). Broad patterns of gene expres-
probability, redundancies of efforts and abun-
sion revealed by clustering analysis of tumor and
dances of varying data in determining reasonable
normal colon tissues probed by oligonucleotide
mathematical formulae to measure the impact
arrays. In Proceedings of the National Academy
of social development processes. Complexity
of Sciences.
accumulates further, when it comes to projects
or programmes that are related to newly evolved Aspncs, J. (2002). Randomized protocols for
ICTs. Many developing and transitional econo- asynchronous consensus, Ref.
mies are entangled with severe social problems
Bellaachia, A., Portnoy, D., Chen, Y. & Elkahloun,
within the vicious poverty cycle; thereby evolu-
A. G. (2002) E-CAST: A data mining algorithm
tion of ICT emulated performance indicators are
for gene expression data. In Proceedings of
extremely difficult to resonate. They are diverse,
the BIOKDD02: Workshop on Data Mining in
deem to diverge and tend to become vulnerable
Bioinformatics (with SIGKDD02 Conference),
in the longer run, without a verified mathemati-
Edmonton, Alberta, Canada.
cal model.
Moreover, data mining algorithms should in- Ben-Dor, A., Shamir, R. & Yakhini, Z. (1999).
corporate design, development, implementation Clustering gene expression patterns. Journal of
and operational factors, in addition to developing Computational Biology, 6(3/4), 281–297.
mathematical models on cost-benefit analysis.
Ben-Or, M. (1983). Another advantage of free
Foremost, utilizing data mining, success cases
choice: Completely asynchronous agreement
should come out at the forefront with rigorous
protocols (extended abstract). In Proceedings of
analysis, so that they could be easily replicated
the Second Annual ACM SIGACT-SIGOPS Sym-
elsewhere, with minimum adjustments.
posium on Principles of Distributed Computing
(pp. 27–30), Montreal, Quebec, Canada.
referenCeS Berthold, M. & Hand, D. J. (1999). Intelligent data
analysis: An introduction. Springer Verlag.
Abbass, H. A., Sarker, R. A., & Newton, C. S.
Blackman, R. B. and Turkey, J. W. (1958). The
(Eds.) (2002). Data mining: A heuristic approach.
measurements of power spectra. New York: Dover
Hershey, PA: IGI Global.
Publications, Inc.
Adamo, Jean-Marc (2001). Data mining for asso-
Boulicaut, Jean-Francois, Esposito, F., Giannotti,
ciation rules and sequential patterns: Sequential
F. & Pedreschi, D. (Eds.) (2004). Knowledge dis-
and parallel algorithms. Springer Verlag
covery in databases. In Proceedings of the PKDD
Agrawal, R. & Srikant, R. (1994). Fast algorithms 2004: 8th European Conference on Principles and
for mining association rules in large databases. Practice of Knowledge Discovery in Databases,
In Proceedings of the 20th International Confer- Pisa, Italy.
ence on Very Large Data Bases (pp. 487-499),
Santiago, Chile.

Application of Data Mining Algorithms

Bramer, M. A. (Ed.) (1999). Knowledge discov- Freitas, A. A. (2002). Data mining and knowl-
ery and data mining: Theory and practice. IEE edge discovery with evolutionary algorithms.
Books. Springer-Verlag.
Bracha, G. & Rachman, O. (1992). Randomized Giuffrida, G., Cooper, L. G., & Chu, W. W. (1998).
consensus in expected O(n2log n) operations. In A scalable bottom-up data mining algorithm for
S. Toueg, P. G. Spirakis & L. M. Kirousis (Eds.), relational databases. In Proceedings of the Tenth
Lecture notes in computer science (Vol. 579, pp. International Conference on Scientific and Statis-
143–150). Delphi, Greece: Springer. Retrieved tical Database Management (pp. 206-209)
April 12, 2008, from http://www.cs.yale.edu/
Hale, J., Threet, J., & Shenoi, S. (1994). A prac-
homes/aspnes/randomized-consensus-survey.
tical formalism for imprecise inference control.
pdf
Ifip Trans. A-Computer Science And Technology,
Chakrabarti, S. (2002). Mining the Web: Discov- 60, 139-156.
ering knowledge from hypertext data. Morgan
Han, J., Kamber, M. & Chiang, J. (1997). Metarule-
Kaufmann.
guided mining of multi-dimensional association
Cox, E. (2004). Fuzzy modeling and genetic algo- rules using data cubes. In Proceedings of inter-
rithms for data mining and exploration. Morgan national conference on knowledge discovering
Kaufmann. and data mining (KDD’97), pp. 207-210.
Curotto, C. L. & Ebecken, N. F. F. (2005). Imple- Hand, D. J., Mannila, H., & Smyth, P. (2001).
menting data mining algorithms in Microsoft® Principles of data mining (Adaptive computation
SQL Server™. WIT Press. and machine learning). The MIT Press.
de Ville, Barry. (2001). Microsoft data mining, Hastie, T.J., & Tibshirani, R.J. (1990). Generalized
Integrated business intelligence for e-commerce additive models. New York: Chapman and Hall.
and knowledge management.
Kantardzic, M. (2002). Data mining: Concepts,
de Ville, Barry (2006). Decision trees for busi- models, methods, and algorithms. Wiley-IEEE
ness intelligence and data mining: Using SAS Press.
enterprise miner. SAS Press.
Kargupta, H. & Chan, P. (Eds.) (2001). Advances
Diplaris, S., Tsoumakas, G., Mitkas, P. A., & in distributed and parallel knowledge discovery.
Vlahavas, I. (2005). Protein classification with MIT/AAAI Press.
multiple algorithms. In Proceedings of 10th
Kloesgen, W. & Zytkow, J. (Eds.) (2002). Hand-
Panhellenic Conference in Informatics. Volos,
book of data mining and knowledge discovery.
Greece: Springer-Verlag.
Oxford University Press.
Drewry et al. (2002). Current state of data min-
Larose, D. T. (2004). Discovering knowledge in
ing. Department of Computer Science, University
data: An introduction to data mining. Wiley-
of Virginia.
Interscience.
Fayyad, U., G. Piatetsky-Shapiro, & P. Smyth.
Lawlor, L. R. (1980). Structure and stability in
(1996). From data mining to knowledge discov-
natural and randomly constructed competitive
ery in databases (a survey). AI Magazine, 17(3),
communities. The American Naturalist, 116(3),
37-54.
394-408.

Application of Data Mining Algorithms

Liu, H. & Motoda, H. (1998a). Feature selection for Perner, P. & Petrou, M. (Eds.). Machine learning
knowledge discovery and data mining. Kluwer. and data mining in pattern recognition. Springer
Verlag.
Liu, H. & Motoda, H. (1998b). Feature extrac-
tion, construction and selection: A data mining Pyle, D. (1999). Data preparation for data mining.
perpective. Kluwer Morgan Kaufmann.
Maimon, O. & Last, M. (2000). Knowledge dis- Rahman, H. (2004). Information dynamics in
covery and data mining - The Info-Fuzzy Network developing countries. In Proceedings of the 5th
(IFN) Methodology. Kluwer Publishers, Massive International Conference on IT in Regional Areas,
Computing. Caloundra, Queensland, Australia.
Mattison, R. M. (1997). Data qarehousing and Rahman, H. (2006). Role of ICTs in socio-eco-
data mining for telecommunications. Artech nomic development and poverty reduction. In H.
House. Rahman (Ed.), Information and communication
technologies for economic and regional devel-
Miller, H. & Han, J. (Eds.) (2001). Geographic
opments
data mining and knowledge discovery. Research
monographs in geographic information systems. Stumme, G., Wille, R. & Wille, U. (1998). Con-
Taylor and Francis. ceptual knowledge discovery in databases using
formal concept analysis methods. Berlin-Heidel-
Mirkin, B. (2005). Clustering for data mining: A
berg, Germany: Springer, Verlag.
data recovery approach. CRC Press.
Thuraisingham, B. (1999). Data mining: Technolo-
Myatt, G. J. (2006). Making sense of data: A
gies, techniques, tools, and trends. CRC Press.
practical guide to exploratory data analysis and
data mining. John Wiley. Utimaco (2005). Data encryption: The foundation
of enterprise security. Foxboro, MA: Utimaco
Nanopoulos, A., Katsaros, D., & Manolopoulos, Y.
Safeware, Inc.
(2003). A data mining algorithm for generated Web
prefetching. IEEE Transactions on Knowledge Wang, J. (Ed.) (2003). Data mining opportunities
and Data Engineering, 15(5), 1155-1169. and challenges. IRM Press.
Pal, S. K. & Mitra, P. (2004). Pattern recogni- Weiss, S. M. & Indurkhya, N. (1997). Predic-
tion algorithms for data mining. Chapman & tive data mining: A practical guide. Morgan
Hall/CRC. Kaufmann.
Pawlak, Z. (1982). Rough sets. Journal of Com- Witten, I. & Frank, E. (2005). Data mining: Prac-
puter and Information Science, 11(5), 341-356, tical machine learning tools and techniques (2nd
1982. ed.). Morgan Kaufmann.
Pei, J., Han, J., & Lakshmanan, L. V. S. (2001). Yoon, J. P. & Kerschberg, L. (1993). A frame-
Mining frequent itemsets with convertible con- work for knowledge discovery and evolution in
straints. Paper presented at the Proceedings of the databases. IEEE Trans. On Knowledge And Data
17th International Conference on Data Engineer- Engineering, 5(6), 973-979.
ing (pp. 433–332), Heidelberg, Germany.
Zhou, C., Li, Z., Meng, Y. & Meng, Q. (2004). A
data mining algorithm based on rough set theory.

Application of Data Mining Algorithms

3
In Proceedings of International Conference on http://www.statsoft.com/textbook/glosa.
Information Acquisition 2004 (pp. 413-416). html#Data_mining accessed on May 29,
2007
4
http://www.bandmservices.com/Cluster-
endnoteS ing/Clustering.htm accessed on March 25,
2007
1 5
http://www.microsoft.com/technet/prod- http://www.statsoft.com/textbook/glosa.
technol/sql/2000/maintain/dmperf.mspx html accessed on March 23, 2007
6
accessed on March 24, 2007 http://www.statsoft.com/textbook/glosa.
2
http://www.statsoft.com/textbook/glosa. html accessed on March 23, 2007
7
html#Algorithm accessed on March 23, http://www.statsoft.com/textbook/sttimser.
2007 html accessed on March 28, 2007
8
http://www.statsoft.com/textbook/glosh.
html#Heuristic accessed on March 28,
2007

0
Section III
Applications of Data Mining

Chapter IX
Prospects and Scopes of Data
Mining Applications in Society
Development Activities
Hakikur Rahman
Sustainable Development Networking Foundation (SDNF), Bangladesh

aBStraCt

Society development activities are continuous processes that are intended to uplift the livelihood of com-
munities and thereby empower the members of communities. Along the way of socialization, these sorts
of activities have become intrinsic phenomenon of a society, though, day-by-day their developments
are intricately adopting innovative scientific techniques. Innovations and technologies, especially, the
information and communication technologies have graced the development actors with dynamically
improved tools and techniques to design, develop and implement diversified performances globally.
Rapidly developed new ICTs gave the development initiators tremendous boost to take many indigenous
that are geographically dispersed, but could easily be monitored. However, many of the development
projects lack of proper management, thorough analysis, appropriate need assessment, and seemingly
could not sustain. In most cases, development partners blame each others, among them are the initia-
tors, designers, implementers, or the donors. Subsequently, in many countries, most innovative success
cases could not see the light of sustainability, due to improper reporting, monitoring, and feedback. In
consequences, projects fail. This chapter tries to establish methodologies for establishing successful
development initiatives, synergizing a few success cases. Furthermore, utilizing recently available means,
as such data mining, projects and activities around each corner of the globe can be easily recorded,

Copyright © 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
Prospects and Scopes of Data Mining Applications in Society Development Activities

adequately analyzed, monitored, and reported for their successful replication in other countries with
necessary favorable condition exists. This chapter also highlighted a few areas of development aspects
and hints application of data mining tools, through which decision-making would be easier. Along this
perspective, this chapter has put forward a few potential areas of society development initiatives, where
data mining applications can be engaged. The focus area varies from basic education, health care, general
commodities, tourism, ecosystem management to a few advanced uses, including database tomography.
Finally, the chapter provides some future challenges and recommendations in terms of using data mining
applications for empowering knowledge society.

introduCtion to carry out mass surveillance and personalized

profiling, in most cases without any controls or
Data mining is an interdisciplinary field of study right of access to examine this data. However,
and it is driven by various multidimensional ap- while utilizing data mining applications in terms
plications. At one hand it involves techniques for of development contexts, the main focus should
machine learning, pattern recognition, statistics, be on sustainable use of resources and the associ-
algorithm, database, linguistic and visualization; ated systems under specific context (producing
and at the other hand, one applies its applications ecological, limnological, climatic, social and
to understand human behavior, such as that of economic benefits) of developing countries. Re-
the end user of an enterprise (Ebecken, Brebbia search activities should also focus on sustainable
& Weigend, 2000; Han & Kamber, 2000; ICDM, management of vulnerable resources and apply
2003). It also assists entrepreneurs to understand integrated management techniques, with a view
the nature of transactions involved, including to support optimization and sustainable use of
those needed to evaluate any risk factor or detect existing resources.
fraud. In addition, the scientific issues and aspects of
Apart from the intricate technology context, archiving scientific and technology data include
the applications of data mining methods deserve the discipline specific needs and practices of sci-
specials attention while to be applied in the devel- entific communities as well as interdisciplinary
opment context. Lack of data has been found to values and methods. Data archiving is primarily a
inhibit the ability of organizations to fully assist program of practices and procedures that support
clients, and lack of knowledge made the govern- the collection, long-term preservation, and low
ment vulnerable to the influence of outsiders who cost access to, and dissemination of scientific and
did have access to data from countries overseas. technology data. The tasks of the data archiving
Furthermore, disparity in data collection need for include: digitizing data, gathering digitized data
a coordinated data archiving and data sharing, and into archive collections, describing the collected
it is extremely crucial for promoting, launching data to support long term preservation, decreasing
and sustaining development projects especially the risks of losing data, and providing easy ways
in developing countries (Berry & Linoff, 2000; to make the data accessible. Data archiving and
Codata, 2002; COL, 2003). the associated data centers need to be part of the
At the same time, the technique of data mining day-to-day practice of science. This is particularly
enables governments and private organizations important now that much new data is collected

Prospects and Scopes of Data Mining Applications in Society Development Activities

and generated digitally, and regularly (Codata, of data mining in its introduction and background
2002; Dunham, 2003; Quéau, 2001). that are necessary to justify data mining aspects
So far, data mining has existed in the form and features, and focuses deeply into data mining’s
of discrete technologies. Recently, its integra- contemporary and prospective application areas
tion into many other formats of information and around the society development processes. The
communication technologies (ICTs) has become author mainly depends on popular data mining
attractive as various organizations possessing books and accessible literatures on WWW, as it
huge databases began to realize the potential of has been observed that real world applications are
information hidden there (Fayyad, Piatetsky- mostly available in numerous Websites of many
Shapiro, Smyth & Uthurusamy, 1996; Hernández, projects around the globe, including their success
Göhring & Hopmann, 2004). The Internet can be cases. Perhaps, there need a collective publication
a tremendous tool for the collection and exchange of success stories in this aspect, so that readers
of information, best practices and vast quanti- can gather knowledge from similar single sources
ties of data. But it is also becoming increasingly of information. However, as far as this chapter’s
congested and its popular use raises issues about title concerns, there has been extensive literature
authentication and evaluation of information and review along this context to enrich the subject
data. The growing number and volume of data matter and theme of this chapter.
sources, together with the high-speed connec- Furthermore, the chapter looks into authenti-
tivity of the Internet and the increasing number cated global approaches and shows the capabili-
and complexity of data sources, are making in- ties of data mining as an effective instrument on
teroperability and data integration an important the basis of its application in real projects in the
research and industry focus. Incompatibilities developing countries. The applications could
between data formats, software systems, meth- be on development of algorithms, computer
odologies and analytical models are barriers to security, open and distance learning, online ana-
easy flow and creation of data, information and lytical processing, scientific modeling, simple
knowledge (Carty, 2002). All these demand, not data warehousing, or interactive collocations.
only technology revolution, but also tremendous However, this chapter emphasizes on effective
uplift of human capacity as a whole. Therefore, scopes and prospects of data mining application
the challenge of human development taking into to improve social instruments, as such commu-
account the social and economic background nity development, environmental improvement
while protecting the environment confronts deci- and life-long learning systems; improvement of
sion makers like government, local communities small and medium entrepreneurships, balancing
and development organizations. How can new of ecological patterns, biodiversity equilibrium,
technology for information and communication spatial database management, disaster manage-
be applied to fulfill this task (Hernández, Göhring ment, and some more advanced uses. This chapter
& Hopmann, 2004; Han & Kamber, 2006)? also put forwards a few success cases across the
This chapter focuses on areas and scopes of globe for its readers, and researchers in this field.
data mining application in general and a few de- Along the arguments, the author has tried to adopt
cision support techniques to achieve sustainable and derive a knowledge hierarchical pyramid that
outcomes for the society. This chapter does not go may be evolved from data/content and finally this
in detail about theoretical issues of data mining, chapter recommends a few future research issues
nor would like to provide in-depth analysis on data with their challenges before concluding.
mining techniques, rather it gives a brief overview

Prospects and Scopes of Data Mining Applications in Society Development Activities

BaCkground protection of privacy in e-transactions without

encroachment of the principle of free access
Data mining and data warehousing techniques throws important challenge to the policy mak-
are becoming indispensable parts of almost ers. This deepens further while ensuring the
all corporate intelligence programs (Berry & legitimate users’ rights to access information,
Linoff, 1997; Intransa, 2005). Data mining has as well as legal rights for privacy. Moreover,
been loosely defined as the process of extracting policies need to be adopted to ensure protection
information from large amounts of data and it is of sensitive information and law enforcement on
becoming a pervasive technology in activities the networks. In addition to these, complicacies
as diverse as using historical data to predict the are there to incorporate commercial, ethical and
success of a awareness raising campaign, or a social version of data mining in terms of provid-
promotional operation looking for patterns of ing technology solutions during cryptography
sequences to act as a monitoring tool, or analyz- or digital signature. Specifically, the business
ing genome chains. enterprise data requirements grow at 50 – 100
Data mining or data discovery is the process percent a year and creating a constant storage
of autonomously extracting useful information infrastructure management challenge (Intransa,
or knowledge from large data stores or data sets. 2005). Therefore, formulation of a generalized
It can be performed on a variety of data stores, code of conduct across the cross-sectoral approach
including the World Wide Web, relational data- to provide fairness, equality, justice and morality
bases, transactional databases, internal legacy in handling collection, repackaging, modification
systems, pdf documents, and data warehouses. and sale of public/private data produces a new
Furthermore, data mining is the ability to query version of challenge. Finally, a collaborative part-
very large databases in order to satisfy a hypoth- nership among private, commercial, government
esis (top-down data mining); or to interrogate a and civil society organizations always remain a
database in order to generate new hypotheses far cry.1
based on rigorous statistical correlations (bot- However, despite all these, utilization of data
tom-up data mining) (Hand, Mannila & Smyth, mining tools and techniques for preservation of
2000; Rud, 2001; Tan, Steinbach & Kumar, 2005; raw data to make them useful, and utilization of
Thearling, 1995). acquired data in making knowledgeable decision
Sometimes, data mining could be derogatory, support systems have not been overshadowed. The
as it involves sorting of accurate information promising side is that, not only researchers and
through a huge volume of data, and the extracting academicians, but also business entrepreneurs are
decision rules may favor one or disfavor another, becoming interested in data mining applications.
without considering any cause-and-effect relation- As newer storage techniques are brought into the
ship. It seems as a technique of betting or letting environment, it is typically added within inflex-
a few monkeys jump on a keyboard, and perhaps ible single server “silos.” Moreover, to ensure
at a point of time in future a sonnet may evolve. that data is available to all users usually means
But, using modern day’s techniques and tools, data moving data from server to server to make it
mining is no more so gloomy, and it is becoming readily accessible. Though this is time-consum-
an important component of knowledge science ing and results in multiple pools of data that
through accumulation of accurate data and by need to be managed, neither of which improves
taking intelligent decisions out of them. return of investment (ROI). In this context, the
Furthermore, in many countries, govern- best practice to improve the ROI of computing
ments’ restriction on persistent data mining and and storage resources is to efficiently consolidate

Prospects and Scopes of Data Mining Applications in Society Development Activities

and share these resources. An IP-SAN (Internet could enjoy equal access to these technologies;
protocol- storage area network) provides a highly and the resulting digital divide is found not only
cost-effective storage solution that leverages an between industrialized and developing countries,
enterprise’s existing IT expertise and resources. but also within developing countries. Moreover,
In addition to these, a shared file system elimi- as the divide grows wider, it aggravates the exist-
nates the inflexible “single server data silos” and ing divisions of power and inequities in access
makes data readily available to those who need to resources even between men and women,
it (Intransa, 2005). the literate and nonliterate, and urban and rural
In recent days, the Internet has become an populations (CIDA, 2005; Witten & Frank, 1999;
increasingly important tool for the collection Witten & Frank, 2005).
and exchange of information, best practices and To improve the situation, diversified strategies
vast quantities of data. But at the same time, it have been adopted in many countries. For the pur-
is also becoming increasingly congested and its poses of this strategy, knowledge for development
popular use raises issues about authentication and has been integrated into development programs
evaluation of information and data. The growing so that the beneficiaries can access, utilize, and
number and volume of data sources, together with disseminate information and knowledge. This
the high-speed connectivity (narrow speed in case is done with a view to promote socioeconomic
of developing countries) of the Internet and the development through appropriate ICTs, coupled
increasing number and complexity of data sources, with the development of required associated skills.
are making interoperability and data integra- Moreover, ICTs offer new ways of providing access
tion an important research and industry focus. to information and knowledge, and thereby create
Incompatibilities among data formats, software significant opportunities for learning; network-
systems, methodologies and analytical models ing, social organization and participation; and
create barriers to the easy flow and creation of improving transparency and accountability. For
data, information and knowledge (Carty, 2002). example, grass roots work by nongovernmental
Around these perspectives, this chapter has organizations and civil society organizations has
tried to focus on the best utilization of data mining greatly benefited from media such as the Internet
applications for the benefit of the society and in (CIDA, 2005). However, for sake of focus of this
these contexts; the following section deals with chapter, a few methods of ICT for development
the methodologies of data mining applications and have been described in this section. The author
their uses to empower the knowledge societies. argues that the following methodologies might
be adopted at national level to improve access,
utilization and dissemination of information for
Main thruSt promoting knowledge society.

Methodologies Greater Role of Public Authorities in

Access to Information
Tools of ICTs, as such radio, television, tele-
phones, computers, and the Internet can provide In many cases, industries and entrepreneurs
access to knowledge in sectors like agriculture, (telcos, private entrepreneurs, and others) are
microenterprise, education, and human rights by providing infrastructure support in addition to the
offering a new realm of choices that enable the government initiatives for access to information
common people to improve their quality of life. resources, as well as contents. In these cases, there
Unfortunately, however, till now not everyone are need to define the concept of public domain

Prospects and Scopes of Data Mining Applications in Society Development Activities

and universal access by promoting common pub- gations intact and promoting equitable access. A
lic welfare in a global context, and at the same competent financial mechanism should be put into
time encourage private initiatives by protecting place to ensure universal access to information by
information rights and economic interests. In providing cross subsidies, preferential taxation,
each case, balance of information right leading and other type of incentives. Telecommunication
to intellectual property rights, ethical integrity, regulatory and tariff policies should have soft
cultural diversity, localized discrimination and corner with mechanisms providing Internet access
return of investment (ROI) have to be taken in to general communities. Perhaps, some tangible
consideration during program development. and intangible products should be recognized and
they would be excluded from tariff enclosure.
Broader and More Efficient Provision of Furthermore, preferential rates should be
Public Contents introduced for educational, research and cultural
organizations. All public service institutions
Though much of the global knowledge is not should have their own public access center from
related to intellectual property rights, but efforts where general public should be able to access
should be given to avoid under-provision of this relevant information (forms, rates, procedures,
knowledge. Clear understanding should be formu- rules and others) at free of costs. These can be
lated between national and global public goods. compensated during the submission of the forms,
If necessary, appropriate ramifications should be rather than during purchase of the forms. Public
adopted on the concept of public domain about institutions with regional and local level branches
classical and anonymous works and information may open similar outlets in phases with the same
produced with public funds. They should be service provision to the common public. In addi-
categorized as copy-left information and should tion to these, public institutions can form common
be made available at free of cost as open content networks of infrastructure comprising existing
with provisions of open source. Necessary policies civil society networks and private/commercial
need to be adopted comprising economic, political, networks by forming consortia, community out-
ethical, social and educational boundaries, includ- lets, freenet centers, public access centers, and so
ing their sustainability (operational, economical; forth. Finally, role of media should be put into the
and/or technical, nontechnical). picture; a vibrant media is an essential ingredi-
ent for a democratic, responsive and transparent
Facilitating Improved Access to civil society.2
Networks and Services
Technology Standards and Privacy
In many countries, still the most important eco- Issues
nomic obstacles in accessing information are
telecommunication tariffs, Internet access fees, These two elements are extremely essential in
licensing fees, taxes, duties, and many other fac- establishing content repositories. So far data min-
tors. These issues need to be resolved as soon as ing concerns, competing private firms have little
possible. In this aspect, there should be a balance interest in preserving the open standards that are
between public administration costs (compromise, really essential to a fully functioning interactive
incentives, or even subsidy for a short period) network, as well as formation of open content that
and regulations (not being regulatory, but being would eventually become public goods. Markets
the facilitator) between commercial interests and may encourage innovation, but they do not neces-
national interest by keeping civil and moral obli- sarily insure the public interest. In this case, gov-

Prospects and Scopes of Data Mining Applications in Society Development Activities

ernments could decide to encourage and support reference service” industry. They give less prefer-
the developments of public domain content (data, ence to questions like, should personal information
information and software) and freewares (LINUX, copyright belong to the persons concerned or to
Apache, etc.). This goal is becoming absolutely the data miners processing electronic transac-
vital, considering the importance of equipping tions or what level of anonymity and privacy
the schools of the world with basic computer protection is desirable? These are essentially
facilities (OLPC, one laptop per child). Similarly, philosophical and political issues. (Chakrabarti,
privacy issues are also of strategic importance. 2002; MITRE, 2001)
The protection of privacy has become one of the
most important human rights issues of this new Development of Consolidated National
millennium. Many search engines or agencies are E-Strategies and E-Policies
accumulating/ monitoring/ mining their subscrib-
ers’ information with/without the consent of the Developing countries are increasingly in quest of
subscriber. This could be valuable in the sense of designing and implementing national strategies
knowledge gathering and eventually be used in to manage the development of appropriate ICT
intelligent decision making processes, however, regulatory, legislative and policy frameworks
to do this the level of anonymity and privacy pro- (UNCTAD, 2004). In this context, appropriate
tection need to be clearly defined. Furthermore, decision making bodies have to be formulated at
any ethical or political issues are also need to be national level with concrete plan of actions and
covered before taking similar decisions. agendas.

An on-going pact, named Ukusa (binding the UNCTAD is becoming a partner of the Global
United States, Great Britain, Canada, Australia, ePolicy Resource Network (ePol-NET) by pro-
New Zealand) uses the ECHELON network that is viding its expertise in the design of e-strategies,
supervised by the US National Security Agency in e-commerce, legal and regulatory issues, e-mea-
order to monitor and process more than 3 billion surement, e-finance and e-government to enhance
phone calls, faxes, e-mails per day throughout the efficiency and effectiveness of the governance
world. Similarly, on a mere click on a hypertext system. ePol-NET functions as a virtual network,
link, the most casual consultation of a site on and partners of this network include the govern-
the World Wide Web generates cookies that feed ment of Ireland, which is providing the secretariat
uncontrollable databases. The technique of data for the partnership, as well as the governments
mining (exploitation of data) enables govern- of Canada, France, Italy, Japan and the United
ments, other agencies and private organizations Kingdom; ECA; ITU; UNDP; OECD and the
to carry out mass surveillance and personalized Commonwealth Telecommunications Organiza-
profiling, and in most cases without any controls tion. (UNCTAD, 2004)
or right of access to examine this data. They vary
from medical care to transport systems, financial Congenial Atmosphere for E-Business
transfers to commercial and banking transactions, and E-Finance
and thereby enormous quantities of information
are accumulated every day. Moreover, commercial Flourishing of small and medium enterprises
interests want to exploit powerful data-mining re- (SMEs) is another precondition to empower a
sources for marketing research or for information community. They stay between micro- and mac-
reselling to data brokers and to the “individual roeconomics and act as the boosting power for

Prospects and Scopes of Data Mining Applications in Society Development Activities

a nation by raising the economic capacity of the ICTs for development. The principal stakehold-
community. ICT, as usual, stands there to support, ers have agreed to identify a set of core indica-
develop and protect their interests. However, in tors that could be collected by all countries and
many cases, lack of adequate information at the harmonized at the international level so as to
disposal of financial service providers on SMEs facilitate the measurement of the achievement
and their payment performance goes against SME of international development goals, including
financing. By adequate information, it is meant those contained in the Millennium Declaration;
that, national policy makers should have sufficient to assist developing countries in building capac-
information about the formation of SMEs, their ity to monitor ICT developments at the national
operations, their outcomes and their effective level; and to develop a global database on ICT
support for the development of economically indicators. Partnership activities in this aspect
sustained society. include the WSIS member states, OECD, ITU,
UNESCO and the UN ICT task force, as well as
A partnership has been designed to explore the op- the UN regional commissions and other relevant
portunities arising from innovative Internet-based regional bodies working on e-measurement issues.
electronic finance methods and their data mining (UNCTAD, 2004).
capacities and find ways of improving the small
and medium entrepreneurs (SMEs) access to trade- One may argue that inclusion of these meth-
related finance and especially, e-finance. Leading odologies is redundant here, but, the author feels
partners are from international and local financial that without necessary preconditions of ICT flour-
service providers, enterprise associations, Gov- ishing at the grass roots, a nation/ a society/ an
ernments and other public entities, international enterprise can not proceed further to reap out its
organizations including the World Bank, WTO ultimate benefits. After appropriate preconditions
and ITC, as well NGOs such as the World Trade for enlightened ICT prevails, application of neces-
Point Federation. (UNCTAD, 2004) sary tools, as such data mining cannot be applied
in coherence at various stages to retrieve/store
E-Measurement and ICT Indicators information and content for making intelligent
decision. Furthermore, data mining is a newly
Finally, it is essential to measure the achievement evolved process in the ICT sector and need to be
of ICT initiatives in each country by taking care of keenly nurtured with proper monitoring utilities.
the local environment. Till now, not much progress To support these arguments, a few uses of ICT
has been made in developing necessary measuring have been put forward in the next subsection that
tools for quantifying the progress in ICTD initia- may be improved further by using data mining.
tives. Neither are there any acceptable indicators However, before proceeding to the next subsec-
to quantify the impact of ICT4D implementations tion on data mining applications, the author would
at the grass roots. Therefore, e-measurement is like to draw attention regarding transformation of
essential for assessing the state of advancement, data into knowledge. These include entrepreneurs’
especially in developing countries in the use and data mining (see Table 1). In addition to this data
impact of ICTs. backup strategy, Table 2 and 3 are being derived
from various researches that are implying as how
The WSIS Plan of Action (Geneva and Tunis stored data can be converted to become knowl-
phases of the Summit) calls for the development edge. Finally, a knowledge hierarchy diagram has
of indicators to monitor progress in the use of been introduced in Figure 1, which shows how
data/content ultimately become wisdom.

Prospects and Scopes of Data Mining Applications in Society Development Activities

Table 1. Data backup strategy and low cost operation in SMEs

Small and medium sized businesses’ data How a Software Can Relieve the Downtime Pain
protection pain points

Limited IT resources for data backup and recovery Storage server can provide, continuous, automatic backup and
incredibly easy recovery
All critical data on one server Storage server and download servers can be restored in a flash.
They are capable to work as very low-cost servers and thus
making backup servers affordable to companies of all sizes
Regulatory pressures and complex processes Storage server employees can find individual backup files quickly
– without tying up expensive IT resources
Cash flow disruptions are very damaging Storage server can provide failover capabilities to a backup
server, so that the business can keep running

Table 2. Physical outcomes of data towards creation of knowledge and their management issues
Physical outcomes (bottom-up preference) Management issues
Knowledge portals Regular update, security, sustained support
Innovative techniques on faster data search, mass storage and QoS with economic value and proper monitoring
accurate data analysis
Optimum resource scheduling with dynamic adaptability Interfaces, Open source techniques
Extended connectivity with possibility to connect home users Interconnectivity, interoperability

Table 3. Applications of data towards creation of knowledge portals and their management issues
Outcome ⇒ Knowledge portals
Applications Management issues
Knowledge management and visualization systems QoS, workflow techniques
Search engines Service interaction models
Data mining Collaborative service composition
Fail safe recovery Visualization
Authorization Grid-aware simulation
Encryption Access Interfaces and technologies

In this way, it has been observed that data/con- 1. Data: Facts or figures
tent can enlighten human skill to transform them 2. Information: Useful data; answers to
into a knowledge society. According to Ackoff “who,” “what,” “where,” and “when”
(1989), the content of the human mind can be 3. Knowledge: Application of information;
classified into five categories: answers “how”

0
Prospects and Scopes of Data Mining Applications in Society Development Activities

Figure 1. Knowledge hierarchy (Adopted from Ackoff, 1989)

Policy

Wisdom

Knowledge Management

Information Understanding

Data/ Content
Emancipation

4. Understanding: Appreciation of “why” modities, small and medium sized enterprises,

5. Wisdom: Evaluated understanding (Markus, e-learning, decision support systems, knowledge
2005). centers and a few advanced uses are described in
this portion. This sub-section discusses a few pro-
Data Mining Applications in Social spective areas, where data mining researchers may
Development Implications continue their comprehensive studies to develop
justified and proper applications that would assist
Use of data mining techniques ranges from its in establishing sound social development systems.
diversified applications in learning systems, Along this context, several case studies have been
knowledge discoveries, web intelligence, entre- introduced to assist the readers and researchers
preneurship management, data visualization, pat- to have a glimpse on those scopes.
tern recognition, statistical analysis, production
control and machine learning to collaborative Education
filtering, bioinformatics, ecosystem management,
spatial data mining and knowledge economy Transforming computing knowledge into educa-
(Berry & Linoff, 1999; Berry & Linoff, 2002; tion is necessary for empowering the knowledge
Berry & Linoff, 2004; Berson & Smith, 1997; society. Community members should develop
Berson, Smith, & Thearling, 1999; Bozdogan, computing competencies so that they can help
2004; Braha, 2001; Bramer, 1999; Cerrito, 2006; to advance the social progress of the nation by
Cios, 2000; Cios, Pedrycz, & Swiniarski, 1998; raising their knowledge level. Transformation
Delmater & Hancock, 2001; Fayyad, Grinstein, is needed at the grass roots, computing profes-
& Wierse, 2001; Ville, 2001). However, for sake sionals and higher educations institutes (HEIs).
of illustrating data mining applications for social Many of the disciplines at those outlets are in-
development purposes, its use in financial market creasingly dependent on computing to manage
study, earth science, ecosystem management, data and information necessary to support deci-
health networks, tourism industries, general com-

Prospects and Scopes of Data Mining Applications in Society Development Activities

sion-making. Furthermore, there are advances clude producing digital libraries of open content,
in computing services that comes from R&D in as such e-books, journals, reports and databases
computing-related science, technology, and en- on DVD and similar high-density information
gineering. Simultaneously, advances are notable storage media. These could be in online or off-line
in other disciplines with the desire to solve grand formats and should be made PC/laptop (perhaps,
challenges (using simulations, supercomputers, for one laptop per child-OLPC) accessible and
data mining, or virtual environments), and to store considerably more information per unit than
enhance the quality of life for individuals and a CD ROM. This form of learning is commonly
social groups. Therefore, transforming education, known as e-learning, and relates directly to ICT
in computing-related disciplines, with technical for development and formation of knowledge
innovation is needed to improve the future of a economy (COL, 2003).
country. In addition, by extending knowledge
of key computing concepts across the range of Knowledge Center
core curriculum areas inherent in undergraduate
education will prepare a nation to leap into the It is considered that the interactive relationship
knowledge society.3,4 between knowledge, attitude and behavior is the
basic parameters of social bonding in a community
ICT is already an imperative factor at all levels and needs to be specifically investigated while
of education, though secondary education is in a considering the issues of sustainable development
decisive stage in many developing countries. But, focused towards improving the quality of life of
it is a proven fact that learning and studying at this villagers. The group of villages sharing the same
stage has potential impact on the new members geography shares almost similar problems; as such
in the community-of-knowledge society. Towards problems related to agriculture (corps, harvest-
implementing educational policies by promoting ing, food security, supply chain), environment
sustainable ICT infrastructures for secondary (natural disaster, calamity, draught, rain), health
schools, the first step is the penetration of Internet- (nutrition, disease, medicine, treatment, physician,
connected computers in the schools. The second hospital), education (school, college, university,
step is even more crucial; the continuous evolution student, teacher, support center, tutorial center,
of didactic methods so that young learners will learning center) and public services (government
actually learn to learn within the WWW-based and nongovernment agencies along with other
infrastructures. Thirdly, application of data mining development partners). Most of them depend on
will turn these institutions into learning organiza- traditional forms of support and existing resource
tions and eventually they will become catalysts infrastructure (land, skill, knowledge, capital,
in the innovative processes of education system. technology, etc.). Applying a baseline field sur-
(Kommers, Kinelev & Kotsik, 2003) vey pertaining to such conditions followed by a
thorough analysis of the relations specific to the
Distance Education conditions pertaining to achieve sustainable de-
velopment can generate policy implications and
Simultaneously, info-miners in the distance develop and implement policies towards improv-
education community can use one or more of ing the quality of life of villagers. Simultaneously
infomining tools to offer high quality open and implemented multiple data collection techniques
distance learning information retrieval and search (documentary—historical data and observation,
service. ICT based infomining services could in- and survey—questionnaire and interviews) can be

Prospects and Scopes of Data Mining Applications in Society Development Activities

used for the knowledge and information accumula- in many places, or in many places it succeeded.
tion through utilizing village centers/knowledge Then the question arises, as, are there any hid-
centers/kiosks/information centers. den reasons for this paradoxical result? Mining
the data in terms of socioeconomic aspects may
Decision Support System provide an explanation. (UNDP, 2001)

To handle huge amounts of data for making fast By applying mathematical modeling or data
and easy decision it is not recommended to work mining through a careful data collection a sys-
with the same structures as for online transaction tem can take care of making intelligent decision.
processing (OLTP), that are traditionally sup- However, the data characteristics, their physical
ported by the operational databases. In OLTP, dimensions, their chemical, biological, social,
tasks are structured for isolated transactions, and or financial implications must be taken into
transactional database are designed to reflect the consideration.
operational semantics of known applications, and,
in particular, to minimize concurrency conflicts. Data compilation and acquisition means measure-
On the other hand, the summarization and con- ments, experiments, and communication. These
solidation of data in data warehouses is targeted require modern technical equipment, a clear
for decision support. High workloads arise with understanding of nature, appropriate financial
mostly ad hoc, complex queries to a huge number sources and, last but not least, for an excellent
of records and a big amount of operations. In this understanding of human being, a psychological
form of query request, where query performance and social sensitivity, experience and devotion.
and response times are more important than trans- In addition to simple data collection, the basis of
action throughput, is known as online analytical mathematical modeling has both its different tar-
processing (OLAP). In this context, data mining gets and its various constraints, that is, it implies
can perform automated search for hidden patterns a number of optimization problems. This thorough
in typically large and multidimensional databases. process of optimization underlying the data col-
The results of data mining techniques are abstract lection is known as experimental design. Having
behavior models that can be used to explain and made this design and found the data, then, these
predict consequences, for example, to support risk data have to be well interpreted. Finally, the system
management and mitigation of natural hazards. should dynamically incorporate a data analysis
So far, data mining and geographical information into the whole entity of mathematical modeling
system (GIS) have existed as two separate tech- and problem solution. Thereby, this dynamics will
nologies. Recently, their integration has become be termed as learning. (Gökmen, 2004)
attractive as various organizations possessing
huge databases began to realize the potential of General Commodity
information hidden there (UNDP, 2001).
Currently, there is no comprehensive and system-
Once a big data warehouse is being filled, ob- atic consultative framework that enables the shar-
taining the concealed causal relationships will ing of information and the use of complementary
help to answer many important questions. For expertise among representatives of all key actors
example, in some places in Nicaragua the concept involved in the review of the general commodity
of backyards was introduced in order to provide situation and the operation of commodity markets.
people with fresh vegetables. This concept failed Most importantly these phenomenon are totally

Prospects and Scopes of Data Mining Applications in Society Development Activities

absent in many developing countries. Moreover, Small and Medium Sized

the efforts of interested stakeholders should thus Entrepreneurs
be put together and directed towards a pragmatic
approach aimed at bringing both focus and prior- Small and medium sized entrepreneurs (SMEs)
ity to break the cycle of poverty in which many often face a conundrum in archiving data/infor-
commodity producers and commodity-dependent mation. Though tape backup systems are inex-
countries are apparently locked-in. pensive and fairly reliable, but they offer poor
Such a consultative process need to address recovery point objective (RPO)6 and recovery
the commodity informatics in a concerted man- time objective (RTO)7 for critical applications,
ner by proposing specific action with respect to and they are usually ineffective for remote data
the following issues: facilitating collaboration backup. Hardware mirroring technology (which
among stakeholders and accomplishing greater use remote copy technology to provide synchro-
coherence in the integration of commodity issues nous mirroring between two sites) offers excellent
in development portfolios; collecting and sharing RPO and RTO, but they are highly expensive for
best practices and lessons learned; maximizing a small or midsize business to buy and manage.
the mobilization of resource flows; commodity Moreover, they are less than ideal for backing up
sector vulnerability and risks; mechanisms to remote locations, which often have low-bandwidth
facilitate the participation of developing country connections.
farmers in international markets; distribution of New solutions based on asynchronous soft-
value-added services in the commodity value ware-based replication can achieve the acceptable
chain; promotion of economically, socially and RTO and RPO objectives for small or midsize
environmentally sustainable approaches in pro- business’ critical applications without adding cost
duction and trade of individual commodities of and complexity of the synchronous replication
interest; promoting business networks within approach. Furthermore, in software-based repli-
developing countries and between developing and cation, only the bytes that are actually changed
developed country enterprises; and established by each write (not the entire block of information
commodity information and knowledge manage- or the whole file) are replicated. Therefore, in
ment networks (UNCTAD, 2004). Utilizing data comparison to synchronous replication solutions,
mining techniques, much of these issues can be asynchronous approach offers lower load on the
resolved through simple data algorithms and cost production servers, faster updates, and the ability
effective solutions can easily be devised. to send replication updates across low-bandwidth
Internet networks (Intransa, 2005).
WinIDAMS (1.2a) issued in April 2006, features
interactive data import/export; wide range of E-Commerce
data analysis techniques such as: table build-
ing, regression analysis, one-way analysis of Utilizing information and data grid major elements
variance, discriminant analysis, cluster analysis, of e-commerce applications can easily be devel-
principal components factor analysis and analysis oped. These involve, electronic data interchange
of correspondences, partial order scoring, rank (EDI) and allowing more flexibility, negotiated
ordering of alternatives, segmentation and itera- exchange and encouraging entrepreneurial activ-
tive typology,5 and can be used in establishing a ity, e-marketplace is beginning to be enormous
marketing and monitoring system for general repository of data and information. Use of WWW
commodities. metadata technologies, as RDF (resource descrip-
tion framework) and XML (extensible markup

Prospects and Scopes of Data Mining Applications in Society Development Activities

language) in technical, scientific and engineering agement techniques at all levels of a health-care
applications are becoming paramount and they system. Thus the KPIs are forming a complete set
are being used in e-commerce applications too of metrics enabling the performance management
nowadays. of a regional health-care system. In addition,
It is a fact that, the knowledge-store of an the performance framework is being technically
entrepreneurship is commonly available in gray applied by the use of state-of-the-art KM tools,
literature. However, recent researches integrated such as data warehouses and business intelligence
within the scientific process associated with information systems. Hence, in technological
metadata encourage researchers assisting in de- sense, the infrastructure is becoming an important
velopment of wealth creation processes, including KM tool that enables knowledge sharing amongst
technology transfer, and effectively transmutes various health-care stakeholders and between
across directly into e-Commerce. Furthermore, different health-care groups. (Berler, Pavlopoulos
beyond metadata-based cataloguing of knowledge & Koutsouris, 2005)
assets, utilizing recent data mining techniques,
it is possible to do text—or multimedia—data Moreover, health research including health sys-
mining to extract further refined knowledge tems research nowadays has become an essential
(Jeferry, 2000). component in achieving the health related MDG’s,
especially the targets set out in reducing maternal
Healthcare Sector and child mortality. In this context, to improve
universal access to sustainable quality health
The advantages of the ICTs in the complex health- care implies a coherent approach to the health
care sector are already well known and well stated. system including research into health information
It is, nevertheless, paradoxical that although the systems and the organization and management of
medical community has embraced with satisfac- functional and cost-effective health services that
tion most of the technological findings allowing are socially equitable and financially sustainable.
the improvement of patient care, but health-care Data mining application can enrich the health-care
informatics has not been advanced that much. informatics and improve the necessary support
services by increasing the availability of reliable
An information model for knowledge management data at minimum effort.
(KM) based upon the use of key performance
indicators (KPIs) in health-care systems has Tourism
been developed. Founded on the use of balanced
scorecard (BSC) framework (Kaplan/Norton) Tourism is, without a reservation, the single largest
and quality assurance techniques in health care economic sector in many countries and treated
(Donabedian), this research is carrying out a as a significant element of economic activity
patient journey centered approach that drives in almost every country. For many developing
information flow at all levels of the day-to-day countries, tourism is of strategic importance
process of delivering effective and managed care, and became a major source of foreign exchange
toward information assessment and knowledge earnings. However, this economic importance is
discovery. Furthermore, in order to persuade not always effectively reflected in public policy.
health-care decision-makers to assess the added Good tourism policy must incorporate a dynamic
value of KM tools, its methodologies include new approach to tourism development rather than a
performance measurement and performance man- passive reaction to externally created prospects.

Prospects and Scopes of Data Mining Applications in Society Development Activities

Thus, given tourism’s central role in the economy Anti-Drug Network

of many countries (specially, Caribbean and South
East Asian countries), tourism authorities must Accumulating information about drug manufac-
study various aspects of the industry from both turers, their traffic route, usage pattern, supply
the demand and supply sides. The information chain, and at the same time integrating related
collected from surveys, desk and online research databases of users, resellers and vendors can
could be used to customize policies and strate- establish a useful drug network (rather, antidrug
gies aimed at redressing any supply problems or network) in detecting illegal drug trafficking.
enhancing the tourism product and providing a Data mining makes it possible.
more competitive and attractive destination to
the visitor. Research must, however, be based on Anti Drug Network (ADNET): A middle-aged man
timely, reliable and accurate information, which in a light blue Mustang is on the way to enter the
is essential for effective policy formulation and United States from Mexico through one of the
decision-making. Thus tourism enterprises had numerous customs checkpoints along the south-
to depend on accurate tourism statistics, which west border. He is confident no one will suspect
help them to undertake research, prepare busi- that he is transporting more than 10 pounds of
ness plans and assist in the design of promotion heroin in secret compartments inside his vehicle;
and marketing strategies and campaigning. he has done it before and he plans to do it again
Similarly, tourism policy-makers require critical and again.
information to develop proper forward planning
of the sector. This involves effective anticipation But, a customs system operator at a site near El
and necessary management change in order to Paso, Texas, uses the Anti-Drug Network (ADNET)
maximize the economic and financial benefits system to access data on the driver and his car
of tourism. In this aspect, the important concern via his license plate. It’s just routine check and
is that critical information can only be obtained takes a few moments.
through high-quality research which facilitates
proper monitoring of tourism-related policies, The agent quickly learns through a system that
evaluation of the effectiveness of specific tourism accesses a large data warehouse of information
initiatives, benchmarking the performance of a on crossings, seizures, and motor vehicles that
particular destination, comparative analyses, and the driver makes this trip on a regular basis, at a
observing trends of visitors (Andrew, 2005). regular time, but this trip is different. She decides
it is worth her time and trouble to continue the
As tourism is an information-intensive service, the inspection. Ten minutes later, she finds more than
UNCTAD e-tourism initiative has been designed to a dozen small packages of white powder inside
provide developing countries the technical means the vehicle. The drugs were seized and the driver
to promote, market and sell their tourism services was arrested.
online. Partners in e-tourism are the UNCTAD
member States, the World Tourism Organization, Situations like this occur almost daily across
UNESCO, national tourism authorities and many the many ports of entry along the Mexican/U.S.
universities. Other potential partners include border and other entry points into the United
regional associations of developing countries, States. It may happen to many other countries and
transport operators and IT companies. (UNC- continents. Sophisticated data-sharing systems
TAD, 2004) developed by the ADNET community (i.e., Depart-
ment of Defense, U.S. Coast Guard, Department

Prospects and Scopes of Data Mining Applications in Society Development Activities

of Justice, Department of State, Department of processes and knowledge management systems,

Treasury, Federal Communications Commission, but also their use diversifications. Their use has
and the intelligence community) give U.S. drug yielded significant improvements in the efficiency
and law enforcement officials a cache of informa- of energy and materials use and contributed to
tion needed to track the flow of illegal narcotics economic expansion without the increases in
and other dangerous substances into the country. environmental impacts leading to efficiency im-
(MITRE, 2001) provements. Advances in information technology
are likely to continue to provide opportunities
Network Business Intelligence for the development of improved environment
and ambient livelihood. Data mining assists to
Network business intelligence leverages the traffic accumulate historical data and change pattern
management system to mine information about of local environment parameters to make intel-
exactly what is flowing on the network, generat- ligent decision for protecting the degradation of
ing reports that combine application, user, and environment, especially if they are human made
server information and technical views. Network (Allenby, Compton & Richards, 2007).
operators can then determine the network’s ap- Amongst the variety of datasets that are in-
plication and protocol mix, as how applications volved in the knowledge society, spatial (or map)
are performing, what impact they are having information takes a major place in terms of con-
on other traffic streams, and what the network tent. These spatial information sets are essential
requirements are for all traffic (Allot, 2005). to make sound decisions at the local, regional,
and central level planning, implementation of
Deep packet inspection (DPI) identifies applica- action plans, infrastructure development, disaster
tion types when port number alone is not enough management support, and even in business de-
by looking further inside the packet header. This velopment. Natural resources management, flood
is particularly helpful for applications using mitigation, environmental restoration, land use
dynamic port numbers, such as voice over IP assessments and disaster recovery are just a few
(VoIP), hyper text transfer protocol (HTTP), examples of areas in which decision-makers are
Citrix-based remote-access applications, and the benefiting from mining spatial information.
Microsoft NetMeeting conferencing application.
HTTP consistently uses port 80; but at the same With the accessibility of satellite-based remote
time many web applications and traffic types use sensing data and the association of spatial da-
HTTP. So merely a port number is not adequate tasets around geographical information systems
for identifying specific HTTP applications. Added (GIS), united with the global positioning system
with information about user, application, protocol, (GPS), the processes of semantic spatial informa-
and machine behavior on the network, one can tion systems are now a reality. The advent of GIS
configure the traffic management system to auto- technology has transformed spatial data handling
matically classify and shape all traffic in a way capabilities and in many cases made it essential
that optimizes the network usage to maximize the for reexmining the roles of government with re-
return on investment (ROI). (Allot, 2005). spect to the supply and availability of geographic
information. Using GIS technology, users are
Environment now able to process maps, both individually and
along with tabular data and mix them together to
Information technologies are important not provide a new perception, the spatial visualization
only for their growing use in decision-making of information. (Rao, 2002)

Prospects and Scopes of Data Mining Applications in Society Development Activities

Ecosystem Management Mineral Resources

Data mining application with databases on Mineral resources are important for all the nation’s
sustainable use of natural resources and the as- citizens and these are essential for individuals,
sociated ecosystems (ecological, hydrological, companies, and communities that depend on
limnological, climatic, social, and economic minerals production for income and broader
conditions), especially in developing countries economic development. Like food, air, and water,
are critical to ecosystem management, and can minerals are fundamental ingredients of human
be very effective for optimization of resources to life. However, science and information on mineral
improve the ecosystem. resources underpin private and public decisions
that determine whether, under what conditions,
Research activities on sustainable management and at what costs minerals become available to
of three most vulnerable ecosystems “humid and producers and consumers.
semi-humid ecosystems,” “coastal zones” and Use of data mining through adaptive learning
“arid and semi-arid ecosystems,” can take care and accumulating information along the years on
of integrated water management on river basin their availability, deposit, nature and process of
scale (recommended by the European Union Water extraction, and analyzing their pertinent values
Initiative as well as Article 10 of the Convention proper resource management can be performed.
on Biological Diversity [CBD]) and support the A good news is that, using recent technologies
implementation of the provisions related to re- advances in minerals science and improvements
search and sustainable use included in the CBD in minerals information contribute to greater
work programs on Dry and Subhumid lands, Inland availability of minerals, extraction at lower cost
water and Marine and Coastal Biodiversity. The and with less environmental damage; help society
ecosystem approach is a strategy for the manage- respond to the depletion of known mineral depos-
ment of land, water and livelihood resources that its and contribute to the substitution of relatively
promote conservation and sustainable use in an abundant minerals for increasingly scarce ones;
equitable way and places people and their natu- and help develop alternative sources of supply for
ral resource use practices at the core of decision minerals subject to unexpected supply disruptions
making. (European Commission, 2005) (BESR, 2004).

Management dynamics of humid and semi- Earth System

humid ecosystems under optimized treatment
may lead to sustainable use of renewable natural Understanding the Earth system is essential to aug-
resources; identification of policy options and/or ment human health, safety, and welfare, alleviating
management strategies for harnessing judicious human suffering including poverty by protecting
use of such resources focused on integrated ap- the global environment, reducing disaster losses,
proach and analysis of natural and agro-resource and achieving sustainable development.8 In this
use at local and/or regional levels (sustainable aspect, appropriate data mining applications may
water management, forest ecosystem reclamation, be implemented to achieve comprehensive, coor-
biodiversity management). However, this sort of dinated and sustained observations of the Earth
longer term skill development and decision-mak- system, in order to improve monitoring of the
ing processes require intensive implementation state of the earth, boost understanding of earth
of data mining schemes. processes, and improve prediction on the behavior
of the earth system. These would eventually ac-

Prospects and Scopes of Data Mining Applications in Society Development Activities

cumulate to produce a long-term quality data bank could dig out a lot of information encouraging
to make intelligent decision and provide benefit to further penetration of new ICT into developing
the society by reducing loss of life and property countries and countries with economies in transi-
from natural and human-induced disasters; under- tion aimed to combine traditional knowledge with
standing environmental factors affecting human knowledge provided by the scientific community.
health and well-being; improving management (WSSD, 2002)
of energy resources; understanding, assessing,
predicting, mitigating, and adapting to climate There are advanced uses of data mining in the
change; improving water resources management field of development, integration and validation
through better understanding of the water cycle; of GRID technologies and their applications in
improving weather information, forecasting and research, industry and for addressing societal
warning; improving the management and protec- challenges. Research in this aspect ranges from
tion of terrestrial, coastal, marine ecosystems; GRID technology building blocks to grid-related
supporting sustainable agriculture; combating middleware and large-scale applications. Research
desertification; and understanding, monitoring also includes test beds for computational GRIDs
and conserving biodiversity (UNDP, 2001). that are the basic layer for harnessing processing
power by distributing massive computational
A Few Advanced Uses tasks to numerous resources (compute cycles
and data storage) over matching communication
Apart from the fields that have been described so links. Consequently, information and knowledge
far, the field of data mining is exploding in many GRIDs allow access to dispersed information,
aspects. New techniques in data mining are ac- and knowledge discovery and extraction from
celerating research across almost all the scientific spread knowledge resources. In this context,
disciplines. Data mining techniques are being used they make use of cognitive techniques and tools
to facilitate research and experimentation in the such as data mining, machine learning, content
development of advanced materials. Researchers semantics, ontology engineering, information
are even using quantum mechanical methods to visualization, and intelligent agents. (Alpaydin,
mine crystal structure and property databases to 2004; CORDIS, 2006).
calculate and predict the properties of new ternary Internet database that contains human drug
and quaternary materials. In this way, each new metabolism data and in turn, be made available
experiment expands the database and provides to users across the globe via a nonprofit basis.
new insights into the laws of physics and chemis- Depending on the chemical structures for both
try (Carty, 2002; Halpern, 2003; Kargupta, Joshi, the parent drug or xenobiotic and the various
Sivakumar & Yesh, 2004). Moreover, new lines of metabolic biotransformation products, the Hu-
research are addressing the links between natural man Drug Metabolism Database (hDMdb) will
systems and socioeconomic systems, as well as be extremely useful to both the medicinal and
sustainable consumption and production patterns toxicological chemistry. During the production
(Science Blog, 2002; Nwabueze, 2003). of open databases with chemical structures con-
nected to biological properties demands new tools
Lixin Zhang, President of the World High Tech- of data mining (IUPAC, 2005).
nology Society stated that science and technology Similarly, database tomography is a textual
contributed towards economic growth without sac- database analysis system consisting of two major
rificing a great loss of natural resources. Through components: (a) algorithms for extracting multi-
appropriate data mining, information technology word phrase frequencies and phrase proximities

Prospects and Scopes of Data Mining Applications in Society Development Activities

(physical closeness of the multiword technical In addition, land improvement or rehabilita-

phrases) from any types of large textual database, tion strategies should look into the broader
to augment, and (b) interpretative capabilities of socioeconomic, inclusive demographic,
the expert human analyst; mostly dependent on the institutional and political dimensions of
application of appropriate data mining techniques desertification or ecosystem degradation.
(Kostoff, Tshiteya, Pfeil & Humenik, 2002). 2. Sustainable, integrated water resource
management (IWRM) at river-basin:
Other Potential Usage Areas The data mining system should address
such dimensions as; increasing use of effi-
Managing Arid and Semi-Arid ciency, particularly in irrigated agriculture;
Ecosystems agroforestry, increasing recycling and reuse
for tree growing, (periurban plantations,
Research on arid and semiarid ecosystem dy- etc.), including innovative multipurpose
namics under varying degrees of human activity utilization requiring integrated management
pressure lead to more sustainable use of renewable attentive to quantity and quality aspects;
natural resources in natural, rural and peri-urban control of sediment load, erosion, flash
areas. Data mining applications can be applied to floods, control of private use, pollution
identify opportunities for enhanced economic and and water logging; water supply/resource
sustainable production by analyzing natural and management at basin level in order to meet
agro-resource use systems at local, regional level competing demands including up-stream
and/or international levels through an integrated and down-stream effects in relation to peri-
approach. Here, the resource management prac- urban areas and groundwater management
tices with historical data from indigenous people in terms of quantity, quality and change in
plays critical role in planning and implementing water table.
sustainable management strategies of renewable 3. Research in forest ecosystem restoration
natural resources. Appropriate tools, including and reclamation techniques: Research in
information systems, decision support tools, cri- terms of data mining should include affores-
teria or indicators of sustainability and rehabilita- tation, vegetation rehabilitation techniques
tion, past and present examples of participatory especially using native species of economic
approaches, based on data mining and existing value (to mitigate or to halt soil, water and
datasets could be a cost-effective support for dry land cover degradation caused by unsus-
land ecosystem management and policies. tainable forestry and farming practices or
Furthermore, it is complemented that, the unsuitable urban settlements). Research
longer-term outcome of data mining applications on restoration or enrichment of degraded
would lead to technological, management and lands and secondary forests with a concern
policy research, in the following focal areas: for conservation of biodiversity, can tap
new market opportunities or mitigate the
1. Improved agriculture and agroforestry negative environmental impacts of market
systems: By taking into account traditional systems through proper data analysis. The
knowledge, the database system should also ecosystem approach, where huge concentra-
look at opportunities beyond farm boundar- tion of data sets are visible, should seek to
ies in order to diversify income generation develop tools to make appropriate balance
and sustain rural and periurban livelihoods. between conservation and the use of biologi-

0
Prospects and Scopes of Data Mining Applications in Society Development Activities

cal diversity while taking full account of the future iSSueS and
cultural, social roles (gender concerns) and ChallengeS
the function in biodiversity conservation and
land rehabilitation (European Commission, The issue of scientific data collection and man-
2005). agement has traditionally been addressed on an
4. Biodiverse, biosafe, and value added informal basis, initially by individual scientists,
crops: Research to increase the sustainable who generally felt that they had to collect their
use and productivity of annual and perennial own research data.9 Later, groups of scientists in
under-utilized tropical and subtropical crops the same or related disciplines began to collabo-
and species is important for the livelihoods rate on the development of larger databases for
of local populations. These crops have po- general-purpose use. This approach had the added
tential for wider use and could significantly scientific advantage of encouraging research test-
contribute to food security, agricultural ing hypotheses on the same body of data by using
diversification and income generation. In- different data mining tools (Earth Institute News,
novative tools and data mining techniques 2005; Grossman, Kamath, Kegelmeyer, Kumar,
for the characterization, development and & Namburu, 2006; Kargupta, Joshi, Sivakumar,
use of crops with enhanced tolerance to & Yesh, 2004).
abiotic stress and in particular: The reality of data explosion in multidimen-
 Tolerance to drought, salinity, heat, sional databases is an astounding and at the same
cold time, an extensively misunderstood phenomenon.
 Enhanced nutrient uptake To implement proper data mining utility, it is es-
 Enhanced tolerance to heavy metals sential to understand what data explosion is, what
and acid soil; will enrich the related causes it, and how it can be avoided, because the
information management systems. consequences of ignoring data explosion can be
5. Aquatic farming systems: “Farming down very costly, and in most cases, result in project
aquatic food webs” with particular attention failure (Potgieter, 2003). Moreover, the exponen-
to economic viability, social acceptability tial growth of business information generated
and an enabling institutional environment every day means even more and more data has
are being regarded as key dimensions to to be backed up. Regardless of the circumstances
sustainable aquatic farming systems. This customers expect services to resume instantly
combination is expected to improve the after any disruption. In addition, the increasing
conditions for food-insecure households need to access data almost around the clock has
through new knowledge products, processes dramatically shrunk the time permitted to backup
and policy relevant dialogue. In this aspect, data. Foremost, today’s data protection challenge
particular emphasis should be given to en- poses substantial risks to companies of all sizes,
hanced participatory approaches with strong but they pose the greatest risk to small and midsize
possibilities of generating an impact in businesses (NSI Software, 2004; Ville, 2007).
society and promoting social empowerment However, one of the problems of data explo-
through knowledge (European Commission, sion is that it results in a massive database and the
2005). Accumulation of intrinsic, but appar- size of the database in one product can literally
ently invisible information and data at local be hundreds and even thousands of times greater
levels are necessary preconditions to develop than the same database in another product. Taking
data mining solutions in this dimension. this as an opportunity, rather than admitting about
the problems of data explosion, the vendor with

Prospects and Scopes of Data Mining Applications in Society Development Activities

the massive database argues that his database is generator, data collector, data processor, data
handling large data sets, while the vendor will archiver and data disseminator (CORDIS, 2006;
imply that the vendor of the smaller database (a Wang & Fu, 2005). Proper data mining method
database without data explosion) cannot address will be able to take care of these issues.
large enterprise datasets. This is a wrong concept. In case of health sector data mining, research
Though the correct analysis should be to compare should focus on diseases having considerable im-
sizes with equal volumes of base data, but due to pact on the economic development perspectives
the size of the databases are so profoundly dif- of the affected communities taking into account
ferent, prospective clients find it hard to believe their socioeconomic status. Research should
that such dramatic differences are possible with provide new knowledge on biology, epidemiol-
similar datasets (Dorian, 1999; Potgieter, 2003; ogy and technologies relevant for sustainable
Tan, 2006). surveillance and control of diseases on a regional
This creates confusion at the client’s end scale; on innovation and improvement of existing
through misinterpretation by the vendor. Proper interventions and help to implement appropriate
data mining tools can eradicate this problem of strategies and policies for prevention, control,
data explosion, though many other factors are and treatment.
involved in this. Future research work may be Regarding data mining of learning content,
carried out in this facet for a longer period of time. social service delivery, e-governance and e-gov-
Research work should be continued in case of mas- ernment challenges will remain, as these processes
sive databases to restrict the data explosion. This are very dynamic in nature and mostly dependent
will reduce introduction of expensive hardware on various factors that are inter-dependent on each
in the procurement process, reduce the loading other. Similarly, in case of mining agricultural
and calculation times, reduce establishment and data, one should learn at first about the agricultural
operational costs, curtail any hidden cost that issues, including the perspective of food security
may arise within the undesired processes, provide to establish data mining algorithm, and perhaps
intelligent enterprise solutions at reduced cost and the same condition is applied to almost all field
effort, and foremost will be able to save projects of data mining.10
from being total failures (Potgieter, 2003). Most of the time, cost of professional data
Future research should also focus advanced management is not appreciated. Furthermore,
techniques of stream data mining and its ap- diversity of technology makes data archiving
plications that includes data compression, data practice and deployment more complex, and as
visualization, intelligent logging systems, sensor more and more people expect access to scientific
network systems, integrated sensor devices, se- and technical data, science and data management
cure storage and firewalls for cracking and SPAM become more interactive, and more complex.
mails. At the same time, research work should To face this, innovations in data archiving and
progress in the field of information visualizing management practice and technology is essential
tools and advanced applications for spatial data, to reduce any inevitable expenses. In addition, to
spatio-temporal data, high dimensional data and make them sustainable, data archival and data
graph-structured data. centers need to provide public services that add
Research work should also continue to reduce value to the data collections.
challenges in case of data interpretation, data SPAM and neuromarketing form another
integrity, data compartmentalization, and data complicated context in the area of data mining.
archival by keeping transparency of data chain. With extended use of the Internet, the potential for
There must be a clear relationship among the data subtle and not so subtle control for gross invasion

Prospects and Scopes of Data Mining Applications in Society Development Activities

of privacy, is coming at the front. However, the just the beginning. But, technologies to assure
best-known weapon against such control, that is, the diffusion of content and content products are
encryption, creates as many problems as it solves, not increasing in that pace, as still the processes
due to its paradoxical allowance of certain factors remain R&D-intensive to establish a common
that further elude the law. It is a newly developed platform for faster networks, new standards,
situation in the domain of Internet and as days software intensive products, virtual reality ap-
are passing, it is becoming critical, and perhaps plications, data-base management and others. In
sometimes harmful, when networks are threatened addition, mobile content and applications in the
by SPAM or undesired emails/contents. field of mobile telecommunication service and
This impact is usually associated with the de- content industry are also generating huge contents
mand of so called, universal access. But universal for innovative data mining.
access/service alone does not suffice (Servaes, In these revised situations, relationships
2004). In order to develop the proper rights and between content originators and final users are
responsibilities in the conditions and complexities changing. Further to these changes, certain in-
of a knowledge society, demand and provision termediaries are being created and attitudes to
of information need to be compromised. The content ownership and acquisition are also chang-
situation aggravates further with the cognitive ing. To suffice more, complete disintermediation
abilities that are necessary to navigate in such a and direct contact between content creators and
complex information space. All these problems content users has not yet been developed to a
are compounded in the underprivileged parts of significant extent.
the world, but they need to deal squarely with this A favorable scenario is that with the advent of
challenge. Mining parameters pertaining to web ICTs, and particularly the Internet, people who
browsing, balancing the demand-supply chain, can afford the hardware and connectivity can gain
behavior pattern of the end user, justifying the access to a wealth of content. Much of them are
nature of content and search optimization can free, or obtained at nominal cost. But, particularly
ease the problem in a very small sense. However, end users in the developing countries cannot even
there are many other complex issues involved in afford to pay the nominal fees. So, there is an ap-
this context and need further research. peal for open and free content provision. Those
who advocate the creation of these open and free
educational (or knowledge) resources believe in
ConCluSion the principal that education (or knowledge de-
velopment) is indeed a basic human right. But,
Due to the dynamic nature of digital content a fundamental tension prevails in fostering the
development and delivery methods, selection of development of such resources as; how will they
proper data mining methods remains critical, es- ultimately be paid for? At a time when knowledge
pecially when the contents are related to; scientific, is increasingly becoming commoditized, and uni-
technical, medical, agricultural, social, music, versities are running more like businesses; there
or online computer and video games. Recently is a significant counter-flow of arguments that
emerged network convergence and rapid diffusion are limiting the institutional capacity to produce
of high-speed broadband has shifted attention open educational resources (OER) (Unwin, 2006).
towards broadband content and applications that These features are in a way disrupting R&D in
promise new business opportunities, growth and this track of data mining.
employment. Moreover, the potential for digital It is true that, advances in technology have
content growth is very high and growth is only changed data collection methods and popular-

Prospects and Scopes of Data Mining Applications in Society Development Activities

ized large-scale data sets, especially in higher account. However, critical research and analysis
education. But, to turn the abundant raw data into call for data mining solutions to preserve related
valid knowledge, researchers need to realize that and relational data for a longer period of time.
traditional statistical techniques have weaknesses Furthermore, to collect the data and to offer it
when used to study large volumes of data. In this up-to-date to a vast number of organizations, pub-
context, more effective and balanced analytical lic investors, stakeholders and common citizens
tools are necessary (Larose, 2004; Smart, 2005; require technical infrastructures and organized
Wang, 2003). The concept of the “learning orga- processes. Though the use of ICT is increasingly
nization,” including “lifelong learning” for staff, acknowledged as a strategy to assist in this aspect,
is now recognized as a key element in corporate but without an adequate study on the applicabil-
strategies. This will reinforce consistency, com- ity and the natural and social context of ICT, it
mon identity, shared corporate culture, common may become useless or even counterproductive.
actions, clear responsibilities, coordination and As such, installing a telecenter in an urban zone
dissemination of good practice (Markus, 2005). In with little or telecommunication infrastructure
the field of natural science, the efficient handling, may sound like an improvement on the quality of
updating and maintenance of the spatial data life, but if the increasing cost for electricity and
infrastructure need highly qualified, properly networking cannot be afforded in the long run,
trained staff. the incentive will fail (UNDP, 2001). Therefore,
Similarly, in the field of agriculture develop- not only the data mining solutions are facing
ment of innovative and efficient data mining challenges, but also, the total context of content
for environment-friendly, post-harvest, storage, accumulation system is not beyond the challenge
processing and marketing methods for products barrier.
derived from such crops, with the objective of Finally, through all these success cases, it is
increasing market accessibility and product-added observed that partnership among implementing
value by promoting the development of niche agencies in scientific and technical matters should
export markets remains a challenge. At the same address key societal issues through interdisci-
time, development and dissemination of sustain- plinary research approaches by combining the
able improved production and management prac- natural and social sciences. In this context, the
tices taking into account traditional knowledge overall objective of the research goal should be
and innovative methods for the conservation and to develop equitable and strong scientific part-
use of genetic resources in food and agriculture nerships among developing countries in order to
looks for pragmatic data mining techniques. contribute to their sustainable development by
Moreover, policy, regulatory and institutional means of human capital development, mobility
issues related to coexistence of multisource crops and institution building (European Commission,
in the agricultural and food chain, including trade 2005). Therefore, applying data mining tools to
and food security issues throw further challenges empower knowledge society, or rather strengthen
in terms of improved and accurate data mining human capacity demands intricate methodolo-
techniques. gies and extensive researches through in-depth
In health sector, challenges for research in- studies.
clude the need to develop cross-sectorial policies
to ensure sustainable measures to fight diseases
with a specific focus on poverty reduction through referenCeS
health improvement. Fundamental issues such
as gender, ethics and equity must be taken into Ackoff, R. L. (1989). From data to wisdom. Journal
of Applied Systems Analysis, 16, 3-9.

Prospects and Scopes of Data Mining Applications in Society Development Activities

Allenby, B. R., Compton, W. D., & Richards, D. Berson A., Smith, S. J., & Thearling, K. (1999).
J. (2007). Information systems and the environ- Building data mining applications for CRM.
ment overview and perspectives. Retrieved April McGraw Hill.
13, 2008, from http://books.nap.edu/openbook.
BESR (2004). Board on Earth Sciences and Re-
php?record_id=6322&page=1
sources (BESR), Future challenges for the U.S.
Allot (2005). The traffic management handbook. Geological survey’s mineral resources program
MN: Allot Communications Ltd. (2004). Washington, D. C.: The National Acad-
emies Press.
Alpaydin, E. (2004). Introduction to machine
learning (adaptive computation and machine Bozdogan, H. (Ed.) (2004). Statistical data mining
learning). The MIT Press. and knowledge discovery. CRC Press.
Andrew, M. (2005). The role of research in sus- Braha, D. (Ed.) (2001). Data mining for design
tainable tourism policy-making. Paper presented and manufacturing: Methods and applications.
at the First Regional Sustainable Tourism Policy Kluwer Publishers.
and Intersectoral Planning Workshop Grand
Bramer, M. A. (Ed.) (1999). Knowledge discov-
Barbados Hotel, Barbados, West Indies.
ery and data mining: Theory and practice. IEE
Berler, A., Pavlopoulos, S., & Koutsouris, D. Books.
(2005). Using key performance indicators as
Carty, A. J. (2002). Scientific and technical data:
knowledge-management tools at a regional health-
Extending the frontiers of research. In Proceed-
care authority level. IEEE Trans Inf Technol
ings of the Opening Address at CODATA 2002:
Biomed, 9(2), 184-192.
Frontiers of Scientific and Technical Data, Mon-
Berry, M. J. A. & Linoff, G. S. (1997). Data min- tréal, Canada.
ing techniques for marketing, sales and customer
Cerrito, P. (2006). Introduction to data mining
support. John Wiley & Sons.
using SAS enterprise miner. SAS Press.
Berry, M. J. A. & Linoff, G. S. (1999). Mastering
Chakrabarti, S. (2002). Mining the Web: Discov-
data mining: The art and science of customer
ering knowledge from hypertext data (1st ed.).
relationship management. John Wiley & Sons.
Morgan Kaufmann.
Berry, M. J. A. & Linoff, G. S. (2000). Mastering
CIDA (2005). CIDA’s strategy on knowledge for
data mining. John Wiley & Sons.
development through information and communi-
Berry, M. J. A. & Linoff, G. S. (2002). Mining the cation technologies (ICT). Canadian International
Web: Transforming customer data. John Wiley Development Agency. Retrieved April 13, 2008,
& Sons. from http://www.acdi-cida.gc.ca/ict
Berry, M. J. A. & Linoff, G. S. (2004). Data Cios, K. J. (Ed.) (2000). Medical data min-
mining techniques: For marketing, sales, and ing and knowledge discovery. Physica-Verlag
customer relationship management. Wiley Com- (Springer).
puter Publishing.
Cios, K., Pedrycz, W., & Swiniarski, R. (1998).
Berson, A. & Smith, S. J. (1997). Data warehous- Data mining methods for knowledge discovery.
ing, data mining, and OLAP. McGraw Hill.
Codata (2002). Committee on data for science
and technology (CODATA). In Proceedings of the

Prospects and Scopes of Data Mining Applications in Society Development Activities

Workshop Synthesis on Archiving Scientific and Gökmen, A. et al. (2004). Balaban Valley Proj-
Technical Data, Pretoria, South Africa. Retrieved ect: Improving the quality of life in rural area in
April 13, 2008, from http://www.tgdc-codata.org. Turkey, 7(Dec 2004). Retrieved April 13, 2008,
cn/english/Html/SA-CT.html http://www.geocities.com/doriendetombe/detom-
bevol7menmbalabanabstract.html
COL (2003). Find information faster: COL’s
“Info-mining” tools. Retrieved April 13, 2008, Grossman, R. L., Kamath, C., Kegelmeyer, P.,
from http://www.col.org/colweb/site/pid/2927 Kumar, V., & Namburu, R. (Eds.) (2006). Data
mining for scientific and engineering applications
CORDIS (2006). GRID technologies and appli-
(Massive computing) (1st ed.). Springer.
cations through CORDIS. Community Research
& Development Information Service. Retrieved Halpern, J. Y. (2003). Reasoning about uncer-
April 13, 2008, from http://www.environmonu- tainty. MIT Press.
ment.com/projects.htm
Han, J. & Kamber, M. (2000). Data mining:
Delmater, R. & Hancock, M. (2001). Data mining Concepts and techniques (1st ed.). Morgan
explained: A manager’s guide to customer-centric Kaufmann.
business intelligence. Digital Press.
Han, J. & Kamber, M. (2006). Data mining:
Dorian, P. (1999). Data preparation for data min- Concepts and techniques (2nd ed.). Morgan
ing. Morgan Kaufmann. Kaufmann.
Dunham, M. (2003). Data mining introductory Hand, D. J., Mannila, H., & Smyth, P. (2000).
and advanced topics. Prentice Hall. Principles of data mining. MIT Press.
Earth Institute News (2005). Scientific commu- Hernández, V., Göhring, W., & Hopmann, C.
nity must develop cross-disciplinary standards (2004). Sustainable decision support for envi-
and practices in academia. Retrieved April 13, ronmental problems in developing countries:
2008, from http://www.earthinstitute.columbia. Applying multi-criteria spatial analysis on the
edu/news/2005/story05-01-05c.html Nicaragua Development Gateway niDG. Research
on computing science (Vol. 11, pp.136-150).
Ebecken, N. F. F., Brebbia, C. A., & Weigend, A.
México: Instituto Politécnico Nacional.
(2000). Data mining II (1st ed.). Computational
Mechanics, Inc. ICDM (2003). ICDM 2003 tutorial. In Proceedings
of the Third IEEE International Conference on
European Commission (2005). Specific pro-
Data Mining, Sponsored by the IEEE Computer
gramme for research technological development
Society, Melbourne, Florida. Retrieved April
and demonstration: Integrating and strength-
13, 2008, from http://www.cs.sfu.ca/~ester/
ening the European research area, 2005 Work
ICDM2003/Lazarevic.abstract.htm
Programme (SP1-10).
Intransa (2005). Managing storage growth with
Fayyad, U., Piatetsky-Shapiro, G., Smyth, P.,
an affordable and flexible IP SAN: A highly cost-
& Uthurusamy, R. (Eds) (1996). Advances in
effective storage solution that leverages existing
knowledge discovery and data mining. AAAI/
IT resources. CA: Intransa, Inc.
MIT Press.
IUPAC (2005). Chemistry and human health
Fayyad, U., Grinstein, G. & Wierse, A. (2001).
council report: 2003-2005. International Union
Information visualization in data mining and
knowledge discovery. Morgan Kaufmann.

Prospects and Scopes of Data Mining Applications in Society Development Activities

of Pure and Applied Chemistry, IUPAC Division Quéau, P. (2001). The information society and
VII. Retrieved April 13, 2008, from http://www. the global good. Retrieved April 13, 2008, from
iupac.org/news/archives/2005/43rd_council/ http://goanna.cs.rmit.edu.au/~aym/rinseap/bali/
Item_09_Div_VII.pdf QueauTalk.html
Jeffery, K. G. (2000). The grid for e-science: Rao, M. (2002). Systems design of a national
E-commerce benefits, information technology spatial data. Bangalore: Indian Space Research
department. CLRC, ITD. Organisation Headquarters.
Kargupta, H., Joshi, A., Sivakumar, K., & Yesh, Rud, O. P. (2001). Data mining cookbook: Model-
Y. (Eds) (2004). Data mining: Next generation ing data for marketing, risk, and CRM. Wiley.
challenges and future directions. AAAI Press.
Science Blog (2002). Partnerships, finance, sus-
Kommers, P., Kinelev, V., & Kotsik, B. (2003). tainable production and consumption patterns.
ICT in secondary education for the knowledge Press Release: United Nations. Retrieved April
society. In T. Varis, T. Utsumi & W. R. Klemm 12, 2008, from http://www.scienceblog.com/com-
(Eds), Global peace through the global university munity/older/archives/L/2002/A/un020319.html
system. The Finnish National Commission for
Servaes, J. E. J. (2004). Knowledge is power (re-
UNESCO, University of Tampere, Hämeenlinna,
visited): Internet and democracy. In P. Lee (Ed.),
Finland.
Proceedings of the International Conference on
Kostoff, R. N., Tshiteya, R., Pfeil, K. M., & Hume- Internet Communication in Intelligent Societies
nik, J. A. (2002). Power source text mining using (pp. 1 – 16). Chinese University of Hong Kong,
bibliometrics and database tomography. Hong Kong. Retrieved April 13, 2008, http://www.
com.cuhk.edu.hk/conference/2004/)
Larose, D. T. (2004). Discovering knowledge
in data: An introduction to data mining. Wiley- Smart, J. C. (Ed.) (2005). Higher education: Hand-
Interscience. book of theory and research (Vol. 20). Virginia
Tech: Springer.
Markus, B. (2005). Building spatial knowledge
infrastructure. Paper presented at the ISPRS Tan, Pang-Ning, Steinbach, M., & Kumar, V.
Workshop on Service and Application of Spatial (2005). Introduction to data mining. Pearson
Data Infrastructure, XXXVI, Hangzhou, China. Addison Wesley.
Retrieved April 13, 2008, from http://www.
Tan, Pang-Ning (2006). Introduction to data min-
commission4.isprs.org/workshop_hangzhou/pa-
ing. Addison Wesley Publication.
pers/65-70%20Bela%20markus-A103.pdf
Thearling, K. (1995). From data mining to da-
MITRE (2001). Stopping traffic: Anti drug
tabase marketing. DIG White Paper 95/02. Re-
network (ADNET). MITRE Digest Archives.
trieved April 13, 2008, from http://www.cs.uvm.
Retrieved April 13, 2008, from http://www.mitre.
edu/~xwu/icdm/cfp-03.shtml
org/news/digest/archives/2001/adnet.html
Nwabueze, K. (November 30, 2003). A case study:
NSI Software (2004). Six tips small and midsize
Role of technology venture capitalist market in de-
businesses can use to protect their critical data.
veloping countries, data mining, integration, and
NJ: NSI Software.
analysis. Timbuktu Chronicles. Retrieved April
Potgieter, J. (2003). OLAP data scalability: Ig- 13, 2008, http://timbuktuchronicles.blogspot.
nore OLAP data explosion at great cost. NSW com/2003_11_01_archive.html
Australia: SPF Pty Ltd.

Prospects and Scopes of Data Mining Applications in Society Development Activities

UNCTAD (2004). UNCTAD XI multi-stake- endnoteS

holder partnerships, information and communi-
cation technologies for development (ICTfD). In 1
http://webworld.unesco.org/infoethics2000/
Proceedings of the United Nations Conference themes.html accessed on June 14, 2006
on Trade and Development. Retrieved April 2
http://webworld.unesco.org/infoethics2000/
13, 2008, http://www.unctad.org/en/docs//td- themes.html accessed on June 14, 2006
l380add1_en.pdf 3
Retrieved March 06, 2007 from Retrieved
April 20, 2006 from http://www.cathalac.
UNDP (2001). United Nations Development Pro-
org/index.php?option=com_content&task
gram: Making new technologies work for human
=view&id=173&Itemid=256
development. Oxford: Oxford University Press. 4
Retrieved May 06, 2007 from http://curric.
Unwin, T. (2006). Facing the challenges, dgCom- dlib.vt.edu/DLcurric/proposalsummary04.
munities: Open educational resources. Retrieved pdf
April 13, 2008, from http://topics.development- 5
http://portal.unesco.org/ci/en/ev.php-URL_
gateway.org/openeducaion ID=2070&URL_DO=DO_TOPIC&URL_
SECTION=201.html accessed on June 15,
Ville, Barry de (2001). Microsoft data mining:
2006
Integrated business intelligence for e-commerce 6
RPO: It is the point in time to which data
and knowledge management.
must be restored in order to resume process-
Ville, Barry de (2007). Microsoft data mining: ing transactions. RPO is the basis on which
Integrated business intelligence for e-commerce a data projection strategy is developed.
7
and knowledge management. Digital Press. RTO: It is a disaster recovery concept in
information technology. The RTO is deter-
Wang, J. (2003). Data mining: Opportunities and
mined based on the acceptable down time
challenges. IRM Press.
in case of a disruption of operations.
8
Wang, L. & Fu, X. (2005). Data mining with com- Retrieved March 06, 2007 from http://curric.
putational intelligence (advanced information and dlib.vt.edu/DLcurric/proposalsummary04.
knowledge processing) (1st ed.). Springer. pdf
9
Retrieved April 21, 2006 from http://www.
Witten, I. & Frank, E. (1999). Data mining, Prac-
codata.org/archives/2002/ArchivingWG-
tical machine learning tools and techniques with
PretoriaRpt.pdf
Java implementations. Morgan Kaufman. 10
The Club of Amsterdam Journal, August
Witten, I. & Frank, E. (2005). Data mining, prac- 2005, Issue 51, available at http://www.
tical machine learning tools and techniques (2nd clubofamsterdam.com/press.asp?contentid
ed.). Morgan Kaufman. =494&catid=85
WSSD (2002). Press release for fifth partnership
plenary, world summit on sustainable develop-
ment. Johannesburg, South Africa.

Chapter X
Business Data Warehouse:
The Case of Wal-Mart

Indranil Bose
The University of Hong Kong, Hong Kong

Lam Albert Kar Chun

The University of Hong Kong, Hong Kong

Leung Vivien Wai Yue

The University of Hong Kong, Hong Kong

Li Hoi Wan Ines

The University of Hong Kong, Hong Kong

Wong Oi Ling Helen

The University of Hong Kong, Hong Kong

aBStraCt

The retailing giant Wal-Mart owes its success to the efficient use of information technology in its op-
erations. One of the noteworthy advances made by Wal-Mart is the development of the data warehouse
which gives the company a strategic advantage over its competitors. In this chapter, the planning and
implementation of the Wal-Mart data warehouse is described and its integration with the operational
systems is discussed. The chapter also highlights some of the problems encountered in the developmental
process of the data warehouse. The implications of the recent advances in technologies such as RFID,
which is likely to play an important role in the Wal-Mart data warehouse in future, is also detailed in
this chapter.

Copyright © 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
Business Data Warehouse

introduCtion at periodic intervals, translated into the format

required by the data warehouse, and loaded into
Data warehousing has become an important tech- the data warehouse. Data in the warehouse may
nology to integrate data sources in recent decades be of three forms — detailed information (fact
which enables knowledge workers (executives, tables), summarized information, and metadata
managers, and analysts) to make better and faster (i.e., description of the data). Data is constantly
decisions (SCN Education, 2001). From a tech- transformed from one form to another in the data
nological perspective, Wal-Mart, as a pioneer in warehouse. Dedicated decision support system
adopting data warehousing technology, has always is connected with the data warehouse, and it can
adopted new technology quickly and successfully. retrieve required data for analysis. Summarized
A study of the applications and issues of data data are presented to managers, helping them to
warehousing in the retailing industry based on make strategic decisions. For example, graphs
Wal-Mart is launched. By investigating the Wal- showing sales volumes of different products
Mart data warehouse from various perspectives, over a particular period can be generated by
we review some of the critical areas which are the decision support system. Based on those
crucial to the implementation of a data warehouse. graphs, managers may ask several questions. To
In this chapter, the development, implementation, answer these questions, it may be necessary to
and evaluation of the Wal-Mart data warehouse query the data warehouse and obtain supporting
is described, together with an assessment of the detailed information. Based on the summarized
factors responsible for deployment of a successful and detailed information, the managers can take
data warehouse. a decision on altering the production volume of
different products to meet expected demands.
Data Warehousing The major processes that control the data flow
and the types of data in the data warehouse are
Data warehouse is a subject-oriented, integrated, depicted in Figure 1. For a more detailed descrip-
time-variant, non-updatable collection of data tion of the architecture and functionalities of a
used in support of management decision-mak- data warehouse, the interested reader may refer
ing (Agosta, 2000). According to Anahory to Inmon and Inmon (2002) and Kimball and
and Murray (1997), “a data warehouse is the Ross (2002).
data (meta/fact/dimension/aggregation) and the
process managers (load/warehouse/query) that
make information available, enabling people BaCkground
to make informed decisions.” Before the use of
data warehouse, companies used to store data in Wal-Mart is one of the most effective users of tech-
separate databases, each of which were meant for nology (Kalakota & Robinson, 2003). Wal-Mart
different functions. These databases extracted was always among the front-runners in employing
useful information, but no analyses were car- information technology (IT) to manage its sup-
ried out with the data. Since company databases ply chain processes (Prashanth, 2004). Wal-Mart
held large volumes of data, the output of queries started using IT to facilitate cross docking in the
often listed out a lot of data, making manual data 1970s. The company later installed bar codes for
analyses hard to carry out. To resolve this problem, inventory tracking, and satellite communication
the technique of data warehousing was invented. system (SCS) for coordinating the activities of
The concept of data warehousing is simple. its supply chain. Wal-Mart also set-up electronic
Data from several existing systems is extracted data interchange (EDI) and a computer terminal

0
Business Data Warehouse

Figure 1. Process diagram of a data warehouse (adapted from Anahory and Murray [1997])

network (CTN), which enabled it to place orders worked with the front-end application, and that
electronically to its suppliers and allowed the data could be transferred from the old systems.
company to plan the dispatch of goods to the The first task for Teradata Corporation was to
stores appropriately. Advanced conveyor system build a prototype of the data warehouse system.
was installed in 1978. The point of sale (POS) Based on this prototype system, a business case
scanning system made its appearance in 1983, study related to the communication between the
when Wal-Mart’s key suppliers placed bar-codes IT department and the merchandising organiza-
on every item, and universal product code (UPC) tions was constructed. The case study and the
scanners were installed in Wal-Mart stores. Later prototype system were used in conjunction to
on, the electronic purchase order management convince Wal-Mart executives to invest in the
system was introduced when associates were technology of data warehouse.
equipped with handheld terminals to scan the Once approved, the IT department began the
shelf labels. As a result of the adoption of these task of building the data warehouse. First, infor-
technologies, inventory management became mation-based analyses were carried out on all of
much more efficient for Wal-Mart. In the early the historical merchandising data. Since the IT
1990s, Wal-Mart information was kept in many department did not understand what needed to
different databases. As its competitors, such as be done at first, time was wasted. About a month
Kmart, started building integrated databases, later, there was a shakedown. The IT department
which could keep sales information down to the focused on the point-of-sales (POS) data. Four
article level, Wal-Mart’s IT department felt that teams were formed: a database team, an applica-
a data warehouse was needed to maintain its tion team, a GUI team, and a Teradata team. The
competitive edge in the retailing industry. Teradata team provided training and overlooked
Since the idea of data warehouse was still new everything. The remaining teams held different
to the IT staff, Wal-Mart needed a technology responsibilities: the database team designed,
partner. Regarding data warehouse selection, created, and maintained the data warehouse, the
there are three important criteria: compatibility, application team was responsible for loading,
maintenance, and linear growth. In the early maintaining, and extracting the data, and the GUI
1990s, Teradata Corporation, now a division of team concentrated on building the interface for
NCR, was the only choice for Wal-Mart, as Tera- the data warehouse. While working on different
data was the only merchant database that fulfilled parts of the data warehouse, the teams supported
these three important criteria. Data warehouse the operations of each other.
compatibility ensured that the data warehouse Hardware was a limitation in the data ware-
house implementation at Wal-Mart. Since all data

Business Data Warehouse

needed to fit in a 600 GB machine, data modeling The initial schema was a star schema with
had to be carried out. To save up storage space, a the central fact table (POS) being linked to the
technique called “compressing on zero” was used other six support tables. However, the star schema
(Westerman, 2001). This technique was created by was soon modified to a snowflake schema where
the prototype teams. The technique assumed that the large fact-table (POS) was surrounded by
the default value in the data warehouse was zero, several smaller support tables (like store, article,
and when this was the case, there was no need to date, etc.) which in turn were also surrounded by
store this data or allocate physical space on the yet smaller support tables (like region, district,
disk drive for the value of zero. This was quite supplier, week, etc.). An important element of
important since it required equal space to store the POS table was the activity sequence number
zero or any large value. This resulted in great disk which acted as a foreign key to the selling activity
space savings in the initial stages of the database table. The selling activity table led to performance
design. Data modeling was an important step in problems after two years, and Wal-Mart decided
Wal-Mart data warehouse implementation. Not to merge this table with the POS table. The next
only did it save up storage but was responsible major change that took place several years later
for efficient maintenance of the data warehouse was the addition of the selling time attribute to
in the future. Hence, it is stated by Westerman the POS table. The detailed description of the
(2001), “If you logically design the database first, summary and fact tables can be obtained from
the physical implementation will be much easier Westerman (2001).
to maintain in the longer term.”
After the first implementation, Wal-Mart
data warehouse consisted of the POS structure Main thruSt
(Figure 2). The structure was formed with a large
fact-base table (POS) surrounded by a number of Approximately one year after implementation
support tables. of the data warehouse in Wal-Mart, a return on

Figure 2. Star schema for Wal-Mart data warehouse (Source: Westerman, 2001)

Business Data Warehouse

investment (ROI) analysis was conducted. In user-ID and password to log on and run queries,
Wal-Mart, the executives viewed investment in and there was no way to track who was actually
the advanced data warehousing technology as a running the specified query. Wal-Mart did manage
strategic advantage over their competitors, and to fix the problem by launching different user-
this resulted in a favorable ROI analysis. However, IDs with the same password “walmart”. But this
the implementation of the data warehouse was in turn led to security problems as Wal-Mart’s
marked by several problems. buyers, merchandisers, logistics, and forecasting
associates, as well as 3,500 of Wal-Mart’s vendor
Problems in Using the Buyer partners, were able to access the same data in the
Decision Support Systems (BDSS) data warehouse. However, this problem was later
solved in the second year of operation of the data
The first graphical user interface (GUI) applica- warehouse by requiring all users to change their
tion based on the Wal-Mart data warehouse was passwords.
called the BDSS. This was a Windows-based
application created to allow buyers to run queries Performance Problems of Queries
based on stores, articles, and specific weeks. The
queries were run and results were generated in a Users had to stay connected to Wal-Mart’s bounc-
spreadsheet format. It allowed users to conduct ing network and database, throughout its entire
store profitability analysis for a specific article 4,000-plus store chain and this was cost-ineffec-
by running queries. A major problem associated tive and time-consuming when running queries.
with the BDSS was that the queries run using this The users reported a high failure rate when the
would not always execute properly. The success users stayed connected to the network for the du-
rate of query execution was quite low at the be- ration of the query run time. The solution to this
ginning (i.e., 60%). BDSS was rewritten several problem was deferred queries, which were added
times and was in a process of continual improve- to enable a more stable environment for users.
ment. Initially, the system could only access POS The deferred queries application ran the query
data, but in a short period of time, access was and saved the results in the database in an off-line
also provided to data related to warehouse ship- mode. The users were allowed to see the status
ments, purchase orders, and store receipts. BDSS of the query and could retrieve the results after
proved to be a phenomenal success for Wal-Mart, completion of the query. With the introduction of
and it gave the buyers tremendous power in their the deferred queries, the performance problems
negotiations with the suppliers, since they could were solved with satisfactory performance, and
check the inventory in the stores very easily and user confidence was restored as well. However, the
order accordingly. users were given the choice to defer the queries.
If they did not face any network-related problems
Problems in Tracking Users with they could still run the queries online, while re-
Query Statistics maining connected to Wal-Mart’s database.

Query Statistics was a useful application for Problems in Supporting Wal-Mart’s

Wal-Mart which defined critical factors in the Suppliers
query execution process and built a system to
track the queries. Tracking under this query sta- Wal-Mart’s suppliers often remained dissatisfied
tistics application revealed some problems with because they did not have access to the Wal-Mart
the warehouse. All users were using the same data warehouse. Wal-Mart barred its suppliers

Business Data Warehouse

from viewing its data warehouse since they did warehouse (Whiting, 2004). As a result, the sys-
not want suppliers to look into the Wal-Mart tems were able to feed data into the data warehouse
inventory warehouse. The executives feared, if seamlessly. There were also technical reasons for
given access to the inventory warehouse, suppliers driving integration. It was easier to get data out
would lower the price of goods as much as they of the integrated data warehouse, thus making
could, and this in turn would force Wal-Mart to it a transportation vehicle for data into the dif-
purchase at a low price, resulting in overstocked ferent computers throughout the company. This
inventory. Later on, Wal-Mart realized that since was especially important because this allowed
the goals of the supplier and the buyer are the same each store to pull new information from the data
(i.e., to sell more merchandise), it is not beneficial warehouse through their replenishment system.
to keep this information away from the suppli- It was also very effective since the warehouse
ers. In fact, the information should be shared so was designed to run in parallel, thus allowing
that the suppliers could come prepared. In order hundreds of stores to pull data at the same time.
to sustain its bargaining power over its suppliers The following is a brief description of Wal-Mart’s
and yet satisfy them, Wal-Mart built Retail Link, applications and how they were integrated into
a decision support system that served as a bridge the enterprise data warehouse.
between Wal-Mart and its suppliers. It was essen-
tially the same data warehouse application like the Replenishment System
BDSS but without the competitors’ product cost
information. With this the suppliers were able to The process of automatic replenishment was criti-
view almost everything in the data warehouse, cally important for Wal-Mart since it was able to
could perform the same analyses, and exchange deliver the biggest ROI after the implementation
ideas for improving their business. Previously, of the data warehouse. Since the replenishment
the suppliers used to feel quite disheartened application was already established, the system
when the buyers surprised them with their up- was quite mature for integration. The replenish-
to-date analyses using the BDSS. The suppliers ment system was responsible for online transac-
often complained that they could not see what tion processing (OLTP) and online analytical
the buyers were seeing. With the availability of processing (OLAP). It reviewed articles for orders.
the Retail Link, the suppliers also began to feel The system then determined whether an order
that Wal-Mart cared about their partners, and this was needed and suggested an order record for
improved the relationship between the suppliers the article. Next these order records were loaded
and the buyers. into the data warehouse and transmitted from the
Once the initial problems were overcome, home office to the store. The store manager then
emphasis was placed on integration of the data reviewed the suggested orders, changed prices,
warehouse with several of the existing operational counted inventory, and so on. Before the order
applications. was placed, the store managers also reviewed
the flow of goods by inquiring about article sales
Integration of the Data Warehouse trends, order trends, article profiles, corporate
with Operational Applications information, and so on. These were examples of
OLAP activities. This meant that the order was
When it comes to integration, the main driving force not automatically placed for any item. Only after
for Wal-Mart was the ability to get the information the store manager had a chance to review the
into the hands of decision makers. Therefore, many order and perform some analyses using the data
of the applications were integrated into the data warehouse was it decided whether the order was

Business Data Warehouse

Table 1. An example store trait table

Store Fresh <60k >120k Kmart target real
id pharmacy deli Bakery Beach Retirement university Sqft Sqft Comp Comp Comp etc.
0 N N N N Y N N Y Y N N …
0 Y Y Y N N Y N Y N Y N …

going to be placed or not. The order could either Store distribution for Article X = (pharmacy *
be placed if the order could be filled from one fresh deli * bakery *  < 60K sq. ft.).
of the Wal-Mart warehouses, or the order could
be directed to the supplier via electronic data This formula indicated that a store which had a
interchange (EDI). In either of the two cases, the pharmacy, a fresh deli, a bakery, and had a size of
order would be placed in the order systems and more than 60,000 sq. ft., should receive the order.
into the data warehouse. From Table 1, we can see that store 2106 satisfies
all these conditions and hence should receive the
Distribution via Traits article X. In this manner, each article had its own
unique formula, helping Wal-Mart distribute its
The traiting concept was developed as an essen- articles most effectively amongst its stores.
tial element of the replenishment system. The All this information was very valuable for de-
main idea was to determine the distribution of an termining the allocation of merchandise to stores.
article to the stores. Traits were used to classify A data warehouse would provide a good estimate
stores into manageable units and could include for a product based on another, similar product
any characteristics, as long as it was somewhat that had the same distribution. A new product
permanent. Furthermore, these traits could only would be distributed to a test market using the
have two values: TRUE and FALSE. Table 1 is traiting concept, and then the entire performance
an example of what a store trait table might look tracking would be done by the data warehouse.
like. Depending on the success or failure of the initial
Traits could also be applied to articles in a trial run, the traits would be adjusted based on
store where a different table could be created for performance tracking in the data warehouse, and
it. These different trait tables were used as part this would be continued until the distribution
of the replenishment system. The most powerful formula was perfected. These traiting methods
aspect of this traiting concept was the use of a were replicated throughout Wal-Mart using the
replenishment formula based on these traits. The data warehouse, helping Wal-Mart institute a
formula was a Boolean formula where the out- comprehensive distribution technique.
come consisted of one of two values. If the result
was true, the store would receive an article and Perpetual Inventory (PI) System
vice versa. This concept was very important for
a large centrally-managed retail company like The PI system was used for maintenance of
Wal-Mart, since the right distribution of goods inventory of all articles, not just the articles ap-
to the right stores affected sales and hence the pearing in the automatic replenishment. Like the
image of the company. A distribution formula replenishment system, it was also an example of
might look like this: an OLAP and OLTP system. It could help man-
agers see the entire flow of goods for all articles,

Business Data Warehouse

including replenishment articles. This data was that its top 100 suppliers must be equipped with
available in the store and at the home office. Thus, RFID tags on their pallets and crates by Janu-
with the use of the replenishment and PI systems, ary, 2005. The deadline is now 2006 and the list
managers could maintain all information related now includes all suppliers, not just the top 100
to the inventory in their store electronically. With (Hardfield, 2004). Even though it is expensive and
all this information in the data warehouse, there impractical (Greenburg, 2004), the suppliers have
were numerous information analyses that could no choice but to adopt this technology.
be conducted. These included: The RFID technology consists of RFID tags
and readers. In logistical planning and opera-
• The analysis of the sequence of events related tion of supply chain processes, RFID tags, each
to the movement of an article consisting of a microchip and an antenna, would
• Determination of operational cost be attached on the products. Throughout the dis-
• Creation of “plan-o-grams” for each store for tribution centers, RFID readers would be placed
making planning more precise. This could at different dock doors. As a product passed a
allow buyers and suppliers to measure the reader at a particular location, a signal would be
best selling locations without physically triggered and the computer system would update
going to the store. the location status of the associated product.
According to Peak Technologies (http://www.
The PI system using the enterprise data ware- peaktech.com), Wal-Mart is applying SAMSys
house could also provide benefits to the customer MP9320 UHF portal readers with Moore Wal-
service department. Managers could help custom- lance RFID labels using Alien Class 1 passive
ers locate certain products with certain charac- tags. Each tag would store an Electronic Product
teristics. The system could allocate the product in Code (EPC) which was a bar code successor that
the store, or identify if there were any in storage, would be used to track products as they entered
or if the product was in transit and when it could Wal-Mart’s distribution centers and shipped to
arrive or even if the product was available in any
nearby stores. This could be feasible due to the
data provided by the PI system and the informa- Figure 3. RFID label for Wal-Mart (Source: E-
tion generated by the data warehouse. Technology Institution (ETI) of the University of
Hong Kong [HKU])

future trendS

Today, Wal-Mart continues to employ the most

advanced IT in all its supply chain functions. One
current technology adoption in Wal-Mart is very
tightly linked with Wal-Mart’s data warehouse,
that is, the implementation of Radio Frequency
Identification (RFID). In its efforts to implement
new technologies to reduce costs and enhance the
efficiency of supply chain, in July 2003, Wal-Mart
asked all its suppliers to place RFID tags on the
goods, packed in pallets and crates shipped to Wal-
Mart (Prashanth, 2004). Wal-Mart announced

Business Data Warehouse

individual stores (Williams, 2004). Figure 3 is that gave them phenomenal strategic advantage
an example of the label. The data stored in the compared to their competitors. They created the
RFID chip and a bar code are printed on the label, BDSS and the Retail Link which allowed easy
so we know what is stored in the chip and also exchange of information between the buyers
the bar code could be scanned when it became and the suppliers and was able to involve both
impossible to read the RFID tag. According to parties to improve sales of items. Another key
Sullivan (2004, 2005), RFID is already installed achievement of the Wal-Mart data warehouse
in 104 Wal-Mart stores, 36 Sam’s Clubs, and three was the Replenishment system and the Perpetual
distribution centers, and Wal-Mart plans to have Inventory system, which acted as efficient deci-
RFID in 600 stores and 12 distribution centers sion support systems and helped store managers
by the end of 2005. throughout the world to reduce inventory, order
The implementation of RFID at Wal-Mart items appropriately, and also to perform ad-hoc
is highly related to Wal-Mart’s data warehouse, queries about the status of orders. Using novel
as the volume of data available will increase concepts such as traiting, Wal-Mart was able to
sufficiently. The industry has been surprised by develop a successful strategy for efficient distri-
estimates of greater than 7 terabytes of item-level bution of products to stores. As can be expected,
data per day at Wal-Mart stores (Alvarez, 2004). Wal-Mart is also a first mover in the adoption of
The large amount of data can severely reduce the the RFID technology which is likely to change the
long-term success of a company’s RFID initiative. retailing industry in the next few years. The use
Hence, there is an increasing need to integrate of this technology will lead to the generation of
the available RFID data with the Wal-Mart data enormous amounts of data for tracking of items
warehouse. Fortunately, Wal-Mart’s data ware- in the Wal-Mart system. It remains to be seen
house team is aware of the situation and they how Wal-Mart effectively integrates the RFID
are standing by to enhance the data warehouse technology with its state-of-the-art business data
if required. warehouse to its own advantage.

ConCluSion referenCeS

In this chapter we have outlined the historical Agosta, L. (2000). The essential guide to data
development of a business data warehouse by the warehousing. Upper Saddle River, NJ: Prentice
retailing giant Wal-Mart. As a leader of adopt- Hall.
ing cutting edge IT, Wal-Mart demonstrated
Alvarez, G. (2004). What’s missing from RFID
great strategic vision by investing in the design,
tests. Information Week. Retrieved November 20,
development and implementation of a business
2004, from http://www.informationweek.com/
data warehouse. Since this was an extremely
story/showArticle.jhtml?articleID=52500193
challenging project, it encountered numerous
problems from the beginning. These problems Anahory, S., & Murray, D. (1997). Data ware-
arose due to the inexperience of the development housing in the real world: A practical guide for
team, instability of networks, and also inability of building decision support systems. Harlow, UK:
Wal-Mart management to forecast possible uses Addison-Wesley.
and limitations of systems. However, Wal-Mart
Greenburg, E. F. (2004). Who turns on the RFID
was able to address all these problems successfully
faucet, and does it matter? Packaging Digest, 22.
and was able to create a data warehouse system

Business Data Warehouse

Retrieved January 24, 2005, from http://www. SCN Education B. V. (2001). Data warehous-
packagingdigest.com/articles/200408/22.php ing — The ultimate guide to building corporate
business intelligence (1st ed.). Vieweg & Sohn
Hardfield, R. (2004). The RFID power play. Supply
Verlagsgesellschaft mBH.
Chain Resource Consortium. Retrieved October
23, 2004, from http://scrc.ncsu.edu/public/APICS/ Sullivan, L. (2004). Wal-Mart’s way. Information
APICSjan04.html Week. Retrieved March 31, 2005, from http://
www.informationweek.com/story/showArticle.
Inmon, W. H., & Inmon, W. H. (2002). Building
jhtml?articleID=47902662&pgno=3
the data warehouse (3rd ed.). New York: John
Wiley & Sons. Sullivan, L. (2005). Wal-Mart assesses new uses
for RFID. Information Week. Retrieved March
Kalakota, R., & Robinson, M. (2003). From
31, 2005, from http://www.informationweek.
e-business to services: Why and why now? Ad-
com/showArticle.jhtml?articleID=159906172
dison-Wesley. Retrieved January 24, 2005, from
http://www.awprofessional.com/articles/article. Westerman, P. (2001). Data warehousing: Using
asp?p=99978&seqNum=5 the Wal-Mart model. San Francisco: Academic
Press.
Kimball, R., & Ross, M. (2002). The data ware-
house toolkit: The complete guide to dimensional Whiting, R. (2004). Vertical thinking. Infor-
modeling (2nd ed.). New York: John Wiley & mation Week. Retrieved March 31, 2005, from
Sons. http://www.informationweek.com/showArticle.
jhtml?articleID=18201987
Prashanth, K. (2004). Wal-Mart’s supply chain
management practices (B): Using IT/Internet Williams, D. (2004). The strategic implications of
to manage the supply chain. Hyderabad, India: Wal-Mart’s RFID mandate. Directions Magazine.
ICFAI Center for Management Research. Retrieved October 23, 2004, from http://www.
directionsmag.com/article.php?article_id=629

This work was previously published in Database Modeling for Industrial Data Management: Emerging Technologies and Ap-
plications, edited by Z.M. Ma, pp. 244-257, copyright 2006 by Information Science Publishing (an imprint of IGI Global).

Chapter XI
Medical Applications of
Nanotechnology in the
Research Literature1
Ronald N. Kostoff
Office of Naval Research, USA

Raymond G. Koytcheff
Office of Naval Research, USA

Clifford G.Y. Lau

Institute for Defense Analyses, USA

aBStraCt

The medical applications literature associated with nanoscience and nanotechnology research was ex-
amined. About 65,000 nanotechnology records for 2005 were retrieved from the Science Citation Index/
Social Science Citation Index (SCI/SSCI) using a comprehensive 300+ term query. The medical applica-
tions were identified through a fuzzy clustering process. Metrics associated with research literatures for
specific medical applications/ applications groups were generated.

introduCtion Kostoff, Murday, Lau & Tolles, 2005b; Kostoff,

Stump, Johnson, Murday, Lau & Tolles 2006a;
During 2003–2005, a comprehensive text mining Kostoff, Murday, Lau & Tolles, 2006b). Based on
study was performed to overview the techni- the global interest generated by these reports, it
cal structure and infrastructure of the global was decided to update and expand the study using
nanotechnology research literature, as well as more recent data, a much more comprehensive
the seminal nanotechnology literature (Kostoff, query, and more sophisticated analytical tools. A
Stump, Johnson, Murday, Lau & Tolles, 2005a;

Copyright © 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
Medical Applications of Nanotechnology in the Research Literature

detailed report from the updated study is contained technology subareas China has achieved parity or
in Kostoff, Koytcheff, and Lau (2007). taken the lead in research article production.
In the updated study, text mining was used to The main institutional copublishing groups
extract technical intelligence from the open source are East Asian: one each from China, Japan, and
global nanotechnology and nanoscience research South Korea. However, publication connectivity
literature (Science Citation Index/Social Science among institutions is much weaker than common
Citation Index (SCI/SSCI) databases (SCI, 2006)). interest or citation connectivity. Correlation of
Identified were: (1) the nanotechnology/ nanosci- institutions by the journals they cite reveals four
ence research literature infrastructure (prolific nationality-based (or locality-based) clusters:
authors, key journals/ institutions/ countries, Chinese, Japanese, American, and European.
most cited authors/ journals/ documents); (2) the Institutions from the same nationality group cite
technical structure (pervasive technical thrusts the same focused journals (primarily, but not
and their interrelationships); (3) nanotechnology exclusively, domestic). Correlation of institutions
instruments and their relationships; (4) potential by documents they cite reveals that only the Chi-
nonmedical nanotechnology applications, and (5) nese institutions constitute a strongly-connected
potential health applications. network.
Most importantly, in the updated study, all The dominant country copublishing network
of the technical structural analyses of the total is a complex web of mainly European nations
nanotechnology database show medical and roughly following geographic lines: Nordic, Cen-
nonmedical applications being a key driver tral Europe, Eastern Europe, and a Western Eu-
in nanoscience and nanotechnology research. rope/Latin American group of romance language
The objectives of this paper are to examine the nations. There is also a UK component country
nanotechnology medical applications literature in network, but it is not linked to the interconnected
depth, and especially show medical applications continental members of the European Union.
relationships to each other and to the underlying Correlation of countries by common thematic
science disciplines. interest shows two major poles: U.S. and China.
In order to place the nanotechnology medical The U.S. pole is strongly connected thematically
applications analyses and findings in their proper to a densely connected network of English-speak-
context, the overall nanotechnology study will ing North American representatives, Western/
first be summarized. Central European nations, and most of the East
Asian allies. China is relatively isolated except
for India, and the Eastern European and Latin
SUMMARy OF OVERALL American representatives are outside the main
nanoteChnology Study network as well.
There is a clear distinction between the publi-
Bibliometrics cation practices of the three most prolific Western
nations and the three most prolific East Asian
Global nanotechnology research article produc- nations. The Western nations publish in journals
tion has exhibited exponential growth for more with almost twice the weighted average impact
than a decade. The most rapid growth over that factors (Journal Impact Factor is a metric that
time period has come from East Asian nations, reflects the average citations received by papers
notably China and South Korea. While the U.S. published recently in the journal) of the East
remains the leader in aggregate nanotechnology Asian nations. However, much of the difference
research article production, in some selected nano- stems from the East Asian nations publishing a

00
Medical Applications of Nanotechnology in the Research Literature

nonnegligible amount in domestic low Impact the “films” component, whereas the “nanotubes”
Factor journals, while the Western nations publish component of the materials/structures category
in higher impact factor international journals. is about one ninth the size of the nonnanotubes
Additionally, some of the Asian countries (e.g., component.
China) are publishing in journals whose initial Furthermore, maps were constructed to show
access date in the SCI/SSCI is relatively recent. groupings of related nonmedical Applications
If China is publishing a nonnegligible fraction of into broader thematic areas. An autocorrelation
its research output in newly-accessed relatively map of the most widely referenced nonmedi-
low impact factor journals, then some of its ap- cal applications showed five weakly-connected
parently rapid growth will not be in the traditional subnetworks:
sense of increased sponsorship or productivity,
but rather due to the SCI/SSCI’s decision to ac- • Electronic devices and components
cess existing journals’ articles. From another • Optical switching
perspective, China’s research article production • Tribology and corrosion
may have been somewhat more competitive for • Optoelectronic sensors
decades, but was artificially suppressed by many • Electrochemical conversion and catalysis
of its journals’ noninclusion in the SCI/SSCI until
only recently. In addition to the maps, factor analyses were
Of the 30 institutions publishing large num- performed to show nonmedical thematic areas
bers of nanotechnology papers, only 4 are from from a slightly different perspective. A six-factor
the U.S., whereas of the 25institutions producing analysis showed the following themes:
highly cited nanotechnology papers, an astonish-
ing 21 papers are from the U.S. The two journals • Factor 1: Optoelectronics
that contain the most cited nanotechnology papers • Factor 2: Tribology
since 1991 are Science and Nature, and the two • Factor 3: Lithography
countries that lead in production of the most • Factor 4: Control systems
cited papers are the U.S. and Germany, with the • Factor 5: Devices
U.S. having four times the number of most cited • Factor 6: Microsystems
nanotechnology papers as Germany. The U.S.
and Germany account for 40% of the most cited Finally, for the most frequently mentioned
nanotechnology papers, while the high paper nonmedical applications, the following observa-
volume production East Asian countries of China tions were made:
and South Korea account for only two percent of
the most cited papers. • TiO2, Pt, Si, gold, and polymers tend to
stand out as the most pervasive associated
Computational Linguistics material types.
• Morphology, thickness /diameter/particle
The retrieved records for 2005 can be divided size, optical properties, catalytic perfor-
into two main categories with similar numbers mance, and electrochemical properties tend
of records (first level). One category focuses on to stand out as the most pervasive associated
phenomena and thin films (32,983 records), and material properties.
the other category focuses on materials/structures • Deposition, absorption, oxidation, im-
(31,742 records). The “phenomena” component of mobilization, catalysis, degradation, and
the first category is roughly four times the size of self-assembly tend to stand out as the most

0
Medical Applications of Nanotechnology in the Research Literature

pervasive associated nanoscale phenom- • Potential discovery and innovation based

ena. on merging common linkages among very
• Thin films, nanowires, nanotubes (especially disparate literatures (Gordon & Dumais,
carbon), and self-assembled monolayers 1998; Kostoff, 2003b, 2006c; Swanson,
tend to stand out as the most pervasive as- 1986; Swanson & Smalheiser, 1997)
sociated nanostructures. • Uncovering unexpected asymmetries from
the technical literature (Kostoff, 2003c;
Goldman, Chu, Parker & Goldman, 1999).
BaCkground For example, Kostoff (2003c) predicted
asymmetries in recorded bilateral organ
The two main components of the present nano- (lungs, kidneys, testes, ovaries) cancer
technology medical applications study are the text incidence rates from the asymmetric oc-
mining analytical procedure and the nanotechnol- currence of lateral word frequencies (left,
ogy topical literature. The text mining background right) in Medline case study articles.
will be summarized briefly. The nanotechnology • Estimating global levels of effort in S&T
background has been described in detail (Kostoff subdisciplines (Kostoff, Green, Toothman
et al., 2005b, 2006b), and will not be repeated & Humenik, 2000; Kostoff, Shlesinger
here. The nanotechnology background is updated & Tshiteya, 2004a; Viator & Pastorius,
and expanded in Kostoff et al. (2007). Finally, the 2001)
relevant technology transfer background issues • Helping authors potentially increase their
will be summarized. citation statistics by improving access to
their published papers, and thereby poten-
Text Mining tially helping journals to increase their im-
pact factors (Kostoff et al., 2004a; Kostoff,
A typical text mining study of the published lit- Shlesinger & Malpohl, 2004b)
erature develops a query for comprehensive infor- • Tracking myriad research impacts across
mation retrieval, processes the retrieved database time and applications areas (Kostoff, Del
using computational linguistics and bibliometrics, Rio, García, Ramírez & Humenik, 2001;
and integrates the processed information. In this Davidse & VanRaan, 1997)
section, the computational linguistics and biblio-
metrics are overviewed. Evaluative bibliometrics (Garfield, 1985;
Science and technology (S&T) computational Narin, 1976; Schubert, Glanzel & Braun, 1987)
linguistics (Hearst, 1999; Kostoff, 2003a; Losie- uses counts of publications, patents, citations and
wicz, Oard & Kostoff, 2000; Zhu & Porter, 2002) other potentially informative items to develop sci-
identifies pervasive technical themes in large ence and technology performance indicators. Its
databases from technical phrases that occur fre- validity is based on the premises that (1) counts
quently. It also identifies relationships among these of patents and papers provide valid indicators
themes by grouping (clustering) these phrases (or of R&D activity in the subject areas of those
their parent documents) on the basis of similarity. patents or papers, (2) the number of times those
Computational linguistics can be used for: patents or papers are cited in subsequent patents
or papers provides valid indicators of the impact
• Enhancing information retrieval and increas- or importance of the cited patents and papers,
ing awareness of the global technical litera- and (3) the citations from papers to papers, from
ture (Greengrass, 1997; Kostoff, Eberhart & patents to patents and from patents to papers pro-
Toothman,1997a; TREC, 2004)

0
Medical Applications of Nanotechnology in the Research Literature

vide indicators of intellectual linkages between associated infrastructure. Once contact is made
the organizations that are producing the patents between the on-going science and the potential
and papers, and knowledge linkage between their user, then the full technology transfer process
subject areas (Narin, Olivastro & Stevensf, 1994). can be initiated.
Evaluative bibliometrics can be used to: This chapter will provide such information
of importance to the nanotechnology technology
• Identify the infrastructure (authors, journals, transfer community. It will identify the main
institutions) of a technical domain nanotechnology health applications from today’s
• Identify experts for innovation-enhancing vantage point, as well as the related science and
technical workshops and review panels infrastructure.
• Develop site visitation strategies for assess-
ment of prolific organizations globally
• Identify impacts (literature citations) of approaCh
individuals, research units, organizations,
and countries The following approach describes how the main
nanotechnology medical applications were
Technology Transfer identified, as well as their direct and indirect
relationships.
In its modern form, nanotechnology has been A document fuzzy clustering analysis
around for about 15 years. It has the status of an (Karypis, 2006), where documents are divided
emerging technology, and many papers/books into groups based on their text similarities and
have been written promoting its applications where documents can be assigned to more than
potential in many areas. One goal of this study one group, was performed on the ~65,000 total
was to document the medical applications poten- nanotechnology records retrieved for the overall
tial. A second goal was to identify some of the nanotechnology study. The resulting hierarchical
science and infrastructure markers of potential taxonomy was inspected visually, and the largest
nanotechnology medical applications, so that the subnetwork that included all medical applica-
science of nanotechnology/ nanoscience could be tions (hereafter called the “health subnetwork”)
accelerated to advanced levels of development. was identified. A metalevel taxonomy of the
This chapter is intended to facilitate the nano- health subnetwork (the highest two hierarchical
technology transition process by identifying the levels) was generated, then a taxonomy of the
significant application areas. elemental (lowest level) clusters was generated.
In 1997, a special issue of the Journal of These clusters were analyzed for infrastructure
Technology Transfer (edited by the first author) and technical content. In the remainder of this
addressed accelerated conversion of science to chapter, the medical applications will be referred
technology (Kostoff, 1997b). Its articles empha- to as “health.”
sized the importance of potential downstream us-
ers of science to become involved with the science
development as early and broadly as possible, in reSultS
order to direct the science toward potential user
needs, and smooth the eventual transition to actual Nanotechnology Health Types
applications. The important first step in this con-
version process is to identify the science relevant The document clustering approach used to
to specific desired applications, and identify the identify the health types was a recent algorith-

0
Medical Applications of Nanotechnology in the Research Literature

mic upgrade of the CLUTO software package Country Productivity

(Karypis, 2006) called fuzzy clustering, where In this highest-level category in the health subnet-
a record could be assigned to multiple clusters. work, the USA appears to have a commanding lead
Fuzzy clustering, compared to nonfuzzy cluster- (a ratio of about 3 to 1) over its nearest competitor
ing where a document is assigned to one cluster (China). However, these results must be considered
only, is important for articles that have multiple in context. First, in total SCI/SSCI articles, the
thrusts, such as health applications articles in a USA had about four times as many records as
research database. China when these data were obtained. Second,
There were 256 elemental clusters specified for overall nanotechnology, the USA had about
for the algorithm. Of these, 22 were in the health 25% more records than China for 2005. Third, for
subnetwork. Of these 22 elemental clusters, 19 nanotechnology instrumentation, China actually
related directly to health. The resultant 19 clusters had 25% more records than the USA. Fourth,
are of different types. Some address specific health relative to China, the USA had a commanding
problems (e.g., tumor treatment, sentinal lymph lead in overall biomedical articles, as our recent
node cancer), some address health treatment text mining study on China showed (Kostoff et
mechanisms (e.g., drug release, drug delivery), al., 2006). When all these facts are integrated, it
some address biomaterial types (e.g., cells, DNA, appears that China is placing substantial emphasis
biofilms, virus proteins, amyloid fibrils), but most on its nanotechnology medical research relative
are health-related phenomena and processes (e.g., to its overall medical research.
peptide sequences, binding and affinity, detection, A more interesting comparison is between the
sensing). The higher level taxonomy categories top Asian producers (China, Japan, South Korea)
will now be discussed, followed by a discussion and the top European producers (Germany, Eng-
of the elemental clusters. land, France). The Asian group produced 1,756 pa-
pers, while their European counterparts produced
Higher Level Taxonomy Categories 1,326 papers, a 32% difference. In aggregate,
the Asian group has a population of 1.48 Billion
Highest Level Category and GDP of $14 Trillion-PPP/ $7.7 Trillion-OER,
while the European group has a population of .19
Table 1 contains a summary of the infrastructure, Billion and a GDP of $5.78 Trillion-PPP/ $6.67
pervasive thrusts, and related science for the 19 Trillion-OER. Thus, on a per capita population
elemental clusters. Characteristics of the highest basis, the European group is almost an order of
level category (node) in the health subnetwork magnitude more productive of nanotechnology
are summarized in the last row on Table 1. Be- medical Applications papers, while on a PPP
cause about 15% of the elemental clusters in the GDP basis, the productivity advantage shrinks
22 cluster health subnetwork were not strictly to a factor of two.
health-related, the results on this row should be However, a brief time trend analysis shows
considered a good approximation. In addition, the extent of the challenge for Europe. The fol-
the numbers of records listed for the highest level lowing short query, which represents some of
node (and all nodes on Table 1) include counts of the key medical terms from Table 1 and their
records from different elemental clusters (due to relation to nanotechnology, was entered into the
the fuzzy nature of the clustering), and therefore SCI/ SSCI search engine for two years: 1995 and
have intrinsic multiple counts. 2006 (December 2006):

0
Medical Applications of Nanotechnology in the Research Literature

Table 1. Central health themes and infastructure

THEME/ #RECORDS/ INSTITUTIONS JOURNALS RELATED SCIENCE
COUNTRIES
DRUG RELEASE polymer, hydrogel, nanoparticles, chitosan,
Natl Univ 11 Journal of Controlled 30
(235 Records) microsphere, molecular.weight, particle.
Singapore Release
size, water.soluble, light.scattering, ethylene
Zhejiang 9 International Journal 28 glycol, cross linking, differential scanning
USA 58
Univ of Pharmaceutics calorimetry, scanning electron microscopy,
Peoples R 55 poly lactic acid, atomic force microscopy,
Korea 8 Drug Development 9
China transmission electron microscopy, dynamic
Res Inst and Industrial
India 37 Chem Pharmacy light scattering, fourier transform infrared,
Technol bovine serum albumin, poly ethylene glycol,
South 31 Journal of 8 poly lactide glycolide
Korea Chonbuk 6 Microencapsulation
Japan 24 Natl Univ European Journal of 8
Germany 23 Pharmaceutics and
Biopharmaceutics

DRUG DELIVERY nanoparticles, cancer, cancer cells, cellular

Natl Univ 10 Journal of Controlled 36
(197 Records) uptake, size distribution, tumor cells, scanning
Singapore Release
electron microscopy, poly lactide glycolide,
Univ Michigan 8 International Journal 23 solid lipid nanoparticles, poly ethylene glycol,
USA 79
Zhejiang Univ 7 of Pharmaceutics blood brain barrier, transmission electron
Peoples R 36 microscopy, bovine serum albumin, confocal
Postgrad Inst 7 Journal of Drug 11
China laser scanning microscopy
Med Educ & Delivery Science and
India 31 Technology
Res
Germany 26 Biomaterials 10
Japan 23
Italy 21
France 19
South 18
Korea
England 18

TUMOR TREATMENT liposomes, mice, cells, nanoparticles,

Univ Texas 7 Journal of Controlled 11
(208 Records) tumor cells, tumor growth, contrast
Release
Univ Michigan 7 agents, endothelial cells, flow cytometry,
Chinese Acad Sci 7 Journal of Magnetism 10 cell lines, magnetic resonance imaging,
USA 107
and Magnetic scanning electron microscopy, transmission
Japan 24 Washington Univ 6 Materials electron microscopy, blood brain barrier,
Germany 22 Ohio State Univ 5 Pharmaceutical 8 superparamagnetic iron oxide nanoparticles,
Research surface plasmon resonance, tumor bearing
Peoples R 19 mice, central nervous system, tumor necrosis
China Magnetic Resonance 6 factor, atomic force microscopy
France 19 in Medicine

South 18 Biomaterials 6
Korea

continued on following page

0
Medical Applications of Nanotechnology in the Research Literature

Table 1. continued

SENTINEL LYMPH lymphoscintigraphy, metastases, lymph node,

Massachusetts 5 European Journal 7
NODE CANCER risk factors, breast cancer, sentinel node,
Gen Hosp of Nuclear Medicine
(112 Records) magnetic resonance imaging, squamous cell
Harvard Univ 4 And Molecular
carcinoma, scanning electron microscopy, von
Imaging
Univ Barcelona 3 willebrand factor, lymph node biopsy, low
USA 50
Urology 4 density lipoprotein, high density lipoprotein,
England 12 MIT 3 intercellular adhesion molecule
Journal of Clinical 4
Netherlands 9 Hosp Clin 3 Oncology
Barcelona
Italy 8
Brigham & 3
Germany 6
Womens Hosp
Japan 5
Beth Israel 3
France 5 Deaconess Med
Ctr
Amer Biosci Inc 3

TISSUE CELLS cells, tissues, collagen, scaffold, bone,

Natl Univ 24 Biomaterials 45
(269 Records) osteoblast, extracellular matrix, cell
Singapore Tissue Engineering 22 adhesion, cell culture, endothelial cells,
Tsing Hua Univ 7 Journal of 19 cell proliferation, cell attachment,
USA 92
MIT 7 Biomedical cell morphology, calcium phosphate,
Peoples R 36 osteoblast cells, bone tissue, self assembly,
Johns Hopkins 7 Materials Research
China tissue culture, phosphatase activity, cell
Univ Part A
Japan 36 growth, scanning electron microscopy,
ASBM6: Advanced 9 atomic force microscopy, transmission
Singapore 30 Biomaterials VI electron microscopy, x-ray photoelectron
Germany 26 spectroscopy, alkaline phosphatase activity,
polymerase chain reaction, mesenchymal
South Korea 23
stem cells, polylactic glycolic acid, bone
England 23 marrow stromal cells

CELLS, Cells, adhesion, apoptosis, endothelial.cells,

Univ Calif 28 Biomaterials 25
EMPHASIZING cell lines, cell surface, cell adhesion, cancer
ADHESION Harvard 20 Journal Of 16 cells, epithelial cells, cell proliferation, cell
(605 Records) Univ Biomedical Materials growth, cell death, extracellular matrix, stem
Johns 16 Research Part A cells, tumor cells, flow cytometry,
Hopkins Langmuir 14 atomic force microscopy, transmission
USA 254
Univ electron microscopy, scanning electron
Germany 86 Biophysical Journal 12 microscopy, surface plasmon resonance,
Univ Tokyo 11 smooth muscle cells, green fluorescent
Japan 82
Natl Univ 11 protein, human umbilical vein, magnetic
Peoples R 52 resonance imaging, superparamagnetic iron
Singapore
China oxide nanoparticles
Cnrs 11
South Korea 46
Chinese 11
Canada 30
Acad Sci
England 28
France 27

continued on following page

0
Medical Applications of Nanotechnology in the Research Literature

Table 1. continued
BIOFILMS biofilm, muscles, bacteria, biofilm formation,
Montana State 4 Water Science and 4
(83 Records) infection, colon, pathogen, tissue, strain,
Univ Technology
epithelial cells, pseudomonas aeruginosa,
Chinese Acad 3 On the Convergence 4 staphylococcus epidermidis, escherichia
USA 33
Sci of Bio-Information-, coli, scanning electron microscopy,
Japan 9 et al. transmission electron microscopy, atomic
Univ Calif 3
Germany 8 force microscopy, extracellular polymeric
substances, confocal laser scanning,
England 8 polymerase chain reaction
South Korea 6
Peoples R China 6
Canada 6

VIRUS PROTEINS protein, virus, capside, gene, sequence,

Univ Calif 30 Langmuir 29
(205 Records) escherichia coli, wild type, virus particles,
Osaka Univ 12 Journal of Biological 27 capside assembly, capside protein, self
Univ Texas 11 Chemistry assembly, atomic force microscopy,
USA 228
Biochemical and 14 transmission electron microscopy, surface
Germany 70 Univ Illinois 8 plasmon resonance, amino acid sequence,
Biophysical Research
Japan 66 Linkoping Univ 8 Communications green fluorescent protein, tobacco mosaic
virus, open reading frame, density gradient
Peoples R 34 Chinese Acad 8 Journal of Virology 13 centrifugation, amino acid
China Sci
Journal of Molecular 12
Italy 34 Biology
France 32 Biochemistry 12
England 32

PROTEIN protein, binding, surface, membranes,

Univ Calif 26 Langmuir 37
INTERACTIONS unfolding, fluorescence, protein adsorption,
(641 Records) Chinese Acad 19 Analytical 17 mass spectrometry, protein surface, x-ray
Sci Chemistry diffraction, atomic force microscopy, surface
Univ Illinois 12 Proc NAS-USA 16 plasmon resonance, bovine serum albumin,
USA 247
scanning electron microscopy, transmission
Germany 85 Univ Texas 10 Biomacromolecules 16 electron microscopy, differential scanning
Japan 65 Max Planck Inst 9 calorimetry, human serum albumin, green
fluorescent protein, polyacrylamide gel
Peoples R 55 Univ 8
electrophoresis, protein protein interactions,
China Washington
quartz crystal microbalance, fourier transform
Italy 50 Tokyo Inst 8 infrared, self assembled monolayer, poly
Technol ethylene glycol, tandem mass spectrometry
England 44
Osaka Univ 8
France 35
Linkoping Univ 8

AMYLOID FIBRILS amyloid.fibrils, protein, peptide, alzheimers

Univ Cambridge 8 Biochemistry 16
(114 Records) disease, collagen, protofibril, prion, beta sheet
Osaka Univ 6 Biophysical Journal 8 structure, fibril formation, self assembly,
Niddkd 5 Journal Of 7 amyloid beta, neurodegenerative diseases,
USA 50
Molecular Biology collagen fibrils, amyloid deposits, thioflavin
England 15 Japan Sci & 5 fluorescence, atomic force microscopy,
Technol Agcy Journal Of 7 transmission electron microscopy, paired
Japan 14
Fukui Univ 4 Biological Chemistry helical filaments
Italy 9
Univ Calif 4
Germany 8
Sweden 7

continued on following page

0
Medical Applications of Nanotechnology in the Research Literature

Table 1. continued
PEPTIDE peptide, binding, sequences, amino acids,
MIT 8 Langmuir 13
SEQUENCES peptide nanotubes, neuropeptides,
(187 Records) Univ Calif 5 Analytical Chemistry 10 structure, protein, circular dichroism,
Journal of the 9 antimicrobial peptides, peptide sequence,
American Chemical alpha helix, molecular dynamics, model
USA 86
Society peptide, surface plasmon resonance, atomic
Japan 28 force microscopy, amino acid residues,
Journal of Biological 9 transmission electron microscopy, amino
Israel 14
Chemistry acid sequence, matrix laser desorption,
Germany 14 quartz crystal microbalance, tandem
Biochemistry 7
Australia 14 mass spectrometry, differential scanning
Biophysical Journal 6
calorimetry, self assembled monolayers, solid
Canada 12
phase peptide
BINDING AND binding, receptors, affinity, protein,
Chinese Acad 13 Journal of Biological 44
AFFINITY interaction, surface plasmon resonance,
Sci Chemistry
(415 Records) ligand, high affinity, binding affinity, binding
CNRS 12 Biochemistry 35 sites, amino acid, active site, ligand binding,
Lund Univ 11 Biochemical and 19 binding protein, cell surface, dissociation
USA 211
Biophysical Research rate, atomic force microscopy, site directed
Japan 51 Univ Calif 10 mutagenesis, amino acid residues, human
Communications
Germany 47 Univ Penn 9 immunodeficiency virus, high affinity
Journal of the 16 binding, isothermal titration calorimetry, low
England 45 Univ Oxford 9 American Chemical density lipoprotein, equilibrium dissociation
France 34 NCI 9 Society constants, immobilized sensor chip, expressed
Scripps Res 8 escherichia coli, human serum albumin,
Inst transmission electron microscopy, quartz
crystal microbalance, molecular dynamics
simulations, epidermal growth factor,
fluorescence resonance energy
IMMUNOSENSORS antibodies, antigens, assays, detection, igg,
Hunan Univ 10 Analytical 22
(248 Records) immobilization, immunoassays, binding,
Univ Turku 7 Chemistry
protein, immunosensor, gold, monoclonal
Kyushu Univ 7 Biosensors & 17 antibody, immunosorbent assay, antigen
USA 74
Bioelectronics antibody, assay elisa, antigen binding, gold
Peoples R 54 Sw China 6 surface, gold nanoparticles, escherichia
Normal Univ Analytica Chimica 13
China coli, antibody binding, surface plasmon
Acta
Japan 30 Sogang Univ 6 resonance, enzyme linked immunosorbent
Sensors and 12 assay, atomic force microscopy, quartz crystal
Germany 16 Actuators B- microbalance, self assembled monolayer,
England 16 Chemical bovine serum albumin, electrochemical
Langmuir 12 impedance spectroscopy, transmission
South Korea 15
electron microscopy
DETECTION, detection, sensor, chip, biosensor, mass
Tsing Hua Univ 11 Sensors and 31
EMPHASIZING spectrometry, liquid chromatography, real
Arizona State 10 Actuators B-
SURFACE PLASMON time, sensor chip, refractive index, sensor
Univ Chemical
RESONANCE surface, gold surface, self assembled, gold
(162 Records) Kyushu Univ 9 Analytical 28 nanoparticles, metal ions, surface plasmon
Chemistry resonance, bovine serum albumin, laser
Chinese Acad 9 desorption ionization
USA 93 Sci Biosensors & 18
Bioelectronics
Japan 47 Univ Calif 8
Analytical 10
Peoples R 44 Max Planck Inst 7 Biochemistry
China Polymer Res
Analytica Chimica 9
Germany 40 CNR 7 Acta
South Korea 34 Acad Sci Czech 7
Republ

continued on following page

0
Medical Applications of Nanotechnology in the Research Literature

Table 1. continued
BIOSENSORS enzymes, immobilization, glucose oxidase,
Univ Calif 6 Biosensors & 14
(92 Records) enzyme activity, enzyme loading, glucose
Chinese Acad Sci 5 Bioelectronics
biosensor, immobilized enzyme, electrode
Univ Twente 4 Analytical 6 surface, catalytic activity, free enzyme,
USA 38
Biochemistry glassy carbon electrode, steady state
Peoples R 28 Pacific Nw Natl 4 current, glucose oxidase, scanning electron
Lab Langmuir 5
China microscopy, direct electron transfer, multi
Louisiana Tech 4 Electroanalysis 5 wall carbon nanotubes, surface plasmon
South 9
Korea Univ Chemical 5 resonance
CSIC 4 Communications
Japan 9
Analytical 5
Germany 9
Chemistry

DNA DETECTION DNA, oligonucleotid, target DNA, DNA

Chinese Acad 18 Analytical Chemistry 21
(282 Records) hybridization, gold nanoparticles, nucleic
Sci Nucleic Acids 20 acids, single stranded DNA, surface plasmon
Univ Calif 17 Research resonance, double stranded DNA, polymerase
USA 166
Purdue Univ 8 Langmuir 20 chain reaction, atomic force microscopy,
Peoples R 81 x-ray photoelectron spectroscopy, peptide
China Nano Letters 16 nucleic acid, self assembled monolayers,
Japan 67 Journal of 16 quartz crystal microbalance
Nanoscience and
Germany 54
Nanotechnology
France 27
Biosensors & 14
England 27 Bioelectronics

DNA MOLECULES DNA molecules, DNA binding, DNA

Chinese Acad 19 Nano Letters 20
(411 Records) fragments, self assembly, bound DNA, DNA
Sci Langmuir 18 protein, DNA sequence, DNA complexes,
Russian Acad 12 Nucleic Acids 16 DNA hybridization, target DNA, atomic force
USA 149
Sci Research microscopy, double stranded DNA, surface
Japan 66 plasmon resonance, single stranded DNA,
Univ Calif 10 Proc NAS-USA 12
Peoples R 64 transmission electron microscopy, calf thymus
Univ Tokyo 9 DNA, x-ray photoelectron spectroscopy,
China
Osaka Univ 9 scanning electron microscopy
Germany 42
Delft Univ 8
France 26
Technol
England 26

DNA, EMPHASIZING dna, gene, transfection, chitosan, plasmid dna,

Chinese Acad 7 Journal of Controlled 11
GENE gene delivery, transfection efficiency, dna
Sci Release
DELIVERY AND complexes, dna nanoparticles, gene transfer,
TRANSFECTIION Univ Calif 5 Langmuir 9 gene therapy, gene expression, surface charge,
(110 Records) Kyoto Univ 5 Bioconjugate 9 particle size, atomic force microscopy,
Chemistry transmission electron microscopy, poly
Delft Univ 5 ethylene glycol, gene delivery systems,
USA 66 Technol Nucleic Acids 7 polymerase chain reaction, nonviral gene
Peoples R 37 Research delivery, green fluorescent protein, plasmid
China dna encoding, dynamic light scattering
Japan 23
South Korea 15
Germany 15
England 13
France 10

continued on following page

0
Medical Applications of Nanotechnology in the Research Literature

Table 1.continued
CELLS, cells, membranes, bacteria, vesicles,
Univ Calif 37 Biomaterials 39
EMPHASIZING cytoplasm, cell wall, transmission electron
MEMBRANES AND Harvard Univ 27 Langmuir 25 microscopy, scanning electron microscopy,
BACTERIA Univ Tokyo 15 Biophysical Journal 18 atomic force microscopy, green fluorescent
(348 Records) protein, human immunodeficiency virus,
Johns Hopkins 14 Journal of Membrane 17 confocal laser scanning microscopy, whole
Univ Science cell patch clamp, gram negative bacteria,
USA 416
Univ Penn 13 Journal of 17 surface plasmon resonancell wall, quantum
Germany 128 Biomedical Materials dots, fourier transform infrared, single particle
Natl Univ 13
Japan 111 Research Part A tracking, bacterial cell surface, plasma
Singapore
membrane, escherichia coli, bacterial cells,
Peoples R 97 Chinese Acad 13 epithelial cells,
China Sci
England 66
France 56

TOT HEALTH+ cells, protein, dna, membrane, binding,

Univ Calif 205 Langmuir 213
(6512) drugS, fluorescence, peptides, surface,
Chinese Acad 153 Analytical 127 nanoparticles, detection, interaction, surface
Sci Chemistry plasmon resonance, atomic force microscopy,
USA 2106
Natl Univ 94 Biomaterials 126 scanning electron microscopy, transmission
Peoples R 735 electron microscopy, differential scanning
Singapore Journal of Physical 120
China calorimetry, x-ray photoelectron spectroscopy,
Osaka Univ 78 Chemistry B bovine serum albumin, poly ethylene glycol,
Japan 696
Univ Texas 68 Biophysical Journal 104 single stranded dna, double stranded dna,
Germany 625 green fluorescent protein, fourier transform
Harvard Univ 68 Journal of 102
England 364 infrared, quartz crystal microbalance,
Univ Illinois 62 Biological
polymerase chain reaction, self assembled
France 337 Chemistry
Natl Inst Adv 57 monolayer, drug delivery systems, magnetic
South Korea 325 Journal of the resonance imaging, confocal laser scanning,
Ind Sci &
Italy 262 Technol American dynamic light scattering, , enzyme linked
Chemical Society 97 immunosorbent assay, resonance energy
Canada 217 Russian Acad 56 transfer, cell surface, x-ray diffraction,
India 170 Sci Journal of 96 escherichia coli, amino acid, particle size,
Tsing Hua 55 Controlled Release drug release, cell line, cell adhesion, dna
Univ Biochemistry 88 molecules, mass spectrometry, endothelial
cells
Univ Tokyo 54 Proc NAS-USA 82
CNRS 54

(protein* or peptide* or DNA or drug*) AND (nano* NOT in median age of the population (Germany, 42.6
(NaNO3 or NaNO2 or nanomolar* or nanosecond* or years; England, 39.3 years; France, 39.1 years;
nanogram* or nanomole*)) China, 32.7 years; South Korea, 35.2 years, Japan,
42.9 years), and the high growth rate in graduation
In 1995, there were 10 records from the Asia of scientists and engineers in China and South
group in total, and 70 records from the Europe Korea (over 50% combined from 1995-2002), the
group in total. In 2006 so far (probably 80% research paper production differential between
of the 2006 records have been entered into the Asia and Europe can only be expected to increase,
database), there were 1,094 records for the Asia probably substantially.
group, and 750 records for the Europe group, as The next largest Asian producer of nanotech-
shown in Figure 1. nology medical applications papers, India, should
Given this growth differential in medically- not be neglected in this equation. From about 1980
related records in just a decade, the difference

0
Medical Applications of Nanotechnology in the Research Literature

Figure 1. Comparison of medical nanotechnology publications between Europe and Asia

00
000
00
Asia
00
Europe
00
00
0
00

to almost 2000, India’s growth rate of research Associated Science

publications stagnated (Kostoff, Johnson, Bowles The science associated with the total health-type
& Dodbele, 2006d), while China’s mushroomed. applications in the highest-level category can be
In the last decade, India has started to experi- divided into four major categories: instrumen-
ence a resurgence of research paper productivity. tation, materials, structures, phenomena. The
Coupled with recent statements of strong sup- key elements of each of these categories are as
port by the Indian Prime Minister for increasing follows:
research output production (Mukhergee, 2006),
India could be the “dark horse” in this race. Its • Instrumentation: Surface plasmon reso-
population (1.1 Billion) rivals China’s, and its nance, atomic force microscopy, scanning
relatively low median age (24.9 years) reflects a electron microscopy, transmission electron
large labor pool potentially available for increased microscopy, differential scanning calorim-
emphasis on research production. etry, x-ray photoelectron spectroscopy, fou-
rier transform infrared spectroscopy, quartz
Institutional Productivity crystal microbalance, magnetic resonance
The USA has substantial institutional representa- imaging, confocal laser scanning, enzyme
tion in the top ten (California, Texas, Harvard, linked immunosorbent assay, laser scan-
Illinois). These university publication numbers ning microscopy, x-ray diffraction, mass
include all the state campuses. Thus, the University spectrometry.
of California system includes University of Cali- • Materials: Protein, DNA, peptides, drugs,
fornia Berkeley (UCB), University of California bovine serum albumin, poly ethylene glycol,
Santa Barbara (UCSB), University of California single stranded DNA, double stranded DNA,
San Francisco (UCSF), and so forth. green fluorescent protein, lipids, human se-
rum albumin, Escherichia Coli, antibodies,
Leading Journals tissues, enzymes, genes, oligonucleotides,
While the leading journals have a strong chemistry gold, nucleic acid.
component, a number of them cross disciplines • Structures: Cells, membranes, surfaces,
among physics, chemistry, biology, and materi- nanoparticles, self-assembled monolayers,
als. cell surfaces, endothelial cells, receptors.

Medical Applications of Nanotechnology in the Research Literature

• Phenomena: Fluorescence, interaction, (emphasizing membranes and bacteria), biofilms),

polymerase chain reaction, dynamic light proteins (protein interactions, amyloid fibrils,
scattering, resonance energy transfer, par- peptide sequences, binding and affinity), sensing
ticle size, drug release, cell adhesion, bind- and detection (immunosensors, detection (empha-
ing, affinity, gene expression, transfection sizing surface plasmon resonance), biosensors),
efficiency. and DNA (DNA detection, DNA emphasizing
gene delivery and transfection). Only one group
Second Highest Level Categories deals with a specific disease (cancer treatment),
one is functional (sensing and detection), and the
The highest level category is divided by the fuzzy other three are based on fundamental biological
clustering algorithm into two categories, with the materials at different aggregation levels (cells,
category centered around cells, proteins, and mem- proteins, DNA).
branes being about seven times the size (number Because of the large number of elemental
of records) of the category centered around DNA. clusters, only the highlights or unusual features
The larger category’s main journals (Langmuir of each will be discussed, starting from the top
185, Biomaterials 120, Journal of Physical Chem- row. Following each discussion are representative
istry B 112, Analytical Chemistry 108, Journal of article titles from the cluster in bold italics, to
Biological Chemistry 97, Biophysical Journal 95) illustrate the theme more concretely.
focus on chemistry, physics, biology, and materi-
als, while the smaller category’s main journals 1. Drug release: USA, China dominant. India
(Langmuir 30, Nano Letters 29, Nucleic Acids ranks much higher in this cluster relative to
Research 27, Analytical Chemistry 27, Journal its overall health types ranking. Aggregat-
of the American Chemical Society 21, Journal of ing top producers only in Europe and Asia,
Nanoscience and Nanotechnology 21) focus on Asia outproduces Europe by more than a
chemistry and nanotechnology. The only journal factor of six. Even though Singapore is not
in common at the top is Langmuir. listed as a leading country, the University
The larger category’s main country perform- of Singapore stands out as the institutional
ers (USA 1867, Peoples R China 620, Japan 608, leader. No USA presence in leading institu-
Germany 561, England 323, France 301, South tions. The journals appear rather applied and
Korea 299) are remarkably similar to the smaller focused. Materials and structures appear to
category’s main country performers (USA 273, be the science emphasis.
Peoples R China 123, Japan 106, Germany 78,  Physical characterization of con-
England 46, France 43, South Korea 33). In aggre- trolled release of paclitaxel from the
gate of these main performers, Asia outproduces TAXUS(TM) Express(2TM) drug-
Europe by about 30%. eluting stent.
 Potential of guar gum microspheres
Lower Level Taxonomy Categories for target specific drug release to
colon.
Characteristics of the lower level taxonomy cat- 2. Drug delivery: USA dominant. Again, India
egories (elemental clusters) are summarized in the ranks high, and University of Singapore
rows of Table 1. There are five main groupings: leads. Asia outproduces Europe by 20%.
cancer treatment (drug release, drug delivery, tu- Journals are again pharmaceutical oriented,
mor treatment, sentinel lymph node cancer), cells and very applied. Again, no USA presence
(tissue cells, cells (emphasizing adhesion), cells in leading institutions. Strong cancer focus
in the science.

Medical Applications of Nanotechnology in the Research Literature

 Highly specific HER2-mediated cel-  Nano-fibrous scaffolds for tissue

lular uptake of antibody-modified engineering.
nanoparticles in tumour cells.  Self-organization of rat cardiac cells
 Developement and characterization into contractile 3-D cardiac tissue.
of biodegradable nanospheres as 6. Cells, emphasizing adhesion: USA with
delivery systems of anti-ischemic commanding lead. Asia outproduces Europe
adenosine derivatives. by 30%. Strong USA university participa-
3. Tumor treatment: USA has commanding tion; also from National University of Sin-
lead. Asia outproduces Europe by fifty per- gapore. Journals have strong biomaterials/
cent. American institutions dominate. Some biophysics orientation. Science strongly
physics journals along with pharmaceuticals. focused on cell growth, interactions, and
Laboratory research at cellular level, with death.
magnetic physics emphasis, seems to domi-  Development of a rare cell fraction-
nate science. ation device: Application for cancer
 Enhanced tumour uptake of Doxo- detection.
rubicin loaded poly(butyl cyanoac-  Nanostructured designs of biomedical
rylate) nanoparticles in mice bearing materials: Applications of cell sheet
Dalton’s lymphoma tumour. engineering to functional regenera-
 MRI after magnetic drug targeting tive tissues and organs.
in patients with advanced solid ma- 7. Biofilms: USA dominant. Asia outproduces
lignant tumors. Europe by thirty percent. Montana State
4. Sentinel lymph node cancer: Again, USA University not seen before. Very applied
and USA institutions dominant. Europe journals. Science strongly focused on films
outproduces Asia by an order of magni- and infection.
tude. Many hospitals represented. Journals  Tooth development in a scincid lizard,
applied, and clinically oriented. Cancer Chalcides viridanus (Squamata), with
detection focus in science. particular attention to enamel forma-
 SPECT-CT for topographic map- tion.
ping of sentinel lymph nodes prior to  Adherence and biofilm formation of
gamma probe-guided biopsy in head Staphylococcus epidermidis and My-
and neck squamous cell carcinoma. cobacterium tuberculosis on various
 Diagnostic performance of nanopar- spinal implants.
ticle-enhanced magnetic resonance 8. Virus proteins: USA with commanding
Imaging in the diagnosis of lymph lead. Europe outproduces Asia by 70%.
node metastases in patients with en- Strong USA university representation, with
dometrial and cervical cancer. University of California system dominant.
5. Tissue cells: USA has commanding lead. Strong biochemistry journal emphasis.
Singapore surprisingly high. National Strong virus research.
University of Singapore again leader, by a  Identification of a region in the her-
wide margin. Asia outproduces Europe by pes simplex virus scaffolding protein
factor of 2.5. Journals strongly biomateri- required for interaction with the por-
als oriented. Science strongly focused on tal.
structure: cells, tissues, and bones.  Mass spectroscopic characterization
of the coronavirus infectious bronchi-

Medical Applications of Nanotechnology in the Research Literature

tis virus nucleoprotein and elucidation University of California system. Science

of the role of phosphorylation in RNA focused on binding, sequencing.
binding by using surface plasmon  Novel electrochemical biosensing
resonance. platform using self-assembled peptide
 Expression of human papillomavi- nanotubes.
rus type 16 L1 protein in transgenic  Plasma levels of AGE peptides in type
tobacco plants. 1 diabetic patients are associated with
9. Protein interactions: USA with command- serum creatinine and not with albu-
ing lead. Europe outproduces Asia by eighty min excretion rate: Possible role of
percent. USA institutions strong. Science AGE peptide-associated endothelial
focused on protein binding, other surface dysfunction.
phenomena.  Interactions of primary amphipathic
 Analysis of protein interactions on cell penetrating peptides with model
protein arrays by a wavelength in- membranes: Consequences on the
terrogation-based surface plasmon mechanisms of intracellular delivery
resonance biosensor. of therapeutics.
 Biosensors: basic features and appli- 12. Binding and affinity: USA with over-
cation for fatty acid-binding protein, whelming lead, solid institutional represen-
an early plasma marker of myocardial tation. Europe outproduces Asia by factor
injury. of 2.5. Biochemistry focus. Science focused
 A central role for protein aggrega- on binding, reception, and affinity.
tion in neurodegenerative disease;  Biomacromolecule surface recogni-
Mechanistic and structural studies tion using nanoparticles.
of human stefins.  Two-step mechanism of binding of
10. Amyloid fibrils: USA with commanding apolipoprotein E to heparin.
lead. Europe outproduces Asia by factor of  Formation of viscoelastic protein lay-
three. Except for University of California ers on polymeric surf aces relevant to
system, USA universities not among most platelet adhesion.
prolific. Biochemical/ biophysical journals. 13. Immunosensors: No infrastructure element
Science linked to Alzheimer’s Disease and dominant, as in previous cases. Asia outpro-
other neurodegenerative diseases. duces Europe by factor of three. No USA
 Structure and function of amyloid in institutional representation in upper tier.
Alzheimer’s disease. Strong use of immune system components
 Surface plasmon resonance for the in science.
analysis of beta-amyloid interactions  Enhancement of the sensitivity of
and fibril formation in 1Alzheimer’s surface plasmon resonance (SPR)
disease research. immunosensor for the detection of
 Structure and morphology of the anti-GAD antibody by changing the
Alzheimer’s amyloid fibril. pH for streptavidin immobilization.
11. Peptide sequences: USA with command-  Development of functionalized ter-
ing lead. Israel, Australia surprisingly high. bium fluorescent nanoparticles for
Asia outproduces Europe by factor of two. antibody labeling and time-resolved
MIT major institutional player, followed by fluoroimmunoassay application.

Medical Applications of Nanotechnology in the Research Literature

14. Detection, emphasizing surface plasmon percent. Chinese Academy of Science insti-
resonance: USA with strong lead. Asia out- tutional leader. Russian Academy of Science
produces Europe by factor of three. Strong strong institutional presence, even though
chemistry focus; some electronics. Science Russia not major player. Science focuses on
focused on sensors, use of gold. DNA binding and DNA networks.
 The fabrication of protein chip based  Atomic force microscopy study of the
on surface plasmon resonance for structural effects induced by echino-
detection of pathogens. mycin binding to DNA.
 Intracellular monitoring of super-  Impedance sensing of DNA binding
oxide dismutase expression in an drugs using gold substrates modified
Escherichia coli fed-batch cultivation with gold nanoparticles.
using on-line disruption with at-line 18. DNA, emphasizing gene delivery and
surface plasmon resonance detec- transfection: USA has strong lead. Asia out-
tion. produces Europe by factor of two. University
 Surface plasmon resonance detection of California system only USA presence in
of endocrine disruptors using immu- institutional leaders. Science focus on gene
noprobes based on self-assembled delivery and transfection efficiency.
monolayers.  Optical tracking of organically
15. Biosensors: USA lead; China strong second. modified silica nanoparticles as DNA
Asia outproduces Europe by factor of five. carriers: A nonviral, nanomedicine
Research focus on enzyme-based biosensors approach for gene delivery
that involve enzyme immobilization.  Nanoparticle based systemic gene
 A novel glucose biosensor based on the therapy for lung cancer: Molecular
nanoscaled cobalt phthalocyanine- mechanisms and strategies to sup-
glucose oxidase biocomposite. press nanoparticle-mediated inflam-
 Multiwall carbon nanotube (MW- matory response
CNT) based electrochemical biosen-  Calcium phosphate nanoparticles as
sors for mediatorless detection of a novel nonviral vector for efficient
putrescine. trasfection of DNA in cancer gene
 Biosensors in drug discovery and drug therapy
analysis. 19. Cells, emphasizing membranes and
16. DNA detection: USA with commanding bacteria: Commanding USA lead. Europe
lead. Asia outproduces Europe by forty outproduces Asia by 25%. Commanding
percent. Strong USA institutional represen- USA organizational representation, with
tation. Science focus is on DNA at surfaces University of Californai system at forefront.
for use in DNA biosensors. Biomaterials literature emphasis. Science
 A biosensor monitoring DNA hybrid- focuses on cell membranes and bacterial
ization based on polyaniline interca- adhesion.
lated graphite oxide nanocomposite  Microtubule-dependent matrix me-
 Detection of DNA and protein mol- talloproteinase-2/matrix metallopro-
ecules using an FET-type biosensor teinase-9 exocytosis: Prerequisite in
with gold as a gate metal human melanoma cell invasion
17. DNA molecules: USA with commanding  Long-term effects of HIV-1 protease
lead. Asia outproduces Europe by thirty inhibitors on insulin secretion and

Medical Applications of Nanotechnology in the Research Literature

insulin signaling in INS-1 beta cells • Drug release

 Early stages of HIV replication: How • Tissue cells
to hijack cellular functions for a suc- • Peptide sequences
cessful infection • Immunosensors
 Membrane-based on-line optical • Detection, emphasizing surface plasmon
analysis system for rapid detection of resonance
bacteria and spores • Biosensors
• DNA, emphasizing gene delivery and trans-
The USA is the leader in all 19 clusters. China fection
took second place in seven clusters, Japan in six,
Germany in four, and England in two. In terms By contrast, the European countries led in the
of main institutions, University of California following areas:
system led in five clusters, Chinese Academy of
Science led in four, and surprisingly University • Sentinel lymph node cancer
of Singapore led in three. University of Singapore • Amyloid fibrils
has strong presence in pharmaceuticals and bio- • Binding and affinity
materials, Chinese Academy of Science has strong
presence in DNA and binding, and University of The Asian countries appear to emphasize
California system has strong presence in cells biomaterials and detection/ sensing, while the
and protein interactions. European countries appear to emphasize treat-
These results require further context. The four ment of specific diseases.
major institutions discussed are of different size, For the technology transfer community, these
have different funding levels, and have different results contain some important messages. First,
manpower and other resources. For example, in while there are some pervasive infrastructure
2005, there were 3,399 articles and reviews in results throughout the elemental clusters (e.g.,
the SCI/SSCI that contained at least one author USA is always most productive, China, Japan,
with a National University of Singapore address, Germany, England typically rank high), there are
and there were a total of 6,622 authors listed on many individual differences. To understand the
these records. The corresponding numbers for the specific research infrastructure related to specific
other major institutions are: Chinese Academy health applications, disaggregated evaluations
of Science, 14,347 records, 19,089 authors; Rus- are necessary. While the present analysis had a
sian Academy of Science, 11,216 papers, 30,137 reasonable level of disaggregation, users interested
authors; University of California system, 27,954 in very specific medical applications will want
records, 84,667 authors. to conduct much more disaggregated analyses.
Thus, for the National University of Singa- There are substantial differences between the
pore to be the publication leader in three thrust overall nanotechnology health results and very
areas requires a considerable concentration of specific health applications results.
its modest resources relative to the other major Additionally, while there are some instruments
institutions. that pervade the different elemental clusters,
How do Asia and Europe differ in their research there are substantial instrumentation, material,
thrust areas? In the nineteen clusters listed on nanostructure, and phenomenological differ-
Table 1, the Asian leading countries in aggregate ences among the clusters. Again, the individual
led by a factor of two or more (publications) in cluster research can differ substantially from
the following thematic areas: the overall nanotechnology health applications

Medical Applications of Nanotechnology in the Research Literature

average. Readers who are interested in tracking SuMMary and ConCluSion

the nanotechnology health-related research for
technology transfer purposes are well advised The study has identified the main nanotechnology
to conduct specific analyses of the above type medical applications as well as the related science
for each application. For investors, identifying and infrastructure. These relationships will allow
which research areas pervade multiple applica- the potential user communities to become involved
tions would be extremely valuable, and the same with the medical applications-related science and
recommendations are made as for technology performers at the earliest stages, to help guide the
transfer application. science conversion towards specific user needs
more efficiently.
Future Research Issues The pervasive instrumentation, materials,
structures, and phenomena related to the most
The underlying science areas (such as cells, frequently mentioned nanotechnology health
proteins, DNA, and membranes) have always applications were identified, as follows:
been the mainstay of biomedical research. The
explosive growth in nanotechnology has enabled • Instrumentation: Surface plasmon reso-
nanotechnology research instruments to be used nance, atomic force microscopy, scanning
in medical applications. A natural question is electron microscopy, transmission electron
whether the growth in medical applications is microscopy, differential scanning calorim-
proportional to the growth in nanotechnology etry, x-ray photoelectron spectroscopy, fou-
research? To address that question, it is necessary rier transform infrared spectroscopy, quartz
to study the growth in papers on nanotechnology crystal microbalance, magnetic resonance
medical applications relative to the total publica- imaging, confocal laser scanning, enzyme
tions in nanotechnology. The USA continues to linked immunosorbent assay, laser scan-
be the leading country in publications. However, ning microscopy, x-ray diffraction, mass
substantial numbers of papers are from the de- spectrometry.
veloping Asian countries. It would be good to • Materials: Protein, DNA, peptides, drugs,
study the rate at which these other countries are bovine serum albumin, poly ethylene glycol,
developing and the possibility they may catch up single stranded DNA, double stranded DNA,
with the US. green fluorescent protein, lipids, human se-
Additionally, the present study focused on rum albumin, Escherichia Coli, antibodies,
output quantity, and was conducted at a reason- tissues, enzymes, genes, oligonucleotides,
able level of disaggregation. Future studies need gold, nucleic acid.
to address quality and resource issues as well, • Structures: Cells, membranes, surfaces,
using much more detailed levels of disaggregation. nanoparticles, self-assembled monolayers,
Larger numbers of clusters should be run, and the cell surfaces, endothelial cells, receptors
citation impact of institutions associated with each • Phenomena: Fluorescence, interaction,
cluster should be obtained along with numbers of polymerase chain reaction, dynamic light
publications. To understand the production effi- scattering, resonance energy transfer, par-
ciency better, the resources required to generate ticle size, drug release, cell adhesion, bind-
these publications should be obtained. ing, affinity, gene expression, transfection
efficiency.

Medical Applications of Nanotechnology in the Research Literature

Moreover, a medical applications categoriza- tors, enzymes, genes, drug delivery, self
tion constructed from visual inspection of the assembly, cell surface, detection limit, esch-
fuzzy clustering categories showed five thematic erichia coli, amino acid, molecular weight,
categories: particle size, real time, serum albumin,
drug release, cell line, cell adhesion, DNA
• Cancer treatment molecules, endothelial cells, surface plas-
• Sensing and detection mon resonance, atomic force microscopy,
• Cells scanning electron microscopy, transmission
• Proteins electron microscopy, differential scanning
• DNA calorimetry, x-ray photoelectron spectros-
copy, bovine serum albumin, poly ethyl-
In summary, for medical applications, analysis ene glycol, single stranded DNA, double
of nineteen thematic categories obtained from stranded DNA, green fluorescent protein,
fuzzy clustering of the total 2005 nanotechnology fourier transform infrared spectroscopy,
database revealed the following: quartz crystal microbalance, polymerase
chain reaction, self assembled monolayer,
• The USA is the publication leader in total magnetic resonance imaging, confocal laser
health types, and in all the thematic areas scanning, dynamic light scattering, enzyme
as well, most by a wide margin. China was linked immunosorbent assay, resonance
the second most prolific in seven thematic energy transfer, extracellular matrix, laser
areas, Japan in six, Germany in four, and scanning microscopy, human serum albu-
England in two. min, and poly lactic acid.
• The University of California system led in
five clusters, the Chinese Academy of Sci-
ence led in four, and the National University referenCeS
of Singapore led in three. The University
of California and the Chinese Academy of Davidse, R. J. & Van Raan, A. F. J. (1997). Out
Science were the most prolific in the non- of particles: Impact of CERN, DESY, and SLAC
medical Applications as well, but their orders research to fields other than physics. Scientomet-
were reversed. The National University of rics, 40(2), 171-193.
Singapore is a prolific contributor, especially
Garfield, E. (1985). History of citation indexes
in pharmaceuticals and biomaterials.
for chemistry - A brief review. JCICS, 25(3),
• The journal Langmuir contains the most
170-174.
articles in total health, and is in the top
layer of 10 of 19 themes. The only journals Goldman, J. A., Chu, W. W., Parker, D. S., &
in common in the top layers of nonmedical Goldman, R. M. (1999). Term domain distribu-
and medical applications and health are tion analysis: A data mining tool for text data-
Langmuir and Journal of Physical Chem- bases. Methods of Information in Medicine, 38,
istry B. 96-101.
• For total health, the key underlying sci-
Gordon, M. D. & Dumais, S. (1998). Using latent
ence areas include cells, proteins, DNA,
semantic indexing for literature based discovery.
membranes, binding, drugs, fluorescence,
Journal of the American Society for Information
peptides, nanoparticles, detection, lipids,
Science, 49(8), 674-685.
antibodies, immobilization, tissues, recep-

Medical Applications of Nanotechnology in the Research Literature

Greengrass, E. (1997). Information retrieval: An Kostoff, R. N. (2003c). Bilateral asymmetry

overview. National Security Agency. TR-R52- prediction. Medical Hypotheses, 61(2), 265-266.
02-96.
Kostoff, R. N., Shlesinger, M., & Tshiteya, R.
Hearst, M. A. (1999). Untangling text data min- (2004a). Nonlinear dynamics roadmaps using
ing. In Proceedings of ACL 99, the 37th Annual bibliometrics and database tomography. Interna-
Meeting of the Association for Computational tional Journal of Bifurcation and Chaos, 14(1),
Linguistics, University of Maryland. 61-92.
Karypis, G. (2006). CLUTO—A clustering tool- Kostoff, R. N., Shlesinger, M., & Malpohl, G.
kit. Retrieved April 13, 2008, from http://www. (2004b). Fractals roadmaps using bibliometrics
cs.umn.edu/˜cluto. and database tomography. Fractals, 12(1), 1-16.
Kostoff, R. N., Eberhart, H. J., & Toothman, D. Kostoff, R. N., Stump, J. A., Johnson, D., Murday,
R. (1997a). Database tomography for information J., Lau, C., & Tolles, W. (2005a). The structure
retrieval. Journal of Information Science, 23(4), and infrastructure of the global nanotechnology
301-311. literature (DTIC Tech. Rep. No. ADA435984),
Defense Technical Information Center, Fort
Kostoff, R. N. (1997b). Accelerating the con-
Belvoir, VA. Retrieved April 13, 2008, from
version of science to technology: Introduction
http://www.dtic.mil/
and overview. Journal of Technology Transfer
[Special Issue on Accelerating the Conversion of Kostoff, R. N., Murday, J., Lau, C., & Tolles,
Science to Technology], 22(3) . W. (2005b). The seminal literature of global
nanotechnology research (DTIC Tech. Rep. No.
Kostoff, R. N., Green, K. A., Toothman, D. R.,
ADA435986), Defense Technical Information
& Humenik, J. A. (2000). Database tomography
Center, Fort Belvoir, VA. Retrieved April 13,
applied to an aircraft science and technology
2008, from http://www.dtic.mil/
investment strategy. Journal of Aircraft, 37(4),
727-730. Kostoff, R. N., Stump, J.A., Johnson, D., Murday,
J., Lau, C., & Tolles, W. (2006a). The structure
Kostoff, R. N., Del Rio, J. A., García, E. O.,
and infrastructure of the global nanotechnology
Ramírez, A. M., & Humenik, J. A. (2001). Cita-
literature. Journal of Nanoparticle Research,
tion mining: Integrating text mining and biblio-
8(3-4), 301-321.
metrics for research user profiling. Journal of the
American Society for Information Science and Kostoff, R. N., Murday, J., Lau, C., & Tolles,
Technology, 52(13), 1148-1156. W. (2006b). The seminal literature of global
nanotechnology research. Journal of Nanoparticle
Kostoff, R. N. (2003a). Text mining for global
Research, 8(2), 193-213.
technology watch. In M. Drake (Ed.), Encyclo-
pedia of library and information science (2nd Kostoff, R. N. (2006c). Systematic acceleration
ed) (Vol. 4, pp. 2789-2799). New York: Marcel of radical discovery and innovation in science
Dekker, Inc. and technology. Technological Forecasting and
Social Change, 73(8), 923-936.
Kostoff, R. N. (2003b). Stimulating innovation. In
L. V. Shavinina (Ed.), International handbook of Kostoff, R. N., Johnson, D., Bowles, C. A., & Dod-
innovation (pp. 388-400). Oxford, UK: Elsevier bele, S. (2006d). Assessment of India’s research
Social and Behavioral Sciences. literature (DTIC Tech. Rep. No. ADA444625),
Defense Technical Information Center, Fort

Medical Applications of Nanotechnology in the Research Literature

Belvoir, VA. Retrieved April 13, 2008, from Swanson, D. R. (1986). Fish oil, raynauds syn-
http://www.dtic.mil/ drome, and undiscovered public knowledge.
Perspect Biol Med, 30(1), 7-18.
Kostoff, R. N,. Koytcheff, R., & Lau, C. G. Y.
(2007). Structure of the global nanoscience and Swanson, D. R. & Smalheiser, N. R. (1997). An
nanotechnology research literature (DTIC Tech. interactive system for finding complementary
Rep. No. ADA461930), Defense Technical Infor- literatures: A stimulus to scientific discovery.
mation Center, Fort Belvoir, VA. Retrieved April Artificial Intelligence, 91(2), 183-203.
13, 2008, from http://www.dtic.mil/
TREC (Text Retrieval Conference) (2004).
Losiewicz, P., Oard, D., & Kostoff, R. N. (2000). Retrieved April 13, 2008, from http://trec.nist.
Textual data mining to support science and gov/.
technology management. Journal of Intelligent
Viator, J. A. & Pestorius, F. M. (2001). Investigat-
Information Systems, 15, 99-119.
ing trends in acoustics research from 1970-1999.
Mukherjee, D. (2006). Promote scientific research. Journal of the Acoustical Society of America,
Central Chronicle. 109(5), 1779-1783 Part 1.
Narin, F. (1976). Evaluative bibliometrics: the Zhao, Y. & Karypis, G. (2004). Empirical and
use of publication and citation analysis in the theoretical comparisons of selected criterion func-
evaluation of scientific activity (monograph). NSF tions for document clustering. Machine Learning,
C-637. National Science Foundation. Contract 55(3), 311-331.
NSF C-627. NTIS Accession No. PB252339/AS.
Zhu, D. H. & Porter, A. L. (2002). Automated
Narin, F., Olivastro, D., & Stevens, K. A. (1994). extraction and visualization of information
Bibliometrics theory, practice and problems. for technological intelligence and forecasting.
Evaluation Review, 18(1), 65-76. Technological Forecasting and Social Change,
69(5), 495-506.
Schubert, A., Glanzel, W., & Braun, T. (1987).
Subject field characteristic citation scores and
scales for assessing research performance. Sci-
entometrics, 12(5-6), 267-291. endnote
SCI (2006). Certain data included herein are de- 1
The views in this paper are solely those of
rived from the Science Citation Index/Social Sci-
the authors, and do not represent the views
ence Citation Index prepared by the THOMSON
of the Department of the Navy or any of its
SCIENTIFIC ®, Inc. (Thomson®), Philadelphia,
components, or the Institute for Defense
Pennsylvania, USA: © Copyright THOMSON
Analyses.
SCIENTIFIC ® 2006. All rights reserved.
SEARCH (2006). TechOasis. Norcross, GA:
Search Technology Inc.

Chapter XII
Early Warning System for SMEs
as a Financial Risk Detector
Ali Serhan Koyuncugil
Capital Markets Board of Turkey, Turkey

Nermin Ozgulbas
Baskent University, Turkey

aBStraCt

This chapter introduces an early warning system for SMEs (SEWS) as a financial risk detector which is
based on data mining. In this study, the objective is to compose a system in which qualitative and quan-
titative data about the requirements of enterprises are taken into consideration, during the development
of an early warning system. Furthermore, during the formation of system; an easy to understand, easy to
interpret and easy to apply utilitarian model that is far from the requirement of theoretical background
is targeted by the discovery of the implicit relationships between the data and the identification of effect
level of every factor. Using the system, SME managers could easily reach financial management, risk
management knowledge without any prior knowledge and expertise. In other words, experts share their
knowledge with the help of data mining based and automated EWS.

introduCtion tion information particles before the 1980’s, but

nowadays they talk about information dews. In
The enormous computers of 1950’s are now small another words, the most important contribution of
enough to fit your hand, and are able to assist with information technology (IT) can be summarized
the organization of work and daily activities. From as information accessibility. On the other hand, the
the beginning of 1980’s, great amounts of data have prevention of accessibility problem caused another
been accumulated with the usage of the database problem—information accuracy. Therefore, the
for computers in everywhere. Information grows actual problem is accurately reading information
when it is shared, therefore, researchers men- from large amounts of information.

In addition, one of the basic insistences of and actionable information from large databases
IT is time concept. In the past, time cost meant and then using the information to make crucial
almost nothing, but today the time is one of the business decisions (Cabena, Hadjinian, Stadler,
most important factors mostly because of the Verhees & Zanasi, 1997, p. 12).
multispeed processors. At that point, time cost for Data mining is the most realistic method to
accessibility of the accurate information became responds with basic requirements of knowledge
an important factor because of data or informa- society, which are to reach accurate, objective, and
tion actuality. useful knowledge in a simplified way. Another
Errors, subjectivity, and uncertainty in per- concept which is necessary for providing accurate,
formance arised from human factors joined with objective, and useful knowledge is “expertise.” It
the acceleration arised from IT; it almost took out is impossible to provide enough expertise for the
the human factor in business processes. Intel- entire society, but it is possible to provide “expert
ligent systems began to take part in procedures, knowledge” via IT.
transactions, and processes instead of the human It is possible to provide expert knowledge to
factor. As a result, computations done by humans nonexperts in every field—business management
turns into IT-based automated systems. and economics as well. However, from the busi-
IT had a rapid improvement in the 1990’s and ness point of view, the firms that mostly need
removed almost all borders and distances on the the information are small and medium industrial
globe in the early 2000’s. The concept of “technol- enterprises (SMEs); they have a great importance
ogy” became insufficient to describe that situation. with regards to economy. Although SMEs have
Therefore, the term “knowledge age” was used made an important contribution to the world’s
for description. Another concept associated with rapid economic growth and the fast industrializa-
knowledge age is “knowledge society.” To reach tion process, to enlighten SMEs’ managers for
accurate, objective, and useful knowledge in an overcoming difficulties and improving strategies
easy way were it became basic requirements of is critically important. These reasons were what
knowledge society. motivated the authors of this chapter to select
Another phenomenon associated with knowl- SMEs as an application area.
edge age is data mining. Towards the end of the SMEs are thrown in financial distress and
1990’s, the idea of strategical usage for great bankruptcy risk by financial issues. Many SMEs
amounts of data led to a fast achievement and are closed because of this financial distress. These
popularity in every area that computers have issues of SMEs were grown out of the lack of
been used for data mining. Data mining is the information and could not use the information in
core of knowledge discovery process, which is decision-making process. By this approach, SMEs
mainly based on statistics, machine learning, need an early warning system which should give
and artificial intelligence. Generally, data mining decision support that is easy to understand, easy
discovers hidden and useful patterns in a very to interpret, and easy to apply for the decision
large amount of data. But it is difficult to make makers of SMEs. Consequently, the structure of
definitive statements about an evolving area and the early warning system:
surely data mining is an area in very quick evolu-
tion. Therefore, there is no one single definition • Does not require expertise for the calcula-
of data mining that would be met with universal tion and interpretation of the financial and
approval. On the other hand, the following defini- administrative indicators
tion is generally acceptable: Data mining is the
process of extracting previously unknown, valid

Early Warning System for SMEs as a Financial Risk Detector

• Can realize the necessary analysis automati- SMEs are defined as nonsubsidiary, independent
cally firms, which employ less than a given number of
• Does not need analytical depth employees. This number varies across national
statistical systems. The most frequent upper limit
Therefore, the intention is to bring out the is 250 employees, as in European Union. However,
relationship between the financial and admin- some countries set the limit at 200 employees,
istrative variables into the open, to identify the while the United States considers SMEs to include
criteria of risk and to use the risk models for deci- firms with fewer than 500 employees (OECD,
sion support. The identification of the criteria of http://www.oecd.org). SMEs, which are the vital
risk by clearifying the relationship between the drivers of the economy, are topic of significant
variables defines the discovery of knowledge from research interest for academics and an issue of
the financial and administrative variables. In this great importance for policy makers around the
context, automatic and estimation oriented infor- globe. Governments in developing countries, as
mation discovery process coincides the definition well as developed countries, started to realize the
of data mining. important role played by SMEs (WIPO, 2002).
This chapter discusses ways of empowering Today, globalization is an important factor that
knowledge society from SME viewpoint via de- has impact on SMEs. SMEs have to be prepared
signing an early warning system based on data to meet the challenges of the opening markets
mining. Using the system, SME managers could and the risks associated with it. Opening markets
easily reach financial management, risk manage- or internationalization of markets provide new
ment knowledge without any prior knowledge or opportunities for expansion and growth. But on
expertise. They can be a part of the knowledge the other hand it means intensive competition
society via the knowledge, which is provided with foreign enterprises, therefore brining threats
by early warning system (EWS) based on data and challenges. Particularly, after the effects of
mining without any complexity. The people who the globalization have been seen, the financial
work for SMEs reach expert knowledge by the problems of SMEs have become the subject of
way of using EWS. In other words, experts share several research and reports.
their knowledge with the help of IT based and The last five years were evaluated, from the
automated EWS. Therefore, SMEs can be the Far East countries to the European countries, and
part of knowledge society. several researches were discovered working on
The following section provides background the financial problems of SMEs.
about the financial problems of SMEs in some Bukvik and Bartlett (2003) aimed to identify
counties and early warning system applications financial problems that prevented the expansion
by data mining and other methods. and development of SMEs in Slovenia, Bosnia,
and Macedonia. In their studies, they applied
a survey on 200 SMEs which were active in
BaCkground Slovenia between 2000-2001. As a result of the
study, the major financial problems of SMEs in
If one looks at the developed countries of the these countries are defined as high cost of capital,
world, such as USA, Japan, and Germany or de- insufficient financial cooperation, the bureaucratic
veloping countries such as Thailand, Malaysia, processes of banks, SMEs’ lack of information
and China, it can be seen that a dynamic and on financial subjects and delay in the collection
vibrant SME sector is playing a key role in the of the payment.
successful economic growth of these countries.

Early Warning System for SMEs as a Financial Risk Detector

Sormani (2005) researched the financial prob- toring the financial condition periodically and
lems of small businesses in the UK. The study orderly. Efforts requiring to learn information on
emphasized the cash flow and unpaid invoices of firms financial conditions have a long story. The
SMEs. According to this study, there were two efforts towards the seperation between financially
main suggestions for SMEs in UK. The first one distress and nondistress enterprises started with
was managing cash flow for the financial health of the z-score that are based on the usage of ratios
SMEs, and the other one was taking into account by Beaver (1996) for single and multiple discrimi-
the time between issuing an invoice and receiving nant analysis of Altman (1968). The examples of
payment in order to run efficiently. other important studies that used multivariable
Bitszenis and Nito (2005) determined the statistical models, are given by Deakin (1972),
financial problems of SMEs while evaluating the Altman, Haldeman, and Narayanan (1977), Taffler
obstacles and problems encountered by entre- and Tisshaw (1977) with the usage of multiple
preneurs in Albania. It was determined that the discriminant models are also given by Zmijewski
most important financial problems were lack of (1984), Zavgren (1985), Jones (1987), Pantalone
financial resources and taxation faced by SMEs and Platt (1987), with the usage of logit vs. probit
in Albania. models are at the same time given by Meyer and
Sanchez and Marin (2005) analyzed the Pifer (1970), with the usage of multiple regression
management characteristics of Spanish SMEs model (Koyuncugil & Ozgulbas, 2006c).
according to their strategic orientation and the Artificial neural network is used for the iden-
consequences in terms of firm performance and tification of problems including financial failure
business efficiency. The study was conducted on and bankruptcy at 1980’s and researchers like
1,351 Spanish SMEs. The results confirmed the Hamer (1983), Coats and Fant (1992), Coats and
expected relationship between management char- Fant (1993), Chin-Sheng et al. (1994), Klersey
acteristics and performance of SMEs in Spain. and Dugan (1995), Boritz et al. (1995), Tan and
Inegbenebor (2006) focused on the role of Dihardjo (2001) and Anandarajan et al. (2001) dealt
entrepreneurs and capacity to access and utilize with artificial neural network in their researches
the fund of SMEs in Nigeria. The sample study (Koyuncugil & Ozgulbas, 2006c).
consisted of 1,255 firms selected to represent 13 The basic areas for data mining in financial
identified industrial subsectors. The results of the studies are about equities, exchange rates, estima-
study showed that the capacities of SMEs to access tion on bankruptcy of enterprises, identification
and utilize the funds were weak in Nigeria. and management of financial risk, management
Kang (2006) analyzed the role of the SMEs of loans, idenfication of customer profiles and the
in the Korean economy. According to the study, analysis of money laundering (Kovalerchuk &
SMEs have been hit hard by the economic slow- Vityaev, 2000). In addition to this, data mining
down and also face some deep structural prob- was used in the studies of Eklund, Back, Van-
lems. The main financial problem of SMEs in haranta, and Visa (2003), Hoppszallern (2003),
Korea was heavy dept financing. Many SMEs are Derby (2003), Chang, Chang, Lin, and Kao (2003),
overburdened with debt. Also SMEs are saddled Lansiluoto, Eklund, Barbro, Vanharanta, and Visa
with excess capacity, and they suffer from grow- (2004), Kloptchenko, Eklund, Karlsson, Back,
ing overseas competition. All these factors are Vanhatanta, and Visa (2004) and Magnusson,
affected the profitability of SMEs in Korea. Arppe, Eklund, and Back (2005) for financial
Nowadays, many entrepreneurs included performance analysis.
SMEs are encountered with financial distress and Koyuncugil and Ozgulbas (2006a) emphasized
this factor motivated the enterprises for moni- the financial problems of SMEs in Turkey by

Early Warning System for SMEs as a Financial Risk Detector

identifying their financial profiles via data mining. by data mining and other analytical methods are
They put forward their suggestions on solutions presented next.
in addition to the stock markets of SMEs, and ex- Early warning systems that are used to exam-
pressed that the first step to solve those problems ine financial failure and risk are investigated for
was to identifty the financial profiles of SMEs. banking sector by Gaytan and Johnson (2002).
Researchers identified the role of SMEs with the Collard (2002) emphasized the importance of early
method of data mining by the usage of the data warning systems and presented ten early warning
from the year 2004 on operations of 135 SMEs signs that pointed business failure and risk to the
in Istanbul Stock Exchange (ISE). As a result of firm’s managers. Mena (2003) mentioned credit
the study, the most important factor that affects card fraud detection via data mining. Gunther and
the financial performance of SMEs is determined Moore (2003) aimed to develop an early warning
as the strategy of finance. At the end of the study, model for monitoring the financial condition of
the basic suggestion indicates that the SMEs who bank. They used a statistical models to include 12
are concentrated on debt financing can increase financial ratios covering the major categories of
the financial performance. financial factors considered under the CAMELS
Another study by Koyuncugil and Ozgulbas rating system.
(2006b) on operations of SMEs in İstanbul Stock Jacops and Kuper (2004) presented an early
Exchange (ISE) was a criterion of the financial warning system for six countries in Asia. They
performance for SMEs that are identified due to used a binomial multivariate qualitative response
factors that affects the financial risk and perfor- approach and constructed a model that calculates
mance of SMEs. the probability of a financial crisis.
Once again in another study, that was held by Apoteker and Barthelemy (2005) focused on
Koyuncugil and Ozgulbas (2006c) on operations financial crises in emerging markets. They used
of SMEs between the years of 2000-2005 in ISE; a newly developed nonparametric methodology
the identification of the financial factors that af- for country risk signaling. They constructed nine
fected the financial failures of SMEs was aimed early warning signals to predict financial crises
with the usage of CHAID (chi-square automatic in emerging markets.
interaction detector) decision tree algorithm. In Ko and Lin (2005) introduced a modularized
various studies by Ozgulbas and Koyuncugil financial distress forecasting mechanism based
(2006) and Ozgulbas, Koyuncugil, and Yilmaz on data mining
(2006) same set of data showed that the success Canbas, Onal, Duzakin, and Kilic (2006)
of the firms was not only based on the financial aimed to investigate whether or not firms that
strength of SMEs, but also based on the scale of are taken into the surveillance market in Istan-
the SMEs. bul Stock Exchange are experiencing financial
But many companies and their managers have distress. They developed an integrated early
not recognized the symptoms of oncoming finan- warning model for financial distress prediction
cial failure and risk in their business. And when by combining principal component analysis and
symptoms or signals start occurring, managers discriminant analysis.
do not know what type of action to take first or Liu and Lindholm (2006) focused on finan-
how to manage the situation. By recognizing cial crises that occurred around the world. They
some early warning signs of business financial showed how the use of fuzzy C-means method
trouble, managers may eliminate, overcome, or at can help to identify economic of financial crises
the very least, side step those troubles and risks. as an early warning system.
Some studies about early warning systems used

Early Warning System for SMEs as a Financial Risk Detector

A novel anomaly detection scheme that uses a in the SET, analyzing any unusual trading, facili-
data mining to handle computer network security tating any investigation of suspicious cases, and
problems is proposed by Shyu, Chen, Sarinnapa- documenting all the tasks to the database (The
korn, and Chang (2006). Stock Exchange of Thailand, 2007).
Chan and Wong (2007) attempted to find finan- The most important study towards the purpose
cial stresses and to predict future financial crises of this study is the design of an early warning
for all possible scenarios. To reach this objective system based on data mining about the examina-
they used early warning system to measure the tion of market abuse (manipulation and insider
resilience by data mining in their study. trading) in the stock market was evaluated by
Kamin, Schindler, and Samuel (2007) used Koyuncugil (2006). Koyuncugil determined the
early warning systems in emerging markets to success of designed early warning system by
identify the roles of domestic and external fac- testing the system with actual data.
tors in emerging market crises. Several probit
models of currency crises were estimated for 26
emerging market countries. These models were Main thruSt
used to identify the separate contributions to the
probabilities of crisis of domestic and external The common result that is gained from these
variables. researchs is SMEs’ financial problems. The basic
Tan and Quektuan (2007), attemped to use reasons of these problems are:
genetic complementary learning (GLC) as a stock
market predictor, and bank failure early warning • The economical condition of country
system is investigated. The experimental results • Underdevelopment of money and capital
show that GCL is a competent computational fi- markets that can provide financial sources
nance tools for stock market prediction and bank to SMEs
failure early warning system in their study. • Insufficiency of financial administration
Securities exhange markets have early warning and administrators
or surveillance systems similar to Stock Watch,
ASAM, and ATOMS. Stock Watch is the New During a period of time, these problems cause
York Stock Exchange’s state-of-the-art computer failures and low performance then SMEs recede
surveillance unit, which monitors the market in from the economical environment. As a result
NYSE-listed stocks for aberrant price and volume of these failures and low performance, only a
activity, which may indicate illegal transactions. portion of SMEs can continue their economical
In addition, automated search and match (ASAM) activities under difficult conditions. The impor-
is another system in which researches and cross- tant role of SMEs for economical development
references with publicly available information on requires guessing the contidition of financial
individuals, corporations, and service organiza- success especially for preventing financial risk
tions are possibly connected to a particular trading under risky conditions. Studies on prediction
situation (www.nyse.com, 2007). of financial failures of enterprises and finding
The Stock Exchange of Thailand (SET) has the possible reasons of these failures take the
employed a computerized system for market attention of administrators, inventors, creditors,
surveillance. The main tool that handle all mar- inspectors, partners of enterprises, academicians
ket surveillance tasks is called “automated tools and especially financial administrators.
for market surveillance” or ATOMS. ATOMS is Actually, it is impossible to solve the problems
aimed for monitoring securities trading activities related to:

Early Warning System for SMEs as a Financial Risk Detector

• The economical condition of country • Give early warning signs

• Underdevelopment of money and capital • Provide roadmaps for prevention financial
markets crisis and distresses
• Easily understand, easily implement and
Without country level economical and political easily interpret, according to the picture
policies, although, it is possible to solve problems given
related to:
The tool defined earlier from information
• Insufficiency of financial administration technologies viewpoint can be termed as early
and administrators warning system (EWS). The basic motivation
behind the development of EWS for SMEs is to
The basic solution for this insufficiency prob- solve financial problems of SMEs in all developed
lem is expert support. However, it is not possible and developing countries. The EWS seeks to
to provide adequate expert for all SMEs and it is identify the risks SMEs may face, today and in
not necessary as well. The most rationale approach the future, and to develop or improve their ability
for solving the problem is to automate the expert to manage those risks. As a result, it is intended
knowledge in financial management. to be more flexible and better able to evolve with
Another development that SMEs are attempt- advances in markets and risk management prac-
ing to solve is their problem obtaining financial tices. Therefore, SMEs reach better governance
resources. They would soon face the requirement and financial performance.
to comply with and make necessary provisions for Furthermore, operational logic of early warn-
the requirements set out in basel-II standard, which ing systems is mainly based on finding unexpected
is supposed to become effective from 2007. and extraordinary behaviors. Thus, according to
The basel-II framework describes a more Cabena et al. (1997), the definition of data mining
comprehensive measure and minimum standard in in this aspect: the process of extracting previ-
for capital adequacy that national supervisory ously unknown, valid and actionable information
authorities are now working to implement through from large databases and then using the informa-
domestic rule-making and adoption procedures. tion to make crucial business decisions. From
Although, the basel-II is fully based on risk that view point the definitions of EWS and data
measures, it seeks to improve the existing rules mining lead to an interesting similarity. Therefore,
by aligning regulatory capital requirements more data mining can be treated as the best analytical
closely to the underlying risks that banks face. approach for early warning systems.
As a result, it is intended to be more flexible and
better able to evolve with advances in markets
and risk management practices (http://www.bis. early Warning SySteM for
org/publ/bcbsca.htm). SMeS BaSed on data Mining
The required support is to provide a tool,
which will: Financial Early Warning Systems

• Provide guidance in financial manage- Early warning system (EWS) is a technique of

ment analysis that is used to predict the achievement
• Provide expert knowledge in more accurate condition of enterprises and to decrease the risk
and speedy way without experts of financial crisis. Applying this technique of
• Discover probable risks and solutions

Early Warning System for SMEs as a Financial Risk Detector

analysis, the condition and possible risks of an • IRIS (insurance regulatory information
enterprise can be identified with quantity. system)
The objectives of EWS are: • FAST (financial analysis tracking system)
• Neural network systems
• Identification of changes in environment • Discriminant models
before clarification • Rating systems
• Identification of speed and direction of • Event history analysis
change for projecting the future • Recursive partitioning algorithm
• Identification of the importance in the pro-
portion of change Definition of the Early Warning
• Determination of deviations and taking System for SMEs Based On Data
signals Mining
• Determination of possible reactions in direc-
tion of privileged deviations Nearly all of the financial early warning systems
• Investigation of the factors that cause change are based on ratio analysis, in other words, based
and the transaction between these factors on financial tables. Financial tables are the data
sources that reflect the financial truth for early
However, there is no specific method for total warning system. However, ratio based models
preventation for a financial crisis of enterprises. ignore the administrative and structural factors
The important point is to set the factors that cause and this situation shows that the human factor
the condition with calmness, to take corrective is not evaluated for the decision mechanism.
precautions for a long term, to make a flexible Therefore, the additional determination of the
emergency plan towards the potential future human factor in every part of the decision process
crisis. In essence, the early warning system is and the characteristics of decision environment
a financial analysis technique, and it identifies to the model will harmonize them with real life
the achievement analysis of enterprise due to its and provide an applicable system.
industry with the help of financial ratios. In this study, the objective is to compose a
Financial early warning systems are grouped system in which qualitative and quantitative data
under three main categories in the literature about the requirements of enterprises are taken
(Kutman, 2001): into consideration, during the development of an
early warning system. Furthermore, during the
• The models towards the prediction of profits formation of system; an easy to understand, easy
of enterprise to interpret and easy to apply utilitarian model
• The ratio based models towards the predic- that is far from the requirement of theoretical
tion of bankrupt/crisis of enterprise background is targeted by the discovery of the
• Economic trend based models towards the implicit relationships between the data and the
prediction of bankrupt/crisis of enterprise identification of effect level of every factor. Be-
cause of this reason, the ideal method that will
Subsequently, the models that are used by help researchers to reach their objective is the data
enterprises for early warning model are mostly mining method that is started to use frequently
based on ratio analysis. Some examples of these nowadays for financial studies.
models are below (Oksay, 2006): It is expected that the system will provide
benefits to SMEs that have higher proportion of
financial risk than larger enterprises, and contri-

Early Warning System for SMEs as a Financial Risk Detector

butions to the economy and science of country. methods. In the scope of the methods of data
Some of the contributions of SME early warning mining:
system (SEWS) that are expected can be sum-
marized as: • Logistic regression
• Discriminant analysis
• The financial requirements and weakness • Cluster analysis
of SMEs will be manifested • Hierarchical cluster analysis
• Identification of the SMEs’ financial strat- • Self organizing maps (SOM)
egy with minimum expertise on financial • Classif ication and regression t rees
administration will be possible (C&RT)
• The financial risk levels of SMEs will be • Chi-square automatic interaction detector
clearified (CHAID)
• The possibility of happening of a financial
crisis will decrease can be the principal methods, in addition to other
• The efficient usage of financial resources classification/segmentation methods. However,
will be provided during the preparation of an early warning sys-
• Loss and gain analysis will be made tem for SMEs, one of the basic objectives is to
• The competition capacity of SMEs against help SME administrators and decision makers,
the financial crisis will increase who do not have financial expertise, knowledge
• The financial comfort will provide oppor- of data mining and analytic perspective, to reach
tunities for investments and especially for easy to understand, easy to interpret, and easy
technological investments to apply results about the risk condition of their
• Financial improvement of SMEs will create enterprises. Therefore, decision tree algorithms
a new potential for export that are one of the segmentation methods can
• The decrease of bankruptcy of enterprises, be used because of their easy to understand and
and contribution of employment on economy easy to apply visualization. Although, several
will create a positive effect decision tree algorithms have widespread usage
• New enterprises and support of taxes for today, chi-square automatic interaction detector
government will increase (CHAID) is separated from other decision tree
• To provide identification of risk factors and algorithms because of the number of the branches
application of risk reducing strategy by that are produced by CHAID. Other decision tree
SMEs algorithms are branched in binary, but CHAID
manifests all the different structures in data
With the help of a scientifically authentic study, with its multibranched characteristic. Hence the
concrete outputs will be offered to the sectoral method of CHAID is used within the scope of
users, and ultimately, output of the study will lead this study.
several new researches.
Steps of the SEWS
Method
The SEWS designed similarly with knowledge dis-
The main approach for SME early warning sys- covery in databases (KDD) process and has 6 main
tem (SEWS) is discovering different risk levels steps (shown in Figure 1) which are given:
and identifying the factors effected risk levels.
Therefore, the SEWS should focus segmentation

Early Warning System for SMEs as a Financial Risk Detector

Figure 1. Data flow diagram of the SEWS

1. Preparation of data collection  Annual turnover

2. Organization of data collection  Annual balance sheet
3. Implementation of DM method  Financing model
4. Determination of risk profiles  The usage situation of alternative fi-
5. Identification for current situation of SME nancing
from risk profiles  Technological infrastructure
6. Description of roadmap for SME  Literacy situation of employes
 Literacy situation of managers
Preparation of Data Collection  Financial literacy situation of employ-
ees
Two data sets can be taken as the foundation for  Financial literacy situation of managers
SEWS:  Financial training need of employes
 Financial training need of managers
1. Financial data that are gained from balance  Knowledge and ability levels of workers
sheets: Items of balance sheets will be en- on financial administration
tered as financial data and will be used to  Knowledge and ability levels of workers
calculate financial indicators of system. on financial administration
2. With the exception of financial data; mana-  Financial problem domains
gerial, demographic, private and structural  Current financial risk position of
data that are gained from SMEs by the SMEs
way of surveys, that include the following
parameters: Organization of Data Collection
 Sector
 Legal status 1. Arrangement of data of balance sheets.
 Number of partners a. Calculation of financial indicators that
 Number of employes are shown in Table 1

0
Early Warning System for SMEs as a Financial Risk Detector

b. Reduction of repeating variables in dif- • X1 has most statistically significant relation

ferent indicators to solve the problem with target Y.
of Collinearity / multicollinearity • X2 has statistically significant relation with
c. Input of missing data X1 where X1 ≤ b1.
d. Solution of outlier and extreme value • X3 has statistically significant relation with
problem X1 where b11 < X1 ≤ b12.
2. Arrangement of survey data.
a. Input of missing data Determination of Risk Profiles
b. Solution of outlier and extreme value
problem Among the target variable and predictor vari-
ables, CHAID algorithm organizes Chi-square
Implementation of DM Method independency test and starts from branching the
variable, which has the strongest relationship, and
Assume that X1, X2,..., X N-1, X N denote discrete or at the same time arranges statistically significant
continous independent (predictor) variables and variables on the branches of the tree in terms of
Y denotes dependent variable as target variable the strength of their relationships. An example
in CHAID algorithm where X1 ∈ [a1,b1], X2 ∈ of a CHAID decision tree is seen in Figure 2. As
[a2,b2],..., X N ∈ [aN,bN] and Y ∈ {Poor, Good}. it is observed from Figure 2, CHAID has multi-
While “Poor” shows poor financial performance branches, while other decision trees are branched
in red bar and “Good” shows good financial per- in binary. Thus, all of the important relationships
formance in green bar in CHAID decision tree in data can be investigated until the subtle details.
in Figure 2. In essence, the study identifies all the different
In Figure 2 we can see that only 3 variables risk profiles. Here the term risk means the risk
of N have a statistically significant relationship that is caused due to of the financial failures of
with the target Y: enterprises.

Table 1. Variables and their definitions

Financial Variables Definitions

Current Ratio Current Assets/ Current Liabilities

Quick Ratio (Liquidity Ratio) (Cash+Marketable Securities+ Accounts Receivable)/ Current
Liabilities
Absolute Liquidity (Cash+Banks+ Marketable Securities+ Accounts Receivable)/
Current Liabilities
Inventories to Current Assets Total Inventories / Current Assets
Current Liabilities to Total Assets Current Liabilities / Total Assets
Debt Ratio Total Dept/Total Assets
Current Liabilities to Total Liabilities Current Liabilities to Total Liabilities
Long Term Liabilities to Total Liabilities Long Term Liabilities to Total Liabilities
Equity to Assets Ratio Total Equity/Total Assets
Current Assets Turnover Rate Net Revenues/ Current Assets
Fixed Assets Turnover Rate Net Revenues / Fixed Assets
Days in Accounts Receviable Net Accounts Receivable/ (Net Revenues /365)
Inventories Turnover Rate Net Revenues / Average Inventories
AssetsTurnover Rate Net Revenues / Assets
Equity Turnover Rate Net Revenues / Equity
Profit Margin Net Income/ Total Margin
Return on Equity Net Income/ Total Equity
Return on Assets Net Income / Total Assets

Early Warning System for SMEs as a Financial Risk Detector

Figure 2. CHAID decision tree

Figure 2 shows that there are six risk pro-  % m121 has poor financial perfor-
files: mance
 % m 221 has good financial perfor-
• Profile B1 shows that: mance
 There are n
11
samples where X1 ≤ b11 • Profile C2 shows that:
and X2 ≤ b21  There are n
21
samples where b11 < X1 ≤
 % m111 has poor financial perfor- b12 and X3 > b31
mance  % m122 has poor financial perfor-
 % m 211 has good financial performance
mance  % m222 has good financial perfor-
• Profile B2 shows that: mance
 There are n
12
samples where X1 ≤ b11 • Profile D shows that
and X2 ≤ b21  There are n samples where b < X ≤
3 12 1
 % m112 has poor financial perfor- b13
mance  % m13 has poor financial performance
 % m 212 has good financial perfor-  % m 23 has good financial perfor-
mance mance
• Profile C1 shows that: • Profile E shows that:
 There are n
21
samples where b11 < X1 ≤  There are n samples where X > b
4 1 13

b12 and X3 ≤ b31  % m14 has poor financial performance

Early Warning System for SMEs as a Financial Risk Detector

 % m 24 has good financial perfor- fines the relationships between financial risk and
mance variables, and also the risk profiles.
At this stage, risk profiles from all of the
If all of the profiles are investigated sepa- firms belonging to a bound are identified in the
rately: study. This identification is realized by taking
the group of variables in the risk profiles into
• Profile B1 shows that if any firm’s variables consideration.
X1 and X2 have values where X1 ≤ b11 and X2 All of the firm will look at the values of their
≤ b21, poor financial performance rate or in own enterprises, in the light of the statistically
another words risk rate of the firm will be significant variables in the decision tree. Accord-
R B1 = m111. ing to Figure 2 these variables are X1, X2, and X3.
• Profile B2 shows that if any firm’s variables The firm compares the values of X1, X2, and X3
X1 and X2 have values where X1 ≤ b11 and X2 between decision tree and firms. Then, they can
> b21, poor financial performance rate or in identify their risk profile. For example if any firm
another words risk rate of the firm will be has X1 > b13. Therefore, the risk profile of the firm
R B2 = m112. must be Profile E.
• Profile C1 shows that if any firm’s variables
X1 and X3 have values where b11 < X1 ≤ b12 and Description of Roadmap for SME
X3 ≤ b31 poor financial performance rate or
in another words risk rate of the firm will According to Figure 2, the risk grades of the firms
be RC1 = m121. can easily be determined. Assume that, the risk
• Profile C2 shows that if any firm’s variables rates of the firms in the order of E > D > C2 > C1
X1 and X3 have values where b11 < X1 ≤ b12 and > B2 > B1. Therefore, the best risk profile will
X3 > b31 poor financial performance rate or be B1. Then, every firm tries to be in Profile B1.
in another words risk rate of the firm will There are two variables X1 and X2 related with
be RC2 = m122. profile B1. If any firm want to be in Profile B1,
• Profile D shows that if any firm’s variable X1 the firm must make arrangements to make values
has values where b12 < X1 ≤ b13 poor financial X1 ≤ b11 and X2 ≤ b21.
performance rate or in another words risk Enterprise will identify the suitable road map
rate of the firm will be R D = m13. after defining its risk profile. The enterprise
• Profile E shows that if any firm’s variable can identify the path to reach upper level risk
X1 has values where X1 > b13 poor financial profile and the indicators will require privileged
performance rate or in another words risk improvement in the light of the priorities of the
rate of the firm will be RE = m14. variables in the roadmap. Furthermore, enterprise
can pass to upper level risk profiles step by step
Identification for Current Situation of and at the same time can reach to a targeted risk
SME According to Risk Profiles profile in the upper levels for improving indica-
tors related to this target. For example, any firm
Part of this study until this point, is based on the in Profile E has the biggest risk rate. The firm
identification of risk profiles from all of the data. must rehabilitate first the variable X1 to decrease
In the scope of the data that is about the past of it between (b12, b13]. Therefore, the firm will be
SMEs, the part of the study until this point de- in profile D and so on.

Early Warning System for SMEs as a Financial Risk Detector

THE FUTURE VISION OF EARLy • Customer retention: Data mining helps to

Warning SySteMS BaSed on identify customers who contribute to the
data Mining company’s bottom line, but who may be
likely to leave and go to a competitor. The
Management company can then target these customers
for special offers and other inducements.
To provide information relating to the actions of • Customer abandonment: Customers who
individual officers, supervisors, and specific units cost more than they contribute should be
or divisions, early warning management systems encouraged to take their business elsewhere.
should be developed and implemented in every Data mining can be used to reveal whether
business. In deciding what information to be a customer has a negative impact on the
included in their early warning system, business company’s bottom line.
should balance the need for sufficient informa- • Market basket analysis: Retailers and
tion for the system to be comprehensive; with the direct marketers can spot product affinities
need for a system that is not too cumbersome to and develop focused promotion strategies by
be utilized effectively. The system should provide identifying the associations between product
supervisors and managers with both statistical purchases in point-of-sale transactions.
information and descriptive information about
the function of business. In this context, early warning systems based
on data mining have also been used for identify-
Marketing ing dissatisfied customers, customer retention
and quality.
Marketing is one of the foremost areas where
data mining techniques can be applied. Data Fraud Detection
mining enables an organization to sort through
vast amounts of customer data to target the right Fraud detection studies have now become wide-
customers. This is of vital importance to the mar- spread. Fraud detection must also be dealt across
keting department of any organization. Substantial all industries, especially the sectors where many
amounts of time and money can be saved if an transactions are made more vulnerable, such
organization knows who their customers are and as health care, retail, credit card services, and
are able to predict what their spending patterns telecommunications. The pioneers in the use of
will be. Potential uses of data mining in the area data mining techniques to prevent fraud were the
of marketing, include: telephone companies and insurance companies,
with banks following close behind. Fraud can
• Customer acquisition: Marketers use data result in a business losing substantial amounts
mining methods to discover attributes that of money. Being able to protect a business from
can predict customer responses to offers the chance of fraud is an important concern for
and communications programs. Then the an organization and data mining can help.
attributes of customers that are found to be Furthermore, early warning systems to detect
most likely to respond are matched to cor- fraudulent actions can design models and can
responding attributes appended to rented be built on fraudulent behavior (or potentially
lists of noncustomers. The objective is to fraudulent behavior) done in the past and then
select only noncustomer households most use data mining to identify behavior of similar
likely to respond to a new offer. nature.

Early Warning System for SMEs as a Financial Risk Detector

Manipulation and Insider Trading • Disease/condition management: Use pre-

Detection dictive modeling techniques to identify high-
risk patients and to proactively intervene
Manipulation is intrinsically about making market and optimize care across populations;
prices move away from their fair values; manipu- • Benchmarking/quality reporting: Perform
lators reduce market efficiency. Insider trading is necessary data management and analysis to
any form of trading based on information that is support internal and external comparisons
relevant to the fundamental value of a company, and reporting requirements;
but that is not publicly available. Insider trading • Clinical research analysis: Support the
will therefore by definition decrease market effi- conduct of clinical research and outcomes
ciency. Insider trading is often equated with market analysis to generate new knowledge and to
manipulation. Market manipulation by contrast optimize clinical care; and
takes place whenever nonpublic information is • Patient safety/error reduction: Utilize
used to push the price of a stock away from its data mining approaches to uncover trends
fundamental value. Again, by definition, market and patterns in clinical errors; identify and
manipulation will decrease market efficiency. investigate key drivers of variation across
Detection of insider trading and manipula- care settings.
tion is widely treated as an important function
of securities regulation. Early warning systems Simultaneously, early warning systems based
based on data mining will detect situations that on data mining have widespread use in fighting
could pose a threat of manipulation or abusive against communicable disease and determining
practices. pioneer variables and clinical evidence. Early
warning systems have been used in the event of
Health epidemics like the global spread of the SARS virus
or malaria and early detection of chronic diseases.
Today, nearly all processes about patients and Early warning systems could also alert health care
hospitals have been done with computers. Un- officials for possible bioterrorist attacks.
fortunately, data mining applications have not
become widespread in health sector, because Risk Management
most of the health and hospital data are not stored
by datawarehouse logic. Potential usage of data Risk management covers not only risks involving
mining in the area of health sector include the insurance, but also business risks from competi-
following components: tive threat, poor product quality, and customer
attrition. Customer attrition, the loss of customers,
• Provider profiling: Analyze physician is an increasing problem and data mining is used
practice patterns by measuring clinical, in the finance, retail, and telecommunications
quality, customer satisfaction, and economic industries to help predict the possible losses of
indicators. Conduct comparative analysis to customers.
identify performance best practices. The key for early warning system is to identify
• Clinical decision support: Measure and and to manage strategic risk and not to ignore it.
view clinical performance across multiple Along this perspective, an early warning system
perspectives to optimize resource utiliza- may involve three components:
tion, cost effectiveness, pathway develop-
ment, and evidence-based decision-making.

Early Warning System for SMEs as a Financial Risk Detector

• Risk identification: What are the potential cro) or regional covariant (meso) to nation-wide
market and industry developments to which covariant (macro). An early warning system based
a company would be vulnerable? on data mining should give signs about natural
• Risk monitoring: What movement exists risks (rainfall, landslides, volcanic eruption, earth-
from competitors or in the business land- quakes, floods, drought, tornados); lifecycle risks
scape that might indicate these factors are (illness, injury, disability, hungers, food poison-
(or will soon be) in play? ing, pan epidemics, old ages and death); social
• Management action: Are executives risks (crimes, domestic violences, drug addiction,
kept aware of risk dynamics, and are they terrorism, gangs, civil strife, war, social upheaval,
equipped to launch a swift and aggres- child abuses); economic risks (unemployment,
sive response before their organization is harvest failure, resettlement, financial or currency
harmed? crisis, market trading shocks); administrative
and political risks (ethnic discrimination, ethnic
Economic Crisis conflict, riots, chemical and biological mass
destruction, administrative induced accidents
The risks of financial turmoil and economic insta- and disasters, political induced malfunction on
bility associated with currency crises have called social programs, coup); and environmental risks
attention to the importance of monitoring fragility (pollution, deforestation, nuclear disasters, soil
in the foreign exchange market or detecting signs salinities, acid rains, global warming).
of weakness in the market that may develop into
crises. Typically, decision-makers would like to
detect the symptoms of a crisis at an early stage ConCluSion
so as to adopt preemptive measures. While fore-
casting the timing of currency crises with a high At the present time, competitive conditions are
degree of accuracy remains a difficult task, deci- an increasing threat to the enterprises day by day.
sion-makers need to develop and improve upon If the importance of SMEs for the economies of
an early warning system that monitors leading countries is taken into consideration, economic
indicators of whether the economy is heading to fragility of SMEs, means the economic fragility
a crisis situation. of countries in other words. After the investiga-
tion of SMEs, the financial administration is seen
Social Risk Detection as one of the biggest problem of SMEs. Finding
practical solutions to these problems will not
Managing social risk is to extend the traditional only help to SMEs but also to the economies of
framework of social policy to the nonmarket based countries. Therefore, having information about
social protection of which its three primary strate- their financial risk, monitoring this financial
gies include prevention, mitigation, and coping. risk and knowing the required roadmap for the
Nowadays, it is well understood that social unrest improvement of financial risk are very important
is positively parallel to the poverty and assist- for SMEs to take the required precautions.
ing individuals, households and communities to However, the main problem preventing the
elevate living standard above the poverty level SMEs on the stage of taking required precautions
will harmonize global economy and strengthen is insufficient administrators. Administrators’
the social security. insufficiency mostly based on inadequate financial
According to the World Bank, the degree of knowledge. The most practical way closing this
social risks usually varies from idiosyncratic (mi- knowledge gap is designing a tool which will

Early Warning System for SMEs as a Financial Risk Detector

present expert knowledge to nonexperts. One of that provides the determination of financial risk
the most important contribution of knowledge also provides a roadmap for the risk reduction
age is giving the chance to share knowledge in of SMEs, and gives opportunity to SMEs to be
different ways via IT. proactive.
Actually, there are a lot of ways for empow- SEWS, is an early warning system that provides
ering knowledge society. One of these ways is the administration of financial risks to the indi-
adding knowledge products as a part of daily viduals who do not have expert knowledge, for a
life or working life of society. Put some facilities subject like financial administration that requires
in society’s working life is increasing transition expertise. SEWS, offers a prototype for an early
speed to become a part of knowledge society. In warning system that is based on data mining for
case of showing the society utilities of Information the every area that requires strategical decision
Age, it will make them aware about Information making and proactivation.
Age and finaly make them a part of knowledge It is apprehended further that, administrators
society. of SMEs will easily be able to use the most up to
Main motivation of this chapter was structured date IT methods, most complicated algorithms
with the lights of picture in the earlier sections. and the most depth expert knowledge via SEWS.
Designing a useful tool for SMEs, which will make In addition, this tool will provide sustainability
SMEs’ administrators aware about the utilities to for their works, and ultimately they will become
become a part of knowledge society. In addition, an efficient member of the emerging knowledge
change their working process that they were used society.
to do, with new habits. The best approaches to put
some new elements in their usual working process
are showing the way and making them stronger referenCeS
against their weaknesses. In this context, when
the lack of financial management knowledge is Altman, E., Haldeman, G., & Narayanan, P. (1977).
taken into the consideration, it can be seen that Zeta analysis: A new model to identify bankruptcy
the most appropriate tool must be to act as a risk of corporations. Journal of Banking and
financial advisor. Moreover, it should be felt that Finance, June, 29-54.
the power of knowledge is greater than financial
Anandarajan, M., Picheng, L. & Anandarajan,
advisory. Therefore, the designed system warns
M. (2001). Bankruptcy prediction of financially
SMEs by taking differences among the SMEs
stressed firms: An examination of the predictive
into the consideration. In a way, this process is
accuracy of artificial neural networks. Interna-
treated as knowledge discovery by data mining.
tional Journal of Intelligent Systems in Account-
Furthermore, data mining is the reflection of
ing, Finance and Management, 10, 69-81.
information technologies in the area of strategical
decision support system. A system can be devel- Apoteker, T. & Barthelemythierry, S. (2005).
oped based on data mining for finding solutions Predicting financial crises in emerging markets
to the financial administration as one of the most using a composite non-parametric model. Emerg-
suitable application area for SMEs as the vital ing Markets Review, 6(4), 363-375.
point of economy.
Beaver, W. (1966). Financial ratios as predictors
In this chapter, an easy to understand and easy
of failure. Journal of Accounting Research, pp.
to use system is designed to observe the financial
71-111.
risk condition of SMEs and provide financial
risk reduction system for SMEs. The system

Early Warning System for SMEs as a Financial Risk Detector

BIS, The Bank for International Settlements. Coats, P. K., & Frant, F. L. (1992). A neural net-
(2006). Basel II: Revised international capital work approach to forecasting financial distress.
framework. Retrieved April 13, 2008, from http:// The Journal of Business Forecasting Methods &
www.bis.org/publ/bcbsca.htm. Systems, 10, 9-12.
Bitzenis, A. & Nito, E. (2005). Obstacles to entre- Coats, P. K., & Frant, F. L. (1993). Recognizing
preneurship in a transition business environment: financial distress patterns using a neural network
The case of Albania. Journal of Small Business tool. Financial Management, 22(3), 142-155.
and Enterprise Development, 12(4), 564-578.
Collard, J. M. (2002). Is your company at risk?
Boritz, E. J., & Kennery, D. (1995). Effectiveness Strategic Finance, 84(1), 37-39.
of neural network types for predicition of busi-
Deakin, E. B. (1972). A discriminant analysis of
ness failure. Expert Systems with Applications,
predictors of business failure. Journal of Account-
9,503-512.
ing Research, 10(1), 167-179.
Bukvic, V. & Bartlett, W. (2003). Financial bar-
Derby, B. L. (2003). Data mining for improper
riers to SME growth in Slovenia. Economic and
payments. The Journal of Government Financial
Business Review, 5(3), 161-181.
Management, 52, 10-13.
Cabena, P., Hadjinian, P., Stadler, R., Verhees, J.,
Eklund, T., Back, B., Vanharanta, H., & Visa,
& Zanasi, A. (1997). Discovering data mining:
A. (2003). Using the self- organizing map as a
from concept to implementation. Upper Saddle
visualization tool in financial benchmarking.
River, NJ: Prentice Hall PTR.
Information Visualization, 2, 171-181.
Canbas, S., Onal, B. Y., Duzakin, H. G., & Kilic,
European Commission (2003). 2003 observatory
S. B. (2006). Prediction of financial distress by
of European SMEs: SMEs in Europe (Tech. Pep.
multivariate statistical analysis: The case of firms
No.7). European Commission.
taken into the surveillance market in the Istanbul
Stock Exchange. International Journal of Theo- Gaytan, A. & Johnson, A. J. (2002). A review
retical & Applied Finance, 9(1), 133. of the literature on early warning systems for
banking crises (Working papers No: 183). Central
Chan, N. H. & Wong, H. Y. (2007). Data mining
Bank of Chile.
of resilience indicators. IIE Transactions, 39,
617–627. Gunther, J. W. & Moore, R. R. (2003). Early
warning models in real time. Journal of Banking,
Chang, S., Chang, H., Lin, C., & Kao, S. (2003).
27(10), 1979-2001.
The effect of organizational attributes on the
adoption of data mining techniques in the fi- Hamer, M. (1983). Failure prediction: Sensitivity
nancial service industry: An empirical study in of classification accuracy to alternative statistical
Taiwan. International Journal of Management, method and variable sets. Journal of Accounting
20, 497-503. and Public Policy, 2, 289-307.
Chin-Sheng, H., Dorsey, R. E., & Boose, M.A. Hoppszallern, S. (2003). Healthcare benchmark-
(1994). Life insurer financial distress prediction: ing. Hospitals & Health Networks, 77, 37-44.
A neural network model. Journal of Insurance
Inegbenebor, A. U. (2006). Financing small and
Regulation, 13(2), 131-168.
medium industries in Nigeria-case study of the
small and medium industries equity investment

Early Warning System for SMEs as a Financial Risk Detector

scheme: Emprical research finding. Journal of Fi- lished doctoral dissertation, Ankara University,
nancial Management & Analysis, 19(1), 71-80. Ankara.
Jacobs, L. J. & Kuper, G. H. (2004). Indicators Koyuncugil, A. S. & Ozgulbas, N. (2006a).
of financial crises do work! An early-warning Financial profiling of SMEs: An application by
system for six Asian countries. International data mining. The European Applied Business
Finance, 0409001, 39. Research (EABR) Conference, Clute Institute for
Academic Research.
Jones, F. (1987). Current techniques in bankruptcy
prediction. Journal of Accounting Literature, 6, Koyuncugil, A. S. & Ozgulbas, N. (2006b). Is there
131-164. a specific measure for financial performance of
SMEs? The Business Review, 5(2), 314-319.
Kamin, S. B., Schindler, J., & Samuel, S. (2007).
The contribution of domestic and external factors Koyuncugil, A. S. & Ozgulbas, N. (2006c).
to emerging market currency crises: An early Determination of factors affected financial dis-
warning system. International Journal of Finance tress of SMEs listed in ISE by data mining. In
and Economics, 12(3), 317-322. Proceedings of the 3rd Congress of SMEs and
Productivity, KOSGEB and Istanbul Kultur
Kang, K. (2006) Outlook and reforms for the
University, Istanbul.
Korean economy in 2006. Retrieved April 13,
2008, from http://www.keia.org/2-Publications/2- Kutman, O. (2001). Researching the early warning
2-Economy/Economy2006/01cover.pdf signals for the enterprises in Turkey. Journal of
Dogus University, 4, 59-70.
Klersey, G. F. & Dugan, M.T. (1995). Substantial
doubts: Using artificial neural networks to evalu- Lansiluoto, A., Eklund, T., Barbro, B., Vanharanta,
ate going concern. In Advanced in Accounting H., & Visa, A. (2004). Industry-specific cycles and
Information Systems. Greenwich: JAI Press. companies’ financial performance comparison
using self-organising maps. Benchmarking, 11,
Kloptchenko, A., Eklund, T., Karlsson. J., Back,
267-286.
B., Vanhatanta, H., & Visa, A. (2004). Combinig
data and text mining techniques for analyzing fi- Liu, S. & Lindholm, C. K. (2006), Assessing early
nancial reports. Intelligent Systems in Accounting warning signals of currency crises: A fuzzy clus-
Finance and Management, 12, 29-41. tering approach. Intelligent Systems in Account-
ing, Finance and Management, 14(4), 179-184.
Ko, P. C. & Lin, P. C. (2005). An evolutionary
modularized data mining mechanism for finan- Magnusson, C., Arppe, A., Eklund, T., & Back,
cial distress forecasts. In A. Ghosh, & L.C. Jain B. (2005). The language of quarterly reports as
(Eds.), Evolutionary Computation in Data Min- an indicator of change in the company’s financial
ing (pp. 249-263). Berlin Heidelberg, Germany: staus. Information & Management, 42, 561-570.
Springer-Verlag.
Mena, J. (2003). Investigative data mining for
Kovalerchuk, B. & Vityaev, E. (2000). Data min- security and criminal detection. USA: Elsevier
ing in finance. Hingham MA: Kluwer Academic Science.
Publisher.
Meyer, P. A., & Pifer, W. H. (1970). Prediction of
Koyuncugil, A. S. (2006). Fuzzy data mining bank failures. The Journal of Finance, 25(4),853-
and its application to capital markets. Unpub- 868.

Early Warning System for SMEs as a Financial Risk Detector

OECD (2000). Policy briefs small and medium- Taffler, R. & Tisshaw, H. (1977). Going, going
sized enterprises: Local strength, global reach. gone - four factors which predict. Accountancy,
Retrieved May 9, 2008,from www.oecd.org/datao- March, 50-54.
ecd/3/30/1918307.pdf
Tan, C. N., & Dihardjo, H. (2001). A study on
Oksay, S. (2006). Publication of insurance re- using artificial neural networks to develop an
search and analysis. Turkey: TSRSB. early warning predictor for credit union financial
distress with comparison to the probit model.
Ozgulbas, N. & Koyuncugil, A. S. (2006). Profil-
Managerial Finance, 27(4), 56-78.
ing and determining the strengths and weaknesses
of SMEs listed in ISE by the data mining decision Tan, Z. & Quektuan, C. (2007). Biological brain-
trees algorithm CHAID. In Proceedings of the inspired genetic complementary learning for stock
10th National Finance Symposium, Izmir. market and bank failure prediction. Computa-
tional Intelligence, 23(2), 236-242.
Ozgulbas, N., Koyuncugil, A. S., & Yılmaz, F.
(2006). Identifying the effect of firm size on The New York Stock Exchange (2007). Retrieved
financial performance of SMEs. The Business April 13, 2008, from www.nyse.com.
Review, 5(2), 162-167.
The Stock Exchange of Thailand (2007). Retrieved
Pantalone, C., & Platt, M. (1987). Predicting fail- April 13, 2008, from www.set.or.th/en/index.
ures of savings and loan associations. AREUEA html.
Journal, 15, 46-64.
WIPO, World Intellectual Property Organiza-
Sanchez, A. & Marin, G. S. (2005). Strategic tion (2002). Interregional forum on small and
orientation, management characteristics, and medium-sized enterprises (SMEs) and intellectual
performance: A study of Spanish SMEs. Journal property (Tech. Rep. No. 02/01). Moscow: Docu-
of Small Business Management, 43(3), 287-309. ment of WIPO.
Shyu, M. L., Chen, S.C., Sarinnapakorn, K., and Zavgren, C. (1985). Assessing the vulnerability to
Chang, L. (2006). Principal component-based failure of American industrial firms: A logistics
anomaly detection scheme. In T.S. Lin, S. Ohsuga, analysis. Journal of Accounting Research, 22,
J. Liau, & X. Hu (Eds.), Foundations and Novel 59-82.
Approaches in Data Mining (pp. 311-329) Sprin-
Zmijewski, M. E. (1984). Methodological issues
ger-Verlag.
related to the estimation of financial distress pre-
Sormani, A. (2005). Debt causes problems for diction models. Journal of Accounting Research,
SMEs. European Venture Capital & Capital (Supplement), 59-82.
Equity Journal, 1, 1.

Chapter XIII
What Role is “Business
Intelligence” Playing
in Developing Countries?
A Picture of Brazilian Companies

Maira Petrini
Fundação Getulio Vargas, Brazil

Marlei Pozzebon
HEC Montreal, Canada

aBStraCt

Constant technological innovation and increasing competitiveness make the management of information
a considerable challenge, requiring decision-making processes built on reliable and timely information
from internal and external sources. Although available information increases, this does not mean that
people automatically derive value from it. After years of significant investment to establish a technological
platform that supports all business processes and strengthens the operational structure’s efficiency, most
organizations are supposed to have reached a point where the implementation of information technol-
ogy (IT) solutions for strategic purposes becomes possible and necessary. This explains the emergence
of “business intelligence” (BI); a response to information needs for decision-making through intensive
IT use. This chapter looks at BI projects in developing countries—specifically, in Brazil. If the manage-
ment of IT is a challenge for companies in developed countries, what can be said about organizations
struggling in unstable contexts such as those often prevailing in developing countries?

introduCtion This perception of IT as a strategic resource

is not exclusive to developed countries. IT is ex-
The final decades of the 20th century and the begin- pected to play a key role in developing countries
ning of the 21st have been marked by a staggering as well. Because IT offers significant potential
proliferation of information and communication benefits for socioeconomic development, the likely
technologies throughout the industrialized world gains in efficiency of production and services are
(Steinmueller, 2001). Not only do globalization at least as relevant in developing countries as in
trends bring a turbulent and most often unequal advanced economies (Avgerou, 2002). The pos-
competitive environment, they also propagate sibility of technology transfer is seen as an oppor-
waves of “managerial imperatives”—such as total tunity for organizations in developing countries
quality; reengineering and integrated systems— to bypass stages of growth in their programs for
that exert tremendous pressure on organizations industrialization and advancement (Steinmueller,
wanting not only to survive, but to succeed. In 2001). However, very often the resulting IT-based
addition to performance and effectiveness, global solutions these companies deploy have had little
organizations are asked to display ethical, social impact in terms of the goals they were intended
and environmental responsibility. This entire to reach (Sahay & Avgerou, 2002). One can argue
context makes the task of managing information that this is partly due to the fact that IT solutions
a formidable challenge. developed in certain contexts—the “developed”
Information management is seen as one of world, for instance—are not necessarily translated
the biggest challenges characterizing today’s beneficially into other contexts, such as those of
corporate context. A combination of constant countries considered “in development.” Such
technological innovation and increasing competi- considerations have motivated the authors to put
tiveness makes the management of information a forward a research project aiming to investigate
difficult task, one which requires decision-mak- the status and role, if any, of BI projects in the
ing processes that are built on reliable and timely context of developing countries, more specifi-
information gathered from internal and external cally, in Brazil.
sources. Although the volume of information The conventional definition of BI refers to the
available is increasing, this does not automatically consolidation and analysis of internal data (e.g.,
mean that people are able to derive value from it transactional POS (point of sales system) data)
(Burn & Loch, 2001). In the IT field, after years and/or external data (e.g., purchased consumer
of significant investments to create technologi- demographics) for the purpose of effective deci-
cal platforms that support all business processes sion-making. Several reasons for BI’s relevance
(processes that are “reengineered” and “inte- are generally put forward. Historically (and
grated”) and that strengthen the efficiency of the continuing today in the vast majority of firms),
operational structure (after undergoing “quality” companies have spent too much time closing their
programs), organizations are supposed to have books and preparing data and financial reports,
reached a point where the implementation of IT and too little time on analysis and review. This
solutions for strategic decision-making processes causes a gap between analysis and action (deci-
becomes possible and necessary. This context sion-making) (Rasmussen, Goldy & Solli, 2002).
explains the emergence of the area generally In addition, a BI initiative includes objectives like
known as “business intelligence” (BI), seen as an creating a vision for the organization, coaching
answer to current needs in terms of information the organization to set realistic goals, and sup-
for strategic decision-making through intensive porting optimal decision-making. Although the
use of information technology (IT). current push, promoted by IT vendors arguing

What Role is “Business Intelligence” Playing in Developing Countries?

for the importance of BI applications, may be implemented and used by Brazilian companies,
seen as simply one more IT-driven management and to indicate their perceived “value” in terms
fashion, it is difficult to deny the benefits of deep of gaining competitive advantage in a globalized
and meaningful insights that easy and rapid access world.
to relevant and consolidated data can provide to
all organizational members, particularly decision
makers. BaCkground
Within a broad enquiry—“What role is BI
playing in developing countries?”—two specific IT and Developing Countries
research questions are explored in this chapter.
First, what approaches, models or frameworks It is often assumed that the impact and implemen-
have been adopted to implement BI projects in tation of IT will be uniform, with little regard to
Brazilian companies? The purpose is to determine particular social or cultural contexts. Drawing
whether those approaches, models or frameworks on experience and research in different parts of
are tailored for particularities and the contextu- the world, including Europe and Latin America,
ally situated business strategy of each company, Avgerou (2002) holds a different view. She de-
or if they are “standard” and imported from veloped a conceptual approach to account for the
“developed” contexts. In addition, the authors organizational diversity in which IT innovation
want to verify whether the temporal dimension takes place, showing how the processes of IT in-
affects the degree of sophistication of a firm’s novation and organizational change reflect local
approach, that is, if more mature projects have aspirations, concerns and action, as well as the
improved their methodologies and use of perfor- multiple institutional influences of globaliza-
mance indicators. tion.
Second, what is the perceived “value” of BI to Such a perspective raises issues about whether
strategic management of Brazilian companies? IT implementation can be handled similarly in
The purpose here is to analyze: what type of in- developing and industrialized countries. For ex-
formation is being considered for incorporation by ample, there are special requirements that should
BI systems; whether they are formal or informal be taken into consideration by ISD (information
in nature; whether they are gathered from internal systems development) methodologies in Africa
or external sources; whether there is a trend that (Mursu, Soriyan, Olufokunbi & Korpela, 2000).
favors some areas, like finance or marketing, over These special requirements are based on the lo-
others, or if there is a concern with maintaining cal socioeconomic conditions as well as on wider
multiple perspectives; who in the firms is using sociopolitical issues including sustainability,
BI systems, and so forth. affordability and community identity. Although
Considering that information technology use these issues are also relevant to industrialized
takes place within a context of “globalization,” countries, they are more critical for developing
and being aware that companies that participate countries and are not sufficiently addressed by
in such a globalized process do not compete under existing ISD methodologies.
equal conditions, the hypothesis is that BI appli- Indeed, a number of estimates suggest that a
cations can help firms in developing countries to significant majority of IS projects in developing
improve their competitive advantage. Exploring countries fail in some way. Why should this be?
these questions and discussing them with Brazil- According to Heeks (2002), central to developing
ian entrepreneurs, the purpose is two-pronged: countries’ IS success and failure is the amount
to sketch the nature and quantity of BI projects of change between “where we are now” and

What Role is “Business Intelligence” Playing in Developing Countries?

“where the information system wants to get us.” 1997). People need to improve their capacity to
The former will be represented by the current address contextual characteristics and particular
reality of the particular context (part of which requirements in order to better implement and
may encompass subjective perceptions of reality). manage an IT application conceived elsewhere
The latter will be represented by the model or (Avgerou, 2002).
conceptions, requirements and assumptions that Recent studies in IS show the importance of
have been incorporated into the new information the local context, particularly the importance
system’s design. “Design conceptions” derive of adapting global practices based on IT when
largely from the worldview of the stakeholders implementing them in developing countries
who dominate the IS design process. Putting (Pozzebon, 2003). However, the nature of these
this a little more precisely, it can be said that the adaptations and the factors that create them are
likelihood of success or failure depends on the poorly understood (O’Bada, 2002). It is believed
size of the gap that exists between “current reali- that an emergent and important role for IT research
ties” and “design conceptions” of the information is to study particular individuals, groups, orga-
system (Heeks, 2002). nizations or societies in detail, and in context. In
Those gaps arise especially when designs this way, studies of IS projects from all parts of
and dominant design stakeholders are remote the world might form the basis for comparisons
(physically or symbolically) from the context of and inferences from a global viewpoint (Avgerou,
IS implementation and use. This can occur in a 2002). In this vein, this work on BI in developing
number of ways, but approaches to IS projects countries seeks to contribute to advancing such
in developing countries are particularly domi- knowledge of local/global gaps in IS implementa-
nated by the mechanistic transfer of Northern tion and use. By outlining the use of a particular
designs to Southern realities (Heeks, 2002). An type of IT application—namely, BI - by particular
example of country-context gaps can be drawn firms—Brazilian companies—and the particu-
from an experience in the Philippines, where larities of these Brazilian projects, the authors hope
an aid-funded project to introduce a field health to improve knowledge of this important area.
information system was designed according to
a Northern model that assumed the presence of Business Intelligence
“skilled” programmers, “skilled” project manag-
ers, a “sound” technological infrastructure, and a The literature review of BI reveals few studies.
need for information outputs like those used in an Most of the articles are conceptual. What’s more,
American health care organization (Heeks, 2002). throughout the literature one encounters the
In reality, none of these elements was present in traditional “separation” between technical and
the Philippine context, and the IS project failed. managerial aspects, outlining two broad patterns
Globalization is a contradictory process, (Table 1).
implying increased interconnectedness of local The technological approach, which prevails
actors along with “globalism” in the form of in- in most studies, presents BI as a set of tools that
creased trans-national uniformity (Beck, 2000). support the storage and analysis of information.
IT is a powerful tool which, if well adapted, can This encompasses a broad category of applications
help countries promote their development (Meier, and technologies for gathering, storing, analyz-
2000). Implementing technologies across loca- ing and providing access to data to help users in
tions represents a huge challenge as “global” prin- the enterprise make better business decisions.
ciples and multiple choices have to be translated Those BI tools include decision support systems,
into “local” contexts and requirements (Williams, query and reporting, online analytical process-

What Role is “Business Intelligence” Playing in Developing Countries?

Table 1. Two approaches to BI

Managerial Approach Technological Approach
Main focus Focus on the process of gathering data Focus on the technological
from internal and external sources and of tools that support the process
analyzing it in order to generate relevant
information
References (Kalakota & Robinson, 2001; Liautaud, (Dhar & Stein, 2006;
2000; Schonberg, Cofino, Hoch, Giovinazzo, 2002, Hackathorn,
Podlaseck & Spraragen, 2000; Vitt, 1998; Kudyba & Hoptroff,
Luckevich & Misner, 2002) 2001; Scoggins, 1999; Watson,
Goodhue, & Wixon, 2002)

ing (OLAP), statistical analysis, forecasting and leverage their massive data assets, thus improving
data mining. the quality and effectiveness of their decisions.
The focus is not on the process itself but on the The growing requirements for data mining and
technologies that allow the recording, recovery, real time analysis of information will be a driving
manipulation and analysis of information. For force in the development of new data warehouse
instance, Kudyba and Hoptroff (2001) conceive of architectures and methods and, conversely, the
BI as data warehousing: technology allows users development of new data mining methods and ap-
to extract data (demographic and transactional) in plications (Kudyba & Hoptroff, 2001). In this vein,
structured reports that can be distributed within Hackathorn (1998) approaches the convergence of
companies through Intranets. Having determined technologies of data warehousing, data mining,
that some organizations get greater returns on hypertext analysis and Web information resources
implementation of data warehousing than others, as a major challenge in creating architecture for
Watson, Goodhue, and Wixon (2002) developed all these technologies in an organizational BI
research showing how data warehousing can platform.
change an organization, what its impact on the In short, BI is a wide set of tools and applica-
organization is, and how such impacts can be tions for collection, consolidation, analysis and
quantified and measured. Sophisticated use of dissemination aiming to improve the decision-
warehoused data occurs when advanced data making process. The components of BI that focus
mining techniques are applied to change data into on collection and consolidation can involve data
information (Scoggins, 1999). management software to access data variables,
Data mining is the utilization of mathemati- extract, transform, and load tools that also enhance
cal and statistical applications that process and data access and storage in a data warehouse or
analyze data. Mathematics refers to equations or data mart. In the analysis and distribution phases,
algorithms that process data to discover patterns each time more different products are launched
and relationships among variables. Statistics gen- and integrated with attention paid to the differ-
erally shed light on the robustness and validity ent uses of the information. These products can
of the relationships that exist in the data-mining include: creation of reports, fine-tuned dashboards
model. Leading methods of data mining include containing customized performance indicators,
regression, segmentation classification, neural visually rich presentations that use gauges, maps,
networks, clustering, and affinity analysis. charts and other graphical elements to juxtapose
The synergy created between data warehous- multiple results; generation of OLAP cubes; and
ing and data mining allows knowledge seekers to data mining software that reveal information

What Role is “Business Intelligence” Playing in Developing Countries?

hidden within valuable data assets through use of company; information is the data that has passed
advanced mathematical and statistical techniques, through filtering and aggregation processes and
making it possible to uncover veins of surprising, acquired a certain level of contextual meaning;
golden insights in a mountain of factual data. intelligence elevates the information to the highest
Figure 1 proposes an overview of BI architec- level, as the result of a complete understanding
ture, distributing each different technology and of actions, contexts and choices.
application in terms of its main contribution in Both approaches—technical and manage-
each step in the BI process. rial—rely on an objective and positive view that
The managerial approach sees BI as a process “strategic decisions based on accurate and usable
in which data from inside and outside the company information lead to an intelligent company.” All
are integrated in order to generate information the subjectivism inherent in social interactions is
relevant to the decision-making process. The role evacuated, and cultural and political issues are not
of BI here is related to the whole informational evoked. In addition, this literature requires some
environment and process by which operational facility with managerial and IT-driven “language
data gathered from transactional systems and games,” where the use of buzzwords—like data
external sources can be analyzed to reveal the warehousing, data mining and OLAP—and jar-
“strategic” business dimensions. gon—like “intelligent enterprise” and “strategic
From this perspective emerge concepts such dimensions”—is ubiquitous.
as the “intelligent company”: one that uses BI Whether the reviewed studies are managerial
to make faster and smarter decisions than its or technological, they share a common idea: (1) the
competitors (Liautaud, 2000). Put simply, “intel- core of BI (process or tool) is information gather-
ligence” entails the distillation of a huge volume ing, analysis and use, and (2) the goal is to support
of data into knowledge through a process of the strategic decision-making process. Taking
filtering, analyzing and reporting information into account the relative scarcity of literature, the
(Kalakota & Robinson, 2001). The explanation of authors looked for other areas that could help in
how companies acquire “intelligence” would lie in reaching a more comprehensive understanding of
the data-information-intelligence transformation. BI. They revisited three distinct but interrelated
Traditional wisdom emerges here: data is raw and areas: information planning, balanced scorecard
mirrors the operations and daily transactions of a and competitive intelligence.

Figure 1. BI architecture (authors’ proposal)

What Role is “Business Intelligence” Playing in Developing Countries?

The Contribution of Information chain of organizations, categorized according to

Planning Literature: Limited and the area to which these spots refer, and which
Strategic Information, Collectively should be monitored. In order to make the deci-
Identified sion-making process more effective, they suggest
that the information should be presented in a
An important question that has been widely manner compatible with the managers’ way of
debated and must be considered in an extended thinking, respecting the corollary that different
definition of BI involves the relationship between managers analyze information from different
strategic planning and IT. The harmony or align- points of view.
ment of organizational strategy and IT strategy In short, authors concerned with information
seems to be increasingly identified as a key factor planning assume that information relevant to
in the success or failure of IT implementations, decision-making is limited but strategic and that
especially for BI projects. A survey of 67 IT ex- it should be collectively identified, but that often
ecutives on three different continents has shown it is already identified within the company or
that they perceive the alignment between IT and clearly defined in managers’ minds. This sets up
corporate objectives as the most important task a paradox counterpoising the need for limited,
(Van Der Zee & De Jong, 1999). thus standardized, information, with respect to
This means that the chances of success in individual decision-making styles and points of
the application of any technology are directly view, thus personalized information. Achieving
related to how it is articulated in terms of orga- such a balance between standardization and
nizational strategy and of the characteristics of customization is one of the biggest challenges in
each industry. Reich and Benbasat (2000) define contemporary IS projects, particularly for BI.
the alignment of IT and organizational strategy
as “the level in which IT mission, objectives and The Contribution of the Balanced
plans support and are supported by the mission, Scorecard Approach: Multidimensional
objectives and business plans” (p. 82). When talk- Information
ing about the information planning phase, they
emphasize the importance of identifying limited Additional support for this study of BI has been
but strategic information (Reich & Benbasat, found in the balanced scorecard approach, as
2000). Eisenhardt and Sull (2001) also reinforce it associates indicators and measures with the
the need for only limited but direct information monitoring of the company’s strategic objec-
and rules for establishing organizational strate- tives (Kaplan, 1996). The concept of balanced
gies. Such information and rules should clearly scorecard encompasses a set of measurements
indicate how processes are executed and what the that provide high-level executives with a quick
business focus is. and understandable view of the business (Kaplan
According to Connelly, McNeill, and Mosi- & Norton, 1992). Its development was motivated
mann (1998), relevant information for deci- by dissatisfaction with traditional performance
sion-making is likely to already exist within the measurements that were concerned only with fi-
company or to be clearly defined in managers’ nancial metrics and that focused on the past rather
minds. They also consider that the information than the future. As a result, financial measures
most valuable for decision-making in a company are complemented with measurements related
is likely to be concentrated in a relatively small to internal process, clients and organizational
number of points (sweet spots) that already ex- learning perspectives. Niven (2002) explored
ist in the information that flows along the value the limitations of exclusive reliance on financial

What Role is “Business Intelligence” Playing in Developing Countries?

measures of performance and explained how the attention. For instance, the “internal processes”
balanced scorecard could overcome them. dimension or the “learning and growing” dimen-
Balanced scorecard, like value-based man- sion invite the firm to give greater consideration
agement, is a component of a broader concept: to their employees’ motivation and training. BSC
strategic performance measurement systems can lead to learning organizations. In addition to
(SPMS). SPMS are described as powerful tools these four classic perspectives proposed by Kaplan
for executing strategy. Value-based management and Norton (1992)—financial, customer, internal
is the term given to a process used to determine processes and learning and growing—others
the drivers of a particular strategy, to understand can be conceived. The firm’s social and envi-
how those drivers link to value creation, and then ronmental roles, regarding its local community
to break those drivers down into steps for action and country, are also increasing in visibility and
and activities that can be pushed throughout an importance. Interorganizational processes and
organization all the way to the shop floor. Value- social responsibility could be considered examples
based management shouldn’t be confused with the of new perspectives that should be articulated by
actual design of strategy—it represents a vehicle balanced scorecard approaches.
and process for strategy execution translated It is our view that such multidimensionality
into the specific value drivers of that particular should be integrated into BI processes.
organization (Frigo, 2002).
The distinctive feature of balanced scorecards The Contribution of the Competitive
is that they are designed to present managers with Intelligence Area: Contextualized
financial and nonfinancial measures spanning Information
different perspectives which, in combination,
provide a way of translating strategy into a co- Finally, the authors have borrowed insights from
herent set of performance measures (Chenhall, an adjacent discipline, competitive intelligence.
2005). Chenhall (2005) defined a key dimension The Society of Competitive Intelligence Profes-
of balanced scorecards: integrative information. sionals (SCIP) defines intelligence as the process
Three interrelated dimensions of integrative in- of collection, analysis and dissemination of ac-
formation were identified in his study. The first, curate, relevant, specific, current and visionary
strategic and operational linkages, is a generic intelligence related to the company, to the business
factor that captures the overall extent to which environment, and to competitors (Miller, 2002).
the systems provide for integration across ele- Gilad and Gilad (1988) defined competitive intel-
ments of the value chain. The second, customer ligence as the activity of monitoring the firm’s
orientation, focuses on customer linkages and external environment to gather information that is
includes financial and customer measures. The relevant to the decision-making process. Indeed,
third, supplier orientation, is based on linkages some authors apply competitive intelligence as a
to suppliers and includes business process and kind of synonym for BI. Despite their common
innovation measures. concern with data collection and analysis, and with
The balanced scorecard’s main contribution the conceptual distinction between information
to this study is its idea of multiple perspectives. and intelligence, the focus of competitive intel-
In a world that is continually becoming more ligence is external information about competitors
and more globalized, and in which management and markets, and it deals with information that
strategies are constantly being revised, the idea is essentially qualitative and textual, informal,
of taking into account multiple perspectives when and ambiguous.
analyzing firms’ performance seems worthy of

What Role is “Business Intelligence” Playing in Developing Countries?

The objective of competitive intelligence BI. Table 2 summarizes the contribution of each
is not to “steal” competitors’ trade secrets or area to building an extended definition.
other private information, but rather to gather, First of all, the authors have favored the notion
in a systematic and open manner (i.e., legally), a of process over that of tool. The “tool” view of
wide range of information that, once put together technology is still predominant in IT literature,
and analyzed (i.e., in context), provides a fuller with its assumption that technology is an engi-
understanding of a competing firm’s structure, neering artifact, expected to do what its designers
culture, behavior, capabilities, and weaknesses intend it to do (Orlikowski & Iacono, 2001). The
(Sammon, Kurland, & Spitalnic, 1984). CI uses “tool” view tends to see IT independently of the
public sources to locate and develop information social or organizational arrangements within
on competition and competitors (McGonagle & which ICT is developed and used. It black-boxes
Vella, 1990). technologies and assumes that they are stable,
The main contribution of competitive intelli- settled artifacts that can be passed from hand to
gence is the focus on contextualized information. hand and used as is, by anyone, anytime, anywhere.
For example, in a manufacturing business, the In contrast, BI can be seen as an organizational
level of “waste” constitutes information that can process, composed of information-related activi-
be analyzed over time and according to product ties carried out by people in particular settings,
line (the context). These analyses might indicate, and whose perceptions and cultural background
for example, that waste levels are higher during a will influence the way they interpret and use
specific period, and in one specific product line. information and technology.
Deeper analyses, including external information, Second, they outlined the collective and
may show that such increments coincide with the socially constructed nature of the process of
increase in air humidity. The intelligence which defining, collecting, transforming, analyzing and
can be acted upon is the fact that the materials sharing relevant information for decision-making
used in that specific product line are more sensi- purposes. It is collective because these activities
tive to air humidity than other materials. Although depend on people interacting within the firm.
information is factual and intelligence is some- People should acknowledge their active role as
thing that can be more purposively acted upon, information producers and consumers, and the
both are contextual. Competitive intelligence role of their company (political and institutional
makes us more conscious of the contextual nature constraints), in that context.
of information, and of the value of aggregating The process is socially constructed because
more external information that is qualitative and information is usually understood as a type of
informal in nature. commodity that can be unambiguously defined
and mechanistically treated and transferred,
The Meaning of Business Intelligence: eliminating the subjectivity inherent in any
Proposing an Extended Definition information-related process and, most impor-
tantly, the intersubjectivity inherent in human
All the approaches reviewed are clearly domi- interactions when dealing with “information”
nated by an objectivist mindset that disregards (Easterby-Smith, Araujo & Burgoyne, 1999).
the socially constructed and political process If information is a collective social construc-
of information production in any organization. tion, tone must focus on the particular activities
Aiming to develop a more critical appreciation, people engage in when producing and managing
the authors put forward a distinct definition of it, and these are likely to change across different
cultures, industries and even organizations. The

What Role is “Business Intelligence” Playing in Developing Countries?

Table 2. The contribution of each area in building an extended definition of BI

Literature revisited Main concepts retained Our elaboration from these
concepts
The core of BI (process or
“Pure” BI literature:
tool) is information gathering, BI is an organizational process
managerial and technical
analysis and use, and the goal consisting of information-related
approaches (see also
is to support the strategic activities.
Table 1)
decision-making.
Information planning
Relevant information for BI is a collective process;
literature (Connelly,
decision-making can be information-related activities
McNeill, & Mosimann,
collectively identified in the (definition, collection,
1998; Eisehhardt & Sull,
company while respecting transformation, analysis and
2001; Reich & Benbasat,
individual decision-making distribution of few but strategic
2000; Van Der Zee & De
styles and points of view. indicators) are inherently collective
Jong, 1999)
The need for multiple
perspectives, from traditional
Balanced scorecard
“BSC” dimensions (learning
approach (Chenhall, 2005; BI is a multidimensional process;
and growing, internal
Frigo, 2002; Kaplan & information-related activities
process, customers, financial)
Norton, 1992; Kaplan, require multiple perspectives
to emergent like inter-
1996; Niven, 2002)
organizational processes and
social responsibility.
Competitive intelligence
literature (Gilad & Gilad, BI is a contextual and culturally
Valuable information, both
1988; McGonagle & situated process; information-
internal and eternal, is always
Vella, 1990; Miller, related activities are essentially
contextual.
2002;Sammon, Kurland, contextual and culturally situated.
& Spitalnic, 1984)
BI is an organizational process consisting of a range of activities
and interactions wherein organizational members define, collect,
An extended definition
transform, analyzes, and share information for decision-making
of BI
purposes. This process is likely to be collective, socially constructed,
multidimensional, contextual and culturally situated.

contextual and culturally situated character of IT lyze, and share information for decision-making
projects is supported by abundant literature on IT purposes.
and development, including efforts at learning This process is likely to be collective, socially
more about the history, culture, social relations constructed, multidimensional, contextual and
and local competencies (Avgerou, 2002). culturally situated. The implications of this ex-
Finally, the authors outlined the multidimen- tended definition for the present study are that,
sional nature of information and BI processes. regarding firms located in Brazil, this chapter
They believe that relevant information and deci- focuses on the particularities of BI projects’ adop-
sion-making processes go beyond the financial tion and use, and examines the consequences of
dimension and that all the proactive and creative these particularities in terms of their “impact” on
aspects of information production depend on perceived organizational benefits. In addition, this
multiple perspectives. In a nutshell, BI is seen as view goes beyond a “technical” appreciation of
an organizational process composed of a range functionalities and technological features. In the
of activities and interactions wherein organiza- rest of this chapter, although how BI is referred
tional members define, collect, transform, ana- to may vary, the greatest importance is placed
on the processual notion—BI is a process—and

0
What Role is “Business Intelligence” Playing in Developing Countries?

expressions such as BI applications, BI tools, BI willingly supplied contact information (Dubé &
solutions and BI systems are merely used to refer Paré, 2003).
to the software components of BI projects. The unit of analysis is BI projects and the first
endeavor was to canvass all firms belonging to
the FIESP (Federation of Industries of the State of
Main thruSt São Paulo). This entity’s goal is to lead Brazilian
industry to a high rank among the world’s most
Research Methods industrially advanced countries, and to support
its associated companies and trade unions. One
The present research has been conceived as a hundred and twenty-nine (129) trade unions are
qualitative study aimed at describing and under- represented, and FIESP serves as a reference in
standing complex phenomena whose contextual the search for solutions, helping businesspeople
factors must be deeply analyzed (Stake, 1998). To manage their firms through strategies, orientation
date, no academic study aimed at researching BI and information.
projects from a social or “developmental” view- However, after dozens of calls that yielded no
point has been carried out in Brazil. Therefore, uniquely BI projects, authors decided to change
this research fills this gap by investigating the their strategy. In order to identify BI projects
implementation and use of Brazilian BI projects, implemented by Brazilian firms, the four major
that is, what approaches to BI implementation vendors of BI tools in Brazil (Business Objects,
are being applied by Brazilian companies and Cognos, Hyperion and Oracle) were contacted and
what the perceived “value” or benefits of these asked to provide a list of clients with qualifications
projects are. potentially meeting the sample selection criteria.
In this way, 30 companies were contacted initially,
Sample and Data Collection only 15 agreed to participate.
Those companies that declined to participate
This research employs criterion sampling, a cited commercial confidentiality or lack of interest
strategy that should include cases that meet a as reasons for their refusal. This lack of inter-
set criterion that is useful for quality assurance est is not related to the role of BI in developing
(Miles & Huberman, 1990, p. 28). The logic of economies per se. Rather, it reflects a common
criterion sampling is to review and study cases picture in Brazil: in general, companies do not
that meet a predetermined criterion of importance. see value in academic research. Although the final
In this study, this means that all selected cases sample, 15 firms, might seem extremely small, it
should meet the same criterion: medium-to-large actually is not since, at the time of this research,
companies that have implemented a BI project the number of Brazilian companies using BI
for more than one year and are currently, and systems was limited.
effectively, using the BI system. This strategy Research by International Data Center (“Busi-
can add an important qualitative component to a ness Intelligence: Aspectos,” n.d.) about BI sce-
quantitative analysis of an information system: narios and tendencies in Brazilian companies
all cases that exhibit certain pre-determined cri- surveyed 250 Brazilian firms and showed that
terion characteristics are routinely identified for only 12% of them (30 firms) had already invested
in-depth qualitative analysis. Criterion sampling in some BI project (which does not imply that
can also be used to identify cases from standard- the implementation had succeeded). The main
ized questionnaires for in-depth follow-up. This barrier to Brazilian firms’ adoption of a BI ap-
strategy can only be used where respondents have plication is the classic decisional criterion of

What Role is “Business Intelligence” Playing in Developing Countries?

real return on investment (ROI) (“Ferramentas and to assure a comprehensive view, semi-struc-
de Business,” n.d.). Authors believe that their tured phone interviews seemed the best strategy.
15-firm sample allows them to sketch an initial Sturges and Hanrahan (2004) conducted research
picture of the situation. Table 3 presents a list of that reports the results of a comparison between
the companies studied, their industry, and which face-to-face interviewing and telephone inter-
BI vendor had provided the software application. viewing in a qualitative study and concluded that
These companies operate in different industries: telephone interviews can be used productively in
manufacturing, financial, insurance, consumer qualitative research.
goods, chemical, health, and technology. In size, Data collection consisted of telephone inter-
they range from medium to large, having between views with one person at each company. The or-
500 and 1,000 employees. ganizational member identified as being in charge
The methodological strategy is based on of BI projects was contacted and interviewed
recent studies showing the value of telephone by phone. Authors used an interview protocol
interviews when the main purpose is to outline in Portuguese (which is available upon request)
an initial picture of a given phenomenon (Han- to conduct the interviews. They started each in-
nula & Pirttimaki, 2003; Harvey, 1988; Miller, terview by explaining the traditional concept of
1995; Robey, Ross, & Boudreau, 2002). In order BI and verifying whether it corresponded to the
to cover a large number of Brazilian companies interviewee’s conception. Concrete examples of

Table 3. Summary of data collection

Company* Industry Type Interviews Vendor
CoServ Service Mutinational 1 Cognos
CoChem Chemical Multinational 1 Business
Objects
CoFo Food and Drugs Multinational 1 Cognos
CoBank Bank Multinational 1 Oracle
CoBanka Bank National 1 Hyperion
CoInsur Insurance National 1 Business
Objects
CoPaper Paper and Cellulose Mutinational 1 Cognos
CoHosp Hospital National 1 Cognos
CoSid Siderurgy Mutinational 1 Oracle
CoTele Telecommunications Mutinational 1 Business
Objects
CoBankb Bank Multinational 1 Hyperion
CoPapera Paper and Cellulose Multinational 1 Cognos
CoChema Chemical Multinational 1 Cognos
CoFoa Food Mutinational 1 Business
Objects
CoInsrua Insurance National 1 Hyperion
Total 15

*All company names are pseudonyms

What Role is “Business Intelligence” Playing in Developing Countries?

the implementation and use of BI activities and in the interview protocol. Figure 2 shows these
applications were requested. results graphically, in a presentation of the seven
The interviews, although semistructured, almost important questions. These questions are
lowed respondents to present particular aspects of closely related to the first research question (What
their BI projects, to describe, in varying degrees approaches, models or frameworks have been
of detail, the nature of their decision-making adopted to implement BI projects in Brazilian
processes, and to report events, constraints, in- companies?) but only tangential to the second
terpretations, and insights that could be seen as research question (What is the perceived “value”
unique to each organizational experience. The of BI to strategic management of Brazilian compa-
interviews took place between January and nies?). As previously described, the data collected
March 2003. Each interview lasted from 30 to 60 were not restricted to the interview protocol ques-
minutes and was not tape-recorded. Immediately tions, due to the semistructured character of the
following each interview, the interviewer wrote interviews, additional comments and questions
a detailed summary from notes taken during the were posed, depending on the course of each
interview. conversation, so that the two research questions
could be explored in different ways.
Data Analysis
For How Many years Have These
The analysis was conducted in two main ways. Brazilian Companies Been Engaged
First, descriptive analysis was applied in examin- in a BI Process?
ing the answers to interview questions. Because
the purpose of this chapter is not to test a theory, BI projects are relatively recent arrivals to Brazil-
but to draw an initial picture in terms of the na- ian business. Of the companies interviewed, 73%
ture and amount of BI projects implemented and had begun the operation of a BI system within the
used by Brazilian companies, authors believe that last three years (Chart 1). The oldest BI applica-
statistical description is an appropriate method of tion was six years old. Because BI projects were
data analysis in an initial phase. called EIS (Executive Information Systems) in
However, in addition to simple descriptive the past, we asked if they had had an EIS before
analysis, they coded the interviews according implementing their BI project, and the answer
to different categories, like “IT jargon,” local or was no.
cultural expressions, and particular interpretations
and insights that respondents expressed during the Did These Brazilian Companies Use
interviews. Although these grounding categories a Specific Methodology to Identify
were not sufficient for building initial concepts or Their BI Indicators?
relationships between concepts, they were helpful
for developing an insightful discussion concern- One of the main concerns relates to what approach-
ing the perceived value of BI projects from the es to BI implementation (in terms of identification
perspective of Brazilian managers. of performance indicators) are being applied by
Brazilian companies. Surprisingly, 73% of the
interviewed companies were not using a specific
reSultS methodology for developing their BI (Chart 2).
Instead of following a given methodology, they
This section presents the main results of the had simply replicated the existing indicators or
interviews, according to the questions included

What Role is “Business Intelligence” Playing in Developing Countries?

Figure 2. A picture of the use of BI projects by Brazilian companies

measures already being used in traditional MIS Were Standard Indicators Adopted
reports and spreadsheets. or Were Specific Indicators Drawn
Actually, they seemed to pay much greater from Local Settings?
attention to building and managing the data
warehouse from a technical perspective than to The question regarding how indicators were
thinking about its content. In sum, there was no identified concerns the verification of whether
collective process for identifying key indicators the collection of indicators integrated into the
that could effectively help the decision-making BI system reflects local aspirations, concerns
process. Among the remaining 27% of companies and actions, and respects the idea that different
actually using a specific BI methodology, the countries have different requirements. Results
preferred method was the balanced scorecard. show that 87% of the companies defined their
It was observed that firms having started their indicators from a Brazilian context while 13%
BI projects during the previous two years were used indicators suggested by international head-
those using specific methodologies to identify quarters or by external consultants (and even in
their BI indicators, but insufficient information those cases, they tried to take advantage of and
was gathered to conclude that there exists an as- add indicators from the local context) (Chart 3).
sociation between these two facts.

What Role is “Business Intelligence” Playing in Developing Countries?

The implications of these findings are discussed while 33% organized them by product, such as
in the next section. credit card, leasing, investments, car insurance
and life insurance (Chart 6). (Actually, all the
Are External Sources of Information companies that presented indicators organized
Combined with Internal Sources? by product were in the banking and insurance
industries). Companies using a balanced score-
Regarding sources of information, results show card approach based their indicators on the four
that the focus is on information produced from op- well-known perspectives.
erational or transactional systems. Few companies
were concerned with external information. Only Do Users from One Area Have
27% of the companies had external information Access to Information from Other
in their BI. In these cases, external information Areas?
amounted to around 10-25% of the total informa-
tion used. Among the main sources of external Finally, questions were asked concerning organi-
information, market institutes (i.e., market share), zation and access to information. The users were
governmental institutes (i.e., demographic infor- managers (87%), top managers (73%), superinten-
mation) were identified and market research for a dents (33%), and specialists (33%) from different
specific proposal were customized (Chart 4). areas. On the one hand, all the specialists (33%)
used data mining tools; on the other hand, neither
Is a Particular Type of Indicator managers and top managers nor superintendents
Dominant? used these tools. Authors believe that this find-
ing reflects the fact that the users who use data
Recent approaches, like the balanced scorecard, mining tools have specific skills, with expertise
warn of the danger of performance measurement in math, statistics, and analysis, and also speak
that essentially reflects a financial dimension and a the business language.
type of monitoring which is perceived as reactive, In 67% of these companies, users from one
rather than proactive. Beyond seeking to avoid area could access information from other areas,
dominance by a financial perspective, these ap- but no company shared any information with its
proaches argue that indicators should be balanced, suppliers or clients (Chart 7). Access control was
that is, with no single dominant perspective but based on hierarchical levels—that is, top manag-
rather a mix of several perspectives. This study ers and managers could access information from
reveals that in 80% of the companies, some kind all areas, but superintendents or specialists could
of indicator was predominant. In 58% of them, only access their area—or on specific roles, that
sales indicators were the most powerful while is, users from one branch could access all infor-
financial indicators dominated in 42%. This is mation, but only in their branch.
understandable because, in general, BI systems
had first been implemented in the commercial or
financial areas of the companies and had been reCoMMendationS and future
restricted to those areas (Chart 5). direCtionS

How Are the Indicators Organized? In assembling this first collection of semi-struc-
tured interviews, authors observed an interesting
67% of the firms organized the indicators by area, picture of BI use in Brazil. Some of the results
such as financial, sales, supply, human resources, were expected but many brought surprises, such

What Role is “Business Intelligence” Playing in Developing Countries?

as the discrepancy between what was seen in In other words, the collection of indicators
advertising and consulting claims and what was integrated into the BI system reflects local as-
found in practice. IT consultants and vendors try pirations, concerns and actions, and respects
to convince public opinion that BI is already a “re- the idea that different countries have different
ality” for most companies, especially among the requirements. It suggests that these Brazilian
leaders in each segment or industry. But, this field firms have resisted against the imposition of key
study shows that although many firms “intend” performance indicators pushed by “globalized”
to start a BI process, few have already embarked companies, especially when dealing with global
on such projects. Is this gap between advertising corporations such as the major IT vendors. The
and practice specific to the Brazilian context, or experience of these firms provides a portrait of
can a similar scenario be found worldwide? local and cultural factors influencing the adoption
A very similar situation prevails in Canada, and impact of BI projects.
based on the experience of one of the researchers The Brazilian situation combines an inward-
who teaches at a Canadian university: there is a lot oriented economy with strong linkages to inter-
of interest in BI as a new IT trend, but there are national sources of technology. Part of the reason
relatively few projects that have reached maturity. for Brazil’s inward orientation is its size and its
For these reasons, the actual benefit of investing in distance from major markets and global production
BI processes still remains to be proven. Authors networks. While Brazil’s economy has become
believe that the scenario is slightly different in the somewhat more globally oriented during the last
US, if one takes into account Kaplan and Norton’s 10 years, local rather than global forces still drive
claims (1992) that 90% of the Fortune 1000 firms IT adoption in this country (Tigre & Dedrick,
are using a balanced scorecard to monitor their 2004). This national feature helps explain the
business (usually, a balanced scorecard is the Brazilian pattern of importing technologies but
interface of a BI system). Aside from the US, the resisting imported performance indicators when
supposed absence of mature BI projects world- using these technologies.
wide is something that Brazilian companies can It is important to recognize the difference
take advantage of, by starting to use BI systems between importing key performance indicators
before or at the same moment as most of their that reflect contexts other than local, and adopt-
competitors elsewhere in the world. ing approaches to BI implementation that help in
A second interesting insight concerns the the identification of key performance indicators.
methodological approach adopted by BI users. Nonetheless, this study’s findings suggest that the
This answered this study’s first research ques- absence of any specific or well-defined approach
tion: What approaches, models or frameworks when engaging in a BI project is dangerous. The
have been adopted in a Brazilian context in the lack of any methodology to improve the capacity
implementation of BI projects? Surprisingly, the to identify performance indicators can jeopardize
majority of companies did not use any “standard” the very existence of a BI process. In addition,
model or methodology imported from “developed” although they paid attention to local and contextu-
contexts, which could be seen as a positive sign alized information, BI systems in the companies
in itself. The results show that although Brazilian studied tended to favor formal and internal infor-
firms have adopted “globalized” products (like mation over that which is informal and external (as
Cognos and Business Objects), they have not suggested by “competitive intelligence”). What’s
adopted imposed or imported methodologies for more, the focus on financial and commercial areas
implementing their BI applications. is also problematic because important areas like

What Role is “Business Intelligence” Playing in Developing Countries?

innovation, employee motivation and collective most abused terms in BI projects. What happens
learning may be neglected, and these represent when BI is applied from a tactical/operational but
some of the strongest sources of competitive not a strategic point of view?
advantage (Kaplan & Norton, 1992). A strategic deployment of BI means the BI ap-
The previous consideration is directly related plication becomes embedded in the systems and
to the second research question: What is the per- processes of the business to build a more agile
ceived “value” of BI to strategic management of enterprise that can anticipate and react faster than
Brazilian companies? Considering that IT use its competitors to changing business conditions
takes place within a context of “globalization,” and and new profit opportunities. On the other hand,
being aware that companies participating in such a tactical deployment of BI aims at making a cur-
a globalized process do not compete under equal rent process more efficient, usually the existing
conditions, the authors expected that BI processes management reporting process.
would help companies in developing countries to A strategic use of information focuses on how
find competitive advantage. As previously dis- well the organization is meeting predefined goals
cussed, this research suggests that few Brazilian and objectives. Furthermore, this use provides
companies are, in fact, using BI processes, and perspective on, and direct support for, how the
even among those using them, there is an absence organization is able to change its ways, going
of well-defined methodologies. This tendency beyond improving current operations. When a
must be revisited. Seeking the reasons for such tactical use of information prevails, it provides
a situation, an attempt was made to determine insight into the status of current and day-to-day
whether it represented a well-known phenomenon processes, or insight into how to improve current
in IT: the adoption of an IT innovation without processes.
really understanding its nature or value. Failure to understand these differences usually
In the Brazilian context, IT consultants and leads to the BI/data warehousing “graveyard.” This
vendors have exercised a strong influence and can entail overspending on a data warehouse in-
have pushed new IT solutions on companies, frastructure that serves only a few BI applications
even when the nature or value of such innova- or underspending that causes huge project delays
tions is not really understood (Pozzebon, 2003). and failure because the infrastructure components
This means that a BI system can be adopted as have been underestimated. Similarly, the use of
a “strategic” project, but end up being used as BI may be tactical rather than strategic, in that
a technical solution for operational and tactical the preoccupation is not with defining strategic
problems. This belief is reinforced by Brazilian information aligned to strategic objectives, but
managers’ current emphasis on BI projects in with recovering indicators or measures that al-
“data warehousing.” ready exist in spreadsheets and are already being
However, a BI process needs alignment with used in traditional managerial reports. The BI ends
organizational strategy in order to produce the up being used as a traditional MIS that is simply
expected benefits, and the lack of understanding more flexible and has graphical functionalities,
of such a strategic role for BI should be overcome. but not as a “business intelligence process” for
The word “strategic” is often used to increase the strategic decision-making.
perceived value of a BI project or a vendor’s of- A BI system’s worth lies in the value of the
ferings, but this element is not always integrated indicators and information dealt with by the people
into the business process during implementation. that are interacting. If there is no awareness of
It is worth recalling Buytendijk’s words (2001) how to conceptualize, produce, analyze and share
regarding the meaning of “strategic,” one of the such information, and of what strategic insights are

What Role is “Business Intelligence” Playing in Developing Countries?

likely to be triggered, the benefit from a BI process ers and practitioners aiming to use IT as a vector
is likely to decrease or disappear. This research for development.
suggests that the strategic and social role of IT is
not always perceived. Behind any IT application
there lie social and political choices. ConCluSion
To trigger a BI process is much more of an
organizational or management issue than a tech- Given the exploratory character of this study—it
nological one. Much of the potential benefit of a BI is, in fact, the first phase of a research project of
project disappears when firms pay more attention this nature—the authors think that they have shed
to how to technically build and effectively manage some light on a subject not yet investigated which
a centralized data repository/data warehousing could be further explored through additional theo-
than to how to collectively and socially build a retical lenses. Most people managing BI systems
mechanism that produces and disseminates useful in the Brazilian companies investigated were more
and timely information for decision-making. concerned with technology than with business.
This encourages the promotion of a less techni- In other words, the companies implemented their
cal view of BI, as reflected in the authors’ extended systems with a technological focus, that is, how
definition. They propose that BI should be seen as to structure data warehousing, which technology
organizational processes instead of as simple BI vendor is better, and so forth.
tools or applications. These processes are likely Furthermore, there is a lack of attention to
to be collective, socially constructed, multidimen- determining what information is most relevant
sional, contextual and culturally situated. When to business, or aligning indicators with strategic
a technical approach dominates, the potential objectives. In the companies where balanced
benefits of BI processes tend to disappear. scorecards were used to drive the development of
The exceptions to this pessimistic scenario the BI system, a greater alignment between indica-
are those Brazilian firms which have decided to tors and strategic objectives was found. That most
use a balanced scorecard as their methodologi- of the Brazilian companies investigated have paid
cal approach. By its very nature, the balanced close attention to the Brazilian context in defin-
scorecard approach calls for rethinking strategic ing their indicators seems a good sign, as recent
goals, requiring the alignment of key indicators IT studies suggest the danger of mechanistically
with top goals and functional objectives. For transposing global principles to local contexts,
these reasons, the authors suggest that firms especially in developing countries.
using adapted balanced scorecards are perhaps However, the fact that most of the companies
the only ones using their BI processes for truly did not employ any specific methodology seems
strategic purposes. to interfere with the creation of value or competi-
On the other hand, precisely because Brazil- tive advantage from their BI projects. The lack
ian firms do not have mechanistically “imported” of methodology is a weakness and invites future
models and approaches from other contexts, the research into locally conceived approaches to BI.
authors find here an opportunity to stimulate these The authors believe that their extended defini-
companies to adopt a framework that is intrinsic to tion of BI, although provisional and in progress,
Brazilian social, economical and cultural contexts. could be helpful in developing a framework that
The indicators have been identified according to adheres to the company’s particular and contex-
the Brazilian context, and this fact can represent tually situated business strategy, with a greater
an important element to be explored by research- likelihood of obtaining “value” and benefit from
these projects.

What Role is “Business Intelligence” Playing in Developing Countries?

This research has revisited existing views of pact your bottom line in 90 days. Ottawa, ON:
BI and proposes a reformulated definition. Its Cognos Incorporated.
authors believe that a collective, contextualized
Dhar, V. & Stein, R. (1996). Seven methods for
and critical process of information management
transforming corporate data into business intel-
may help companies in developing countries
ligence. Upper Saddle River, NJ: Prentice Hall.
derive value from their BI projects. IT can be
a powerful tool that helps countries promote Dubé, L. & Paré, G. (2003). Rigor in informa-
their own development and emphasize the lo- tion systems positivist case research: Current
cal context within which the IT-based solutions practices, trends and recommendations. MIS
are implemented. However, the nature of these Quarterly, 27(4), 597-635.
“adaptations” and the factors that influence them
Easterby-Smith, M., Araujo, L., & Burgoyne, J.
are poorly understood, and this chapter’s main
(1999). Organizational learning and the learn-
contribution has been to shed some light on these
ing organization: Developments in theory and
elements regarding BI processes.
practice. London, UK: Sage Publications.
Eisenhardt, K. M. & Sull, D. N. (2001). Strategy
referenCeS as simple rules. Harvard Business Review, 79(1),
106-117.
Avgerou, C. (2002). Information systems and
Ferramentas de business intelligence no Brasil
global diversity. London, UK: Oxford University
(2003). Retrieved March 12, 2003, from www.
Press.
idcbrasil.com.br.
Beck, U. (2000). What is globalization? Cam-
Frigo, M. L. (2002). Strategy-focused performance
bridge, UK: Polity Press.
measures. Strategic Finance, 84(3), 10-13.
Burn, J. M. & Loch, K. D. (2001). The societal
Gilad, B. & Gilad, T. (1988). The business in-
impact of the World Wide Web—Key challenges
telligence system: A new tool for competitive
for the 21st century. Information Resources Ma-
advantage. New York: Amacom.
nagement Journal, 14(4), 4-14.
Giovinazzo, W. A., (2002). Internet-enabled
Business intelligence: Aspectos e tendências
business intelligence. Upper Saddle River, NJ:
do uso de ferramentas de análise corporativa.
Prentice Hall.
Retrieved March 12, 2003, from www.idcbrasil.
com.br. Hackathorn, R. D. (1998). Web farming for the data
warehouse: Exploiting business intelligence and
Buytendijk, F. (2001). Strategic BI: Its definition
knowledge management. San Francisco: Morgan
and effect on infrastructure. Gartner Group.
Kaufmann Publishers.
Chenhall, R. H. (2005). Integrative strategic
Hannula, M. & Pirttimaki, V. (2003). Business
performance measurement systems, strategic
intelligence empirical study on the top 50 Finn-
alignment of manufacturing, learning and strate-
ish companies. Journal of American Academy of
gic outcomes: An exploratory study. Accounting,
Business, 2(2), 593-599.
Organizations and Society, 30(5), 395-423.
Harvey, C. D. (1988). Telephone survey tech-
Connelly, R., McNeil, R., & Mosimann, R. (1998).
niques. Canadian Home Economics Journal,
The multidimensional manager - 24 ways to im-
38(1), 30-35

What Role is “Business Intelligence” Playing in Developing Countries?

Heeks, R. (2002). Information systems and devel- Niven, P. R. (2002). Balanced scorecard step-by-
oping countries: Failure, success and local impro- step: Maximizing performance and maintaining
visation. Information Society, 18(2), 101-112. results. New York: J. Wiley & Sons.
Kalakota, R. & Robinson, M. (2001). E-business O’Bada, A. (2002). Local adaptations to global
2.0—Roadmap for success. New York: Addison- trends: A study of an IT-based organizational
Wesley. change program in a Nigerian bank. Information
Society, 18(2), 77.
Kaplan, R. & Norton, D. (1992). The balanced
scorecard—Measures that drive performance. Orlikowski, W. J. & Iacono, C. S. (2001). Research
Harvard Business Review, 70(1), 71-79. commentary: Desperately seeking “IT” in IT
research—A call to theorizing the IT artifact.
Kaplan, R. (1996). The balanced scorecard:
Information Systems Research, 12(2) 121-156.
Translating strategy into action. Boston: Harvard
Business School Press. Pozzebon, M. (2003). The implementation of
configurable technologies: Negotiations between
Kudyba, S. & Hoptroff, R. (2001). Data mining
global principles and local contexts. Unpublished
and business intelligence: A guide to productivity.
doctoral dissertation, McGill University, Mon-
Hershey, PA: Idea Group Publishing.
treal, Canada.
Liautaud, B. (2000). E-business intelligence:
Rasmussen, N., Goldy, P. S., & Solli, P. O. (2002).
turning information into knowledge into profit.
Financial business intelligence—Trends, technol-
New York: McGraw-Hill.
ogy, software selection, and implementation. New
McGonagle, J. J. & Vella, C. M. (1990). Out- York: John Wiley and Sons.
smarting the competition. Naperville, IL:
Reich, B. & Benbasat, I. (2000). Factors that influ-
Sourcebooks.
ence the social dimension of alignment between
Meier, R. L. (2000). Late-blooming societies can business and information technology objectives.
be stimulated by information technology. Futures, MIS Quarterly, 24(1), 81-113.
32(2), 163.
Robey, D., Ross, J., & Boudreau, M. (2002).
Miles, M. B. & Huberman, A. M. (1990). Qualita- Learning to implement enterprise systems: An
tive data analysis. London: Sage Publications. exploratory study of the dialectics of change.
Journal of Management Information Systems,
Miller, C. (1995). In-depth interviewing by tele-
19(1), 17.
phone: Some practical considerations. Evaluation
and Research in Education, 9(1), 29-38. Sahay, S. & Avgerou, C. (2002). Information
and communication technologies in developing
Miller, J. (2002). O milênio da inteligência com-
countries. Information Society, 18(2), 1-5.
petitiva, Brazil: Bookman.
Sammon, W. L., Kurland, M. A., & Spitalnic, R.
Mursu, A., Soriyan, H. A., Olufokunbi, K., &
(1984). Business competitor intelligence: Methods
Korpela, M. (2000). Information systems de-
for collecting, organizing, and using information.
velopment in a developing country: Theoretical
New York: John Wiley & Sons.
analysis of special requirements in Nigeria and
Africa. In Proceedings of the 33rd Hawaii Inter- Schonberg, E., Cofino, T., Hoch, R., Podlaseck,
national Conference on System Sciences. Maui, M., & Spraragen, S. (2000). Measuring success.
Hawaii: IEEE. Communications of the ACM, 43(8), 53-57.

0
What Role is “Business Intelligence” Playing in Developing Countries?

Scoggins, J. (1999). A practitioner’s view of Van Der Zee, J. T. M. & De Jong, B. (1999).
techniques used in data warehousing for sifting Alignment is not enough: Integrating business
through data to provide information. In Pro- and information technology management with
ceedings of The Eight International Conference the balanced score card. Journal of Management
on Information and Knowledge Management, Information Systems, 16(2), 137-158.
Kansas City, MI.
Vitt, E., Luckevich, M., & Misner, S. (2002).
Stake, R. E. (1998). Case studies. In N. K. Denzin Business intelligence. Microsoft Press.
& Y. S. Lincoln (Eds.), Strategies of qualitative
Watson, H., Goodhue, D., & Wixon, B. (2002).
inquiry (pp. 86-109). Thousand Oaks, CA: Sage
The benefits of data warehousing: Why some
Publications.
organizations realize exceptional payoffs. Infor-
Steinmueller, W. E. (2001). ICTs and the possi- mation & Management, 39(6), 491-502.
bilities for leapfrogging by developing countries.
Williams, R. (1997). Universal solutions or lo-
International Labour Review, 140(2), 193-210.
cal contingencies? Tensions and contradictions
Sturges, J. & Hanrahan, K. (2004). Comparing in the mutual shaping of technology and work
telephone and face-to-face qualitative interview- organization. In I. McLoughlin & M. Harris
ing: a research note. Qualitative Research, 4(1) (Eds), Innovation, organizational change and
107-118. technology. London, UK: International Thomson
Business Press.
Tigre, P. B. & Dedrick, J. (2004). E-commerce in
Brazil: local adaptation of a global technology.
Electronic Markets, 14(1) 36-40.

Chapter XIV
Building an Environmental GIS
Knowledge Infrastructure
Inya Nlenanya
Center for Transportation Research and Education,
Iowa State University, USA

aBStraCt

Technologies such as geographic information systems (GIS) enable geospatial information to be captured,
updated, integrated, and mapped easily and economically. These technologies create both opportunities
and challenges for achieving wider and more effective use of geospatial information in stimulating and
sustaining sustainable development through smart policy making. This chapter proposes a simple and
accessible conceptual knowledge discovery interface that can be used as a tool to accomplish that. In
addition, it addresses some issues that might make this knowledge infrastructure stimulate sustainable
development with emphasis on sub-Saharan Africa.

introduCtion regional and to global. Environmental problems

such as global climate change and unsustainable
Technologies such as geographic information developments in many parts of the world are
systems (GIS) enable geographic information to be evolving as major issues for the future of the
captured, updated, integrated, and mapped easily planet and of mankind. Acidification of lakes
and economically. These technologies create both and rivers, destruction of vital natural wetlands,
opportunities and challenges for achieving wider loss of biotic integrity and habitat fragmentation,
and more effective use of geoinformation in stimu- eutrophication of surface waters, bioaccumulation
lating and sustaining sustainable development of toxic pollutants in the food web, and degrada-
through smart policy making. With the start of a tion of air quality contribute some of the many
new millennium humankind faces environmental examples of how human-induced changes have
changes greater in magnitude than ever before impacted the Earth system. These human induced
as the scale of the problem shifts from local to changes are stressing natural systems and reduc-

ing biological diversity at a rate and magnitude of data, the specific processing methods and the
not experienced for millions of years (Speth, interoperability of these methods so as to reduce
2004). Also, anthropogenic stresses such as those the time wasted in duplication of resources. This
associated with population growth, dwindling infrastructure is necessary especially in the face
resources, chemical and biological pollution of of unprecedented data availability.
water resources are expected to become more During the last decade, the society has wit-
acute and costly. nessed a tremendous development in the capa-
The approach in dealing with these environ- bilities to generate and acquire environmental
mental issues requires a balanced response in the data to support policy and decision-making.
form of an environmental management strategy. Furthermore, the rapid and exploding growth of
Such a response must utilize the best available online environmental data due to the Internet and
scientific understanding and data in addition to the widespread use of ecological and biological
an infrastructure that combines both in order databases have created an immense need for
to deliver sound science-based solutions to the intuitive knowledge discovery and data mining
myriad of environmental problems. In the Fall methodologies and tools.
2003 edition of the Battelle Environmental Up- However, in Africa, where according to
dates, it was argued that such a response would Song (2005) the bandwidth speed of an average
result in a complex decision network. This argu- university has the same aggregate bandwidth as
ment must have inspired the National Science a single home user in North America or Europe
Foundation (NSF) in 2004 to propose a network and costs more than 50 times for this bandwidth
of infrastructure called National Ecological Ob- than its counterparts in Europe or North America
servatory Network (NEON). NEON supports deserves special attention while establishing such
continental-scale research targeted to address the networks. This statistics is from a continent where
environmental challenges by facilitating the study the major issues include hunger, poverty, AIDS,
of common themes and the transfer of data and and political instability and these summarizes
information across research sites (NAS, 2004). why sub-Saharan Africa in this knowledge age
This creates a platform that enables easy and is still undeveloped and unable to tackle her own
quick access to the environmental data needed environmental problems. Clearly, a survey of the
to tackle the environmental challenges. wealthiest nations in the world would quickly
NEON is based on the same concept as grid reveal that GDP is directly proportional to the
computing. Grid computing eliminates the need volume of digital information exchange. Technol-
to have all data in one place through on-demand ogy transfer has not been able to make a mark in
aggregation of resources at multiple sites (Chetty Africa simply because the proponents ignored
& Buyya, 2002). This creates an enabling plat- the social and economic questions of access to
form for the collection of more specialized data markets, fair wages, water, land rights, and so
with the hope of integrating them with data from forth, in favor of purely technical questions and
other related areas. This has particular benefit in rejecting the indigenous knowledge in the process.
environmental data management and analysis Hence with all the progress made in cutting edge
since both data and specific processing meth- technology for data acquisition, there is still a
ods are frequently exchanged and used within dearth of geographic information exchange in
various organizations (Vckovski & Bucher, 1996). sub-Saharan Africa.
Together, NEON and grid computing form the Sobeih (2005) argues that, “GIS is considered
enabler for the construction of an environmental to be one of the most important tools for increased
cyberinfrastructure that will permit the transfer public participation and development that offers

Building an Environmental GIS Knowledge Infrastructure

new socio-economic development opportunities. has evolved to an all enveloping technology that
It can encourage human resource development has found useful applications in every facet of
within the country, facilitate the participation of human enterprise. Technologies such as global
youth in public life, help provide an analytic and positioning systems (GPSs) and remote sensing
scientific understanding of development issues, satellites have been largely responsible for the
and much more” (p. 190). Evidently, in a region GIS evolution complemented with reductions in
of the world marked with political instability, the the cost of computer hardware, electronic storage
role of the private sector and the ordinary citizen media, etc (Chainey & Ratcliffe, 2005; Longley
has become elevated. Hence, the need to increase et al., 2001). Ratcliffe (2004) believes that in ad-
capacity for handling GIS tools in environmental dition to the technology aspect of GIS evolution,
policy making. All the more important is this the discipline has also benefited immensely from
environmental GIS knowledge interface as the what he refers to as the scientific development of
wealth of the continent lies in the environment. The the discipline, an angle developed by Goodchild
participants at the AFRICAGIS 2005 Conference (1991) and Longley et al. (2001). As a result, GIS
which held at South Africa concluded delibera- has seen the adaptation of analytical methods,
tions by recognizing the opportunity provided by techniques and processes to problems with a
geospatial information for use in the development spatial component—and every human activity has
in Africa. Consequently, the specific objectives a spatial axis, thereby making GIS omnipresent
of this chapter are: in modern life (Chainey & Ratcliffe, 2005) and a
partner in development. As a partner in develop-
1. To develop a simple and accessible con- ment, there is need to leverage all the utility of
ceptual knowledge infrastructure that can GIS to increase the environmental knowledge
be used as a tool to introduce GIS into the base in sub-Saharan Africa.
education curriculum in sub-Saharan Af- In the global economy, knowledge is every-
rica thing, which is one thing that industrialized coun-
2. To adapt (1) to the current context of sub-Sa- tries have in common (Mokyr, 2002). But before
haran Africa taking into effect the prevailing one gets to knowledge, data is needed. There is
social and economic questions a dearth of environmental data in developing
3. To proffer policies for development in sub- countries (Kufoniyi, Huurneman & Horn, 2005;
Saharan Africa Rüther, 2001). And where they are available,
they are not in digital format (Dunn, Atkins &
Townsend, 1997). Organizations such as Environ-
BaCkground mental Information Systems-Africa (EIS-Africa),
USAID and other notable international organiza-
From the history of GIS, it is without doubt that tions have been in the forefront of the campaign
environmental application has been one of the to bridge the environmental knowledge gap by
motivating factors that led to the development of concentrating on human and institutional capac-
GIS in the mid-1960’s (Longley, Goodchild, Ma- ity building in the GIS sector and in encouraging
guire & Rhind, 2001). This is due to the fact that the integration of GIS into policy making. As a
environmental issues arise as a result of human way of strengthening these efforts, this chapter
activities and almost all human activities involve a proposes a knowledge discovery interface.
geographic component (Blaschke, 2001; Longley
et al., 2001; Rautenstrauch & Page, 2001). From
infancy in land use applications in Canada, GIS

Building an Environmental GIS Knowledge Infrastructure

KNOWLEDGE DISCOVERy eral definition borrowing from Miller and Han

interfaCe (2001) and Fayyad et al. (1996), as “discovering
and visualizing the regularities, structures and
A knowledge discovery interface (KDI) is a type rules from data, discovering useful knowledge
of interface that provides the means by which from data and for finding new knowledge.” This
users can connect the suite of data mining tools definition takes into account the research area of
to communicate with each other irrespective of data visualization, which hitherto has been largely
their implementation and at the same time com- ignored in knowledge discovery research (Lee,
municate with the data. KDI defines the range Ong & Quek, 1995). From this definition, a KDI
of permissible inputs, outputs and controls be- provides the means of integrating the discover-
tween the elements that make up the knowledge ing of information from a database via statistical
discovery process in order to encourage more techniques and machine learning with visualiza-
participation from various fields of study which tion techniques so that the two work seamlessly
may not be part of the traditional data mining to extract new knowledge or add to existing.
research catchment’s area. A KDI provides the means for controlling
The knowledge discovery process is a com- the complex processes of extraction, organiza-
putationally-intensive task consisting of complex tion and presentation of discovered information
interactions between a human and a large database, (Brachman & Anand, 1996) from a database. This
supported by heterogeneous suite of tools (Brach- definition encompasses the various steps in the
man & Anand, 1996). Consequently a knowledge knowledge discovery process. In Miller (in press)
discovery interface defines the rules for the com- the knowledge discovery process is grouped into
plex interactions between not just the user and a the following steps as shown in Figure 1:
large database but most importantly between the
heterogeneous suite of tools and a large assort- • Data preprocessing (data selection, data
ment of databases. It is very important that this cleaning and data reduction)
suite of data mining tools sees the assortment of • Data mining (choosing the data mining
databases as a whole and not just as a sum of the task, choosing the data mining technique
parts since the best picture is being looked for. In and data mining)
this case, the best picture is one that takes from • Knowledge construction (interpreting
all sources and presents an output that is unique the mined patterns and consolidating the
to all its sources. This is very significant because discovered knowledge)
in knowledge discovery the object is not to look
for the obvious but for some interesting pattern The steps are independent and, therefore, a KDI
(Fayyad, Piatetsky-Shapiro & Smyth, 1996) that provides the protocol that connects the steps. Mitra
can be used for decision making. and Acharya (2003) add a new dimension in their
To further understand the concept of the KDI, assessment of the knowledge discovery process
the author is going to look at some of the defini- as involving all that have been mentioned above
tions of knowledge discovery in database in order plus the modeling and the support of the overall
to get a better understanding of the knowledge human machine interaction. A KDI, in short, is
discovery process. Koua and Kraak (2004) defines the connection between the user, the knowledge
knowledge discovery as a higher level process discovery tools and the data.
using information from the data mining process
to turn it into knowledge or integrate it with prior
knowledge. They went on to present a more gen-

Building an Environmental GIS Knowledge Infrastructure

Figure 1. Overview of the knowledge discovery process

Why a Knowledge Discovery He continued to present that visual data mining

Interface is Needed can be seen as a hypotheses generation process. It
can also be used for hypotheses verification, but
A KDI simplifies the process of knowledge dis- in union with automatic techniques from statistics
covery by making it easy for the user to interact or machine learning (Lee, Ong, Toh & Chan,
with the wealth of environmental data and the 1996). Additionally, visual data mining can help
suite of data mining tools available. This has the the user to determine whether a certain data is the
potential to encourage more participants and to best choice for the targeted learning process. This
expand the knowledge input into the research makes visual data mining an important member of
field. It provides the key to a human-centered the suite of knowledge discovery tools. Hence, the
knowledge discovery process that Brachman need for KDI to integrate the visual data mining
and Anand (1996) emphasize since it gives the tools with nonvisual data mining tools.
user control over the tools. This control is very With a repertoire of already existing tools for
important as advances in knowledge discovery handling spatial data, spatiotemporal data, and
technologies are yielding more tools than the user nonspatial data, a knowledge discovery interface
can grasp without the help of a KDI. will eliminate the need to create a holistic system
As a result of the breakthroughs in data stor- that handles all the forms of data from scratch.
age and data collection technologies, datasets for Instead, through a well-developed interface, these
environmental studies now come in tera- or giga- existing tools can be integrated for the benefit
bytes of memory space. This factor is responsible of knowledge extraction from all kinds of data
for influencing the advances in artificial neural models available for environmental applications.
networks that have enhanced the analysis and Examples of these existing tools include ArcGIS
visual mining of large volumes of data. Keim and spatial OLAP or SOLAP (which is an integra-
(2002), in his assessment of visual data mining, tion of geographic information system (GIS) and
argued that it gives an overview of the data and it OLAP (Bedard, Gosselin, Rivest, Proulx, Nadeau
is particularly good for noisy, inhomogeneous and & Lebel, 2003). Accordingly, the KDI reduces the
large datasets which are a few of the characteristics time required for the deployment of a state-of-art
of the data available for environmental modeling. knowledge discovery infrastructure.

Building an Environmental GIS Knowledge Infrastructure

The iterative nature of the knowledge discov- Han (2001) describes the pre-processing of data
ery process which is highlighted in Fayyad et al. which is partly accomplished in data warehouses
(1996), Han (1999), NAS (2003), and Mitra et al. as fundamental to the knowledge discovery pro-
(2003) suggests that the process of applying tools cess because it integrates and represents data in
and transformations in the task of knowledge a manner that supports very efficient querying
discovery is repeated until the analyst discovers and processing. Zaiane, Han, Li, and Hou (1998)
some striking regularities that were not known. highlights its importance by observing that most
This iterative character has the advantage of allow- of the studies done in knowledge discovery are
ing the entire process to be broken into modules. confined to the data filtering step which is part of
KDI is very useful where modules exist because the data preprocessing stage. This presumes that
it defines the rules for inter-modular interaction. the success of the overall process centers on how
As a result, KDI enables a platform that leads to well the data is prepared before mining since the
specialized stand-alone applications such that data preparation process has the power to bias the
modifications can be made to one part without knowledge that can be extracted. Thuraisingham
affecting the entire system. This is a view that (1999) makes his own case for the importance of
Thuraisingham (1999) shares by recommending data warehousing in these words, “good data is the
the development of data mining modules as a key to good mining.” As a result, advances in data
collection of generic reusable tools. warehousing and database integration would play
The contribution of grid computing to the a very important role in enhancing the knowledge
knowledge discovery process comes with its own discovery process. Database integration plays a
attendant problem. With the availability of data in role here because it provides the input to the data
intranet repository and geodata on the Internet, warehousing stage. For environmental applica-
the problem arises of what kind of data would be tions, the data of choice is geo-spatial. Currently,
best for a particular learning process. Albertoni, conventional conceptual database models do not
Bertone, and De Martino (2003) captures this by provide a straightforward mechanism to explicitly
acknowledging the urgent need for methods and capture the semantics related to space (Khatri,
tools to support the user in data exploration. He Ram & Snodgrass, 2004). However, research is
proposed a solution based on the integration of underway to develop tools for automatic integra-
different techniques including data mining, vi- tion of geo-spatial data (from well-structured
sualization and graphical interaction techniques. vector and raster models to unstructured models
His approach aims to aid the user in making the such as georeferenced multimedia data) from
right choice of data by offering both an automated heterogeneous sources into one coherent source
presentation of data to dynamically visualize (NAS, 2003). This will enable applications to be
the metadata and interactive functionalities to designed that integrate geospatial data from dif-
discover the relationship among the different ferent sources. The next logical step would be to
metadata attributes. This approach is hinting at provide a KDI that will integrate these applications
creating a common control platform for these into the overall knowledge discovery process.
interactive functionalities to be integrated so Another argument for the need for a KDI is
that the user can manage them. KDI provides the fact that data mining, and consequently the
that common control. Metadata is mentioned overall process of knowledge discovery, is a rela-
here to underscore its prime place in data mining tively young and interdisciplinary field, drawing
(Thuraisingham, 1999). from such areas as database management systems,
The bulk of the knowledge discovery process data warehousing, data visualization, informa-
is in the data preprocessing stage. Miller and tion retrieval, performance computing, and so

Building an Environmental GIS Knowledge Infrastructure

forth. It needs the integration of approaches from by an external agent. In tight coupling, knowledge
multiple disciplines (Han, 1999). These fields that and data are not only transferred, they can be
support knowledge discovery all have their own shared by the components via common internal
standards which creates the need for integration. structures (Alexandre, 1997). A comparison of the
In addition, research is currently underway to ad- two degrees of integration would show that tightly
dress the development of formalized platforms to coupled systems would definitely be difficult to
enhance multidisciplinary research investments upgrade without tearing down everything. Also
(NAS, 2003). A KDI would be advantageous to scalability and reusability problems would arise.
fully utilize the results of this research. It would be difficult to integrate such a system
outside of the application domain that warranted
KDI System Conceptualization its design. Longley et al. (2001) believes that
as standards for software development become
The KDI system consists of GIS components, more widely adopted, software developers or
data mining components and the interactions users would prefer software systems whose
between the two. The degree of integration which components are reusable. This would give them
is a measure of the interaction between the com- the choice of building from scratch or building
ponents would be loose coupling as opposed to by components (Longley et al., 2001). From a
tight coupling. In loose coupling, the interaction purely financial standpoint, choice is everything.
is limited in time and space, that is, control and Consequently, the three main components of the
data can be transferred directly between the com- KDI are geospatial component, spatial analysis
ponents. Nevertheless the interaction is always component, and the knowledge component as
explicitly initiated by one of the components or shown in Figure 2.

Figure 2. KDI architecture

Building an Environmental GIS Knowledge Infrastructure

The Geospatial Component of reality. The idea of the geospatial component

is to provide a toolbox in the KDI that handles
The geospatial component is purely a GIS-based data in a way that bridges the gap between data on
tool or assortment of tools. Blaschke (2001) dif- paper and reality on ground. According to Thur-
ferentiates GIS from other spatial environmental ston, Poiker, and Moore (2003), the accuracy of a
information systems on the basis of its data link- model is in its ability to recreate reality accurately.
ages. Rautenstrauch and Page (2001) in making They emphasize that accuracy is a function of
a case for environmental informatics, argue that the quality of information included in the model.
environmental studies should not be limited to Looking from another perspective, given that the
just ecological data, a view which Groot and quality of the information (data) is acceptable
McLaughlin (2000) already saw the need for by both in content and in how well the contents are
opining that the pendulum is moving in favor of a integrated; the value of the knowledge extracted
geospatial data infrastructure (GDI). GDI has been from this information would be a function of how
described by Coleman and McLaughlin (1997) as the data is encapsulated and manipulated.
an information system linking environmental, Figure 3 shows the processing that goes on
socioeconomic and institutional databases. A key within the geospatial component. The query
characteristic of geospatial data is its potential for connector acts as a kind of filter for selecting
multiple applications (Groot & McLaughlin, 2000) the data specified by the user. The selected data
which is a reflection of the technologies, including is promptly represented using either of the GIS
GPS and remote sensing, that are used to collect data models- vectors and raster. This ensures
such data. In other words, a better understanding consistency between how the features are stored
of environmental issues lies in integrating purely in the database and how they are represented in
environmental subjects like ecology, land use, and visual form. The vector/raster block represents
so forth, with other factors that influence them, the encapsulation of the data for presentation in a
for example, the economy. A typical example of visual product or map. The cyclic route linking the
this application would be in the area of sustainable components shows a tightly coupled connection
development which is the ability to simultaneously which ensures that at each point on the route the
tackle the economic and environmental propor- data will remain the same giving no loophole for
tions of resource distribution and administration data corruption. The two-sided arrow connect-
(Groot & McLaughlin, 2000). ing the query to the visual product represents the
GIS plays the role of presenting these linkages connection between the data and visual product
in such a way as to force an environmental view

Figure 3. Schematics of the interactions within the geospatial component

Building an Environmental GIS Knowledge Infrastructure

which makes it easy to query the database via knowledge, which can form the basis for updat-
the visual product. ing the database in terms of reorganizing the way
Against this background, the geospatial com- data is integrated or linked.
ponent provides the tools that enable the user to Spatial analysis refers to the ability to manipu-
edit, display, customize, and output spatial data. It late spatial data into different forms and extract
acts as the display unit of the KDI making use of additional information as a result (Bailey, 1994).
the geovisualization functionality of the GIS. As a Combining spatial analysis and GIS has been a
result, researchers are adopting the geovisualiza- study area many researchers have been interested
tion view of GIS in the conceptualization of the in. Wise and Haining (1991) identified the three
geospatial component. This geovisualization view categories of spatial analysis as statistical spatial
basically sees GIS as a set of intelligent maps or data analysis (SDA), map-based analysis and
views that show features and feature relationships mathematical modeling. Haining (1994) believes
on the earth (Zeiler, 1999). In this setup, the map that for GIS to attain its full measure, it needs to
acts as a window into the geodatabase (ESRI, incorporate SDA techniques.
2004) which is at the heart of GIS architecture. The nature of this link between spatial analysis
Also from a knowledge discovery point of view, and GIS is the subject of the spatial component
the geospatial component acts as the exploratory implementation. Based on the study of the linkage
data analysis tool that gives the user a summary between GIS and spatial analysis, Goodwill et al.
of the problem at hand ensuring a greater grasp (1991) distinguished between four scenarios:
of environmental issues.
1. Free standing spatial analysis software
The Spatial Component 2. Loose coupling of proprietary GIS software
with statistical software
Vital to understanding the need for a spatial 3. Close coupling of GIS and statistical soft-
component is the fact that the map or any other ware
visual product is merely a representation of 4. Complete integration of statistical spatial
information stored in a database. The database analysis in GIS
is the depository of spatial information and not
the map (Thurston et al., 2003). Kraak (2000) Of all the four, most attention is on close
argues that a map has three major functions in coupling or loose coupling (Gatrell & Rowling-
the manipulation of geospatial data: son, 1994) mostly because both options give the
developers/users freedom in implementing the
1. It can function as a catalog of the data avail- linkage in the way that will best accomplish
able on the database. their task. Also it makes it easy to integrate other
2. It can be used to preview available data. components as the need arises.
3. It can form part of a database search en- The spatial component is the integration of GIS
gine. tools and statistical tools. While the geospatial
component seeks to encapsulate the data in a
In a sense, the map is a guide to other infor- way that will enhance knowledge discovery, the
mation and as a result can be used to direct the spatial component deals with manipulating the
extraction of information from the database. raw data in a way that will enhance application
The information so extracted is then subjected of the appropriate levels of theory and model-
to spatial analysis for the purpose of extracting ing capability in real problem solving situations

0
Building an Environmental GIS Knowledge Infrastructure

(O’Kelly, 1994). To this end, the spatial component and economic consequences of the choices made
provides the tools for analysis and transformation with regards to resource management. Such
of spatial data for environmental studies. To be decision making, she continues, requires ready
able to perform the analysis, the spatial com- access to current, relevant and accurate spatial
ponent must be able to extract the data first. To information by decision makers and stakeholders.
extract the data, it needs access to the GIS tools Feeney (2003) argues that spatial information is
for data integration, filtering, cleaning and all the one of the most critical elements underpinning
necessary data preprocessing tasks. The spatial decision making for many disciplines. She went
component must possess tools that will allow the on to define decision support as the automation,
results of the spatial analysis to be used to update modeling and/or analysis that enables information
the database in addition to the ability to view the to be shaped from data. The task of the knowl-
results. Figure 4 provides the schematics of how edge component is to transform the information
the spatial component works. extracted into knowledge thereby improving the
quality of the decision making process. It ac-
Knowledge Component complishes this task by providing the necessary
input for creating new environmental models or
Spatial decision support systems (SDSS) are very validating/updating existing ones.
important tools for planning and decision making As a result, the knowledge component is made
for environmental management. Normally, SDSS up of a collection of learning algorithms. The
combine spatially explicit observational data and knowledge component acts as the decision sup-
simulation of physical process with a represen- port of the entire system. As the decision support
tation that is suited for nonspecialist decision component, it can be used to structure, filter and
makers and other stakeholders (Taylor, Walker & integrate information, model information where
Abel, 1998). It also provides the users and deci- gaps occur in data, produce alternative solution
sion-makers with the tools for dealing with the scenarios as well as weight these according to
ill- or semistructured spatial problems in addition priorities, and most importantly facilitate group as
to providing an adequate level of performance well as distributed participation in decision mak-
(Abiteboul, 1997; Hopkins, 1984; Stefanakis et al., ing (Feeney, 2003). The interactions taking place
1998; Taylor et al., 1998). According to Ting (2003), in the component are shown diagrammatically in
sustainable development demands complex deci- Figure 5. The interactions form the basis of the
sion making that combines environmental, social implementation. The figure shows a tight- coupled

Figure 4. Schematics of the interactions within the spatial component

Building an Environmental GIS Knowledge Infrastructure

integration of the components—ensuring that the Shukla, Kok, Prasher, Clark, and Lacroix (1996),
data and knowledge extracted are tied together. Kao (1996), Yang, Prasher, and Lacroix (1996),
Maier and Dandy, 1996), Schaap and Bouten
(1996) and Schaap and Linhart (1998), and Ba-
potential appliCationS of the sheer, Reddi, and Najjar (1996) have all applied
kdi the neural network analysis to various problems
in agriculture, water resources and environmental
The possibility of packaging data mining methods domains. The KDI is a great resource for applying
as re-useable software applications objects have data mining applications to water and environ-
opened up the whole realm of knowledge discovery ment–related problems. It also provides the plat-
to people outside the traditional usage base. form for performing machine learning analysis
The KDI enables a knowledge discovery on the magnitude of environmental data collected
platform optimized for environmental applica- in order to keep up with the pace at which they
tions by integrating state of the art standalone are being collected.
GIS application and data mining functionality The KDI can be used as a teaching tool for
in a closely coupled, open, and extensible system an introductory course for students interested in
architecture. The data mining functionality can the partnership between GIS, data mining and
be used to validate models for decision support environmental management without overloading
systems used in generating environmental poli- them with advanced GIS. It will also widen the
cies. For example, Mishra Ray, and Kolpin (in scope of environmental students in the area of
press) conducted a research on the neural network programming languages, by challenging them to
analysis of agrichemical occurrence in drinking design models that are portable and reuseable. It
water wells to predict the vulnerability of rural can also serve as a teaching aid in creating the need
domestic wells in several Midwestern states of for more public participation in environmental
USA to determine agrichemical contamination. resource management within the framework of
The research objectives included studying the the long distance education paradigm.
correctness of results from the neural network The KDI is a prototype for using object
analysis in estimating the level of contamination oriented programming platform to enable the
with known set of data and to show the impact design of environmental modeling systems that
of input parameters and methods to interpret the are reuseable with well-defined inputs, outputs
results. Also, Hsu et al. (1995), Shamseldin (1997), and controls for easy integration.

Figure 5. Interactions within the knowledge component

Building an Environmental GIS Knowledge Infrastructure

kdi iMpleMentation can be used in any operating systems –environ-

ment- which is an important consideration in an
In the implementation, a GIS-based system which environment where there are not a lot of choices.
integrates logic programming and relational In addition, JFC ensures that the implementation
database techniques has been adopted. It is well can be done in any other language besides English.
documented that geographic analysis and spatial This makes it possible for the end users to be able
visualization improve operational efficiency, to take ownership of the application.
decision making, and problem solving. Software
developers need the flexibility to build domain
specific, easy-to-use applications that incorpo- future iSSueS
rate the power of GIS technology into a focused,
user-friendly application (ESRI, 2004). The KDI This section discusses some of the emerging
implementation consists of a GIS component and technologies and how they are redefining the
Java Foundation Classes, JFC (see www.java. need for KDI and the future of the synergy of
sun.com). JFC encompass a group of features for GIS and data mining.
building graphical user interfaces (GUIs) and add-
ing rich graphics functionality and interactivity GeoSensors
to applications. The KDI application consists of
a GIS component, the data mining application, GIS has been evolving over the years in response
and JFC that provide the interface for connect- to the advances in database management technolo-
ing them (Onwu, 2005). The GIS component is gies. Currently, advances in sensor technology
implemented using the ArcGIS® engine which and deployment strategies are transforming the
is an integrated family of GIS software products way geospatial data is collected and analyzed
from ESRI® that delivers complete, scalable GIS and the quality with which they are delivered
at the project, work group, and enterprise levels (NAS, 2003). This presupposes that the current
(ESRI, 2004). It contains developer application methods of storing geospatial information are
programming interfaces (APIs) that embed GIS bound to change. The change is predicated on
logic in nonGIS-centric applications and efficient- the fact that homogenous collection of data is
ly build and deploy custom ArcGIS applications now being replaced by heterogeneous collection
on the desktop. The data mining application can of data for an area of interest, for example, video
be implemented with any third party data mining and temperature feeds. The nature of these feeds
application. Basically it provides the collection or data will warrant that pieces of information
of machine learning algorithms for data mining will vary in content, resolution and accuracy in
tasks. This includes algorithms for classification addition to having a spatiotemporal component
and regression, dependency modeling/link analy- (Nittel & Stefanidis, 2005). The current trend of
sis and clustering. The algorithms can either be geosensor technology is also going to affect the
applied directly to a dataset or called from user- time for data analysis. Usually, data is analyzed
defined Java code. after it has been downloaded but with regards to
The choice of ESRI® products is informed by energy considerations for the sensors, there might
the fact that they are one of the few GIS companies be the need to do real-time analysis of data being
with a commitment to human and institutional collected so that the sensor can discard unneces-
capacity building in Africa. JFC is open-source sary data and transmit useful data in accordance
and will not cost the users anything to implement. to the study requirements. This implies having an
The portability of the application ensures that it on-chip data mining capability. As a result, the

Building an Environmental GIS Knowledge Infrastructure

geospatial nature of the data being collected would down the road. With object oriented programming
necessitate the call for GIS functionality tightly platforms, each solution would be implemented
integrated with the on-chip data mining applica- as reuseable software application object.
tion. This is targeted to minimizing the energy
consumption of the sensors by reducing amount Geographic Data Mining
of data to transmit. Nittel and Stefanidis (2005)
suggested a minimization of data acquisition time The new ArcGIS 9 from ESRI is revolutionizing
as a solution to energy optimization. the concept of a geodatabase. The new ArcSDE
In response to the ongoing development of has the capability for storing and managing vector,
sensor technology, Tao, Liang, Croitoru, Haider, raster and survey dataset within the framework
and Wang (2005) is proposing the Sensor Web of the relational database management system
which would be the sensor-equivalent of the (RDBMS) (ESRI, 2004b). This implies that not
Internet. According to them, the Sensor Web only is the geospatial data linked to nonspatial
would be a global sensor that connects all sensors dataset; it is also linked to images. Hence the
or sensor databases. The Sensor Web would be user has the choice of what kind of map to view-
interoperable, intelligent, dynamic, scalable, and vector maps or satellite imagery. This creates
mobile. This is certainly going to revolutionize an enabling environment for geographic data
the concept of GPS systems. With these sensors mining (GDM).
connecting wirelessly via Internet and possibly Geographic data mining is at best a knowl-
by satellite linkages, the possibility increases of edge discovery process within the context of a
having live feed for each location on the surface map instead of a database. Miller and Han (2001)
of the earth complete with video, audio and other define it as the application of computational tools
parameters of interest at a particular time. It is to reveal interesting patterns in objects or events
going to be more like having a live Webcam with distributed in geographic space and across time.
the added benefit of knowing the current wind The time component specifically refers to the
speed, temperature, humidity, etc. all wrapped satellite images which represent pictures taken
up within the framework of a GIS so that the geo- over time. GDM is closely related to geographic
spatial component is not lost. This will obviously visualization (GV) which is the integration of
warrant a multimedia data mining application to cartography, GIS, and scientific visualization
tap into the vast knowledge trapped in the video for the purpose of exploring geographic data in
images. The knowledge from the video feeds order to communicate geographic information to
is then integrated with the knowledge from the end-users (MacEachren & Kraak, 1997). With
nonspatial data in order to get a perfect or ap- these developments, the possibility of performing
proximate picture. Although the emphasis would machine learning analysis on a map object will
not be on perfect, but on approximate because as greatly increase the knowledge available for en-
Evan Vlachos framed it in his opening address to vironmental management as this will reduce the
GIS 1994, “it is better to be approximately right level of abstraction of spatial data and preserve
rather than precisely wrong,” (Vlachos, 1994). the loss of spatial information. Also GDM would
The success of the scenario painted in the be the best way to capture the contribution of
foregoing paragraphs can only be accomplished the time component in the knowledge extracted.
in a closely coupled working multidisciplinary There is still the problem of how to incorporate
partnership. All the stakeholders involved must a time component in the RDBMS (NAS, 2003).
be accommodated at the outset to offset the pos- But GDM of satellite maps would remove the need
sibility of creating integration problems later to abstract the time component making satellite

Building an Environmental GIS Knowledge Infrastructure

images a repository of spatial-temporal data. The expertise of metropolitan centers and the innate
next task will be to encapsulate these developments local knowledge of the environment in rural areas
in reuseable application objects with well defined (USAID/AFR, 2003).
user interfaces in order to make it accessible to Juma (2006) believes that African universities
the managers of environmental resources. should take the initiative in community develop-
ment by developing an educational curriculum that
addresses the needs of the community. African
ConCluSion universities should re-align themselves so that they
became active participants with the international
GIS started as a technology for data creation and organizations in institutional capacity building.
has now evolved into one for data management. African universities should provide the leverage
This research focused on the development and needed to bring the expertise together.
implementation of a prototype KDI for environ- The road to sustainable development in sub-
mental science applications. This was predicated Saharan Africa will not be complete without
on the need to help policy makers to grasp with addressing the role of governments. Juma (2006)
the current environmental challenges. The exten- proposes the role of governments as a facilitator.
sible nature of the KDI makes it a dynamic tool With government as facilitator, this creates a level
since it allows for integration with other tools. field for public-private partnerships in the form
The challenge is now to package this concept in of nongovernmental organizations to step in and
a cost effective way as a tool to introduce GIS in get the knowledge to the rural communities by
the educational curriculum. creating urban-rural partnerships and investing
In all, what GIS does is very simple. It makes in youths as the harbinger of rural development.
a point aware of its position vis-à-vis other points. As a facilitator, African government should
Stretching this understanding, the concept of a be committed to the fact that knowledge is the
network becomes obvious. The challenge before currency of development and if the developing
sub-Saharan African countries becomes how to countries must join their developed counterparts
create a social infrastructure that will connect in providing basic services to their citizens, there
these points so that they can work for a common is the need to create a unified system of tracking
goal and avoid duplication of resources. That is the vast potentials in Africa and organizing it in
the first step in taking the initiative to bridge such a way that it can provide insights that would
the knowledge gap with the rest of the world. In produce policies that would bring about develop-
2003, USAID Success Stories captured the cur- ment in Africa. The main benefit for the establish-
rent state of Africa’s efforts in bridging this gap ment of a GIS based system is to stimulate and
in the following lines: assist development activities in the region. One
We find currently that a chasm exists, sepa- way of doing this is by creating a commission
rating the users of environmental information, tasked with the creation of baseline geographic
policymakers, and scientists from one another. data at the local government level and converting
We often think of this as a divide between con- existing data into digital format. The funding
tinents, but more importantly, it is also a divide for this commission can be sourced from private
between islands of expertise. There is a divide companies, or international agencies/foreign aid.
between highly dedicated and competent ana- The availability of baseline data makes it easy for
lysts in Africa from the state of the art in the international development agencies to track the
rest of the world, but also between the analysts progress of development in a region.
and decision-makers, and between the scientific

Building an Environmental GIS Knowledge Infrastructure

In the face of the failure of technology transfer Bedard, Y., Gosselin, P., Rivest, S., Proulx, M.,
in the developing countries, there is need for a GIS Nadeau, M., Lebel, G., & Gagnon, M., (2003).
system that answers the more fundamental social Integrating GIS components with knowledge
and economic questions as well as the technical discovery technology for environmental health
ones, an opinion exemplified by Ficenec (2003). decision support. International Journal of Medi-
GIS is very important for stimulating community cal Informatics, 70, 79-94.
development by providing a way for policy makers
Blaschke, A. (2001). Environmental monitoring
to match resources with potentials available in a
and management of protected areas through in-
community. This leads to grassroot development,
tegrated ecological information systems- An EU
poverty reduction, job opportunities, and overall,
perspective. In C. Rautenstrauch & S. Patig (Ed.),
an economically viable state.
Environmental information systems in industry
and public administration (pp. 75-100). Hershey,
PA: Idea Group Publishing.
referenCeS
Brachman, R. J. & Anand, T. (1996). The pro-
Abiteboul, S. (1997). Querying semi-structured cess of knowledge discovery in databases. In
data. In Proceedings of the International Confer- U. Fayyad, G. Piatetsky-Shapiro, P. Smyth & R.
ence on Database Theory, Delphi, Greece. Uthurusamy (Ed.), Advances in knowledge dis-
covery and data mining (pp. 37-57). Cambridge,
AfricaGIS (2005). Conference resolutions draft.
MA: AAAI/MIT Press.
Retrieved April 13, 2008, from http://www.afri-
cagis2005.org.za/agp/africagispapers/AfricaGIS Chainey, S. & Ratcliffe, J. (2005). GIS and crime
2005 Resolutions draft 041105.doc mapping. Chichester, West Sussex: John Wiley
and Sons.
Albertoni, R., Bertone, A., & De Martino, M. A.
(2003). Visualization-based approach to explore Chetty, M. & Buyya, R. (2002). Weaving com-
geographic metadata. In Proceedings of the 11th putational crids: How analogous are they with
International Conference in Central Europe on electrical grids? IEEE Computing in Science and
Computer Graphics, Visualization and Com- Engineering, July/August, 61-71.
puter Vision, WSCG 2003, Plzen-Bory, Czech
Coleman, D. J. & McLaughlin, J. D. (1997).
Republic.
Information access and network usage in the
Alexandre, F. (1997). Connectionist-symbolic emerging spatial information marketplace. Jour-
integration: From unified to hybrid approaches. nal of Urban and Regional Information Systems
Mahwah, NJ: Lawrence Erlbaum Associates. Association, 9, 8-19.
Bailey, T. C. (1994). A review of statistical spatial Dunn, C. E., Atkins, P. J., & Townsend, J. G.
analysis in geographical information systems. (1997). GIS for development: A contradiction in
In A. S. Fotheringham & P. A. Rogerson (Ed.), terms? Area, 29(2), 151-159.
Spatial analysis and GIS (pp. 14-44). London,
ESRI (2004a). ArcGIS 9: What is ArcGIS? A White
UK: Taylor and Francis.
Paper. Redlands, CA: Environmental Systems
Basheer, I. A., Reddi, L. N., & Najjar, Y. M. Research Institute.
(1996). Site characterization by NeuroNets: An
ESRI (2004b). ArcSDE: Advanced spatial data
application to the landfill siting problem. Ground
server. White Paper. Retrieved May 8, 2008
Water, 34, 610-617.

Building an Environmental GIS Knowledge Infrastructure

from http://esri.com/library/whitepapers/pdfs/ Hsu, K-L., Gupta, H. V., & Soroosian, S. (1995).

arcgis_spatial_analyst.pdf Artificial neural network modeling of the rain-
fall-runoff process. Water Resour. Res., 31,
Fayyad, U., Piatetsky-Shapiro, G., & Smyth, P.
2517-2530.
(1996). From data mining to knowledge discovery
in databases. AI Magazine, 17, 37-54. Juma, C. (2006, April). Reinventing African
economies: Technological innovation and the
Feeney, M. F. (2003). SDIs and decision support.
sustainability tansition. Paper presented at The
In I. Williamson, A. Rajabifard, & M. F. Feeney
John Pesek Colloquium on Sustainable Agricul-
(Ed.), Developing spatial data infrastructures:
ture, Ames, Iowa
From concept to reality (pp. 195-210). London,
UK: Taylor & Francis. Kao, J-J. (1996). Neural net for determining DEM-
based model drainage pattern. Journal of Irriga-
Ficenec, C. (2003, June). Explorations of partici-
tion and Drainage Engineering, 122, 112-121.
patory GIS in three Andean watersheds. Paper
presented at the University Consortium of Geo- Keim, D. A. (2002). Information visualization and
graphic Information Science (UCGIS) Summer visual data mining. IEEE Transactions on Visual-
Assembly 2003, Pacific Grove, CA. ization and Computer Graphics, 7, 100-107.
Gatrell, A. & Rowlingson, B. (1994). Spatial point Khatri, V., Ram, S., & Snodgrass, R. T. (2004).
process modeling in a GIS environment. In A.S. Augmenting a conceptual model with geospa-
Fotheringham & P.A. Rogerson (Ed.), Spatial tiotemporal annotations. IEEE Transactions on
analysis and GIS (pp. 148-163). London, UK: Knowledge And Data Engineering, 16, 1324-
Taylor and Francis. 1338.
Goodchild, M. F., Haining, R., & Wise, S. M. Koua, E. L. & Kraak, M. J. (2004). Geovisualiza-
(1991). Integrating GIS and spatial data analy- tion to support the exploration of large health and
sis: Problems and possibilities. International demographic survey data. International Journal
Journal of Geographic Information Systems, 6, of Health Geographics, 3,12.
407-423.
Kraak, M.-J. (2000). Access to GDI and the
Groot, R. & McLaughlin, J. (2000). Introduction. function of visualization tools. In R. Groot & J.
In R. Groot & J. McLaughlin (Eds.), Geospatial McLaughlin (Eds.), Geospatial data infrastruc-
data infrastructure: Concepts, cases and good ture: Concepts, cases and good practice (pp.
practice (pp. 1-12). Oxford, UK: Oxford Uni- 217-321). Oxford, UK: Oxford University Press.
versity Press.
Kufoniyi, O., Huurneman, G., & Horn, J. (2005,
Haining, R. (1994). Designing spatial data analysis April). Human and institutional capacity build-
modules for GIS. In A.S. Fotheringham & P.A. ing in geoinformatics through educational
Rogerson (Eds.), Spatial analysis and GIS (pp. networking. Paper presented at the International
46-63). London, UK: Taylor and Francis. Federation of Surveyors Working Week 2005,
Cairo, Egypt.
Han, J. (1999). Data mining. In J. Urban & P.
Dasgupta (Eds.), Encyclopedia of distributed Lee, H. Y., Ong, H. L., & Quek, L. H. (1995). Ex-
computing. Kluwer Academic Publishers. ploiting visualization in knowledge discovery. In
Proceedings of the 1st International Conference
Hopkins, L. D. (1984). Evaluation of methods for
on Knowledge Discovery and Data Mining (pp.
exploring Ill-defined problems. Environmental
198 – 201), Montreal, Canada.
Planning B: Planning and Design, 11, 339-348.

Building an Environmental GIS Knowledge Infrastructure

Lee, H. Y., Ong, H. L., Toh, E. W., & Chan, S. K. National Academy of Sciences (NAS) (2003).
(1996). A multi-dimensional data visualization IT roadmap to a geospatial future. Washington,
tool for knowledge discovery in databases. In D.C.: The National Academies Press.
Proceedings of IEEE Conference on Visualiza-
Nittel, S. & Stefanidis, A. (2005). GeoSensor
tion, pp. 26–31.
networks and virtual GeoReality. In S. Nittel &
Longley, P. A., Goodchild, M. F., Maguire, D.J., A. Stefanidis (Eds.), GeoSensor networks (pp.
& Rhind, D. W. (2001). Geographic information 1-9). Boca Raton, FL: CRC Press.
systems and science. West Sussex, England: John
O’Kelly, M. E. (1994). Spatial analysis and GIS.
Wiley and Son, Ltd.
In A.S. Fotheringham & P.A. Rogerson (Eds.),
Maier, H. R. & Dandy, G. C. (1996). The use of Spatial analysis and GIS (pp. 66-79). London,
artificial neural networks for the prediction of UK: Taylor and Francis.
water quality parameters. Water Resour. Res.,
Onwu, I. (2005). Knowledge discovery interface
32, 1013-1022.
for environmental applications. Unpublished
MacEachren, A. M. & Kraak, M.-J. (1997). Ex- master’s thesis, Iowa State University, Ames.
ploratory carthographic visualization: Advanc-
Ratcliffe, J. (2004). Strategic thinking in criminal
ing the agenda. Computer and Geosciences, 23,
intelligence. Sydney: Federation Press.
335-343.
Rautenstrauch, C. & Page, B. (2001). Environmen-
Miller, H. J. (in press). Geographic data mining
tal informatics-methods, tools and applications
and knowledge discovery. In J.P. Wilson & A. S.
in environmental information processing. In C.
Fotheringham (Eds.), Handbook of geographic
Rautenstrauch & S. Patig (Eds.), Environmental
information science. Blackwell.
information systems in industry and public ad-
Miller, H. J. & Han, J. (2001). Geographic data ministration (pp. 2-11). Hershey, PA: Idea Group
mining and knowledge discovery. London: Taylor Publishing.
and Francis.
Rüther, H. (2001, October). EIS education in
Mishra, A., Ray, C., & Kolpin, D. W. (in press). Africa – The geomatics perspective. Paper
Use of qualitative and quantitative information in presented at the International Conference on
neural networks for assessing agricultural chemi- Spatial Information for Sustainable Development,
cal contamination of domestic wells. Journal of Nairobi, Kenya
Hydrological Engineering.
Schaap, B. D. & Linhart, S.M. (1998). Quality of
Mitra, S. & Acharya, T. (2003). Data mining: ground water used for selected municipal water
Multimedia, soft computing and bioinformatics. supplies in Iowa, 1982-96 water years (p. 67).
Hoboken, NJ: John Wiley and Sons, Inc. Iowa City, IA: U.S. Geological Survey Open File
Report 98-3.
Mokyr, J. (2002). The gifts of Athena: Historical
origins of the knowledge economy. New Haven: Schaap, M. G. & Bouten, W. (1996). Modeling
Princeton University Press. water retention curves of sandy soils using neural
networks. Water Resour. Res., 32, 3033-3040.
National Academy of Sciences (NAS) (2003). IT
roadmap to a geospatial future., Washington, Shamseldin, A. Y. (1997). Application of a neural
D.C.: The National Academies Press. network technique to rainfall-runoff modeling.
Journal of Hydrology, 199, 272-294.

Building an Environmental GIS Knowledge Infrastructure

Shukla, M. B., Kok, R., Prasher, S. O., Clark, Thurston, J., Poiker, T. K., & Moore, J. P. (2003).
G., & Lacroix, R. (1996). Use of artificial neural Integrated geospatial technologies: A guide to
networks in transient drainage design. Transac- GPS, GIS, and data logging. Hoboken, NJ: John
tions of the ASAE, 39, 119-124. Wiley & Sons.
Sobeih, A. (2005). Supporting natural resource Ting, L. (2003). Sustainable development, the
management and local development in a de- place for SDIs, and the potential of e-governance.
veloping connection: Bridging the policy gap In I. Williamson, A. Rajabifard & M. F. Feeney
between the information society and sustainable (Eds.), Developing spatial data infrastructures:
development. A publication of the International From concept to reality (pp. 183-194). London,
Institute for Sustainable Development (IISD), UK: Taylor & Francis.
pp. 186-210.
USAID (2003). USAID Africa success stories.
Song, S. (2005). Viewpoint: Bandwidth can bring Retrieved April 13, 2008, from http://africastories.
African universities up to speed. Science in Africa, usaid.gov:80/print_story.cfm?storyID=23
September 2005. Retrieved April 13, 2008, from
Vckovski, A. & Bucher, F. (1996). Virtual data
http://www.scienceinafrica.co.za/2005/septem-
sets - Smart data for environmental applica-
ber/bandwidth.htm
tions. In Proceedings of the Third International
Speth, J. G. (2004). Red sky at morning: America Conference/Workshop on Integrating GIS and
and the crisis of the global environment. Yale Environmental Modeling, Santa Fe, NM.
University Press.
Vlachos, E. (1994). GIS, DSS and the future. In
Stefanakis, E., Vazirgiannis, M., & Sellis, T. Proceedings of the 8th Annual Symposium on
(1999). Incorporating fuzzy set methodologies Geographic Information Systems in Forestry,
in a DBMS repository for the application domain Environmental and Natural Resources Manage-
of GIS. International Journal of Geographic ment, Vancouver, Canada.
Information Science, 13, 657-675.
Wise, S. M. & Haining, R. P. (1991). The role
Taylor, K., Walker, G., & Abel, D. (1999). A of spatial analysis in geographical information
framework for model integration in spatial deci- systems. Westrade Fairs, 3, 1-8.
sion support systems. International Journal of
Yang, C.-C., Prasher, S. O., & Lacroix, R. (1996).
Geographic Information Science, 13, 533-555.
Application of artificial neural networks to land
Tao, V., Liang, S., Croitoru, A., Haider, Z. M., drainage engineering. Trans. ASAE, 39, 525-
& Wang, C. (2005). GeoSwift: Open geospatial 533.
sensing services for sensor web. In S. Nittel &
Zaïane, O. R., Han, J., Li, Z.-N., & Hou, J. (1998).
A. Stefanidis (Eds.), GeoSensor Networks (pp.
Mining Multimedia Data. In Proceedings of the
267-274). Boca Raton, FL: CRC Press.
CASCON’98: Meeting of Minds (pp. 83-96),
Thuraisingham, B. M. (1999). Data mining: Toronto, Canada.
Technologies, techniques, tools and trends. Boca
Zeiler, M. (1999). Modeling our world: The ESRI
Raton, FL: CRC Press.
guide to Geodatabase design. Redlands, CA:
ESRI Press.

Chapter XV
The Application of Data
Mining for Drought Monitoring
and Prediction
Tsegaye Tadesse
National Drought Mitigation Center,
University of Nebraska, USA

Brian Wardlow
National Drought Mitigation Center,
University of Nebraska, USA

Michael J. Hayes
National Drought Mitigation Center,
University of Nebraska, USA

aBStraCt

This chapter discusses the application of data mining to develop drought monitoring tools that enable
monitoring and prediction of drought’s impact on vegetation conditions. These monitoring tools help
decision makers to assess the current levels of drought-related vegetation stress and provide insight into
the possible future trends in vegetation conditions at local and regional scales, which can be used to make
knowledge-based decisions. The chapter summarizes current research using data mining approaches
(e.g., association rules and decision-tree methods) to develop these types of drought monitoring tools
and briefly explains how they are being integrated with decision support systems. Future direction in
data mining techniques and drought research is also discussed. This chapter is intended to introduce
how data mining is be used to enhance drought monitoring and prediction in the United States and assist
others to understand how similar tools might be developed in other parts of the world.

introduCtion intensity, and duration of drought. As a result,

researchers are focusing on developing drought
Over the past few decades, many parts of the monitoring tools using new analytical techniques
world have experienced devastating impacts that can explore these complex relationships.
from the frequent occurrences of both short- and Recently, data mining techniques were used to
long-term droughts, and decision makers such develop improved drought monitoring tools and
as policy makers and farmers are faced with the better understanding of drought characteristics
difficult challenge of dealing with these natural (Harms, Deogun& Tadesse, 2002; Tadesse, Wil-
disasters. Although drought characteristics are hite, Harms, Hayes & Goddard, 2004).
complex and the prediction of such events is dif- The primary strength of data mining tech-
ficult, decisions must still be made to manage and niques is their capability to search databases for
mitigate drought impacts whenever this natural hidden patterns and find predictive information
disaster occurs. With an increase in population that experts may miss because it lies outside their
growth and the resultant demand for natural re- expectations (Berry & Linoff, 2000; Cabena,
sources (e.g., food and water), the vulnerability Stadler, R., Verhees & Zanasi, 1998; Groth, 1998).
of people to natural disasters such as drought has In addition, data mining can be used to answer
dramatically increased. As a result, droughts of difficult questions or problems that would be too
identical magnitude and spatial coverage will time-consuming and/or complex to resolve using
incur more damages and greater impacts today traditional methods. The automated, prospective
than they would a few decades ago. analyses offered by data mining move beyond
There is a growing need for improved drought the analyses of past climatic events commonly
monitoring tools to assist people in making more used for drought monitoring and allow complex
informed drought risk management decisions. relationships between many diverse variables (or
Such tools would help decision makers to imple- indictors) to explore for this application. Data
ment effective responses (crisis management) that mining tools also have the potential to predict
include technical, financial, and humanitarian future trends and behaviors, and this information
assistance to drought-affected areas. Improved could allow decision makers to make proactive,
drought-related information is needed to make knowledge-driven decisions (Tadesse, Brown &
more efficient and effective planning and miti- Hayes, 2005a).
gation decisions. This requires new tools that This chapter reviews the use of data mining
can deliver more accurate and detailed drought techniques for drought monitoring in the United
information in a timely and reliable fashion. States and highlights the challenges facing this
Many studies have focused on developing application. The chapter briefly explains the
improved drought monitoring tools that can assist potential of data mining techniques for drought
in the decision-making process (Goddard, Harms, monitoring and the current research activities in
Reichenbach, Tadesse & Waltman, 2003). Most of developing drought monitoring tools and integra-
these studies have relied on traditional statistical tion systems to enhance drought assessment and
methods to build models based on the relationships prediction for the continental United States. The
of atmospheric, climatic, and oceanic variables chapter also presents examples of the results of
to drought events. However, traditional statistical this ongoing collaborative research by computer
techniques are often insufficient for identifying scientists, remote sensing specialists, water re-
drought and its characteristics (e.g., intensity) be- sources specialists, and climatologists in the
cause of the complex interplay of these variables, central United States.
which affect the occurrence, geographic extent,

The Application of Data Mining for Drought Monitoring and Prediction

BaCkground and biophysical parameters and to distinguish

patterns that may be used to predict drought. In
Drought Monitoring light of this, it is essential to have an efficient
way to extract information from large databases
Drought is characterized by its intensity, spatial and to deliver relevant and actionable informa-
extent, and duration (Wilhite, 2000). The deter- tion for drought mitigation. One of the recently
mination of these characteristics in real time is developed techniques relevant for such purposes
often complicated (Kottegoda, Natale & Raiteri, is data mining.
2004; Svoboda, LeComte, Hayes, Heim, Gleason, Data mining is a process that uses a variety of
Angel, 2002; Wilhelmi & Wilhite, 2002). The data analysis tools to discover patterns and rela-
capability to characterize these different dimen- tionships among a number of variables in different
sions of drought is important because effective data sets. This approach integrates techniques
drought planning and mitigation actions require from machine learning, pattern recognition,
drought indicators based on sound science that statistics, databases, and visualization and has
provide useful information about a drought event been used by numerous disciplines (e.g., science,
and its impacts. Drought indicators can be based on business, and medicine) to address the issue of
atmospheric, hydrologic, and/or satellite observa- information extraction from large databases.
tions that either directly or indirectly influence the The data mining approach is commonly used
occurrence of drought in a specific area or region. in the commercial sector by companies to design
Modern technical capabilities, which include the strategies to increase profitability. For example,
development of computer algorithms to identify data mining is used to predict (or identify) con-
hidden patterns within multiple datasets, could sumers that are the most likely to buy certain
help improve drought indices that are often used products. Based on this information, companies
to make planning decisions and trigger mitiga- can effectively identify the market demand. Data
tion actions. mining can also be used by businesses to under-
Because of the varied and potentially cata- stand trends in the marketplace to reduce costs
strophic losses resulting from drought in many and improve the timeliness of products reaching
parts of the world, both governmental and nongov- the market. Recent studies found this method to
ernmental decision makers need improved access be one of the best tools to identify the patterns
to accurate and timely monitoring and prediction of supply and demand for specific products; this
tools to assist them in dealing more effectively and type of information is necessary to be profitable
efficiently. Better early warning and prediction of in a competitive market (Berry & Linoff, 2000;
drought is the foundation of the new paradigm for Cabena et al., 1998; Larose, 2005).
risk-based drought management. Technological Data mining is also being increasingly used for
advances (e.g., computing capabilities, algorithms, environmental applications (De’ath & Fabricus,
and Web-based services) will allow improved 2000; White, Kumar, & Tcheng, 2005) and holds
and enhanced drought monitoring tools to be considerable potential for identifying complex
developed, which will improve the ability to more relationships among atmospheric, oceanic, and
effectively manage water and other shared natural other environmental variables as they relate to
resources during periods of drought. droughts. The most common data mining algo-
rithms and models, which include decision trees,
Data Mining to Identify Drought associations, clustering, classification, multiple
linear regression, sequential patterns, and time-
Large historical data sets are needed to identify series forecasting, have the potential to identify
relationships between different climatic, oceanic,

The Application of Data Mining for Drought Monitoring and Prediction

drought patterns and characteristics. Association, (Aguilar, 2002). The integration of data mining
clustering, and sequence discovery approaches and GIS techniques into a drought monitoring
may be useful tools for investigating and describ- tool is useful for fully exploring large, diverse
ing the occurrence and intensity of drought, while databases; developing predictive drought models
classification, regression, and time-series analyses based on historical climate-ocean-biophysical
may be appropriate for mapping and monitoring relationships and occurrences; and applying the
drought patterns. models in a geospatial environment to map and
One of the main challenges of data mining in monitor drought patterns.
drought research is interpreting model results. Drought characteristics can be better moni-
Models from decision trees are easier to interpret tored after their patterns have been recognized
because their classification and rules structure using data mining algorithms. Models that are
is transparent to the user, while the results from based on historical data and their patterns can
neural networks are the least comprehensible and be applied on near-real time geospatial data. This
most difficult to understand because of the non- capability allows more informed decisions to be
linear combination of many parameters in their made at the earliest stages of drought onset and
models and their relatively ‘black box’ modeling intensification. Also, some studies have recently
environment. Using a combination of different used data mining to build predictive models that
models and comparing the model outputs may provide outlooks of drought conditions and pat-
provide a better understanding of the results. terns (Tadesse et al., 2004). This information can
Data mining techniques can identify “lo- be used for proactive drought management
cal” patterns better than traditional time-series
analysis techniques, which largely focus on global
models such as statistical correlations. The in- Current reSearCh uSing data
frequent and complex nature of drought requires Mining
alternative analysis techniques that emphasize
the discovery of local patterns of climate and In the United States, recent research has developed
oceanic data. For example, one may consider drought assessment and predictive models using
the occurrence of drought and its association data mining techniques that include association
with climatic and oceanic parameters instead of rules, regression-trees, and neural networks
all precipitation patterns that include both dry (Brown, Tadesse & Reed, 2002; Goddard et al.,
and wet periods. In other words, since drought 2003; Harms et al., 2002; Tadesse et al., 2004;
monitoring is particularly concerned with dry Tadesse, Wilhite, Hayes, Harms & Goddard,
episodes, the data-mining algorithm is needed to 2005b). To enhance the efficiency and accuracy
discover the associations between oceanic and/or of drought monitoring for agricultural and water
atmospheric conditions/patterns and the resulting resources management, these models and algo-
drought event(s). rithms used databases containing oceanic, climate,
Other techniques can be utilized to maximize biophysical (e.g., land cover, soil, and irrigation
the results acquired from the data mining algo- data), and satellite-based vegetation condition
rithms to identify drought characteristics. Among information. These databases can be efficiently
these techniques are geographic information sys- accessed, manipulated, and integrated with data
tems (GIS), which have the capability to integrate mining techniques to develop improved drought
geospatial data sets of different types and spatial monitoring tools. In the following sections, some
scales, analyze this information, and present examples of these current research activities are
the results of the analysis in a geospatial format presented to demonstrate the utility of data min-

The Application of Data Mining for Drought Monitoring and Prediction

ing techniques for identifying and monitoring late green-up. The integration of the satellite and
drought. climate data allows the VegDRI model to utilize
the climate information to identify the vegetation
Drought Monitoring Tool Integrating condition anomalies related to drought.
Climate and Satellite Data The VegDRI model is built using a regres-
sion-tree data mining technique that incorporates
A collaborative research effort between the Na- information from climate-based drought indices
tional Drought Mitigation Center (NDMC) and the (i.e., Standardized Precipitation Index [SPI] and
U. S. Geological Survey’s (USGS) National Center Palmer Drought Severity Index [PDSI]), 1-ki-
for Earth Resources Observation Science (EROS) lometer satellite-based observations of general
was recently undertaken in the United States to vegetation conditions (derived from a time series
improve the country’s national drought monitoring of normalized difference vegetation index [NDVI]
capabilities. The objective of this research was to data from the advanced very high resolution radi-
develop and implement a new drought monitoring ometer [AVHRR]), and other environmental data
indicator called the Vegetation Drought Response sets that summarize land use/land cover (LULC),
Index (VegDRI) across the conterminous United soil characteristics, and the ecological setting
States. The VegDRI integrates climate, satellite, (Tadesse et al., 2005a). Figure 1 summarizes the
and other biophysical information (e.g., land cover, specific inputs and processing steps for the de-
percentage of irrigated agriculture, soil available velopment of the VegDRI and the dissemination
water capacity, and ecosystem type) in a data of the VegDRI information.
mining environment to produce a 1-kilometer Currently, a semioperational, biweekly
(km) resolution drought indicator. VegDRI product is being generated for the U.S.
The VegDRI is expected to provide improved, Northern Great Plains with plans to further ex-
more spatially precise information regarding pand coverage to the conterminous United States
drought-induced vegetation stress than tradi- (http://edc2.usgs.gov/phenological/drought/index.
tional drought indicators that are based solely html). A 1-km VegDRI map is produced at 2-week
on climate indices or satellite-based vegetation intervals to provide timely information of drought
condition observations. Drought indices based on effects on vegetation during the growing season.
meteorological station observations have a coarse Figure 2 shows an example of the VegDRI map
spatial resolution. These stations are not uniformly over fifteen states in the central and mid-west-
distributed and, as a result, the extrapolated data ern United States. The VegDRI map in Figure 2
for areas between the stations does not neces- shows the large areas of Wyoming, South Dakota,
sarily represent accurate spatial information of Colorado, Nebraska, and New Mexico that expe-
drought patterns. In contrast, the satellite data rienced severe to extreme drought in 2002. The
have uniform and continuous coverage over large high 1-km resolution of the VegDRI allows users
areas and contain spatially detailed information. to zoom in on the map to a more localized level
However, the satellite data require ground-truth (e.g., county) and identify more specific areas
(e.g., climate and biophysical evidence) to validate that are affected by drought. Moreover, overlay-
the environmental factors influencing changes ing the land cover map on the VegDRI map, it
in the vegetation condition observations over is possible to calculate the percentage of a land
time (Ji & Peters, 2003; McVicar & Bierwirth, cover type (e.g., grassland or cropland) affected
2001). Satellite observations highlight areas with during a drought event. Such information can
anomalous vegetation conditions that might be be utilized by agricultural producers and other
caused by drought, flooding, pest infestations, or

The Application of Data Mining for Drought Monitoring and Prediction

Figure 1. This flow chart shows the data inputs and model building processes for VegDRI and VegOut,
as well as the dissemination mechanism for the information.

decision makers for a variety of drought planning are also used for VegOut. The main objective of
and mitigation activities. the VegOut research is to develop a tool through
the use of data mining and knowledge discovery
Vegetation Outlook: Integrating techniques that will enable the anticipation of
Climate and Satellite Data Using drought conditions and the assessment of con-
Data Mining sequential landscape and vegetation response
at local scales based on ocean-atmosphere-land
Research has also been undertaken to use the interactions and their relationships with drought.
predictive capability of data mining techniques Figure 3 (a) shows an experimental VegOut map
to develop a tool called the Vegetation Outlook that provides a 2-week outlook of vegetation
(VegOut), which provides an outlook of the veg- conditions expressed as Standardized Seasonal
etation conditions a few weeks in advance during Greenness (SSG) for the central United States
the growing season. The VegOut map is similar for July25, 2006. The actual SSG observed by
to the VegDRI map but shows a general outlook satellite on July 25 is shown in Figure 3 (b). The
of the vegetation conditions in the upcoming SSG patterns predicted 2 weeks in advance in
weeks. Inputs, processing steps, and dissemination the VegOut map are in strong agreement with
mechanisms similar to those shown in Figure 1 the SSG patterns observed by satellite on July 26

The Application of Data Mining for Drought Monitoring and Prediction

Figure 2. (a) Vegetation Drought Response Index (VegDRI) map for July 25, 2002. VegDRI shows the
severe drought conditions that plagued most of Colorado, New Mexico, and Wyoming and the western
parts of Kansas, Nebraska, and South Dakota in 2002. The favorable vegetation conditions that oc-
curred in the eastern part of this area (Illinois, Iowa, Minnesota, Missouri, and Wisconsin) during that
same time were also represented in the VegDRI map. (b) The 15-states study area within the United
States is highlighted in grey.

for most areas in the central United States. The predictive capability improved during this period
correlation (r2) between the actual SSG and the because the vegetation activity is usually more
predicted SSG from this experimental VegOut stable during the maturity and senescence (i.e.,
model was 0.98. desiccation and leaf-drop) phases of the seasonal
In the VegOut research, Tadesse et al. (2005a) growth cycle.
showed that future outlooks of drought-induced The VegOUT is designed to complement the
vegetation stress patterns could be delivered up “current” drought condition information rep-
to six weeks in advance and at a high spatial resented in the VegDRI products by providing
resolution (1-km2). According to Tadesse et al. drought conditions and patterns into the future,
(2005a), the accuracy of the prediction is lower which could assist decision makers in both their
(r2=0.44) for the early phenological phases be- short- and long-term planning.
cause of instability of vegetation condition close Research is currently underway to build on the
to the start of the growing season. However, the initial VegOut research by Tadesse et al. (2005a)
correlation (r2) between the predicted VegOut and test the hypothesis that global oceanic condi-
model results and the actual satellite-observed tions (e.g., El Niño or La Niña) are an important
SSG values ranged from 0.67 to 0.85 during precursor to drought conditions over the conti-
the remainder of the growing season. VegOut’s nental United States. A better understanding and

The Application of Data Mining for Drought Monitoring and Prediction

Figure 3. (a) The Vegetation Outlook (VegOut) map that predicted general vegetation conditions (ex-
pressed as standardized seasonal greenness – SSG) 2 weeks prior to the biweekly period ending on July
25, 2006. (b) The observed SSG for the biweekly period ending 25 July 2006 that was derived from
the satellite data. The maps show general agreement in the intensity and pattern of vegetation condi-
tions except for some localized areas of severe to extreme vegetation stress over northeastern Montana,
southern North Dakota, western South Dakota, and central Nebraska.

representation of the oceanic-climate-drought et al., 2002; Tadesse et al., 2004; Tadesse et al.,
relationship in the VegOut model will improve 2005a) that have shown the importance of data
our predictive capabilities for providing useful mining in exploring and discovering the relation-
short-term outlooks (i.e., 2-, 4-, and 6-week) of ships between ground-based climatic observa-
drought-induced vegetation stress. Different ap- tions and oceanic observations to enable a better
proaches and techniques including association understanding of their causes and effects as they
rules, regression decision-tree method, and Case pertain to drought.
Based Reasoning (CBR) methods are being tested This current VegOut research is expected to
to build better predictive models. To enhance the lead to a greater understanding of the drought-
existing VegOut model, a regression-tree method related relationships among remote sensing,
is being tested with scenarios that are based on climatic, oceanic, and biophysical data, which
the probability of occurrence of precipitation, and can be used to improve the spatial and temporal
with El Niño /La Niña conditions (teleconnection) resolutions of our drought monitoring and predic-
that are based on changes in ocean-atmosphere tion capabilities.
dynamics of the Pacific and Atlantic oceans. For
this reason, oceanic variables based on Pacific Identifying Drought Using
and Atlantic observations are also being tested Association Rules
to identify patterns that may be used for predict-
ing drought. Another approach uses association rules to
This research is built on previous studies identify drought at a specific weather station
(Brown et al., 2002; Goddard et al., 2003; Harms

The Application of Data Mining for Drought Monitoring and Prediction

or geographic location. Tadesse et al. (2005b) systems (EWS) in an effort to more effectively
developed a new data mining algorithm called and efficiently use available resources to reduce
the minimal occurrences with constraints and the impacts of drought.
time lags (MOWCATL) to identify relationships In an effort to help build an integrated EWS,
between oceanic parameters and drought indices researchers are continuing to: (1) investigate the
at a specific location. Rather than using traditional relationships between drought and oceanic param-
global statistical associations, the MOWCATL eters over the continental United States and use this
algorithm identifies historical drought episodes information to identify triggering mechanisms
from periods of normal and wet conditions and for the onset, continuance, and end of drought at
then uses these drought episodes to find time- regional, state, and local levels; (2) build predic-
lagged relationships with oceanic parameters. As tive models to assess and predict drought using an
with all association-based data mining algorithms, integrated database of satellite, climate, oceanic,
MOWCATL is used to find existing relationships and biophysical information; and (3) evaluate the
between historical drought events and the corre- results from these predictive models using crop
sponding oceanic signals in the data, and is not yield data in an effort to identify and predict the
by itself a prediction tool. impacts of drought on agriculture. These types
Using the MOWCATL algorithm (Tadesse et of studies will continue to improve drought risk
al., 2005b), the analyses of the rules generated for analysis through the creation of knowledge-based
selected stations and state-averaged precipitation decision-making tools that can be integrated with
and temperature data for Nebraska from 1950 to other drought risk management systems (both
1999 indicated that most occurrences of drought implemented and in development).
were preceded by positive values of the Southern Progress in solving these drought risk man-
Oscillation Index (SOI), negative values of the agement challenges has reached a stage where
Multivariate ENSO Index (MEI), negative val- the value of data mining approaches has been
ues of the Pacific/North American (PNA) Index, established, but further research is needed to fully
negative values of the Pacific Decadal Oscillation explore the utility of these methods. The computer
(PDO), and negative values of the North Atlantic science community is also challenged to develop
Oscillation (NAO). The frequency and confidence new, alternative data mining techniques that can
of the time-lagged relationships found between be used to better characterize the complex envi-
these five oceanic indices and drought at the se- ronmental relationships associated with drought
lected stations in Nebraska indicates that oceanic and produce improved drought indicators. In the
parameters can be used as indicators of drought future, researchers need to continue to work on
in the central United States. bringing data mining and GIS together for me-
teorological and environmental data exploration,
analysis, and visualization. This combination will
future trendS allow a much higher level of interaction between
users and database systems. As the integration of
As has been shown, identifying patterns of data mining and GIS technologies improves, so
drought characteristics using climatic, oceanic, will our capabilities to solve more complex envi-
and satellite data and finding their associations ronmental problems and provide more effective
with vegetation conditions is of great importance management and monitoring tools.
for drought monitoring. New drought monitoring At present, the VegDRI and VegOut models
tools that integrate these diverse types of infor- are experimental and have only been implemented
mation should also be utilized in early warning over the central United States. Research is cur-

The Application of Data Mining for Drought Monitoring and Prediction

rently underway to expand the models to the other drought-related information, as compared to
parts of the United States. These models may traditional drought indicators and tools.
also be tested and applied internationally with Data mining techniques have demonstrated
considerable potential to improve researchers’ great potential in identifying drought characteris-
monitoring capabilities in developing parts of the tics and their spatial and temporal patterns as well
world. However, international expansion of both as finding the association of these characteristics
models will be heavily dependent on the avail- with oceanic processes, which can be used to
ability of similar climatic, satellite, oceanic, and improve both drought monitoring and prediction.
biophysical data that are needed as input variables The use of such techniques will help drought re-
into both models. The models’ input variables searchers and policy makers to: (1) develop effec-
may need to be modified or substituted by other tive drought monitoring capabilities, (2) improve
variables depending on data availability and their water management based on area-specific drought
relationship(s) with vegetation conditions in a predictions, (3) improve the allocation of human
specific country or region in the world. and financial resources during drought events,
(4) improve financial protection strategies such as
crop insurance, (5) implement effective drought
ConCluSion policies to reduce vulnerability to drought, and
(6) develop alternative food supply options in
Accurate and timely assessment of the onset, relation to drought hazards.
spatial extent, and severity of drought is critical The VegDRI is an example of a current drought
for responding to a multitude of environmental monitoring approach that uses data mining tech-
and socio-economic impacts. Identifying the trig- niques. By collectively considering a number
gering mechanisms and associations of various of climate and biophysical variables, this tool
parameters related to drought occurrences are identifies drought-induced vegetation stress that
important tasks in improving drought monitor- cannot be identified solely by traditional climate
ing and prediction and the resultant tools and drought indices and satellite vegetation indices.
information that are available to decision makers. Data mining has proven to be an effective and
Currently, existing tools are limited in their ability efficient means of integrating and analyzing
to predict drought, and development of such tools this diverse collection of variables for drought
is a research priority. characterization. Data mining techniques also
Increased understanding of drought patterns offer the potential to move beyond monitoring
and characteristics can improve the development and begin to predict drought conditions, as was
and implementation of planning and mitigation demonstrated in this chapter by the VegOut and
actions. One of the challenges in understanding MOWCATL tools that have been recently devel-
drought is the difficulty of extracting meaning- oped. The availability of these advanced analytical
ful information from large volumes of data for tools allows researchers to enhance the drought
numerous climate and hydrologic variables and characterization capabilities by exploring and
indices, which are produced at a variety of spatial analyzing diverse, and often complex, sets of
(e.g., station, climate division, or regional) and information in ways that were not possible with
temporal (e.g., weekly, biweekly, or monthly) more traditional analysis techniques.
scales. As mentioned in this chapter, data mining These new and improved drought monitoring
has proven useful in analyzing these large collec- tools can be used to provide improved information
tions of diverse data sets and gleaning improved for more effective drought planning, manage-
ment, and risk analysis. They can also be used

The Application of Data Mining for Drought Monitoring and Prediction

to develop a better understanding of the impacts Harms, S. K., Deogun, J., & Tadesse, T.
of drought on available resources, which assist (2002). Discovering sequential association
decision makers in taking appropriate and timely rules with constraints and time lags in multiple
mitigation actions. Lastly, these tools can be uti- sequences. Lecture notes in artificial intelligence
lized for policy decisions related to sustainable 2366: Foundations of intelligent systems. In Pro-
development and preparation for future challenges ceedings of the 13th International Symposium
in resource-limited areas. on Methodologies for Intelligent Systems (pp.
432-441). Lyon, France.
Ji, L. & Peters, A. J. (2003). Assessing vegetation
referenCeS
response to drought in the northern Great Plains
using vegetation and drought indices. Remote
Aguilar, A. M. (2002). Integrating GIS, circular
Sensing of Environment, 87, 85-98.
statistics and KDSD for modelling spatial data:
A case study. Geographical and Environmental Kottegoda, N. T., Natale, L., & Raiteri, E. (2004).
Modelling, 6(1), 5-25. Some considerations of periodicity and persis-
tence in daily rainfalls. Journal of Hydrology,
Berry, J. A. & Linoff, G. (2000). Mastering data
296(1-4), 23-37.
mining: The art and science of customer rela-
tionship management. New York: John Wiley Larose, D. T. (2005). Discovering knowledge in a
& Sons, Inc. data: An introduction to data mining. New Jersey:
John Wiley & Sons, Inc.
Brown, J. F., Tadesse, T., & Reed, B. C. (2002).
Integrating satellite data and climate data for US McVicar, T. R. & Bierwirth, P. N. (2001). Rapidly
drought mapping and monitoring. In Proceedings assessing the 1997 drought in Papua New Guinea
of the 15th Conference on Biometeorology and using composite AVHRR imagery. International
Aerobiology joint with 16th International Con- Journal of Remote Sensing, 22, 2109-2128.
gress on Biometeorology, (pp. 147-150). Kansas
Svoboda, M., LeComte, D., Hayes, M., Heim,
City, Missouri.
R., Gleason, K., Angel, J., Rippey, B., Thinker,
Cabena, P. H., Stadler, R., Verhees, J., & Zanasi, R., Palecki, M., Stooksbury, D., Miskus, D., &
A. (1998). Discovering data mining: From concept Stephens, S. (2002). The drought monitor. Bul-
to implementation. New Jersey: IBM. letin of the American Meteorological Society,
83(8), 1181-1190.
De’ath, G. & Fabricus, K. E. (2000). Classifica-
tion and regression trees – A powerful yet simple Tadesse, T., Brown, J. F., & Hayes, M. J. (2005a).
technique for ecological data analysis. Ecology, A new approach for predicting drought-related
8(11), 3178-3192. vegetation stress: Integrating satellite, climate,
and biophysical data over the U.S. central plains.
Goddard, S., Harms, S. K., Reichenbach, S. E.,
ISPRS Journal of Photogrammetry and Remote
Tadesse, T., & Waltman, W. J. (2003). Geospatial
Sensing, 59(4), 244-253.
decision support for drought risk management.
Communication of the ACM, 46(1), 35-37. Tadesse, T., Wilhite, D. A., Hayes, M. J., Harms,
S. K., & Goddard, S. (2005b). Discovering associa-
Groth, R. (1998). Data mining: A hands-on ap-
tions between climatic and oceanic parameters to
proach for business professionals. New Jersey:
monitor drought in Nebraska using data-mining
Prentice Hall.
techniques. Journal of Climate, 18(10), 1541-
1550.

0
The Application of Data Mining for Drought Monitoring and Prediction

Tadesse, T., Wilhite, D. A., Harms, S. K., Hayes, Wilhite, D. A. (2000): Drought as a natural hazard:
M. J., & Goddard, S. (2004). Drought monitor- concepts and definitions. In D. A. Wilhite (Ed.),
ing using data mining techniques: A case study Drought: A global assessment (Vol. 1, pp. 3-18).
for Nebraska, USA. Natural Hazards, 33(1), London: Routledge Publishers.
137-159.
Wilhelmi, O. V. & Wilhite, D. A. (2002). Assessing
White, A. B., Kumar, P., & Tcheng, D. (2005). A vulnerability to agricultural drought: a Nebraska
data mining approach for understanding topo- case study. Natural Hazards, 25(1), 37-58.
graphic control on climate-induced inter-annual
vegetation variability over the United States.
Remote Sensing of Environment, 98, 1-20.

Compilation of References

Abbass, H. A., Sarker, R. A., & Newton, C. S. (Eds.) Adamo, Jean-Marc (2001). Data mining for association
(2002). Data mining: A heuristic approach. Hershey, rules and sequential patterns: Sequential and parallel
PA: IGI Global. algorithms. Springer Verlag

Abbott, J. (2001). Data data everywhere – and not a byte AfricaGIS (2005). Conference resolutions draft. Re-
of use? Qualitative Market Research: An International trieved April 13, 2008, from http://www.africagis2005.
Journal, 4(3), 182-192. org.za/agp/africagispapers/AfricaGIS 2005 Resolutions
draft 041105.doc
Abiteboul, S. (1997). Querying semi-structured data. In
Proceedings of the International Conference on Database Agosta, L. (2000). The essential guide to data warehous-
Theory, Delphi, Greece. ing. Upper Saddle River, NJ: Prentice Hall.

Ackland, R. & Gibson, R. (2004). Mapping political Agrawal, R. & Srikant, R. (1994). Fast algorithms for
party networks on the WWW. In Proceedings of the mining association rules. In J. B. Bocca, M. Jarke & C.
Australian Electronic Governance Conference, Mel- Zaniolo (Eds.), Proceedings of the 20th International
bourne, Australia. Conference on Very Large Data Bases (pp. 487-499).
San Francisco: Morgan Kaufmann Publishers.
Ackland, R. (2005). Estimating the size of political Web
graphs. Revised paper presented to ISA Research Com- Agrawal, R. & Srikant, R. (1995). Mining sequential
mittee on Logic and Methodology Conference. Retrieved patterns. In Proceedings of the 1995 International Con-
April 10, 2008, from http://acsr.anu.edu.au/staff/ackland/ ference Data Engineering, Taipei, Taiwan.
papers/political_ web_graphs. pdf
Agrawal, R., Imielinski, T., & Swami, A. (1993). Mining
Ackland, R. (2005). Mapping the U.S. political blogo- association rules between sets of items in large databases.
sphere: Are conservative bloggers more prominent? In P. Buneman & S. Jajdia (Eds.), Proceedings of the 1993
Paper presented to BlogTalk Downunder, Sydney. ACM SIGMOD International Conference on Manage-
Retrieved April 10, 2008, from http://acsr.anu.edu.au/ ment of Data (pp. 207-216). New York: ACM Press.
staff/ackland/papers/polblogs.pdf
Aguilar, A. M. (2002). Integrating GIS, circular statistics
Ackoff, R. L. (1989). From data to wisdom. Journal of and KDSD for modelling spatial data: A case study. Geo-
Applied Systems Analysis, 16, 3-9. graphical and Environmental Modelling, 6(1), 5-25.

Adafre, S. F. & Rijke, M. D. (2005). Discovering missing Aksoy, S., Koperski, K., Tusk, C. & Marchisio, G. (2004).
links in Wikipedia. In Proceedings of the 3rd International Interactive training of advanced classifiers for mining
Workshop on Link Discovery (pp. 90-97). ACM Press. remote sensing image archives. In Proceedings of the
ACM International Conference on Knowledge Discovery
and Data Mining (pp. 773-782). Seattle, Washington.

Albertoni, R., Bertone, A., & De Martino, M. A. (2003). Américo, M. C. S., Vieira, I. C. G., Veiga, J. B. & Araujo,
Visualization-based approach to explore geographic R. (in press) Pecuária e Amazônia: Estratégias sociais e
metadata. In Proceedings of the 11th International reestruturação do território nas frentes pioneiras: Rodo-
Conference in Central Europe on Computer Graphics, via PA-279 e região da Terra do Meio no Pará [Cattle
Visualization and Computer Vision, WSCG 2003, Plzen- ranching and Amazonia: Social strategies and territory
Bory, Czech Republic. reorganization in new frontiers – PA-279 and Terra do
Meio region in Pará state]. In R. Araujo & P. Lena (Eds.),
Alexandre, F. (1997). Connectionist-symbolic integra-
Alternativas de desenvolvimento sustentável na Amazô-
tion: From unified to hybrid approaches. Mahwah, NJ:
nia: Experiências recentes [Alternatives of sustainable
Lawrence Erlbaum Associates.
development in Amazônia: Recent experiences].
Ali, K., Manganaris, S., & Srikant, R. (1997). Partial
Anahory, S., & Murray, D. (1997). Data warehousing in
classification using association rules. In D. Hecker-
the real world: A practical guide for building decision
man, H. Mannila, D. Pregibon & R. Uthurusamy (Eds.),
support systems. Harlow, UK: Addison-Wesley.
Proceedings of the Third International Conference on
Knowledge Discovery and Data Mining (pp. 115-118). Anandarajan, M., Picheng, L. & Anandarajan, M. (2001).
Menlo Park, CA: AAAI Press. Bankruptcy prediction of financially stressed firms: An
examination of the predictive accuracy of artificial neural
Allenby, B. R., Compton, W. D., & Richards, D. J. (2007).
networks. International Journal of Intelligent Systems in
Information systems and the environment overview and
Accounting, Finance and Management, 10, 69-81.
perspectives. Retrieved April 13, 2008, from http://books.
nap.edu/openbook.php?record_id=6322&page=1 Andrew, M. (2005). The role of research in sustainable
tourism policy-making. Paper presented at the First
Allot (2005). The traffic management handbook. MN:
Regional Sustainable Tourism Policy and Intersectoral
Allot Communications Ltd.
Planning Workshop Grand Barbados Hotel, Barbados,
Alon et al. (1999). Broad patterns of gene expression West Indies.
revealed by clustering analysis of tumor and normal colon
Apoteker, T. & Barthelemythierry, S. (2005). Predicting
tissues probed by oligonucleotide arrays. In Proceedings
financial crises in emerging markets using a composite
of the National Academy of Sciences.
non-parametric model. Emerging Markets Review, 6(4),
Alpaydin, E. (2004). Introduction to machine learning 363-375.
(adaptive computation and machine learning). The
Aspncs, J. (2002). Randomized protocols for asynchro-
MIT Press.
nous consensus, Ref.
Altman, E., Haldeman, G., & Narayanan, P. (1977).
Avgerou, C. (2002). Information systems and global
Zeta analysis: A new model to identify bankruptcy
diversity. London, UK: Oxford University Press.
risk of corporations. Journal of Banking and Finance,
June, 29-54. Badia, A. & Kantardzik, M. (2005). Graph building as a
mining activity: Finding links in the small. In Proceed-
Alvarez, G. (2004). What’s missing from RFID tests.
ings of the 3rd International Workshop on Link Discovery
Information Week. Retrieved November 20, 2004, from
(pp. 17-24). ACM Press.
http://www.informationweek.com/story/showArticle.
jhtml?articleID=52500193 Bailey, T. C. (1994). A review of statistical spatial analysis
in geographical information systems. In A. S. Fothering-
Alves, D. S. (2002). Space-time dynamics of deforesta-
ham & P. A. Rogerson (Ed.), Spatial analysis and GIS
tion in Brazilian Amazonia. International Journal of
(pp. 14-44). London, UK: Taylor and Francis.
Remote Sensing, 23(14).

Compilation of References

Baldi, P., Frasconi, P., & Smyth, P. (2003). Modeling ACM SIGACT-SIGOPS Symposium on Principles of
the Internet and the Web: Probabilistic methods and Distributed Computing (pp. 27–30), Montreal, Quebec,
algorithms. West Sussex, UK: John Wiley. Canada.

Basheer, I. A., Reddi, L. N., & Najjar, Y. M. (1996). Site Bentley, R. (2005, August 25). Data with destiny. Caterer
characterization by NeuroNets: An application to the & Hotelkeeper, 38.
landfill siting problem. Ground Water, 34, 610-617.
Berendt, B. (2002). Using site semantic to analyze, visual-
Bauer, E., & Kohavi, R. (1999). an empirical comparison ize and support navigation. Data Mining and Knowledge
of voting classification algorithms: Bagging, boosting, Discovery, 6, 37-59.
and variants. Machine Learning, 36, 105-139.
Berendt, B., Hotho, A., & Stumme, G. (2002). Towards
Baum, E.B., & Haussler, D. (1989). What net size gives semantic web mining. Lecture Notes in Computer Sci-
valid generalisation? Neural Computation, 1(1), 151- ence (vol. 2342, pp. 264-278).
160.
Berler, A., Pavlopoulos, S., & Koutsouris, D. (2005). Using
Beaver, W. (1966). Financial ratios as predictors of failure. key performance indicators as knowledge-management
Journal of Accounting Research, pp. 71-111. tools at a regional health-care authority level. IEEE Trans
Inf Technol Biomed, 9(2), 184-192.
Beck, U. (2000). What is globalization? Cambridge,
UK: Polity Press. Berners-Lee, T., Hendler, J., & Lassila, O. (2001). The
Semantic Web. Scientific American, 284(5), 34-43.
Becker, B. (1997). Amazonia. São Paulo: Atica.
Berry, J. A. & Linoff, G. (2000). Mastering data mining:
Bedard, Y., Gosselin, P., Rivest, S., Proulx, M., Nadeau,
The art and science of customer relationship manage-
M., Lebel, G., & Gagnon, M., (2003). Integrating GIS
ment. New York: John Wiley & Sons, Inc.
components with knowledge discovery technology for
environmental health decision support. International Berry, M. J. A. & Linoff, G. S. (1997). Data mining
Journal of Medical Informatics, 70, 79-94. techniques for marketing, sales and customer support.
John Wiley & Sons.
Bellaachia, A., Portnoy, D., Chen, Y. & Elkahloun, A.
G. (2002) E-CAST: A data mining algorithm for gene Berry, M. J. A. & Linoff, G. S. (1999). Mastering data
expression data. In Proceedings of the BIOKDD02: Work- mining: The art and science of customer relationship
shop on Data Mining in Bioinformatics (with SIGKDD02 management. John Wiley & Sons.
Conference), Edmonton, Alberta, Canada.
Berry, M. J. A. & Linoff, G. S. (2000). Mastering data
Bellaachia, A., Vommina, E., & Berrada, B. (2006). mining. John Wiley & Sons.
Minel: A framework for mining e-learning logs. In
Berry, M. J. A. & Linoff, G. S. (2002). Mining the Web:
Proceedings of the 5th IASTED International Confer-
Transforming customer data. John Wiley & Sons.
ence on Web-based Education (pp. 259-263). Puerto
Vallarta, Mexico. Berry, M. J. A. & Linoff, G. S. (2004). Data mining tech-
niques: For marketing, sales, and customer relationship
Ben-Dor, A., Shamir, R. & Yakhini, Z. (1999). Cluster-
management. Wiley Computer Publishing.
ing gene expression patterns. Journal of Computational
Biology, 6(3/4), 281–297. Berson A., Smith, S. J., & Thearling, K. (1999). Building
data mining applications for CRM. McGraw Hill.
Ben-Or, M. (1983). Another advantage of free choice:
Completely asynchronous agreement protocols (ex- Berson, A. & Smith, S. J. (1997). Data warehousing,
tended abstract). In Proceedings of the Second Annual data mining, and OLAP. McGraw Hill.

Compilation of References

Berthold, M. & Hand, D. J. (1999). Intelligent data ana- Blazewicz, J., & Kasprzak, M. (2003). Determining
lysis: An introduction. Springer Verlag. genome sequences from experimental data using evo-
lutionary computation. In G. G. Fogel & D. W. Corne
BESR (2004). Board on Earth Sciences and Resources
(Eds.), Evolutionary computation in bioinformatics (pp.
(BESR), Future challenges for the U.S. Geological
41-58). San Francisco: Morgan Kaufmann.
survey’s mineral resources program (2004). Washington,
D.C.: The National Academies Press. Bodie, Z., Kane, A., Marcus, A. J., & Ryan, P. J. (2003).
Investments (Fourth Canadian Edition). Toronto, ON
Bhat, N., & McAvoy, T. J. (1990). Use of neural nets
(Canada): McGraw-Hill Ryerson Limited.
for dynamic modelling and control of chemical process
systems. Computer Chemical Engineering, 14(4/5), Boritz, E. J., & Kennery, D. (1995). Effectiveness of
573-583. neural network types for predicition of business failure.
Expert Systems with Applications, 9,503-512.
Bhattacharya, I. & Getoor, L. (2004). Deduplication
and group detection using links. In Proceedings of the Botschen, G., Thelen, E. M. & Pieters, R. (1999).Using
SIGKDD Workshop on LinkAnalysis and Group Detec- means-end structures for benefit segmentation: An ap-
tion, Seattle, WA. plication to services. European Journal of Marketing,
33 (1/2).
Bins, L., Fonseca, L. & Erthal, G. (1996). Satellite
imagery segmentation: A region growing approach. In Boulicaut, Jean-Francois, Esposito, F., Giannotti, F.
Proceedings of the 8th Brazilian Symposium on Remote & Pedreschi, D. (Eds.) (2004). Knowledge discovery
Sensing (pp.1-4). in databases. In Proceedings of the PKDD 2004: 8th
European Conference on Principles and Practice of
BIS, The Bank for International Settlements. (2006).
Knowledge Discovery in Databases, Pisa, Italy.
Basel II: Revised international capital framework.
Retrieved April 13, 2008, from http://www.bis.org/publ/ Bozdogan, H. (Ed.) (2004). Statistical data mining and
bcbsca.htm. knowledge discovery. CRC Press.

Bishop, C. M. (1995). Neural networks for pattern rec- Bracha, G. & Rachman, O. (1992). Randomized consensus
ognition. Oxford, UK: Oxford University. in expected O(n2log n) operations. In S. Toueg, P. G. Spi-
rakis & L. M. Kirousis (Eds.), Lecture notes in computer
Bishop, C. M. (2003). Neural networks for pattern rec-
science (Vol. 579, pp. 143–150). Delphi, Greece: Springer.
ognition. Oxford University Press.
Retrieved April 12, 2008, from http://www.cs.yale.edu/
Bitzenis, A. & Nito, E. (2005). Obstacles to entrepre- homes/aspnes/randomized-consensus-survey.pdf
neurship in a transition business environment: The case
Brachman, R. J. & Anand, T. (1996). The process of
of Albania. Journal of Small Business and Enterprise
knowledge discovery in databases. In U. Fayyad, G.
Development, 12(4), 564-578.
Piatetsky-Shapiro, P. Smyth & R. Uthurusamy (Ed.),
Blackman, R. B. and Turkey, J. W. (1958). The measure- Advances in knowledge discovery and data mining (pp.
ments of power spectra. New York: Dover Publications, 37-57). Cambridge, MA: AAAI/MIT Press.
Inc.
Bradbury, D. (2005, August 31). Technology Jargon
Blaschke, A. (2001). Environmental monitoring and Buster. Caterer & Hotelkeeper,
management of protected areas through integrated eco-
Bradley, P.S., Fayyad, U.M., & Mangasarian, O.L. (1999).
logical information systems- An EU perspective. In C.
Mathematical programming for data mining: Formula-
Rautenstrauch & S. Patig (Ed.), Environmental informa-
tions and challenges. INFORMS Journal on Computing,
tion systems in industry and public administration (pp.
11, 217-238.
75-100). Hershey, PA: Idea Group Publishing.

Compilation of References

Braha, D. (Ed.) (2001). Data mining for design and ference on Data Engineering (pp. 443-452). Los Alamitos,
manufacturing: Methods and applications. Kluwer CA: IEEE Computer Society Publications.
Publishers.
Burn, J. M. & Loch, K. D. (2001). The societal impact
Bramer, M. (2007). Principles of data mining: Under- of the World Wide Web—Key challenges for the 21st
graduate topics in computer science. London, UK: century. Information Resources Management Journal,
Springer-Verlag. 14(4), 4-14.

Bramer, M. A. (Ed.) (1999). Knowledge discovery and Business intelligence: Aspectos e tendências do uso de
data mining: Theory and practice. IEE Books. ferramentas de análise corporativa. Retrieved March
12, 2003, from www.idcbrasil.com.br.
Briassoulis, H. (2004). Analysis of land use change:
Theoretical and modeling approaches. Retrieved April Buytendijk, F. (2001). Strategic BI: Its definition and
8, 2008, from http://www.rri.wvu.edu/WebBook/Brias- effect on infrastructure. Gartner Group.
soulis
C 5.0. (2004). Retrieved from http://www.rulequest.
Brin, S. & Page, L. (1998). The anatomy of a large-scale com/see5-info.html
hypertextual Web search engine. In Proceedings of the
Cabena, P. H., Stadler, R., Verhees, J., & Zanasi, A. (1998).
7th International World Wide Web Conference, Elsevier
Discovering data mining: From concept to implementa-
Science (pp. 107-117), New York.
tion. New Jersey: IBM.
Brin, S., Motwani, R., Ullman, J. D., & Tsur, S. (1997).
Cabena, P., Hadjinian, P., Stadler, R., Verhees, J., & Za-
Dynamic itemset counting and implication rules for
nasi, A. (1997). Discovering data mining: from concept
market basket data. In J. Peckham (Ed.), Proceedings
to implementation. Upper Saddle River, NJ: Prentice
of the 1997 ACM SIGMOD International Conference
Hall PTR.
on Management of Data (pp. 255-264). New York:
ACM Press. Cai, C. H., Fu, A. W. C., Cheng, C. H., & Kwong, W. W.
(1998). Mining association rules with weighted items. In
Brown, J. F., Tadesse, T., & Reed, B. C. (2002). Integrating
B. Eaglestone, B. C. Desai & J. Shao (Eds.), Proceedings
satellite data and climate data for US drought mapping
of the 1998 International Database Engineering and
and monitoring. In Proceedings of the 15th Conference
Application Symposium (pp. 68-77). Los Alamitos, CA:
on Biometeorology and Aerobiology joint with 16th In-
IEEE Computer Society Publications.
ternational Congress on Biometeorology, (pp. 147-150).
Kansas City, Missouri. Cai, D., Shao, Z., He, X., Yan, X., & Han, J. (2005).
Minning hidden community in heterogeneous social net-
Buchner, A. G., Mulvenna, M. D., Anand, S. S. & Hughes,
works. In Proceedings of the 3rd International Workshop
J. G. (1999). Navigation pattern discovery from Internet
on Link Discovery (pp. 58-65). ACM Press.
data. In Proceedings of the Web Usage Analysis and User
Profiling Workshop (pp. 25-30), San Diego, CA. Câmara, G., Souza, R., Freitas, U. & Garrido, J. (1996).
SPRING: Integrating Remote Sensing and GIS with
Bukvic, V. & Bartlett, W. (2003). Financial barriers
object-oriented data modelling. Computers and Graph-
to SME growth in Slovenia. Economic and Business
ics, 15(6), 13-22.
Review, 5(3), 161-181.
Câmara, G., Vinhas, L., Souza, L., Paiva, L., Monteiro,
Burdick, D., Calimlim, M., & Gehrke, J. (2001). MAFIA:
A., Carvalho, M. & Raoult, B. (2001). Design patterns in
A maximal frequent itemset algorithm for transactional
GIS development: The Terralib experience. In Proceed-
databases. In Proceedings of the 17th International Con-
ings of the III Brazilian Symposium in Geoinformatics,
GeoInfo 2001, Rio de Janeiro.

Compilation of References

Canada Centre for Remote Sensing (2003). Fundamentals Charnes, A., & Cooper, W.W. (1961). Management models
of remote sensing. Remote Sensing Tutorial (pp. 5-44). and industrial applications of linear programming (vols.
Retrieved April 8, 2008, from www.ccrs.nrcan.gc.ca/ 1 & 2). New York: John Wiley & Sons.
ccrs/learn/tutorials/fundam/fundam_e.html
Chen, H. & Chau, M. (2004). Web mining: Machine learn-
Canbas, S., Onal, B. Y., Duzakin, H. G., & Kilic, S. B. ing for Web applications. Annual Review of Information
(2006). Prediction of financial distress by multivariate Science and Technology (ARIST), 38, 289-329.
statistical analysis: The case of firms taken into the
Chen, H., Chung, W., Xu, J. J., Wang, G., Qin, Y., & Chau,
surveillance market in the Istanbul Stock Exchange.
M. (2004). Crime data mining: A general framework and
International Journal of Theoretical & Applied Finance,
some examples. Computer, 37(4), 50-56.
9(1), 133.
Chen, H., Qin, J., Reid, E., Chung, W., Zhou, Y., Xi, W.,
Carty, A. J. (2002). Scientific and technical data: Ex-
Lai, G., Bonillas, A., & Sageman, M., (2004). The dark
tending the frontiers of research. In Proceedings of the
Web portal: Collecting and analyzing the presence of
Opening Address at CODATA 2002: Frontiers of Scientific
domestic and international terrorist groups on the Web.
and Technical Data, Montréal, Canada.
In Proceedings of the 7th International Conference on
Caterer & Hotelkeeper (2000, September 7). Hotel groups Intelligent Transportation Systems (ITSC), Washington
deny they’re missing Web opportunities, 14. D.C.

Caterer & Hotelkeeper (2004, 24 June). Do the knowl- Chen, I. J. & Popovich, K. (2003). Understanding cus-
edge, 34. tomer relationship management (CRM); People, process
and technology. Business Process Management Journal,
Cerrito, P. (2006). Introduction to data mining using SAS
9(5), 672-688.
enterprise miner. SAS Press.
Chen, Y., Wang, J. Z. & Krovetz, R. (2003). CLUE:
Chainey, S. & Ratcliffe, J. (2005). GIS and crime mapping.
Cluster-based retrieval of images by unsupervised learn-
Chichester, West Sussex: John Wiley and Sons.
ing. In K. A. Meraim, I. Bloch (Eds.), In Proceedings of
Chakrabarti, S. (2003). Mining the Web: Discovering the IEEE Seventh International Symposium on Signal
knowledge from hypertext data. San Francisco: Morgan Processing and its Applications (pp. 202-231).
Kaufmann Publishers.
Chenhall, R. H. (2005). Integrative strategic performance
Chakrabarti, S., (2000). Data mining for hypertext: A tuto- measurement systems, strategic alignment of manufac-
rial survey. ACM SIGDDD Explorations, 1(2), 1-11. turing, learning and strategic outcomes: An exploratory
study. Accounting, Organizations and Society, 30(5),
Chan, K. A., Menkveld, A. J., & Yang, Z. (2003). Evi-
395-423.
dence on the foreign share discount puzzle in China:
Liquidity or information asymmetry? (Working Paper). Chetty, M. & Buyya, R. (2002). Weaving computational
Hong Kong, China: University of Science and Technol- crids: How analogous are they with electrical grids?
ogy (HKUST). IEEE Computing in Science and Engineering, July/Au-
gust, 61-71.
Chan, N. H. & Wong, H. Y. (2007). Data mining of resil-
ience indicators. IIE Transactions, 39, 617–627. Chin-Sheng, H., Dorsey, R. E., & Boose, M.A. (1994).
Life insurer financial distress prediction: A neural net-
Chang, S., Chang, H., Lin, C., & Kao, S. (2003). The
work model. Journal of Insurance Regulation, 13(2),
effect of organizational attributes on the adoption of
131-168.
data mining techniques in the financial service industry:
An empirical study in Taiwan. International Journal of
Management, 20, 497-503.

Compilation of References

CIDA (2005). CIDA’s strategy on knowledge for de- Knowledge Discovery in Databases (pp. 99-111). Berlin
velopment through information and communication Heidelberg, Germany: Springer-Verlag.
technologies (ICT). Canadian International Development
Coenen, F. & Leng, P. (2004). An evaluation of approaches
Agency. Retrieved April 13, 2008, from http://www.
to classification rule selection. In Proceedings of the 4th
acdi-cida.gc.ca/ict
IEEE International Conference on Data Mining (pp.
Cios, K. J. (Ed.) (2000). Medical data mining and knowl- 359-362). Los Alamitos, CA: IEEE Computer Society
edge discovery. Physica-Verlag (Springer). Publications.

Cios, K., Pedrycz, W., & Swiniarski, R. (1998). Data Coenen, F., Goulbourne, G., & Leng, P. (2001). Computing
mining methods for knowledge discovery. association rules using partial totals. In L. D. Raedt & A.
Siebes (Eds.), Principles of Data Mining and Knowledge
Cirasa, A., Pilato, G., Sorbello, F., & Vassallo, G. (2000).
Discovery – Proceedings of the 5th European Conference
EαNet: A neural solution for Web pages classification.
on Principles and Practice of Knowledge Discovery in
In Proceedings of the 4th World Multiconference on
Databases (pp. 54-66). Berlin Heidelberg, Germany:
Systemics, Cybernetics, and Informatics SCI2000,
Springer-Verlag.
Orlando, Florida.
Coenen, F., Leng, P., & Ahmed, S. (2004). Data structure
Clemons, E. & Row, M. (2000, November 13).Behaviour
for association rule mining: T-tree and p-tree. IEEE
is key to web retailing strategy. Financial Times.
Transactions on Knowledge and Data Engineering,
Coats, P. K., & Frant, F. L. (1992). A neural network ap- 16(6), 774-778.
proach to forecasting financial distress. The Journal of
Coenen, F., Leng, P., & Goulbourne, G. (2004). Tree
Business Forecasting Methods & Systems, 10, 9-12.
structures for mining association rules. Journal of Data
Coats, P. K., & Frant, F. L. (1993). Recognizing financial Mining and Knowledge Discovery, 8(1), 25-51.
distress patterns using a neural network tool. Financial
Cohen, A. & Nachmias, R. (2006). A quantitative cost
Management, 22(3), 142-155.
effectiveness model for Web-supported academic instruc-
Codata (2002). Committee on data for science and technol- tion. The Internet and Higher Education, 9(2), 81-90.
ogy (CODATA). In Proceedings of the Workshop Synthe-
Cohn D. & Chang, H. (2000). Learning to probabilisti-
sis on Archiving Scientific and Technical Data, Pretoria,
cally identify authoritative documents. In Proceedings of
South Africa. Retrieved April 13, 2008, from http://www.
the17th International Conference on Machine Learning
tgdc-codata.org.cn/english/Html/SA-CT.html
(ICML2000) (pp. 167-174), Stanford, California.
Coenen, F. & Leng, P. (2001). Optimising association
COL (2003). Find information faster: COL’s “Info-min-
rule algorithms using itemset ordering. In M. Bramer, F.
ing” tools. Retrieved April 13, 2008, from http://www.
Coenen & A. Preece (Eds.), Research and Development
col.org/colweb/site/pid/2927
in Intelligent Systems XVIII – Proceedings of the Twenty-
first SGES International Conference on Knowledge Based Coleman, D. J. & McLaughlin, J. D. (1997). Informa-
Systems and Applied Artificial Intelligence (pp. 53-66). tion access and network usage in the emerging spatial
London, UK: Springer-Verlag. information marketplace. Journal of Urban and Regional
Information Systems Association, 9, 8-19.
Coenen, F. & Leng, P. (2002). Finding association rules
with some very frequent attributes. In T. Elmaa, H. Collard, J. M. (2002). Is your company at risk? Strategic
Mannila & H. Toivonen (Eds.), Principles of Data Min- Finance, 84(1), 37-39.
ing and Knowledge Discovery – Proceedings of the 6th
Connelly, R., McNeil, R., & Mosimann, R. (1998). The
European Conference on Principles and Practice of
multidimensional manager - 24 ways to impact your

Compilation of References

bottom line in 90 days. Ottawa, ON: Cognos Incorpo- de Ville, Barry. (2001). Microsoft data mining, Integrated
rated. business intelligence for e-commerce and knowledge
management.
Cooley, R., Mobasher, B., & Srivastava, J. (1997). Web
mining: Information and pattern discovery on the World De’ath, G. & Fabricus, K. E. (2000). Classification and
Wide Web. In Procceding of the 9th International Con- regression trees – A powerful yet simple technique for
ference on Tools with Artificial Intelligence(ICTAI ’97) ecological data analysis. Ecology, 8(11), 3178-3192.
(pp. 558-567), New Port Beach, CA: IEEE Computer
Deakin, E. B. (1972). A discriminant analysis of predic-
Society.
tors of business failure. Journal of Accounting Research,
CORDIS (2006). GRID technologies and applications 10(1), 167-179.
through CORDIS. Community Research & Development
Delmater, R. & Hancock, M. (2001). Data mining ex-
Information Service. Retrieved April 13, 2008, from
plained: A manager’s guide to customer-centric business
http://www.environmonument.com/projects.htm
intelligence. Digital Press.
Couldwell, C. (1998, May 21). A data day battle. Com-
Demertzis, N., Diamantaki, K., Gazi, A., & Sartzetakis,
puting, 64-66.
N. (2005). Greek political marketing on-line: An analysis
Cox, E. (2004). Fuzzy modeling and genetic algorithms of parliament members’ Web sites. Journal of Political
for data mining and exploration. Morgan Kaufmann. Marketing, 4(1), 51-74.

Curotto, C. L. & Ebecken, N. F. F. (2005). Implementing Derby, B. L. (2003). Data mining for improper payments.
data mining algorithms in Microsoft® SQL Server™. The Journal of Government Financial Management,
WIT Press. 52, 10-13.

Cuthbertson, K. & Nitzsche, D. (2001). Investments: Spot Desikan, P., Srivastava, J., Kumar, V., & Tan, P. N.
and derivatives markets. Chichester, West Sussex, UK: (2002). Hyperlink analysis: Techniques and applications
John Wiley & Sons, Ltd. (Tech. Rep. TR 2002-0152). Army High Performance
Computing Center.
Dai, H. & Mobasher, B. (2003). A road map to more
effective Web personalization; Integrating domain Dhar, V. & Stein, R. (1996). Seven methods for trans-
knowledge with Web usage mining. In Proceedings of forming corporate data into business intelligence. Upper
the International Conference on Internet Computing (IC Saddle River, NJ: Prentice Hall.
2003), Las Vegas, Nevada.
Dietterich, T. (2000). Ensemble methods in machine
Damodaran, A. (2001). Corporate finance theory and learning. In Kittler & Roli (Eds.), Multiple classifier
practice (2nd ed.). New York: John Wiley & Sons, Inc. systems (pp. 1-15). Berlin: Springer-Verlag (Lecture
Notes in Pattern Recognition 1857).
Davidse, R. J. & Van Raan, A. F. J. (1997). Out of particles:
Impact of CERN, DESY, and SLAC research to fields Dietterich, T. G., Lathrop, R. H., & Lozano-Perez, T.
other than physics. Scientometrics, 40(2), 171-193. (1997). Solving the multiple-instance problem with
axis-parallel rectangles. Artificial Intelligence, 89(1-2),
Davies, A. (2000, 29 June). Data’s the way to do it, Ca-
31-71.
terer & Hotelkeeper, 31-32.
Diplaris, S., Tsoumakas, G., Mitkas, P. A., & Vlahavas,
Davies, A. (2001, 26 July). On-line, on course. Caterer
I. (2005). Protein classification with multiple algorithms.
& Hotelkeeper, 37-39.
In Proceedings of 10th Panhellenic Conference in Infor-
de Ville, Barry (2006). Decision trees for business intel- matics. Volos, Greece: Springer-Verlag.
ligence and data mining: Using SAS enterprise miner.
SAS Press.

Compilation of References

Do, T. D., Chang, K., & Hui, S. C. (2004). Web mining Edlington, S. (2003, January 20). Future perfect? Caterer
for cyber monitoring and filtering. In Proceedings of the & Hotelkeeper, 26.
2004 IEEE Conference on Cybernetics and Intelligent
Eirinaki, M. & Vazirgiannis, M. (2003). Web mining
Systems Vol. 1 (pp. 399-404). Singapore.
for Web personalization. ACM Transactions on Internet
Dong, G. & Li, J. (1999). Efficient mining of emerging Technology, 3(1), 1-27.
patterns: Discovering trends and differences. In Proceed-
Eisenhardt, K. M. & Sull, D. N. (2001). Strategy as simple
ings of the Fifth ACM SIGKDD International Conference
rules. Harvard Business Review, 79(1), 106-117.
on Knowledge Discovery and Data Mining (pp. 43-52).
New York: ACM Press. Eklund, T., Back, B., Vanharanta, H., & Visa, A. (2003).
Using the self- organizing map as a visualization tool
Dorian, P. (1999). Data preparation for data mining.
in financial benchmarking. Information Visualization,
Morgan Kaufmann.
2, 171-181.
Drewry et al. (2002). Current state of data mining. Depart-
El-Hajj, M. & Zaiane, O. R. (2003). Inverted matrix:
ment of Computer Science, University of Virginia.
Efficient discovery of frequent items in large datasets
Dubé, L. & Paré, G. (2003). Rigor in information systems in the context of interactive mining. In L. Getoor, T. E.
positivist case research: Current practices, trends and Senator, P. Domingos & C. Faloutsos (Eds.), Proceedings
recommendations. MIS Quarterly, 27(4), 597-635. of the Ninth ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining (pp. 109-118).
Duda, R. O., Hart, P. E., & Stork, D. G. (2001). Pattern
New York: ACM Press.
classification. New York: John Wiley.
Enke, D. & Thawornwong, S. (2005). The use of data
Dunham, M. (2003). Data mining introductory and
mining and neural networks for forecasting stock market
advanced topics. Prentice Hall.
returns. Expert Systems with Applications, 29(2005),
Dunn, C. E., Atkins, P. J., & Townsend, J. G. (1997). 927-940.
GIS for development: A contradiction in terms? Area,
Escada, M. I. S., Monteiro, A. M., Aguiar, A. P., Carneiro,
29(2), 151-159.
T. & Câmara, G. (2005). Análise de padrões e processos
Dyer, N. A. (1998). What’s in a relationship (other than de ocupação para a construção de modelos na Amazô-
relations)? Insurance Brokers Monthly & Insurance nia [Analysis of land use patterns and processes for the
Adviser, 48(7), 16-17. construction of models in Amazonia]. In Proceedings
of the XII Brazilian Symposium on Remote Sensing (pp.
Earth Institute News (2005). Scientific community must
2973-2983), Goiania, Brazil.
develop cross-disciplinary standards and practices in
academia. Retrieved April 13, 2008, from http://www. Escada, M. I. S., Vieira, I. C. G., Amaral, S., Araújo, R.,
earthinstitute.columbia.edu/news/2005/story05-01-05c. Veiga, J. B. D., Aguiar, A. P. D., Veiga, I., Oliveira, M.,
html Pereira, J. L. G., Filho, A. C., Fearnside, P. M., Venturieri,
A., Carriello, F., Thales, M., Carneiro, T. S., Monteiro,
Easterby-Smith, M., Araujo, L., & Burgoyne, J. (1999).
A. M. V., & Câmara, G. (2005). Padrões e processos de
Organizational learning and the learning organization:
ocupação nas novas fronteiras da Amazônia: O inter-
Developments in theory and practice. London, UK: Sage
flúvio do Xingu/Iriri [Land use patterns and processes
Publications.
in Amazonian new frontiers: The Xingu/Iriri region].
Ebecken, N. F. F., Brebbia, C. A., & Weigend, A. (2000). Estudos Avançados [Advanced Studies], 19, 9-23.
Data mining II (1st ed.). Computational Mechanics,
Esposito, F., Malerba, D., Di Pace, L., & Leo, P. (1999).
Inc.
A learning intermediary for automated classification

00
Compilation of References

of Web pages. In Proceedings of the 16th International Fayyad, U.M., Piatetsky-Shapiro, G., & Uthurusamy, R.
Workshop on Machine Learning in Text Data Analysis (2003). Summary from the KDD-03 Panel: Data mining:
(ICML1999) (pp. 37-46). The next 10 years. ACM SIGKDD Explorations Newslet-
ter, 5(2), 191-196.
ESRI (2004). ArcGIS 9: What is ArcGIS? A White Pa-
per. Redlands, CA: Environmental Systems Research Feeney, M. F. (2003). SDIs and decision support. In I.
Institute. Williamson, A. Rajabifard, & M. F. Feeney (Ed.), De-
veloping spatial data infrastructures: From concept to
ESRI (2004). ArcSDE: Advanced spatial data server.
reality (pp. 195-210). London, UK: Taylor & Francis.
White Paper. Retrieved May 8, 2008 from http://esri.com/
library/whitepapers/pdfs/arcgis_spatial_analyst.pdf Ferramentas de business intelligence no Brasil (2003).
Retrieved March 12, 2003, from www.idcbrasil.com.
European Commission (2003). 2003 observatory of
br.
European SMEs: SMEs in Europe (Tech. Pep. No.7).
European Commission. Ficenec, C. (2003, June). Explorations of participatory
GIS in three Andean watersheds. Paper presented at
European Commission (2005). Specific programme for
the University Consortium of Geographic Informa-
research technological development and demonstration:
tion Science (UCGIS) Summer Assembly 2003, Pacific
Integrating and strengthening the European research
Grove, CA.
area, 2005 Work Programme (SP1-10).
Fisher, D. (1987). Improving inference through con-
Faca, F. M. & Lanzi, P. L. (2005). Mining interesting
ceptual clustering. In Proceedings of the 1987 AAAI
knowledge from Weblogs: A survey. Data Knowledge
Conference (pp. 461-465). Seattle, Washington.
Engineering, 53(3), 225-241.
Fisher, R.A. (1936). The use of multiple measurements in
Fang, X., Sheng, O. R. L. (2005). Designing a better
taxonomic problems. Annals of Eugenics, 7, 179-188.
Web portal for digital government: A Web-mining based
approach. In Proceedings of the 2005 National Confer- Foot, K., Schneider, S., Dougherty, M., Xenos, M., &
ence on Digital Government Research (pp. 277-278), Larsen, E. (2003). Analyzing linking practices: Candidate
Atlanta, Greorgia. sites in the 2002 U.S. electoral Web sphere. Journal of
Mediated Communication, 8(4).
Farrell, M. (2006). Create a diversified portfolio. ©2006
Path to Investing  Leading the way to financial knowl- Fraser, J., Fraser, N., & McDonald, F. (2000). The stra-
edge®. New York: Lightbulb Press, Inc. tegic challenge of electronic commerce. Supply Chain
Management: An International Journal, 5(1), 7-14
Fayyad, U., G. Piatetsky-Shapiro, & P. Smyth. (1996).
From data mining to knowledge discovery in databases Freed, N., & Glover, F. (1981). Simple but powerful goal
(a survey). AI Magazine, 17(3), 37-54. programming models for discriminant problems. Euro-
pean Journal of Operational Research, 7, 44-60.
Fayyad, U., Grinstein, G. & Wierse, A. (2001). Infor-
mation visualization in data mining and knowledge Freed, N., & Glover, F. (1986). Evaluating alternative
discovery. Morgan Kaufmann. linear programming models to solve the two-group
discriminant problem. Decision Science, 17, 151-162.
Fayyad, U., Piatetsky-Shapiro, G., & Smyth, P. (1996).
From data mining to knowledge discovery in databases. Freitas, A. A. (2002). Data mining and knowledge dis-
AI Magazine, 17, 37-54. covery with evolutionary algorithms. Springer-Verlag.

Fayyad, U., Piatetsky-Shapiro, G., Smyth, P., & Uthuru- Frigo, M. L. (2002). Strategy-focused performance
samy, R. (Eds) (1996). Advances in knowledge discovery measures. Strategic Finance, 84(3), 10-13.
and data mining. AAAI/MIT Press.

0
Compilation of References

Fukuda, H., Passos, E., Pacheco, A. M., Neto, L. B., Gillani, B. (1998). The Web as a delivery mechanism to
Valerio, J., Roberto, V. J. D., Antonio, E. R., & Chigener, enhance instruction. Educational Media International,
L. (2000). Web text mining using a hybrid system. In 35(3), 197-202.
Proceedings of the 6th Brazilian Symposium on Neural
Giovinazzo, W. A., (2002). Internet-enabled business
Networks (pp.131–136).
intelligence. Upper Saddle River, NJ: Prentice Hall.
Galbreath, J. & Rogers, T. (1999). Customer relation-
Giuffrida, G., Cooper, L. G., & Chu, W. W. (1998). A
ship leadership: A leadership and motivation model for
scalable bottom-up data mining algorithm for relational
the twenty-first century business. The TQM Magazine,
databases. In Proceedings of the Tenth International
11(3), 161-171.
Conference on Scientific and Statistical Database Man-
Garfield, E. (1985). History of citation indexes for chem- agement (pp. 206-209)
istry - A brief review. JCICS, 25(3), 170-174.
Gledhill, B. (2002, February 28). Learning from history.
Garofalakis, J., Kappos, P., & Mourloukos, D. (1999). Caterer & Hotelkeeper, 33.
Website optimization using page popularity. IEEE In-
Gleim, R., Mehler, A., & Dehmer, M. (2006). Web
ternet Computing, 3(4), 22-29.
corpus mining by instance of Wikipedia. In Proceed-
Gatrell, A. & Rowlingson, B. (1994). Spatial point process ings of the EACL 2006 Workshop on Web as Corpus,
modeling in a GIS environment. In A.S. Fotheringham Trento, Italy.
& P.A. Rogerson (Ed.), Spatial analysis and GIS (pp.
Goddard, S., Harms, S. K., Reichenbach, S. E., Tadesse,
148-163). London, UK: Taylor and Francis.
T., & Waltman, W. J. (2003). Geospatial decision sup-
Gaytan, A. & Johnson, A. J. (2002). A review of the port for drought risk management. Communication of
literature on early warning systems for banking crises the ACM, 46(1), 35-37.
(Working papers No: 183). Central Bank of Chile.
Gökmen, A. et al. (2004). Balaban Valley Project: Im-
Getoor, L. & Diehl, C. P. (2005). Link mining: A survey. proving the quality of life in rural area in Turkey, 7(Dec
ACM SIGKDD Explorations Newsletter, 7(2), 3-12. 2004). Retrieved April 13, 2008, http://www.geocities.
com/doriendetombe/detombevol7menmbalabanabstract.
Getoor, L., Segal, E., Tasker, B., & Koller, D. (2001).
html
Probabilistic models of text and link structure for
hypertext classification. In Proceedings of the IJCAI Goldenberg, A. & Moore, A. W. (2005). Bayes net graphs
Workshop on Text Learning: Beyond Supervision, Se- to understand co-authorship networks? In Proceedings
attle, Washington. of the 3rd International Workshop on Link Discovery
(pp. 1-8). ACM Press.
Gibson, R. K. & Ward, S. J. (2000). A proposed meth-
odology for studying the functions and effectiveness of Goldman, J. A., Chu, W. W., Parker, D. S., & Goldman,
party and candidate Web-sites. Social Science Computer R. M. (1999). Term domain distribution analysis: A data
Review, 18(3), 301-319. mining tool for text databases. Methods of Information
in Medicine, 38, 96-101.
Gilad, B. & Gilad, T. (1988). The business intelligence
system: A new tool for competitive advantage. New Goodchild, M. F., Haining, R., & Wise, S. M. (1991).
York: Amacom. Integrating GIS and spatial data analysis: Problems
and possibilities. International Journal of Geographic
Giles, C. L., Bollacker, K., & Lawrence, S. (1998).
Information Systems, 6, 407-423.
CiteSeer: An automatic citation indexing system. In
Proceedings of the 3rd ACM Conference on Digital Gordon, M. D. & Dumais, S. (1998). Using latent se-
Libraries, 89-98. mantic indexing for literature based discovery. Journal

0
Compilation of References

of the American Society for Information Science, 49(8), Haining, R. (1994). Designing spatial data analysis
674-685. modules for GIS. In A.S. Fotheringham & P.A. Rogerson
(Eds.), Spatial analysis and GIS (pp. 46-63). London,
Gouda, K. & Zaki, M. J. (2001). Efficiently mining
UK: Taylor and Francis.
maximal frequent itemsets. In N. Cercone, T. Y. Lin & X.
Wu (Eds.), Proceedings of the 2001 IEEE International Hale, J., Threet, J., & Shenoi, S. (1994). A practical
Conference on Data Mining (pp. 163-170). Los Alamitos, formalism for imprecise inference control. Ifip Trans.
CA: IEEE Computer Society Publications. A-Computer Science And Technology, 60, 139-156.

Goymour, A. (2001, 26 July). Host in the machine. Caterer Halpern, J. Y. (2003). Reasoning about uncertainty.
& Hotelkeeper, 43-45. MIT Press.

Greenburg, E. F. (2004). Who turns on the RFID faucet, Hamer, M. (1983). Failure prediction: Sensitivity of
and does it matter? Packaging Digest, 22. Retrieved classification accuracy to alternative statistical method
January 24, 2005, from http://www.packagingdigest. and variable sets. Journal of Accounting and Public
com/articles/200408/22.php Policy, 2, 289-307.

Greengrass, E. (1997). Information retrieval: An over- Han, E. H., Karypis, G., & Kumar, V. (1997). Scallable
view. National Security Agency. TR-R52-02-96. parallel data mining for association rules. In Proceed-
ings of the ACM SIGMOD Conference Management of
Grönroos, C. (1994). From scientific management to
Data.
service management: A management perspective for
the age of service competition. International Journal Han, J. & Kamber, M. (2001). Data mining concepts
of Service Management, 5(1), 5-20. and techniques. San Francisco: Morgan Kaufmann
Publishers.
Groot, R. & McLaughlin, J. (2000). Introduction. In R.
Groot & J. McLaughlin (Eds.), Geospatial data infra- Han, J. & Kamber, M. (2006). Data mining concepts and
structure: Concepts, cases and good practice (pp. 1-12). techniques (2nd ed.). San Francisco: Morgan Kaufmann
Oxford, UK: Oxford University Press. Publishers.

Grossman, R. L., Kamath, C., Kegelmeyer, P., Kumar, V., Han, J. (1999). Data mining. In J. Urban & P. Dasgupta
& Namburu, R. (Eds.) (2006). Data mining for scientific (Eds.), Encyclopedia of distributed computing. Kluwer
and engineering applications (Massive computing) (1st Academic Publishers.
ed.). Springer.
Han, J., Cai, Y., & Cercone, N., (1993). Data-driven dis-
Groth, R. (1998). Data mining: A hands-on approach for covery of quantitative rules in relational databases. IEEE
business professionals. New Jersey: Prentice Hall. Trans. Knowledge and Data Engineering, 5, 29-40.

Gulati, R. & Garino, J. (2000, May-June). Get the right Han, J., Kamber, M. & Chiang, J. (1997). Metarule-gui-
mix of bricks and mortar. Harvard Business Review, ded mining of multi-dimensional association rules using
107-114. data cubes. In Proceedings of international conference
on knowledge discovering and data mining (KDD’97),
Gunther, J. W. & Moore, R. R. (2003). Early warning
pp. 207-210.
models in real time. Journal of Banking, 27(10), 1979-
2001. Han, J., Koperski, K. & Stefanovic, N. (1997). GeoMiner:
A system prototype for spatial data mining. In Proceed-
Hackathorn, R. D. (1998). Web farming for the data
ings of the ACM SIGMOD International Conference on
warehouse: Exploiting business intelligence and
Management of Data (pp. 553-556).
knowledge management. San Francisco: Morgan
Kaufmann Publishers.

0
Compilation of References

Han, J., Pei, J., & Yin, Y. (2000). Mining frequent patterns He, J., Liu, X., Shi, Y., Xu, W., & Yan, N. (2004). Clas-
without candidate generation. In W. Chen, J. F. Naughton sifications of credit cardholder behavior by using fuzzy
& P. A. Bernstein (Eds.), Proceedings of the 2000 ACM linear programming. International Journal of Informa-
SIGMOD International Conference on Management of tion Technology and Decision Making, 3, 633-650.
Data (pp. 1-12). New York: ACM Press.
Hearst, M. A. (1999). Untangling text data mining. In
Han, J.W., & Kamber, M. (2000). Data mining: Concepts Proceedings of ACL 99, the 37th Annual Meeting of the
and techniques. San Diego: Academic Press. Association for Computational Linguistics, University
of Maryland.
Hand, D. J., Mannila, H., & Smyth, P. (2000). Principles
of data mining. MIT Press. Hecht-Nielsen, R. (1990). Neurocomputing. Reading,
MA: Addison Wesley.
Hand, D., Mannila, H., & Smyth, P. (2001). Principles
of data mining. Cambridge: MIT Press. Heeks, R. (2002). Information systems and developing
countries: Failure, success and local improvisation.
Hannula, M. & Pirttimaki, V. (2003). Business intel-
Information Society, 18(2), 101-112.
ligence empirical study on the top 50 Finnish compa-
nies. Journal of American Academy of Business, 2(2), Hernández, V., Göhring, W., & Hopmann, C. (2004).
593-599. Sustainable decision support for environmental problems
in developing countries: Applying multi-criteria spatial
Hardfield, R. (2004). The RFID power play. Supply Chain
analysis on the Nicaragua Development Gateway niDG.
Resource Consortium. Retrieved October 23, 2004, from
Research on computing science (Vol. 11, pp.136-150).
http://scrc.ncsu.edu/public/APICS/APICSjan04.html
México: Instituto Politécnico Nacional.
Harms, S. K., Deogun, J., & Tadesse, T. (2002). Dis-
Herrera-Viedma, E. & Pasi, G. (2006). Soft approaches
covering sequential association rules with constraints
to information retrieval and information access on the
and time lags in multiple sequences. Lecture notes in
Web: An introduction to the special topic section. Journal
artificial intelligence 2366: Foundations of intelligent
of the American Society for Information Science and
systems. In Proceedings of the 13th International Sym-
Technology, 57(4), 511-514.
posium on Methodologies for Intelligent Systems (pp.
432-441). Lyon, France. Hertz, J., Krogh, A., & Palmer, R. G. (1991). Introduc-
tion to the theory of neural computation. Redwood City,
Harvey, C. D. (1988). Telephone survey techniques.
CA: Addison Wesley
Canadian Home Economics Journal, 38(1), 30-35
Hess, A. & Kushmerick, N. (2004). Machine learning
Hastie, T.J., & Tibshirani, R.J. (1990). Generalized ad-
for annotating semantic Web services. In Proceedings of
ditive models. New York: Chapman and Hall.
the AAAI Spring Symposium on Semantic Web Services,
Hayes, C., Avesani, P., & Veeramachaneni, S. (2006). An Palo Alto, California.
analysis of bloggers and topics for a blog recommender
Hesselgesser, J., Taub, D., Baskar, P., Greenberg, M.,
system. In Proceedings of the Workshop on Web Mining,
Hoxie, J., Kolson, D.L., & Horuk, R. (1998). Neuronal
7th European Conference on Machine Learning and the
apoptosis induced by HIV-1 gp120 and the Chemokine
10th European Conference on Principles and Practice
SDF-1alpha mediated by the Chemokine receptor CXCR4.
of Knowledge Discovery in Databases (ECML/PKDD),
Curr Biol, 8, 595-598.
Berlin, Germany.
Hidber, C. (1999). Online association rule mining. In
Haykin, S. (1994). Neural networks, a comprehensive
A. Delis, C. Faloutsos & S. Ghandeharizadeh (Eds.),
foundation. New York: Macmillan.
Proceedings of the 1999 ACM SIGMOD International

0
Compilation of References

Conference on Management of Data (pp. 145-156). New ICDM (2003). ICDM 2003 tutorial. In Proceedings of the
York: ACM Press. Third IEEE International Conference on Data Mining,
Sponsored by the IEEE Computer Society, Melbourne,
Ho, K. & Robinson, C. (2001). Personal financial planning
Florida. Retrieved April 13, 2008, from http://www.cs.sfu.
(3rd ed.). North York, ON (Canada): Captus Press Inc.
ca/~ester/ICDM2003/Lazarevic.abstract.htm
Holsheimer, M., Kersten, M. L., Mannila, H., & Toivonen,
Inegbenebor, A. U. (2006). Financing small and medium
H. (1995). A perspective on databases and data mining.
industries in Nigeria-case study of the small and medium
In U. M. Fayyad & R. Uthurusamy (Eds.), Proceedings
industries equity investment scheme: Emprical research
of the First International Conference on Knowledge
finding. Journal of Financial Management & Analysis,
Discovery and Data Mining (pp. 150-155). Menlo Park,
19(1), 71-80.
CA: AAAI Press.
Inmon, W. H., & Inmon, W. H. (2002). Building the data
Hong, G. H. & Lee, J. H. (2005). Designing an intelligent
warehouse (3rd ed.). New York: John Wiley & Sons.
Web information system of government based on Web
mining. Lecture notes in computer science (Vol. 3614, INPE, National Institute for Space Research (2005).
pp. 1071-1078). PRODES project - Monitoring the Brazilian Amazon
forest using satellites. National Institute for Space Re-
Hopkins, L. D. (1984). Evaluation of methods for explor-
search. Retrieved April 8, 2008, from http://www.obt.
ing Ill-defined problems. Environmental Planning B:
inpe.br/prodes
Planning and Design, 11, 339-348.
Intransa (2005). Managing storage growth with an af-
Hoppszallern, S. (2003). Healthcare benchmarking.
fordable and flexible IP SAN: A highly cost-effective
Hospitals & Health Networks, 77, 37-44.
storage solution that leverages existing IT resources.
Hoss, D. (2000). The e-business explosion: Strategic CA: Intransa, Inc.
data solutions for e-business success. DM Review, 10(8),
Irani, Z., Al-Sebie, M., & Elliman, T. (2006). Transac-
24-28.
tion stage of e-government systems: Identification of its
Houtsma, M. & Swami, A. (1995). Set-oriented mining location & importance. In Proceedings of the 39th Hawaii
of association rules in relational databases. In P. S. Yu & International Conference on System Sciences, Hawai.
A. L. Chen (Eds.), Proceedings of the Eleventh Interna-
IUPAC (2005). Chemistry and human health council
tional Conference on Data Engineering (pp. 25-33). Los
report: 2003-2005. International Union of Pure and Ap-
Alamitos, CA: IEEE Computer Society Publications.
plied Chemistry, IUPAC Division VII. Retrieved April
Hsu, K-L., Gupta, H. V., & Soroosian, S. (1995). Artificial 13, 2008, from http://www.iupac.org/news/archives/2005/
neural network modeling of the rainfall-runoff process. 43rd_council/Item_09_Div_VII.pdf
Water Resour. Res., 31, 2517-2530.
Jacobs, L. J. & Kuper, G. H. (2004). Indicators of financial
Hu, W. & Meng, B. (2005). Design and implementation of crises do work! An early-warning system for six Asian
Web mining system based on multi-agent. Lecture notes countries. International Finance, 0409001, 39.
on artificial intelligence (Vol. 3584, pp.491-498).
Jazayeri-Rad, H. (2004). The nonlinear model-predictive
Hung, S.-Y., Liang, T.-P., & Liu, V. W.-C. (1996). In- control of a chemical plant using multiple neural networks.
tegrating arbitrage pricing theory and artificial neural Neural Computing and Applications, 13(1), 2-15.
networks to support portfolio management. Decision
Jeffery, K. G. (2000). The grid for e-science: E-com-
Support Systems, 18(1996), 301-316.
merce benefits, information technology department.
CLRC, ITD.

0
Compilation of References

Jepson, B., Collins, A., & Evans, A. (1993). Post-neural ing market currency crises: An early warning system.
network procedure to determine expected prediction International Journal of Finance and Economics,
values and their confidence limits. Neural Computing 12(3), 317-322.
and Applications, 1(3), 224-228.
Kandampully, J. & Duddy, R. (1999). Relationship mar-
Ji, L. & Peters, A. J. (2003). Assessing vegetation response keting: a concept beyond primary relationship. Marketing
to drought in the northern Great Plains using vegetation Intelligence &Planning, 17(7), 315-323.
and drought indices. Remote Sensing of Environment,
Kang, K. (2006) Outlook and reforms for the Ko-
87, 85-98.
rean economy in 2006. Retrieved April 13, 2008, from
Jiang, Z., Piggee, C., Heyes, M.P., Murphy, C., Quearry, http://www.keia.org/2-Publications/2-2-Economy/
B., Bauer, M., Zheng, J., Gendelman, H.E., & Markey, Economy2006/01cover.pdf
S.P. (2001). Glutamate is a mediator of neurotoxicity in
Kantardzic, M. (2002). Data mining: Concepts, models,
secretions of activated HIV-1-infected macrophages.
methods, and algorithms. Wiley-IEEE Press.
Journal of Neuroimmunology, 117, 97-107.
Kao, J-J. (1996). Neural net for determining DEM-based
Jobber, D. (1998). Principles of marketing (2nd ed.).
model drainage pattern. Journal of Irrigation and Drain-
McGraw-Hill
age Engineering, 122, 112-121.
John, G. H., Miller, P., & Kerber, R. (1996). Stock selection
Kaplan, R. & Norton, D. (1992). The balanced score-
using rule induction. IEEE Expert, 11(5), 52-58.
card—Measures that drive performance. Harvard Busi-
Jones, F. (1987). Current techniques in bankruptcy predic- ness Review, 70(1), 71-79.
tion. Journal of Accounting Literature, 6, 131-164.
Kaplan, R. (1996). The balanced scorecard: Translating
Jong Soo Park, Ming-Syan Chen, & Philip S. Yu. (1997). strategy into action. Boston: Harvard Business School
Using a hash-based method with transaction trimming Press.
for mining association rules. IEEE Transactions on
Kargupta, H. & Chan, P. (Eds.) (2001). Advances in
Knowledge and Data Engineering, 9(5), 813-825.
distributed and parallel knowledge discovery. MIT/
Joplin, B. (2001, March/April). Are we in danger of AAAI Press.
becoming CRM lemmings? Customer Management,
Kargupta, H., Joshi, A., Sivakumar, K., & Yesh, Y. (Eds)
81- 85
(2004). Data mining: Next generation challenges and
Juma, C. (2006, April). Reinventing African economies: future directions. AAAI Press.
Technological innovation and the sustainability tansi-
Karypis, G. (2006). CLUTO—A clustering toolkit.
tion. Paper presented at The John Pesek Colloquium on
Retrieved April 13, 2008, from http://www.cs.umn.
Sustainable Agriculture, Ames, Iowa
edu/˜cluto.
Kalakota, R. & Robinson, M. (2001). E-business 2.0—
Kaul, M., Garden, G.A., & Lipton, S.A. (2001). Pathways
Roadmap for success. New York: Addison-Wesley.
to neuronal injury and apoptosis in HIV-associated
Kalakota, R., & Robinson, M. (2003). From e-business to dementia. Nature, 410, 988-994.
services: Why and why now? Addison-Wesley. Retrieved
Kautz, H., Selman, B., & Shah, M. (1997). Referral Web:
January 24, 2005, from http://www.awprofessional.
Combining social networks and collaborating filtering.
com/articles/article.asp?p=99978&seqNum=5
Communications of the ACM, 40(3), 63-65.
Kamin, S. B., Schindler, J., & Samuel, S. (2007). The
Keim, D. A. (2002). Information visualization and visual
contribution of domestic and external factors to emerg-
data mining. IEEE Transactions on Visualization and
Computer Graphics, 7, 100-107.

0
Compilation of References

Key Note (2002), Customer Relationship Management Kohonen, T. (1990). The self-organizing maps. Proceed-
ings of the IEEE, 78, 1464-1480.
Key Note (2002), Hotels
Kommers, P., Kinelev, V., & Kotsik, B. (2003). ICT in
Key Note (2003), Hotels
secondary education for the knowledge society. In T.
Khalil, O. E. M. & Harcar, T. D. (1999). Relationship Varis, T. Utsumi & W. R. Klemm (Eds), Global peace
marketing and data quality management. SAM Advanced through the global university system. The Finnish Na-
Management Journal, 64 (2). tional Commission for UNESCO, University of Tampere,
Hämeenlinna, Finland.
Khatri, V., Ram, S., & Snodgrass, R. T. (2004). Aug-
menting a conceptual model with geospatiotemporal Kosala, R. & Blockeel, H. (2000). Web mining research:
annotations. IEEE Transactions on Knowledge and Data A survey. ACM, 2(1), 1-15.
Engineering, 16, 1324-1338.
Kostoff, R. N,. Koytcheff, R., & Lau, C. G. Y. (2007).
Kim, K., & Lee, W. B. (2004). Stock market prediction Structure of the global nanoscience and nanotechnology
using artificial neural networks with optimal feature research literature (DTIC Tech. Rep. No. ADA461930),
transformation. Neural Computing and Applications, Defense Technical Information Center, Fort Belvoir, VA.
13(3), 255-260. Retrieved April 13, 2008, from http://www.dtic.mil/

Kimball, R., & Ross, M. (2002). The data warehouse Kostoff, R. N. (1997). Accelerating the conversion of sci-
toolkit: The complete guide to dimensional modeling ence to technology: Introduction and overview. Journal
(2nd ed.). New York: John Wiley & Sons. of Technology Transfer [Special Issue on Accelerating
the Conversion of Science to Technology], 22(3) .
Klersey, G. F. & Dugan, M.T. (1995). Substantial doubts:
Using artificial neural networks to evaluate going con- Kostoff, R. N. (2003). Text mining for global technology
cern. In Advanced in Accounting Information Systems. watch. In M. Drake (Ed.), Encyclopedia of library and
Greenwich: JAI Press. information science (2nd ed) (Vol. 4, pp. 2789-2799).
New York: Marcel Dekker, Inc.
Kloesgen, W. & Zytkow, J. (Eds.) (2002). Handbook of
data mining and knowledge discovery. Oxford Univer- Kostoff, R. N. (2003). Stimulating innovation. In L. V.
sity Press. Shavinina (Ed.), International handbook of innovation
(pp. 388-400). Oxford, UK: Elsevier Social and Behav-
Kloptchenko, A., Eklund, T., Karlsson. J., Back, B.,
ioral Sciences.
Vanhatanta, H., & Visa, A. (2004). Combinig data and
text mining techniques for analyzing financial reports. Kostoff, R. N. (2003). Bilateral asymmetry prediction.
Intelligent Systems in Accounting Finance and Manage- Medical Hypotheses, 61(2), 265-266.
ment, 12, 29-41.
Kostoff, R. N. (2006). Systematic acceleration of radi-
Ko, P. C. & Lin, P. C. (2005). An evolutionary modularized cal discovery and innovation in science and technology.
data mining mechanism for financial distress forecasts. Technological Forecasting and Social Change, 73(8),
In A. Ghosh, & L.C. Jain (Eds.), Evolutionary Computa- 923-936.
tion in Data Mining (pp. 249-263). Berlin Heidelberg,
Kostoff, R. N., Del Rio, J. A., García, E. O., Ramírez, A.
Germany: Springer-Verlag.
M., & Humenik, J. A. (2001). Citation mining: Integrating
Kohara, K., Ishikawa, T., Fukuhara, Y., & Nakamura, text mining and bibliometrics for research user profiling.
Y. (1997). Stock price prediction using prior knowledge Journal of the American Society for Information Science
and neural networks. International Journal of Intelligent and Technology, 52(13), 1148-1156.
Systems in Accounting, Finance and Management, 6(1),
11-22.

0
Compilation of References

Kostoff, R. N., Eberhart, H. J., & Toothman, D. R. (1997). lege of Information Science and Technology, University
Database tomography for information retrieval. Journal of Nebraska-Omaha, USA.
of Information Science, 23(4), 301-311.
Kou, G., Liu, X., Peng, Y., Shi, Y., Wise, M., & Xu, W.
Kostoff, R. N., Green, K. A., Toothman, D. R., & Hu- (2003). Multiple criteria linear programming approach
menik, J. A. (2000). Database tomography applied to to data mining: Models, algorithm designs and software
an aircraft science and technology investment strategy. development. Optimization Methods and Software, 18,
Journal of Aircraft, 37(4), 727-730. 453-473.

Kostoff, R. N., Johnson, D., Bowles, C. A., & Dodbele, Kou, G., Peng, Y., Chen, Z., Shi, Y., & Chen, X. (2004,
S. (2006). Assessment of India’s research literature July 12-14). A multiple-criteria quadratic programming
(DTIC Tech. Rep. No. ADA444625), Defense Technical approach to network intrusion detection. In Proceedings
Information Center, Fort Belvoir, VA. Retrieved April of the Chinese Academy of Sciences Symposium on Data
13, 2008, from http://www.dtic.mil/ Mining and Knowledge Management, Beijing, China.

Kostoff, R. N., Murday, J., Lau, C., & Tolles, W. (2005). Kou, G., Peng, Y., Shi, Y., & Chen, Z. (2006). A new
The seminal literature of global nanotechnology research multi-criteria convex quadratic programming model
(DTIC Tech. Rep. No. ADA435986), Defense Technical for credit data analysis. Working Paper, University of
Information Center, Fort Belvoir, VA. Retrieved April Nebraska at Omaha, USA.
13, 2008, from http://www.dtic.mil/
Kou, G., Peng, Y., Yan, N., Shi, Y., Chen, Z., Zhu, Q.,
Kostoff, R. N., Murday, J., Lau, C., & Tolles, W. (2006). Huff, J., & McCartney, S. (2004, July 19-21). Network
The seminal literature of global nanotechnology research. intrusion detection by using multiple-criteria linear
Journal of Nanoparticle Research, 8(2), 193-213. programming. In Proceedings of the International Con-
ference on Service Systems and Service Management,
Kostoff, R. N., Shlesinger, M., & Malpohl, G. (2004).
Beijing, China.
Fractals roadmaps using bibliometrics and database
tomography. Fractals, 12(1), 1-16. Koua, E. L. & Kraak, M. J. (2004). Geovisualization to
support the exploration of large health and demographic
Kostoff, R. N., Shlesinger, M., & Tshiteya, R. (2004).
survey data. International Journal of Health Geograph-
Nonlinear dynamics roadmaps using bibliometrics and
ics, 3,12.
database tomography. International Journal of Bifurca-
tion and Chaos, 14(1), 61-92. Kovalerchuk, B. & Vityaev, E. (2000). Data mining in
finance: Advances in relational and hybrid methods.
Kostoff, R. N., Stump, J.A., Johnson, D., Murday, J., Lau,
New York: Kluwer Academic Publisher.
C., & Tolles, W. (2006). The structure and infrastruc-
ture of the global nanotechnology literature. Journal of Koyuncugil, A. S. & Ozgulbas, N. (2006). Financial
Nanoparticle Research, 8(3-4), 301-321. profiling of SMEs: An application by data mining.
The European Applied Business Research (EABR) Con-
Kostoff, R. N., Tshiteya, R., Pfeil, K. M., & Humenik, J.
ference, Clute Institute for Academic Research.
A. (2002). Power source text mining using bibliometrics
and database tomography. Koyuncugil, A. S. & Ozgulbas, N. (2006). Is there a
specific measure for financial performance of SMEs?
Kottegoda, N. T., Natale, L., & Raiteri, E. (2004). Some
The Business Review, 5(2), 314-319.
considerations of periodicity and persistence in daily
rainfalls. Journal of Hydrology, 296(1-4), 23-37. Koyuncugil, A. S. & Ozgulbas, N. (2006). Determination
of factors affected financial distress of SMEs listed in
Kou, G., & Shi, Y. (2002). Linux-based Multiple Linear
ISE by data mining. In Proceedings of the 3rd Congress
Programming Classification Program: (Version 1.0.) Col-

0
Compilation of References

of SMEs and Productivity, KOSGEB and Istanbul Kultur Multiple classifier systems (pp. 78-86). Berlin: Springer-
University, Istanbul. Verlag (Lecture Notes in Pattern Recognition 1857).

Koyuncugil, A. S. (2006). Fuzzy data mining and its Lambin, E. (1999). Land-use and land-cover change
application to capital markets. Unpublished doctoral implementation strategy. Retrieved April 8, 2008, from
dissertation, Ankara University, Ankara. http://www.geo.ucl.ac.be/LUCC/lucc.html

Kraak, M.-J. (2000). Access to GDI and the function Lambin, E. F., Geist, H. J. & Lepers, E. (2003). Dynamics
of visualization tools. In R. Groot & J. McLaughlin of land use and land cover change in Tropical Regions.
(Eds.), Geospatial data infrastructure: Concepts, cases Annual Review of Environment and Resources, 28(1)
and good practice (pp. 217-321). Oxford, UK: Oxford 205-241.
University Press.
Lansiluoto, A., Eklund, T., Barbro, B., Vanharanta, H., &
Krol, C. (1999, May). A new age: It’s all about relation- Visa, A. (2004). Industry-specific cycles and companies’
ships. Advertising Age, 70(21), S1-S4. financial performance comparison using self-organising
maps. Benchmarking, 11, 267-286.
Kudyba, S. & Hoptroff, R. (2001). Data mining and
business intelligence: A guide to productivity. Hershey, Lappas, G. & Yannas, P. (2006). A framework to evalu-
PA: Idea Group Publishing. ate political party Websites. In Proceedings of the 4th
International Conference on Politics and Information
Kufoniyi, O., Huurneman, G., & Horn, J. (2005, April).
Systems: Technologies and Applications Vol. II (pp.
Human and institutional capacity building in geoinfor-
226-231), Orlando, Florida.
matics through educational networking. Paper presented
at the International Federation of Surveyors Working Larose, D. T. (2004). Discovering knowledge in data: An
Week 2005, Cairo, Egypt. introduction to data mining. Wiley-Interscience.

Kuncheva, L.I. (2000). Clustering-and-selection model Law, R. (1999). Demand for hotel spending by visitors
for classifier combination. In Proceedings of the 4th to Hong Kong: A study of various forecasting tech-
International Conference on Knowledge-Based Intel- niques. Journal of Hospitality and Leisure Marketing,
ligent Engineering Systems and Allied Technologies 6(4), 17-29.
(KES’2000).
Lawlor, L. R. (1980). Structure and stability in natural
Kuo, R. J., Liao, J. L., & Tu, C. (2005). Integration of and randomly constructed competitive communities.
ART2 neural network and genetic k-means algorithm for The American Naturalist, 116(3), 394-408.
analyzing Web browsing paths in electronic commerce.
Lawrence, S. & Giles, C. L. (1999). Accessibility of
Decision Support Systems, 40, 355-374.
information on the Web. Nature, 400, 107-09.
Kutman, O. (2001). Researching the early warning
Lazo, J. G., Maria, M., Vellasco, R., Aurelio, M., &
signals for the enterprises in Turkey. Journal of Dogus
Pacheco, C. (2000). A hybrid genetic-neural system for
University, 4, 59-70.
portfolio selection and management. In Proceedings of
Kwak, W., Shi, Y., Eldridge, S., & Kou, G. (2006). the 7th International Conference on Engineering Ap-
Bankruptcy prediction for Japanese firms: Using multiple plications of Neural Networks. Kingston Upon Thames,
criteria linear programming data mining approach. In UK: Kingston University.
Proceedings of the International Journal of Data Mining
Lee, H. Y., Ong, H. L., & Quek, L. H. (1995). Exploiting
and Business Intelligence.
visualization in knowledge discovery. In Proceedings of
Lam, L. (2000). Classifier combinations: Implementa- the 1st International Conference on Knowledge Discovery
tions and theoretical issues. In Kittler & Roli (Eds.), and Data Mining (pp. 198 – 201), Montreal, Canada.

0
Compilation of References

Lee, H. Y., Ong, H. L., Toh, E. W., & Chan, S. K. (1996). Liu, H. & Motoda, H. (1998). Feature selection for
A multi-dimensional data visualization tool for knowl- knowledge discovery and data mining. Kluwer.
edge discovery in databases. In Proceedings of IEEE
Liu, H. & Motoda, H. (1998). Feature extraction,
Conference on Visualization, pp. 26–31.
construction and selection: A data mining perpective.
Lee, S.M. (1972). Goal programming for decision Kluwer
analysis. Auerbach.
Liu, J. W., Yu, S. J., & Le, J. J. (2005). Online mining
Lempel, R. & Moran, S. (2001). SALSA: The stochastic dynamic Web news patterns using machine learn meth-
approach for link-structure analysis. ACM Transactions ods. Lecture notes on artificial intelligence (Vo. 3614,
on Information Systems, 19(2), 131-160. pp. 462-465).

Lester, T. (2004, March 31). Pitfalls of precision bomb- Liu, J., Pan, Y., Wang, K., & Han, J. (2002). Mining
ing. FT Management, 4. frequent item sets by opportunistic projection. In Pro-
ceedings of the Eighth ACM SIGKDD International
Levene, M. & Loizou, G. (1999). Computing the entropy
Conference on Knowledge Discovery and Data Mining
of user navigation in the Web (Tech. Rep. No. RN/99/42),
(pp. 229-238). New York: ACM Press.
University College London.
Liu, S. & Lindholm, C. K. (2006), Assessing early
Li, X. (1998). Web page design and graphic use of three
warning signals of currency crises: A fuzzy clustering
U.S. newspapers. Journalism and Mass Communication
approach. Intelligent Systems in Accounting, Finance
Quarterly, 75(2), 353-365.
and Management, 14(4), 179-184.
Liautaud, B. (2000). E-business intelligence: turning
Long, G., Hogg, M. K., Hartley, M. & Angold, S. J.
information into knowledge into profit. New York:
(1999). Relationship marketing and privacy: Exploring
McGraw-Hill.
the thresholds. Journal of Marketing Practice: Applied
Lin, D.-I., & Kedem, Z. M. (1998). Pincer search: A new Marketing Science, 5(1), 4-20.
algorithm for discovering the maximum frequent set.
Long, M. M. & Schiffman, L. G. (2000). Consumption
In H.-J. Schek, F. Saltor, I. Ramos & G. Alonso (Eds.),
values and relationships: Segmenting the market for
Advances in Database Technology – Proceedings of the
frequency programs. Journal of Consumer Marketing,
6th International Conference on Extending Database
17(3).
Technology (pp. 105-119). Berlin Heidelberg, Germany:
Springer-Verlag. Longley, P. A., Goodchild, M. F., Maguire, D.J., & Rhind,
D. W. (2001). Geographic information systems and sci-
Lindgreen, A. & Crawford, I. (1999). Implementing,
ence. West Sussex, England: John Wiley and Son, Ltd.
monitoring and measuring a programme of relation-
ship marketing. Marketing Intelligence & Planning, Lopez, A., Bauer, M.A., Erichsen, D.A., Peng, H., Gen-
17(5), 231-239. delman, L., Shibata, A., Gendelman, H.E., & Zheng, J.
(2001). The regulation of neurotrophic factor activities
LINDO Systems Inc. (2003). An overview of LINGO
following HIV-1 infection and immune activation of
8.0. Retrieved from http://www.lindo.com/cgi/frameset.
mononuclear phagocytes. In Proceedings of Soc. Neu-
cgi?leftlingo.html;lingof.html
rosci. Abs., San Diego, CA.
Lindsay, P.H., & Norman, D.A. (1972). Human informa-
Losiewicz, P., Oard, D., & Kostoff, R. N. (2000). Textual
tion processing: An introduction to psychology. New
data mining to support science and technology manage-
York: Academic Press.
ment. Journal of Intelligent Information Systems, 15,
99-119.

0
Compilation of References

Lu, Q. & Getoor, L. (2003). Link-based text classifica- Mannila, H., Toivonen, H., & Verkamo, I. (1994). Ef-
tion. In Proceedings of the 3rd International Workshop ficient algorithms for discovering association rules. In
on Link Discovery (pp. 1-8). ACM Press. Proceedings of the AAAI Workshop, Knowledge Dis-
covery in Databases.
Lu, S., Hu, H., & Li, F. (2001). Mining weighted associa-
tion rules. Intelligent Data Analysis, 5(2001), 211-255. Margolis, M., Resnick, D., & Tu, C.-C. (1997). Cam-
paigning on the Internet: Parties and candidates on the
Luck, D. & Lancaster, G. (2003). E-CRM: Customer
World Wide Web in the 1996 primary season. Harvard
relationship marketing in the hotel industry. Manage-
International Journal of Press/Politics, 2(1), 59-78.
rial Auditing Journal – Accountability and the Internet,
18(3), 213-232. Markus, B. (2005). Building spatial knowledge infra-
structure. Paper presented at the ISPRS Workshop on
MacDonald, J. (2002). The Earth observation business
Service and Application of Spatial Data Infrastructure,
and the forces that impact it. In D. Couts (Ed.), Earth
XXXVI, Hangzhou, China. Retrieved April 13, 2008,
observation business network 2002. Vancouver, CA:
from http://www.commission4.isprs.org/workshop_
MacDonald Dettwiler.
hangzhou/papers/65-70%20Bela%20markus-A103.pdf
MacEachren, A. M. & Kraak, M.-J. (1997). Exploratory
Martin-Guerrero, J. D., Palomares, A., Balaguer-
carthographic visualization: Advancing the agenda.
Ballester, E., Soria-Olivas, E., Gomez-Sanchis, J., &
Computer and Geosciences, 23, 335-343.
Soriano-Asensi, A. (2006). Studying the feasibility of
Magnusson, C., Arppe, A., Eklund, T., & Back, B. (2005). a recommender in a citizen Web portal based on user
The language of quarterly reports as an indicator of modeling and clustering algorithms. Expert Systems with
change in the company’s financial staus. Information & Applications, 30, 299-312.
Management, 42, 561-570.
Matsuo, Y., Ohsawa, Y., & Ishizuka, M. (2001). Aver-
Maier, H. R. & Dandy, G. C. (1996). The use of artificial age-clicks: A new measure of distance on the WWW.
neural networks for the prediction of water quality pa- In Proceedings of First Asia-Pacific Conference, Web
rameters. Water Resour. Res., 32, 1013-1022. Intelligence, Japan.

Maimon, O. & Last, M. (2000). Knowledge discovery and Mattison, R. M. (1997). Data warehousing and data
data mining - The Info-Fuzzy Network (IFN) Methodol- mining, for telecommunications. Artech House.
ogy. Kluwer Publishers, Massive Computing.
Maule, R. W. (1998). Content design frameworks for
Mangasarian, O.L. (2000). Generalized support vector Internet studies curricula and research. Internet Re-
machines. In A. Smola, P. Bartlett, B. Scholkopf, & D. search: Electronic Networking Applications and Policy,
Schuurmans (Eds.), Advances in large margin classifiers 8(2), 174-184.
(pp. 135-146). Cambridge, MA: MIT Press.
May, M. & Savinov, A. (2002). An integrated platform
Mannila, H. & Raiha, K. J. (1987). Dependency inference. for spatial data mining and interactive visual analysis.
In Proceedings of the 1987 International Conference Very In Proceedings of the International Conference on
Large Data Bases, (pp. 155-158). Brighton, England. Data Mining Methods and Databases for Engineering
(pp. 90-101).
Mannila, H., Toivonen, H., & Verkamo, A. I. (1994). Ef-
ficient algorithms for discovering association rules. In U. Mayer, M. A., Karkaletsis, V., Stamatakis, K., Leis, A.,
M. Fayyad & R. Uthurusamy (Eds.), Knowledge Discov- Villarroel, D., Thomeczek, C., Labsky, M., Lopez-Osten-
ery in Databases: Papers from the 1994 AAAI Workshop ero, F., & Honkela, T. (2006). MedIQ-Quality labelling
(pp. 181-192). Menlo Park, CA: AAAI Press. of medical Web content using multilingual information

Compilation of References

extraction. Studies in Health Technology and Informat- Miller, C. (1995). In-depth interviewing by telephone:
ics, 121, 183-190. Some practical considerations. Evaluation and Research
in Education, 9(1), 29-38.
McDonald, W. J. (1998). Direct marketing: An integrated
approach. McGraw-Hill International Editions. Miller, H. J. & Han, J. (2001). Geographic data mining
and knowledge discovery. London: Taylor and Francis.
McGarigal, K. & Marks, B. (1995). FRAGSTATS: Spatial
pattern analysis program for quantifying landscape Miller, J. (2002). O milênio da inteligência competitiva,
structure. USDA Forestry Service Technical Report Brazil: Bookman.
PNW-351, Washington, DC.
Mirkin, B. & Mirkin, B. G. (2005). Clustering for data
McGarigal, K. (2002). Landscape pattern metrics. In A.H. mining: A data recovery approach. Virginia Beach, VA:
El-Shaarawi & W.W. Piegorsch (Eds.), Encyclopedia of Chapman & Hall / CRC.
environmentrics (pp. 1135-1142). Sussex, England: John
Mishra, A., Ray, C., & Kolpin, D. W. (in press). Use
Wiley & Sons.
of qualitative and quantitative information in neural
McGonagle, J. J. & Vella, C. M. (1990). Outsmarting the networks for assessing agricultural chemical con-
competition. Naperville, IL: Sourcebooks. tamination of domestic wells. Journal of Hydrological
Engineering.
McVicar, T. R. & Bierwirth, P. N. (2001). Rapidly as-
sessing the 1997 drought in Papua New Guinea using Mitchell, T. (1997). Machine learning. McGraw Hill.
composite AVHRR imagery. International Journal of
Mitra, S. & Acharya, T. (2003). Data mining: Multimedia,
Remote Sensing, 22, 2109-2128.
soft computing and bioinformatics. Hoboken, NJ: John
Meier, R. L. (2000). Late-blooming societies can be stimu- Wiley and Sons, Inc.
lated by information technology. Futures, 32(2), 163.
MITRE (2001). Stopping traffic: Anti drug network
Meinel, G. & Neubert, M. (2004). A comparison of seg- (ADNET). MITRE Digest Archives. Retrieved April
mentation programs for high resolution remote sensing 13, 2008, from http://www.mitre.org/news/digest/ar-
data. International Archives of Photogrammetry and chives/2001/adnet.html
Remote Sensing, 35(1), 1097-1105.
Mladenic, D. & Grobelnik, M. (1999). Predicting content
Melville, P., Mooney, R. J., & Nagarajan, R. (2002). from hyperlinks. In Proceedings of the 16th International
Content-boosted collaborative filtering for improved ICML99 Workshop on Machine Learning in Text Data
recommendations. In Proceedings of the 18th National Analysis (pp. 109-113).
Conference on Artificial Intelligence (pp. 187-192).
Mobasher, B., Cooley, R., & Srivastava, J. (1999). Creat-
Mena, J. (2003). Investigative data mining for security ing adaptive Web sites through usage based clustering
and criminal detection. USA: Elsevier Science. of URLs. In Proceedings of the IEEE Knowledge and
Data Engineering Exchange Workshop (KDEX99),
Meyer, P. A., & Pifer, W. H. (1970). Prediction of bank
Chicago, Illinois.
failures. The Journal of Finance, 25(4),853-868.
Mobasher, B., Dai, H., Luo, T., Sung, Y., & Zhu, J.
Michalski, R. S. & Tecuci, G. (1994). Machine learning: A
(2000). Integrating Web usage and content mining for
multistrategy approach (Vol. IV). Morgan Kaufmann
more effective Web personalization. In Proceedings of
Miles, M. B. & Huberman, A. M. (1990). Qualitative the International Conference on E-Commerce and Web
data analysis. London: Sage Publications. Technologies (ECWeb 2000) (pp. 165-176). Greenwich,
UK.

Compilation of References

Mobasher, B., Jain, N., Han, E., & Srivastava, J. (1996). Narin, F. (1976). Evaluative bibliometrics: the use of
Web Minning: Pattern discovery from WWW transaction publication and citation analysis in the evaluation of
(Tech. Rep. TR-96050). Department of Computer Sci- scientific activity (monograph). NSF C-637. National
ence, University of Minnesota, Minneapolis. Retrieved Science Foundation. Contract NSF C-627. NTIS Acces-
April 12, 2008, from http://citeseer.ist.psu.edu/mobash- sion No. PB252339/AS.
er96web.html
Narin, F., Olivastro, D., & Stevens, K. A. (1994). Bib-
Mokyr, J. (2002). The gifts of Athena: Historical ori- liometrics theory, practice and problems. Evaluation
gins of the knowledge economy. New Haven: Princeton Review, 18(1), 65-76.
University Press.
Nasraoui, O. & Pavuluri, M. (2004). Complete this
Moncrief, W. C. & Cravens, D. (1999). Technology and puzzle : A connectionist approach to accurate Web
the changing marketing world. Marketing Intelligence recommendations based on a committee of predictors.
and Planning, 17(7), 329-332. In Proceedings of the 6th WEBKDD Workshop, Seattle,
Washington.
Mooney, R. J. & Roy, L. (2000). Content-based book
recommending using learning for text categorization. National Academy of Sciences (NAS) (2003). IT roadmap
In Proceedings of the 5th ACM Conference on Digital to a geospatial future., Washington, D.C.: The National
Libraries (pp. 195-204). ACM Press. Academies Press.

Moore, G. (1999). Crossing the chasm: Marketing and Neftci, S. N. (2004). Principles of financial engineering.
selling high-tech products to mainstream customers. Burlington, MA: Elsevier Academic Press.
Oxford, UK: Capstone.
Nemati, H. R. & Barko, C. D. (2003). Key factors for
Mukherjee, D. (2006). Promote scientific research. achieving organizational data-mining success. Industrial
Central Chronicle. Management & Data Systems, 103(4), 282-292.

Murphy, J. M. (2001, March-April). Customer excellence: Ngu, D. S. W. & Wu, X. (1997). Sitehelper: A localized
From the top down. Customer Management, 36-41. agent that helps incremental exploration of the World
Wide Web. Computer Networks, 29(8-13), 1249-1255.
Mursu, A., Soriyan, H. A., Olufokunbi, K., & Korpela, M.
(2000). Information systems development in a developing Nitsche, M. (2002, January-March). Developing a truly
country: Theoretical analysis of special requirements in customer-centric CRM system: Part One – Strategic and
Nigeria and Africa. In Proceedings of the 33rd Hawaii artchitectural implementation. Interactive Marketing,
International Conference on System Sciences. Maui, 3(3), 207-217.
Hawaii: IEEE.
Nittel, S. & Stefanidis, A. (2005). GeoSensor networks
Myatt, G. J. (2006). Making sense of data: A practical and virtual GeoReality. In S. Nittel & A. Stefanidis
guide to exploratory data analysis and data mining. (Eds.), GeoSensor networks (pp. 1-9). Boca Raton, FL:
John Wiley. CRC Press.

Nagao, M. & Matsuyama, T. (1980). A structural analy- Niven, P. R. (2002). Balanced scorecard step-by-step:
sis of complex aerial photographs. New York: Plenum Maximizing performance and maintaining results. New
Press. York: J. Wiley & Sons.

Nanopoulos, A., Katsaros, D., & Manolopoulos, Y. (2003). NSI Software (2004). Six tips small and midsize busi-
A data mining algorithm for generated Web prefetching. nesses can use to protect their critical data. NJ: NSI
IEEE Transactions on Knowledge and Data Engineer- Software.
ing, 15(5), 1155-1169.

Compilation of References

Nwabueze, K. (November 30, 2003). A case study: Role of Ozgulbas, N., Koyuncugil, A. S., & Yılmaz, F. (2006).
technology venture capitalist market in developing coun- Identifying the effect of firm size on financial performance
tries, data mining, integration, and analysis. Timbuktu of SMEs. The Business Review, 5(2), 162-167.
Chronicles. Retrieved April 13, 2008, http://timbuktu-
Pal, S. K. & Mitra, P. (2004). Pattern recognition algo-
chronicles.blogspot.com/2003_11_01_archive.html
rithms for data mining. Chapman & Hall/CRC.
O’Bada, A. (2002). Local adaptations to global trends: A
Pal, S., Talwar, V., & Mitra, P. (2002). Web mining in
study of an IT-based organizational change program in
soft computing framework: Relevance, state of the art
a Nigerian bank. Information Society, 18(2), 77.
and future directions. IEEE Transactions on Neural
O’Kelly, M. E. (1994). Spatial analysis and GIS. In A.S. Networks, 13(5), 1163-1177.
Fotheringham & P.A. Rogerson (Eds.), Spatial analysis
Palmer, A. (1996). Relationship marketing: A universal
and GIS (pp. 66-79). London, UK: Taylor and Francis.
paradigm or management fad? The Learning Organisa-
Oberle, D., Berendt, B., Hotho, A., & Gonzalez, J. (2003). tion, 3(3), 18-25.
Conceptual user tracking. Lecture notes on artificial
Palmer, A., McMahon-Beattie, U. & Beggs, R. (2000). A
intelligence (Vol. 2663, pp. 155-164).
structural analysis of hotel sector loyalty programmes.
OECD (2000). Policy briefs small and medium-sized en- International Journal of Contemporary Hospitality
terprises: Local strength, global reach. Retrieved May 9, Management, 12(1), 54-60.
2008,from www.oecd.org/dataoecd/3/30/1918307.pdf
Pantalone, C., & Platt, M. (1987). Predicting failures
Oksay, S. (2006). Publication of insurance research and of savings and loan associations. AREUEA Journal,
analysis. Turkey: TSRSB. 15, 46-64.

Olson, D., & Shi, Y. (2005). Introduction to business Parhami, B. (1994). Voting algorithms. IEEE Transac-
data mining. New York: McGraw-Hill/Irwin. tions on Reliability, 43, 617-629.

Onwu, I. (2005). Knowledge discovery interface for en- Park, H. W. (2003). Hyperlink network analysis: A new
vironmental applications. Unpublished master’s thesis, method for the study of social structure on the Web.
Iowa State University, Ames. Connections, 25(1), 49-61.

Opitz, D., & Maclin, R. (1999). Popular ensemble meth- Park, J. S., Chen, M.-S., & Yu, P. S. (1995). An effective
ods: An empirical study. Journal of Artificial Intelligence hash based algorithm for mining association rules. In M. J.
Research, 11, 169-198. Carey & D. A. Schneider (Eds.), Proceedings of the 1995
ACM SIGMOD International Conference on Manage-
Orlikowski, W. J. & Iacono, C. S. (2001). Research com-
ment of Data (pp. 175-186). New York: ACM Press.
mentary: Desperately seeking “IT” in IT research—A
call to theorizing the IT artifact. Information Systems Parthasarathy, S., Zaki, M. J., & Li, W. (1997). Application
Research, 12(2) 121-156. driven memory placement for dynamic data structures
(Tech. Rep. URCS TR 653). University of Rochester.
Overell, S. (2004, March 31). Customers are not there to
be hunted. FT Management, 2. Pawlak, Z. (1982). Rough sets. Journal of Computer and
Information Science, 11(5), 341-356, 1982.
Ozgulbas, N. & Koyuncugil, A. S. (2006). Profiling and
determining the strengths and weaknesses of SMEs Pazzani, M. & Billsus, D. (1997). Learning and revising
listed in ISE by the data mining decision trees algorithm user profiles: The identification of interesting Web sites.
CHAID. In Proceedings of the 10th National Finance Machine Learning, 27(3), 313-331.
Symposium, Izmir.

Compilation of References

Pei, J., Han, J., & Lakshmanan, L. V. S. (2001). Mining Prabhaker, P. (2001). Integrated marketing-manufactur-
frequent itemsets with convertible constraints. Paper ing strategies. Journal of Business &Industrial Market-
presented at the Proceedings of the 17th International ing, 16(2), 113-128.
Conference on Data Engineering (pp. 433–332), Hei-
Prashanth, K. (2004). Wal-Mart’s supply chain man-
delberg, Germany.
agement practices (B): Using IT/Internet to manage
Pei, J., Han, J., & Mao, R. (2000, May). CLOSET: An the supply chain. Hyderabad, India: ICFAI Center for
efficient algorithm for mining frequent closed itemsets. Management Research.
In D. Gunopulos & R. Rastogi (Eds.), Proceedings of
Pyle, D. (1999). Data preparation for data mining.
the 2000 ACM SIGMOD Workshop on Research Issues
Morgan Kaufmann.
in Data Mining and Knowledge Discovery (pp. 21-30),
Dallas, TX. Quah, T. S. & Srinivasan, B. (1999). Improving returns
on stock investment through neural network selection.
Peng, Y., Kou, G., Chen, Z., & Shi, Y. (2004). Cross-
Expert Systems with Applications, 17(4), 295-301.
validation and ensemble analyses on multiple-criteria
linear programming classification for credit cardholder Quéau, P. (2001). The information society and the global
behavior. In Proceedings of ICCS 2004 (pp. 931-939). good. Retrieved April 13, 2008, from http://goanna.
Berlin: Springer-Verlage (LNCS 2416). cs.rmit.edu.au/~aym/rinseap/bali/QueauTalk.html

Perner, P. & Petrou, M. (Eds.). Machine learning and Quinlan, J. R. (1993). C4.5: Programs for machine learn-
data mining in pattern recognition. Springer Verlag. ing. San Francisco: Morgan Kaufmann Publishers.

Piatetsky-Shapiro, G., Djeraba, C., Getoor, L., Grossman, Quinlan, R. (1993). Programs for machine learning. San
R., Feldman, R. & Zaki, M. (2006). What are the grand Francisco: Morgan Kaufmann.
challenges for data mining? - KDD-2006 Panel Report.
Raghavan, S. N. R. (2005). Data mining in e-commerce:
SIGKDD Explorations, 8(2), 70-77.
A survey. Sadhana, 30(2&3), 275-289.
Pierrakos, D., Paliouras, G., Papatheodorou, C.,
Rahman, H. (2004). Information dynamics in developing
Karkaletsis, V., & Dikaiakos, M. (2003). Web community
countries. In Proceedings of the 5th International Confer-
directories: A new approach to Web personalization.
ence on IT in Regional Areas, Caloundra, Queensland,
Lecture notes on artificial intelligence (Vol. 3209, pp.
Australia.
113-129).
Rahman, H. (2006). Role of ICTs in socio-economic
Pilato, G., Vitabile, S., Vassallo, G., Conti, V., & Sor-
development and poverty reduction. In H. Rahman
bello, F. (2003). A concurrent neural classifier for HTML
(Ed.), Information and communication technologies for
documents retrieval. Lecture notes in computer science
economic and regional developments
(Vol. 2859, pp. 210-217).
Rao, M. (2002). Systems design of a national spatial
Potgieter, J. (2003). OLAP data scalability: Ignore
data. Bangalore: Indian Space Research Organisation
OLAP data explosion at great cost. NSW Australia:
Headquarters.
SPF Pty Ltd.
Rasmussen, N., Goldy, P. S., & Solli, P. O. (2002).
Pozzebon, M. (2003). The implementation of configurable
Financial business intelligence—Trends, technology,
technologies: Negotiations between global principles
software selection, and implementation. New York:
and local contexts. Unpublished doctoral dissertation,
John Wiley and Sons.
McGill University, Montreal, Canada.
Ratcliffe, J. (2004). Strategic thinking in criminal intel-
ligence. Sydney: Federation Press.

Compilation of References

Rautenstrauch, C. & Page, B. (2001). Environmental International Conference on Management of Data (pp.
informatics-methods, tools and applications in environ- 85-93). New York: ACM Press.
mental information processing. In C. Rautenstrauch &
Robey, D., Ross, J., & Boudreau, M. (2002). Learning
S. Patig (Eds.), Environmental information systems in
to implement enterprise systems: An exploratory study
industry and public administration (pp. 2-11). Hershey,
of the dialectics of change. Journal of Management
PA: Idea Group Publishing.
Information Systems, 19(1), 17.
Reeves, T. C. & Dehoney, J. (1998). Cognitive and so-
Ross, S. (1976). The arbitrage theory of capital asset
cial functions of courseWeb sites. In H. Maurer & R.G.
pricing. Journal of Economic Theory, 13, 341-360.
Olson (Eds.), Proceedings of WebNet World Conference
98—World Conference of the WWW, Internet & Intranet. Rud, O. P. (2001). Data mining cookbook: Modeling data
Orlando, FL: Association for the Advancement of Com- for marketing, risk, and CRM. Wiley.
puting in Education.
Rui, Y., Huang, T. S. & Chang, S. F. (1999). Image
Reich, B. & Benbasat, I. (2000). Factors that influence retrieval: Current techniques, promising directions and
the social dimension of alignment between business open issues. Journal of Visual Communication and Im-
and information technology objectives. MIS Quarterly, age Representation, 10(1), 39-62.
24(1), 81-113.
Rushing, J., Ramachandran, R., Nair, U. J., Graves, S.
Reichheld, F. & Schefter, P. (2000, July/ August). E- J., Welch, R. & Lin, A. (2005). ADaM: A data mining
loyalty. Harvard Business Review, 105-113. toolkit for scientists and engineers. Computers and
Geosciences, 31(5), 607-618.
Rennie, J. & McCallum, A. K. (1999). Using reinforcement
learning to spider the Web efficiently. In Proceedings Rüther, H. (2001, October). EIS education in Africa
of the 16th International ICML99 Workshop on Machine – The geomatics perspective. Paper presented at the
Learning in Text Data Analysis (pp. 335-343). International Conference on Spatial Information for
Sustainable Development, Nairobi, Kenya
Resig, J., Dawara, S., Homan, C. M., & Teredesai, A.
(2004). Extracting social networks from instant mes- Rymon, R. (1992). Search through systematic set enu-
saging populations. In Proceedings of LinkKDD’04, meration. In B. Nebel, C. Rich & W. R. Swartout (Eds.),
Seattle, Washington. Proceedings of the 3rd International Conference on
Principles of Knowledge Representation and Reason-
Rich, M. K. (2000). The direction of marketing relation-
ing (pp. 539-550). San Francisco: Morgan Kaufmann
ships. The Journal of Business & Industrial Marketing,
Publishers.
15(2/3), 170-179.
Sahay, S. & Avgerou, C. (2002). Information and com-
Richardson, M. & Domingos, P. (2002). The intelligent
munication technologies in developing countries. Infor-
surfer: Probabilistic combination of link and content
mation Society, 18(2), 1-5.
information in PageRank. Advances in Neural Informa-
tion Processing Systems, 14. Sammon, W. L., Kurland, M. A., & Spitalnic, R. (1984).
Business competitor intelligence: Methods for collect-
Riedl, R. (2003). Design principles for E-government
ing, organizing, and using information. New York: John
services. In Proceedings of eGov Day 2003, Vienna,
Wiley & Sons.
Austria.
Sanchez, A. & Marin, G. S. (2005). Strategic orientation,
Roberto, J. & Bayardo, Jr. (1998). Efficiently mining
management characteristics, and performance: A study
long patterns from databases. In L. M. Hass, & A.
of Spanish SMEs. Journal of Small Business Manage-
Tiwary (Eds.), Proceedings of the 1998 ACM SIGMOD
ment, 43(3), 287-309.

Compilation of References

Sarwar, B., Karypis, G., Konstan, J., & Riedl, J. (2000). Science Blog (2002). Partnerships, finance, sustainable
Analysis of recommendation algorithms for e-commerce. production and consumption patterns. Press Release:
In Proceedings of the ACM Conference on Electronic United Nations. Retrieved April 12, 2008, from http://
Commerce (pp. 158-162). www.scienceblog.com/community/older/archives/
L/2002/A/un020319.html
Savasere, A., Omiecinski, E., & Navathe S. (1995). An
efficient algorithm for mining association rules in large Scime, A. (2005). Web mining: Application and tech-
databases. In U. Dayal, P. M. D. Gray & S. Nishio (Eds.), niques. Hershey, PA: Idea Group Inc.
Proceedings of the 21st International Conference on
SCN Education B. V. (2001). Data warehousing — The
Very Large Data Bases (pp. 432-444). San Francisco:
ultimate guide to building corporate business intelligence
Morgan Kaufmann Publishers.
(1st ed.). Vieweg & Sohn Verlagsgesellschaft mBH.
Schaap, B. D. & Linhart, S.M. (1998). Quality of ground
Scoggins, J. (1999). A practitioner’s view of techniques
water used for selected municipal water supplies in
used in data warehousing for sifting through data to
Iowa, 1982-96 water years (p. 67). Iowa City, IA: U.S.
provide information. In Proceedings of The Eight In-
Geological Survey Open File Report 98-3.
ternational Conference on Information and Knowledge
Schaap, M. G. & Bouten, W. (1996). Modeling water Management, Kansas City, MI.
retention curves of sandy soils using neural networks.
SEARCH (2006). TechOasis. Norcross, GA: Search
Water Resour. Res., 32, 3033-3040.
Technology Inc.
Scheffer, T. (2004). Email answering assistance by semi-
Sebastiani, F. (2002). Machine learning in automated text
supervised text classification. Intelligent Data Analysis,
categorization. ACM Computing Surveys, 34(1), 1–47.
8(5), 2004.
Semeraro, G., Basile, P., Degemmis, M., & Lops, P.
Schneider, S. & Foot, K. (2004). The Web as an object
(2006). Discovering user profiles from papers by us-
of study. New Media & Society, 6(1), 114-122.
ing word sense disambiguation. In Proceedings of the
Schonberg, E., Cofino, T., Hoch, R., Podlaseck, M., & ECML/PKDD Workshop on Web Mining (pp. 69-79),
Spraragen, S. (2000). Measuring success. Communica- Berlin, Germany.
tions of the ACM, 43(8), 53-57.
Senthil Kumar, A. V. & Wahidabanu, R. S. D. (2006).
Schröder, M., Rehrauer, H., Seidel, K. & Datcu, M. Directed graph approach for association rule mining.
(2000). Interactive learning and probabilistic retrieval In Proceedings of the 2nd International Conference
in remote sensing image archives. IEEE Transactions on ICTS, Indonesia.
Geoscience and Remote Sensing, 23(1), 2288-2298.
Servaes, J. E. J. (2004). Knowledge is power (revisited):
Schubert, A., Glanzel, W., & Braun, T. (1987). Subject Internet and democracy. In P. Lee (Ed.), Proceedings of
field characteristic citation scores and scales for as- the International Conference on Internet Communication
sessing research performance. Scientometrics, 12(5-6), in Intelligent Societies (pp. 1 – 16). Chinese University
267-291. of Hong Kong, Hong Kong. Retrieved April 13, 2008,
http://www.com.cuhk.edu.hk/conference/2004/)
SCI (2006). Certain data included herein are derived
from the Science Citation Index/Social Science Cita- Shamseldin, A. Y. (1997). Application of a neural net-
tion Index prepared by the THOMSON SCIENTIFIC work technique to rainfall-runoff modeling. Journal of
®, Inc. (Thomson®), Philadelphia, Pennsylvania, USA: Hydrology, 199, 272-294.
© Copyright THOMSON SCIENTIFIC ® 2006. All
Sharma, A. & Woodward, R. (2001). Political economy
rights reserved.
Websites: A researcher’s guide. New Political Economy,
6(1), 119-130.

Compilation of References

Shi, Y, Peng, Y., Kou, G., & Chen, Z. (2005). Classifying Silva, M. P. S. (2006). Mineração de Padrões de Mudança
credit card accounts for business intelligence and decision em Imagens de Sensoriamento Remoto [Mining patterns
making: A multiple-criteria quadratic programming ap- of change in remote sensing images] (Unpublished doc-
proach. International Journal of Information Technology toral thesis). São José dos Campos: National Institute for
and Decision Making, 4, 581-600. Space Research (INPE).

Shi, Y, Peng, Y., Xu, W., & Tang, X. (2002). Data mining Silva, M. P. S., Câmara, G., Souza, R. C. M., Valeriano, D.
via multiple criteria linear programming: Applications M. & Escada, M. I. S. (2005). Mining patterns of change
in credit card portfolio management. International in remote sensing image databases. J. Han & B. Wah
Journal of Information Technology and Decision Mak- (Eds.), In Proceedings of the Fifth IEEE International
ing, 1, 131-151. Conference on Data Mining (pp. 362-369).

Shi, Y. (2001). Multiple criteria and multiple constraint Sinha, I. (2000, March/ April). Cost transparency: The
levels linear programming: Concepts, techniques and Net’s real threat to prices and brands. Harvard Business
applications. NJ: World Scientific. Review, 43-55.

Shi, Y., & Yu, P.L. (1989). Goal setting and compromise Smart, J. C. (Ed.) (2005). Higher education: Hand-
solutions. In B. Karpak & S. Zionts (Eds.), Multiple book of theory and research (Vol. 20). Virginia Tech:
criteria decision making and risk analysis using micro- Springer.
computers (pp. 165-204). Berlin: Springer-Verlag.
Smeulders, A.W.M., Worring, M., Santini, S., Gupta, A.
Shi, Y., Wise, W., Luo, M., & Lin, Y. (2001). Multiple & Jain, R. (2000). Content-based image retrieval at the
criteria decision making in credit card portfolio man- end of the early years. IEEE Transactions on Pattern
agement. In M. Koksalan & S. Zionts (Eds.), Multiple Analysis and Machine Intelligence, 22(1), 1349-1380.
criteria decision making in new millennium (pp. 427-
Sobeih, A. (2005). Supporting natural resource manage-
436). Berlin: Springer-Verlag.
ment and local development in a developing connec-
Shibata, A., Zelivyanskaya, M., Limoges, J., Carlson, tion: Bridging the policy gap between the information
K.A., Gorantla, S., Branecki, C., Bishu, S., Xiong, H., & society and sustainable development. A publication of
Gendelman, H.E. (2003). Peripheral nerve induces mac- the International Institute for Sustainable Development
rophage neurotrophic activities: Regulation of neuronal (IISD), pp. 186-210.
process outgrowth, intracellular signaling and synaptic
Song, S. (2005). Viewpoint: Bandwidth can bring African
function. Journal of Neuroimmunology, 142, 112-129.
universities up to speed. Science in Africa, September
Shimabukuro, Y. et al. (1998). Using shade fraction im- 2005. Retrieved April 13, 2008, from http://www.scien-
age segmentation to evaluate deforestation in Landsat ceinafrica.co.za/2005/september/bandwidth.htm
thematic mapper images of the Amazon region. Interna-
Sormani, A. (2005). Debt causes problems for SMEs.
tional Journal of Remote Sensing 19(3), 535-541.
European Venture Capital & Capital Equity Journal,
Shukla, M. B., Kok, R., Prasher, S. O., Clark, G., & 1, 1.
Lacroix, R. (1996). Use of artificial neural networks in
Speth, J. G. (2004). Red sky at morning: America and
transient drainage design. Transactions of the ASAE,
the crisis of the global environment. Yale University
39, 119-124.
Press.
Shyu, M. L., Chen, S.C., Sarinnapakorn, K., and Chang,
Spiliopoulou, M. & Pohle, C. (2001). Data mining for
L. (2006). Principal component-based anomaly detection
measuring and improving the success of Web sites. Data
scheme. In T.S. Lin, S. Ohsuga, J. Liau, & X. Hu (Eds.),
Mining and Knowledge Discover, 5(1-2), 85-114.
Foundations and Novel Approaches in Data Mining (pp.
311-329) Springer-Verlag.

Compilation of References

Spiliopoulou, M., Pohle, C., & Faulstich, L. (1999). Im- Sullivan, L. (2004). Wal-Mart’s way. Information Week.
proving the effectiveness of a Web site with Web usage Retrieved March 31, 2005, from http://www.informa-
mining. In Proceedings of WEBKDD99 (pp. 142-162), tionweek.com/story/showArticle.jhtml?articleID=4790
San Diego, CA. 2662&pgno=3

Srivastava, A. N., & Weigend, A. S. (1994). Computing Sullivan, L. (2005). Wal-Mart assesses new uses for
the probability density in connectionist regression. In M. RFID. Information Week. Retrieved March 31, 2005,
Marinara & G. Morasso (Eds.), Proceedings ICANN, 1 from http://www.informationweek.com/showArticle.
(pp. 685-688). Berlin: Springer-Verlag. jhtml?articleID=159906172

Srivastava, J., Cooley, R., Deshpande, M., & Tan, P. Sun, A., Lim, E. P. & Ng, W.K. (2002). Web classifica-
(2000). Web usage mining: Discovery and applications tion using support vector machine. In Proceedings of
of usage patterns from Web data. SIGKDD Explora- the Fourth ACM CIKM International Workshop on
tions, 1, 12-23. Web Information and Data Management (WIDM’02),
McLean, Virginia.
Stake, R. E. (1998). Case studies. In N. K. Denzin & Y.
S. Lincoln (Eds.), Strategies of qualitative inquiry (pp. Sutcliffe, A. (2001). Heuristic evaluation of Website at-
86-109). Thousand Oaks, CA: Sage Publications. tractiveness and Web usability. Lecture notes in computer
science (Vol. 2220, pp. 183-198).
Stamatakis, K, Karkaletsis, V., Paliouras, G., Horlock,
J., Grover, C., Curran, J. R. & Dingare, S. (2003). Do- Svoboda, M., LeComte, D., Hayes, M., Heim, R., Glea-
main-specific Web site identification: The CROSSMARC son, K., Angel, J., Rippey, B., Thinker, R., Palecki, M.,
focused Web crawler. In Proceedings of the Second Stooksbury, D., Miskus, D., & Stephens, S. (2002). The
International Workshop on Web Document Analysis drought monitor. Bulletin of the American Meteorological
(WDA 2003) (pp. 75-78), Edinburgh, UK. Society, 83(8), 1181-1190.

Stefanakis, E., Vazirgiannis, M., & Sellis, T. (1999). Incor- Swanson, D. R. & Smalheiser, N. R. (1997). An inter-
porating fuzzy set methodologies in a DBMS repository active system for finding complementary literatures: A
for the application domain of GIS. International Journal stimulus to scientific discovery. Artificial Intelligence,
of Geographic Information Science, 13, 657-675. 91(2), 183-203.

Steinmueller, W. E. (2001). ICTs and the possibilities Swanson, D. R. (1986). Fish oil, Raynaud’s syndrome,
for leapfrogging by developing countries. International and undiscovered public knowledge. Perspect Biol Med,
Labour Review, 140(2), 193-210. 30(1), 7-18.

Stolfo, S.J., Fan, W., Lee, W., Prodromidis, A., & Chan, Tadesse, T., Brown, J. F., & Hayes, M. J. (2005). A new
P.K. (2000). Cost-based modeling and evaluation for data approach for predicting drought-related vegetation stress:
mining with application to fraud and intrusion detection: Integrating satellite, climate, and biophysical data over
Results from the JAM project. In Proceedings of the the U.S. central plains. ISPRS Journal of Photogram-
DARPA Information Survivability Conference. metry and Remote Sensing, 59(4), 244-253.

Stumme, G., Wille, R. & Wille, U. (1998). Conceptual Tadesse, T., Wilhite, D. A., Harms, S. K., Hayes, M. J.,
knowledge discovery in databases using formal con- & Goddard, S. (2004). Drought monitoring using data
cept analysis methods. Berlin-Heidelberg, Germany: mining techniques: A case study for Nebraska, USA.
Springer, Verlag. Natural Hazards, 33(1), 137-159.

Sturges, J. & Hanrahan, K. (2004). Comparing telephone Tadesse, T., Wilhite, D. A., Hayes, M. J., Harms, S. K.,
and face-to-face qualitative interviewing: a research note. & Goddard, S. (2005). Discovering associations between
Qualitative Research, 4(1) 107-118.

Compilation of References

climatic and oceanic parameters to monitor drought in The New York Stock Exchange (2007). Retrieved April
Nebraska using data-mining techniques. Journal of 13, 2008, from www.nyse.com.
Climate, 18(10), 1541-1550.
The Stock Exchange of Thailand (2007). Retrieved April
Taffler, R. & Tisshaw, H. (1977). Going, going gone - four 13, 2008, from www.set.or.th/en/index.html.
factors which predict. Accountancy, March, 50-54.
Thearling, K. (1995). From data mining to database
Tan, C. N., & Dihardjo, H. (2001). A study on using artifi- marketing. DIG White Paper 95/02. Retrieved April
cial neural networks to develop an early warning predictor 13, 2008, from http://www.cs.uvm.edu/~xwu/icdm/cfp-
for credit union financial distress with comparison to the 03.shtml
probit model. Managerial Finance, 27(4), 56-78.
Thelwall, M. (2006). Interpreting social science link
Tan, Pang-Ning, Steinbach, M., & Kumar, V. (2005). analysis research: A theoretical framework. Journal
Introduction to data mining. Pearson Addison Wesley. of the American Society for Information Science and
Technology, 57(1), 60-68.
Tan, Z. & Quektuan, C. (2007). Biological brain-inspired
genetic complementary learning for stock market and Thuraisingham, B. (1999). Data mining: Technologies,
bank failure prediction. Computational Intelligence, techniques, tools, and trends. Boca Raton, FL: CRC
23(2), 236-242. Press LLC.

Tango-Lowy, R. & Lewis, L. (2005). Situation manage- Thurston, J., Poiker, T. K., & Moore, J. P. (2003). Inte-
ment in crisis scenarios based on self-organizing neural grated geospatial technologies: A guide to GPS, GIS, and
mapping technology. In Proceedings of the IEEE Military data logging. Hoboken, NJ: John Wiley & Sons.
Communications Conference (pp. 1-7), Atlantic City,
Tigre, P. B. & Dedrick, J. (2004). E-commerce in Bra-
New Jersey.
zil: local adaptation of a global technology. Electronic
Tao, F., Murtagh, F., & Farid, M. (2003). Weighted Markets, 14(1) 36-40.
association rule mining using weighted support and
Ting, L. (2003). Sustainable development, the place for
significance framework. In L. Getoor, T. E. Senator,
SDIs, and the potential of e-governance. In I. Williamson,
P. Domingos & C. Faloutsos (Eds.), Proceedings of
A. Rajabifard & M. F. Feeney (Eds.), Developing spatial
the Ninth ACM SIGKDD International Conference on
data infrastructures: From concept to reality (pp. 183-
Knowledge Discovery and Data Mining (pp. 661-666).
194). London, UK: Taylor & Francis.
New York: ACM Press.
Toivonen, H. (1996). Sampling large databases for as-
Tao, V., Liang, S., Croitoru, A., Haider, Z. M., & Wang,
sociation rules. In T. M. Vijayaraman, A. P. Buchmann,
C. (2005). GeoSwift: Open geospatial sensing services for
C. Mohan & N. L. Sarda (Eds.), Proceedings of the
sensor web. In S. Nittel & A. Stefanidis (Eds.), GeoSensor
22nd International Conference on Very Large Data
Networks (pp. 267-274). Boca Raton, FL: CRC Press.
Bases (pp. 134-145). San Francisco: Morgan Kaufmann
Tapp, A. (2001). Principles of direct marketing (2nd ed). Publishers.
Prentice Hall.
TREC (Text Retrieval Conference) (2004). Retrieved
Taylor, K., Walker, G., & Abel, D. (1999). A framework April 13, 2008, from http://trec.nist.gov/.
for model integration in spatial decision support systems.
Trippi, R. R., & DeSieno, D. (1992). Trading equity in-
International Journal of Geographic Information Sci-
dex futures with a neural-network. Journal of Portfolio
ence, 13, 533-555.
Management, 19, 27-33.
Tepeci, M. (1999). Increasing brand loyalty in the hospi-
tality industry. International Journal of Contemporary
Hospitality Management, 11(5).

0
Compilation of References

Tsaih, R., Hsu, Y., & Lai, C. C. (1998). Forecasting S Vapnik, V.N. (2000). The nature of statistical learning
& P 500 stock index futures with a hybrid AI system. theory (2nd ed.). New York: Springer.
Decision Support Systems, 23(2), 161-174.
Vckovski, A. & Bucher, F. (1996). Virtual data sets - Smart
Tseng, C.-C. (2004). Portfolio management using hybrid data for environmental applications. In Proceedings of the
recommendation system. In Proceedings of the 2004 IEEE Third International Conference/Workshop on Integrating
International Conference on e-Technology, e-Commerce, GIS and Environmental Modeling, Santa Fe, NM.
and e-Services (pp. 202-206). Los Alamitos, CA: IEEE
Viaene, S., Derrig, R. A., Baesens, B., & Dedene, G.
Computer Society Publications.
(2003). A comparison of state-of-the-art classifica-
Tug, E., Sakiroglu, M., & Arslan, A. (2006). Automatic tion techniques for expert automobile insurance claim
discovery of the sequential accesses from Web log data fraud detection. Journal of Risk and Insurance, 69(3),
files via a genetic algorithm. Knowledge-Based Systems, 373-421.
19(3), 180-186.
Viator, J. A. & Pestorius, F. M. (2001). Investigating
Turner, M. G. (1989). Landscape ecology: The effect trends in acoustics research from 1970-1999. Journal
of pattern on process. Annual Review of Ecology and of the Acoustical Society of America, 109(5), 1779-1783
Systematics, 20, 171-197. Part 1.

UNCTAD (2004). UNCTAD XI multi-stakeholder Ville, Barry de (2001). Microsoft data mining: Integrated
partnerships, information and communication technolo- business intelligence for e-commerce and knowledge
gies for development (ICTfD). In Proceedings of the management.
United Nations Conference on Trade and Development.
Ville, Barry de (2007). Microsoft data mining: Integrated
Retrieved April 13, 2008, http://www.unctad.org/en/
business intelligence for e-commerce and knowledge
docs//tdl380add1_en.pdf
management. Digital Press.
UNDP (2001). United Nations Development Program:
Vitt, E., Luckevich, M., & Misner, S. (2002). Business
Making new technologies work for human development.
intelligence. Microsoft Press.
Oxford: Oxford University Press.
Vlachos, E. (1994). GIS, DSS and the future. In Pro-
Unwin, T. (2006). Facing the challenges, dgCommuni-
ceedings of the 8th Annual Symposium on Geographic
ties: Open educational resources. Retrieved April 13,
Information Systems in Forestry, Environmental and
2008, from http://topics.developmentgateway.org/open-
Natural Resources Management, Vancouver, Canada.
educaion
Walters, D. & Lancaster, G. (1999). Value and information
USAID (2003). USAID Africa success stories. Retrieved
– Concepts and issues for management. Management
April 13, 2008, from http://africastories.usaid.gov:80/
Decision, 37(8), 643-656.
print_story.cfm?storyID=23
Walters, D. & Lancaster, G. (1999). Using the Internet
Utimaco (2005). Data encryption: The foundation of
as a channel for commerce. Management Decision,
enterprise security. Foxboro, MA: Utimaco Safeware,
37(10), 800-816.
Inc.
Wang, H. & Weigend, A. S. (2004). Data mining for
Van Der Zee, J. T. M. & De Jong, B. (1999). Alignment
financial decision making. Decision Support Systems,
is not enough: Integrating business and information
37(2004), 457-460.
technology management with the balanced score card.
Journal of Management Information Systems, 16(2), Wang, J. (Ed.) (2003). Data mining opportunities and
137-158. challenges. IRM Press.

Compilation of References

Wang, J. T. L., Zaki, M. J., Toivonen, H. T. T., & Shasha, realize exceptional payoffs. Information & Management,
D. E. (2005). Data mining in bioinformatics. London, 39(6), 491-502.
UK: Springer-Verlag.
Weingessel, A., Dimitriadou, E., & Hornik, K. (2003,
Wang, J., & Wang, Z. (1997). Using neural network to March 20-22). An ensemble method for clustering. In
determine Sugeno measures by statistics. Neural Net- Proceedings of the 3rd International Workshop on Dis-
works, 10, 183-195. tributed Statistical Computing, Vienna, Austria.

Wang, J., Han, J., & Pei, J. (2003). CLOSET+: Searching Weiss, S. M. & Indurkhya, N. (1997). Predictive data
for the best strategies for mining frequent closed itemsets. mining: A practical guide. Morgan Kaufmann.
In L. Getoor, T. E. Senator, P. Domingos & C. Faloutsos
Westerman, P. (2001). Data warehousing: Using the
(Eds.), Proceedings of the Ninth ACM SIGKDD Interna-
Wal-Mart model. San Francisco: Academic Press.
tional Conference on Knowledge Discovery and Data
Mining (pp. 236-245). New York: ACM Press. White, A. B., Kumar, P., & Tcheng, D. (2005). A data
mining approach for understanding topographic control
Wang, L. & Fu, X. (2005). Data mining with computa-
on climate-induced inter-annual vegetation variability
tional intelligence (advanced information and knowledge
over the United States. Remote Sensing of Environment,
processing) (1st ed.). Springer.
98, 1-20.
Wang, L., Khan, L. & Breen, C. (2002). Object bound-
Whiting, R. (2004). Vertical thinking. Information Week.
ary detection for ontology-based image classification. In
Retrieved March 31, 2005, from http://www.information-
Proceedings of the Third ACM International Workshop
week.com/showArticle.jhtml?articleID=18201987
on Multimedia Data Mining (pp. 51-61).
Wilhelmi, O. V. & Wilhite, D. A. (2002). Assessing
Wang, W. & Yang, J. (2005). Mining sequential patterns
vulnerability to agricultural drought: a Nebraska case
from large data sets. Secaucus, NJ: Springer-Verlag
study. Natural Hazards, 25(1), 37-58.
New York, Inc.
Wilhite, D. A. (2000): Drought as a natural hazard: con-
Wang, W., Yang, J., & Yu, P. (2000). Efficient mining
cepts and definitions. In D. A. Wilhite (Ed.), Drought: A
of weighted association rules (WAR). In Proceedings of
global assessment (Vol. 1, pp. 3-18). London: Routledge
the Sixth ACM SIGKDD International Conference on
Publishers.
Knowledge Discovery and Data Mining (pp. 270-274).
New York: ACM Press. Williams, D. (2004). The strategic implications of Wal-
Mart’s RFID mandate. Directions Magazine. Retrieved
Wang, X., Abraham, A., & Smith, K. (2005). Intelligent
October 23, 2004, from http://www.directionsmag.
Web traffic mining and analysis. Journal of Network and
com/article.php?article_id=629
Computer Applications, 28, 147-165.
Williams, R. (1997). Universal solutions or local
Wang, Y. & Hu, J. (2002). A machine learning approach
contingencies? Tensions and contradictions in the mu-
for table detection on the Web. In Proceedings of the
tual shaping of technology and work organization. In I.
11th International World Web Conference, Honolulu,
McLoughlin & M. Harris (Eds), Innovation, organiza-
Hawaii.
tional change and technology. London, UK: International
Wasserman, S. & Faust, K. (1994). Social network Thomson Business Press.
analysis: Methods and applications. Cambridge Uni-
WIPO, World Intellectual Property Organization (2002).
versity Press.
Interregional forum on small and medium-sized enter-
Watson, H., Goodhue, D., & Wixon, B. (2002). The prises (SMEs) and intellectual property (Tech. Rep. No.
benefits of data warehousing: Why some organizations 02/01). Moscow: Document of WIPO.

Compilation of References

Wise, S. M. & Haining, R. P. (1991). The role of spatial Processing (Vol. 5, pp. 2325-2327). Los Alamitos, CA:
analysis in geographical information systems. Westrade IEEE Computer Society Publications.
Fairs, 3, 1-8.
Yoon, J. P. & Kerschberg, L. (1993). A framework for
Witten, I. & Frank, E. (1999). Data mining, Practical knowledge discovery and evolution in databases. IEEE
machine learning tools and techniques with Java imple- Trans. On Knowledge And Data Engineering, 5(6),
mentations. Morgan Kaufman. 973-979.

Witten, I. & Frank, E. (2005). Data mining, practical Yu, D. L., & Gomm, J. B. (2002). Enhanced neural network
machine learning tools and techniques (2nd ed.). Morgan modelling for a real multi-variable chemical process.
Kaufman. Neural Computing and Applications, 10(4), 289-299.

Wolfgang, G. & Lars, S. (2000). Mining Web naviga- Yu, H., Han, J., & Chang, K. C. (2002). PEBL: Positive
tion path fragments. In Proceedings of the Workshop on example based learning for Web page classification using
Web Mining for E-Commerce (KDD2000) (pp. 105-110). SVM. In Proceedings Of The International Conference
Boston, MA. On Knowledge Discovery In Databases (KDD02) (pp.
239-248), New York.
WSSD (2002). Press release for fifth partnership plenary,
world summit on sustainable development. Johannesburg, Yu, L., Wang, S., & Lai, K. K. (2005). Mining stock
South Africa. market tendency using GA-based support vector ma-
chines. In X. Deng & Y. Ye (Eds.), Proceedings of the
Wu, H., Gordon, M., DeMaagd, K., & Fan, W. (2006).
First International Workshop on Internet and Network
Mining Web navigations for intelligence. Decision Sup-
Economics (pp. 336-345). Berlin Heidelberg, Germany:
port Systems, 41, 574-591.
Springer-Verlag.
Yan, N., Wang, Z., Shi, Y., & Chen, Z. (2005). Clas-
Yu, P.L. (1985). Multiple criteria decision making:
sification by linear programming with signed fuzzy
Concepts, techniques and extensions. New York: Ple-
measures. Working Paper, University of Nebraska at
num Press.
Omaha, USA.
Zahra, S., Sisodia, R., & Matherne, B. (1999, April).
Yang, C.-C., Prasher, S. O., & Lacroix, R. (1996). Ap-
Exploiting the dynamic links between competitive and
plication of artificial neural networks to land drainage
technology strategies. European Management Journal,
engineering. Trans. ASAE, 39, 525-533.
17(2), 188-201.
Yannas, P. & Lappas, G. (2005). Web campaign in the
Zaiane, O. R. (2001). Web usage mining for a better
2002 Greek municipal elections. Journal of Political
Web-based learning environment. In Proceedings of
Marketing, 4(1), 33-50.
Conference on Advanced Technology for Education (pp.
Yao, Y. Y., Hamilton, H. J. & Wang, X. (2002). Page- 60-64). Banff, Alberta, Canada.
Prompter: An intelligent agent for Web navigation created
Zaïane, O. R., Han, J., Li, Z.-N., & Hou, J. (1998). Mining
using data mining techniques. Lecture notes in computer
Multimedia Data. In Proceedings of the CASCON’98:
science (Vol. 2475, pp. 506-513).
Meeting of Minds (pp. 83-96), Toronto, Canada.
Ye, Z., Liu, X., Yao, Y., Wang, J., Zhou, X., Lu, P., &
Zaki, M. J. & Hsiao, C.-J. (2002) CHARM: An efficient
Yao, J. (2002). An intelligent system for personal and
algorithm for closed itemset mining. In R. L. Grossman,
family financial service. In L. Wang, J. C. Rajapakse, K.
J. Han, V. Kumar, H. Mannila & R. Motwani (Eds.),
Fukushima, S.-Y. Lee & X. Yao (Eds.), Proceedings of
Proceedings of the Second SIAM International Confer-
the 9th International Conference on Neural Information
ence on Data Mining (Part IX No. 1). Philadelphia, PA:
SIAM.

Compilation of References

Zaki, M. J., Parthasarathy, S., Li, W., & Ogihara, W. Zhao, Y. & Karypis, G. (2004). Empirical and theoretical
(1997). Evaluation of sampling for data mining of as- comparisons of selected criterion functions for document
sociation rules. In Proceedings of the 7th International. clustering. Machine Learning, 55(3), 311-331.
Workshop Research Issues in Data Engineering.
Zheng, J., Thylin, M., Ghorpade, A., Xiong, H., Per-
Zaki, M. J., Parthasarathy, S., Ogihara, M., & Li, W. sidsky, Y., Cotter, R., Niemann, D., Che, M., Zeng, Y.,
(1997). New algorithms for fast discovery of association Gelbard, H. et al. (1999). Intracellular CXCR4 signaling,
rules. In D. Heckerman, H. Mannila, & D. Pregibon (Eds.), neuronal apoptosis and neuropathogenic mechanisms of
Proceedings of the Third International Conference on HIV-1-associated dementia. Journal of Neuroimmunol-
Knowledge Discovery and Data Mining (pp. 283-286). ogy, 98, 185-200.
Menlo Park, CA: AAAI Press.
Zheng, J., Zhuang, W., Yan, N., Kou, G., Erichsen, D.,
Zaki, M. J.,Parthasarathy, S., & Li, W. (1997). A localized McNally, C., Peng, H., Cheloha, A., Shi, C., & Shi, Y.
algorithm parallel association mining. In Proceedings (2004). Classification of HIV-1-mediated neuronal den-
of the 9th ACM Symposium Parallel Algorithms and dritic and synaptic damage using multiple criteria linear
Architectures. programming. Neuroinformatics, 2, 303-326.

Zavgren, C. (1985). Assessing the vulnerability to failure Zhou, C., Li, Z., Meng, Y. & Meng, Q. (2004). A data min-
of American industrial firms: A logistics analysis. Journal ing algorithm based on rough set theory. In Proceedings
of Accounting Research, 22, 59-82. of International Conference on Information Acquisition
2004 (pp. 413-416).
Zeiler, M. (1999). Modeling our world: The ESRI guide
to Geodatabase design. Redlands, CA: ESRI Press. Zhou, Z., Jiang, K., & Li, M. (2005). Multi-instance
learning based Web mining. Applied Intelligence, 22(2),
Zenobi, G., & Cunningham, P. (2002). An approach to
135-147.
aggregating ensembles of lazy learners that supports
explanation. Lecture Notes in Computer Science, 2416, Zhu, D. H. & Porter, A. L. (2002). Automated extraction
436-447. and visualization of information for technological intel-
ligence and forecasting. Technological Forecasting and
Zhang, D. & Zhou, L. (2004). Discovery golden nuggets:
Social Change, 69(5), 495-506.
Data mining in financial application. IEEE Transactions
on Systems, Man, and Cybernetics – Part C: Applications Zimmermann, H.-J. (1978). Fuzzy programming and
and Reviews, 34(4), 513 –522. linear programming with several objective functions.
Fuzzy Sets and Systems, 1, 45-55.
Zhang, J., Shi, Y., & Zhang, P. (2005). Several multi-cri-
teria programming methods for classification. Working Zmijewski, M. E. (1984). Methodological issues related
Paper, Chinese Academy of Sciences Research Center on to the estimation of financial distress prediction models.
Data Technology & Knowledge Economy and Graduate Journal of Accounting Research, (Supplement), 59-82.
University of Chinese Academy of Sciences, China.
Zukerman, I. & Albrecht, D. (2001). Predictive statisti-
Zhang, Y., Yu, J. X., & Hou, J. (2005). Web communities: cal models for user modeling. User Modelling and User
Analysis and construction. Berlin: Springer. Adapted Interaction, 11, 5-18. tecedents of executive
information system success: A path analytic approach.
Decision Support System, 22(1), 31-43.

About the Contributors

Hakikur Rahman, PhD is the executive director and CEO of Sustainable Development Networking
Foundation (SDNF), the transformed entity of the Sustainable Development Networking Programme
(SDNP) in Bangladesh where he was working as the national project coordinator since December 1999.
SDNP is a global initiative of UNDP and it completed its activity in Bangladesh on December 31, 2006.
He is also acting as the secretary of South Asia Foundation Bangladesh Chapter. Before joining SDNP
he worked as the director, Computer Division, Bangladesh Open University. He has written several
books and many articles/papers on computer education for the informal sector and distance education.
He is the founder-chairperson of Internet Society Bangladesh Chapter; editor, the Monthly Computer
Bichitra; founder-principal and member secretary, ICMS College; head examiner (Computer), Bangladesh
Technical Education Board; executive director, BAERIN (Bangladesh Advanced Education Research
and Information Network) Foundation; and involved in establishment of a ICT based distance education
university in Bangladesh. Graduating from the Bangladesh University of Engineering and Technology
in 1981, he has done his Master’s of engineering from the American University of Beirut in 1986 and
completed his PhD in computer engineering from the Ansted University, BVI, UK in 2001.

***

Gilberto Câmara is general director of Brazil’s National Institute for Space Research (INPE) for the
period 2006 to 2010. INPE works in space science, space engineering, Earth observation and weather
and climate studies. Previously, he was head of INPE’s Image Processing Division from 1991 to 1996
and director for Earth observation from 2001 to 2005. His research interests include geographical in-
formation science and engineering, spatial databases, spatial analysis and environmental modeling. He
has published more than 150 full papers on refereed journals and scientific conferences. He has also
been the leader in the development of GIS technology in Brazil.

Frans Coenen has a general background in AI has been working in the field of data mining and
knowledge discovery in data (KDD) for some ten years. He is a member of the IFIP WG12.2  Ma-
chine Learning and Data Mining group and the British Computer Society’s specialist group in AI. He
has some 140 refereed publications on KDD and AI related research. Frans Coenen is currently a senior
lecturer within the Department of Computer Science at the University of Liverpool.

Maria Isabel Sobral Escada is graduated in ecology and has her doctorate in remote sensing from
National Institute for Space Research—INPE. She works in the Image Processing Division (DPI) at
INPE and is vice-coordinator of GEOMA—an Amazonia modeling network composed by several Insti-
tutes of Brazilian Ministry of Science and Technology—MCT. Her research interests include Amazonia
land use and land cover change, pattern analysis, models and their connection with social, economic,
territorial planning, and public policy issues.

Michael J. Hayes is the director for the National Drought Mitigation Center and an associate profes-
sor in the School of Natural Resources at the University of Nebraska-Lincoln. His interests include the
economic, environmental, and social impacts of drought; developing drought monitoring and impact
assessment methodologies; and assisting states and Native American tribes with the development of
drought plans. Dr. Hayes received a Bachelor’s degree in meteorology from the University of Wiscon-
sin-Madison, and his Master’s and Doctoral degrees in atmospheric sciences from the University of
Missouri-Columbia.

Ronald Neil Kostoff received a PhD in aerospace and mechanical sciences from Princeton Univer-
sity in 1967. He has worked for Bell Laboratories, Department of Energy, and Office of Naval Research
(ONR). He has authored over 100 technical papers, served as guest editor of three journal special issues,
obtained two text mining system patents, and presently manages a text mining pilot program at ONR.
Raymond George Koytcheff is a recent graduate of Columbia University, where he majored in
biophysics and economics-mathematics. At the Office of Naval Research, he performed text mining of
nanotechnology research. At the Naval Research Laboratory (NRL), he worked on remote sensing and
tribology research.

Ali Serhan Koyuncugil, MSc, PhD is working as a statistician for Capital Markets Board of Turkey.
He had his licence, MSc, and PhD degrees in statistics from Ankara University Department of Statistics.
His current research interests are design and development of fraud detection, risk management, early
warning, surveillance, information, decision-support and classification systems, design and development
of datawarehouses and statistical databases, development of indicators, models and algorithms, conduct-
ing analysis on capital markets, finance, health, SME’s, large scale statistical researchs (e.g., census),
population and development, socioeconomic and demographic affairs based on data mining, statistics,
quantitative decision making, operational research, optimization, mathematical programming, fuzzy set,
technical demography theory and applications. He took part in a lot of international and national projects
(UN, IBRD, EU, etc.). He took part in a lot of international and national conferences as an organizer,
reviewer and advisor. He is member of the IASC and IASS sections of ISI, Turkish Statistical Associa-
tion, Turkish Informatics Society and was former vice head of Turkish Statisticians Association.

A.V.Senthil Kumar is presently working as a senior lecturer in the Department of MCA, CMS Col-
lege of Science and Commerce, Coimbatore, Tamilnadu, India. He has more than 11 years of teaching
and 5 years of industrial experience. His research area includes data mining and image processing.

Georgios Lappas, PhD, is Lecturer of Informatics in the Department of Public Relations and
Communication in the Technological Educational Institution (TEI) of Western Macedonia, Kastoria,
Greece. He holds a BSc in physics from the University of Crete-Greece (1990), MSc in applied artificial

About the Contributors

intelligence from the University of Aberdeen-UK (1993) and he received his PhD from University of
Hertfordshire-UK for his work on “Combinatorial Optimization Algorithms Applied to Pattern Clas-
sification.” His research interests include: pattern classification, machine learning, neural networks,
web mining, multimedia mining, the use of the Internet in politics (e-politics), in public administration
(e-Government), and in campaigning (e-campaigns).

Clifford GY Lau is a research staff member with the Institute for Defense Analyses’ Information
Technology and Systems Division. Prior to joining IDA, he worked at the Office of Naval Research
(ONR). He received a PhD in electrical engineering and computer science in 1978 from the University
of California at Santa Barbara. He has published over 40 papers and served as guest editor for the IEEE
Proceedings, and is a fellow of the IEEE.

Diana Luck, PhD, lectures in general and specialist marketing modules as well as in project manage-
ment at the London Metropolitan University. Her interest in the interdisciplinary aspects of management
research stems from her past experience in a variety of business environments. She considers business
processes to be part of a gestalt rather than a set of disjointed disciplines. Her research interests revolve
around CRM and corporate social responsibility. She would consider her main contribution to her field
of study to be the broadening of marketing into the social arena and a focus upon accountability.

Inya Nlenanya is a program coordinator with the Iowa Resource for International Service, a nonprofit
organization based in Ames, Iowa whose mission is to promote international education, development,
and peace through rural initiatives. Mr. Nlenanya obtained his bachelors degree in electronic engineer-
ing from the University of Nigeria, Nsukka. He also has a Master’s degree in agricultural engineering
from Iowa State University. He currently resides in Ames, Iowa.

Nermin Ozgulbas, MSc, PhD is associate professor of finance at Baskent University in Turkey. She
taught financial management, financial analysis and cost accounting at the Department of Health Care
Management in Baskent University. She also taught financial management and cost analysis at distance
education program of Administration of Health Care Organizations in Anadolu University. Her research
and publication activities include finance, accounting, cost accounting and cost effectiveness in health
care organizations, capital markets and SMEs. She has publications presentations and projects in many
subject areas including the topics mentioned earlier. Some of the journals published her articles are: The
International Journal of Health Planning and Management, The Business Review Cambridge, Journal
of Economy, Business and Finance, Journal of Productivity, The Health Care Manager, Journal of Ac-
counting and Finance, World of the Accounting and Finance Journal, Journal of Health and Society.

Abdul Matin Patwari is the vice chancellor of the University of Asia Pacific, Dhaka, Bangladesh.
Obtaining his PhD in electrical engineering from University of Sheffield, UK in 1967 he has held the
position of head of the Department of Electrical and Electronics Engineering (EEE) and dean of the
faculty, Bangladesh University of Engineering & Technology (BUET). He was the vice chancellor of
BUET, director general of Islamic Institute of Technology and also served many national committees
as the Chairman. He served several universities as visiting professor, including Purdue University,
Indiana; California State University, Pomona; The University of New Castle, Upon Tyne. Dr. Patwari
has over 75 publications in the field of engineering science and visited almost all important countries
of the world as the delegate head or team member.

About the Contributors

Maira Petrini has been a professor at the Fundação Getulio Vargas-EAESP, in Brazil, since 2000.
Her research interests include business intelligence and corporate strategic planning. Professor Petrini has
also worked as an IT consultant since 2001. Her work has been published in major Brazilian journals.

Marlei Pozzebon is an associate professor at HEC Montréal, in Canada. She has been affiliated
with this institution since 2002. Her research interests include the political and cultural aspects of in-
formation technology implementation, the use of structuration theory and critical discourse analysis
in the information systems field, business intelligence and the role of information technology in local
development, and corporate social responsibility. Before joining HEC Montréal, Professor Pozzebon had
worked at three Brazilian universities. She has also been an IT consultant since 1995. Prior to this, for
at least 10 years, she was a systems analyst. Her research has been published in, among others, Journal
of Management Studies, Organization Studies, and Journal of Strategic Information Systems and the
Journal of Information Technology.

Marcelino Pereira dos Santos Silva is director of the Post Graduate Department and coordinator
of the Master Program in Computer Science of the Rio Grande do Norte State University (UERN). As
professor of computer science, he has been on UERN since 1996. Born January 16, 1970 in São Paulo,
Brazil, he earned his Bachelor’s degree in computer science from the Federal University of Campina
Grande in 1992 and his PhD from the National Institute for Space Research in 2006. He is a member of
the Brazilian Computer Society, and his research interests include data mining, geographical informa-
tion science and artificial intelligence.

Tsegaye Tadesse received the BS degree in physics from Addis Ababa University, Ethiopia (1982), his
MSc from space studies from International Space University, France (1998), and his PhD in agrometeorol-
ogy from the University of Nebraska-Lincoln, U.S.A (2002). Dr. Tadesse is currently a climatologist/as-
sistant geoscientist with the National Drought Mitigation Center at the University of Nebraska-Lincoln.
His current research is on the development of new drought monitoring and prediction tools that utilize
remote sensing, GIS and data mining techniques. His other research includes data mining application
in identifying drought characteristics and their association with satellite and oceanic indices.

R.S.D. Wahidabanu is presently head, Department of CSE, Government College of Engineering,

Salem, Tamilnadu, India. She has 25 years of teaching experience. Her research areas include pattern
recognition, artificial intelligence, and data mining.

Yanbo J. Wang is currently a fourth year doctoral student in the Department of Computer Science
at the University of Liverpool, UK. He was awarded a Bachelor of administrative studies with honours,
in information technology, by York University, Canada. Yanbo’s main current research is in data mining
and text mining, especially approaches for classification association rule mining, weighted association
rule mining, and their applications.

Brian D. Wardlow received his BS degree in geography and geology from Northwest Missouri
State University (1994), the MA degree in geography from Kansas State University (1996), and the PhD
degree in geography from the University of Kansas (2005). He is currently an assistant professor with
the National Drought Mitigation Center at the University of Nebraska-Lincoln. His current research is

About the Contributors

on the development of new drought monitoring and prediction tools that utilize remote sensing, GIS
and data mining techniques. Dr. Wardlow’s other research includes the application of remote sensing
for land cover characterization/change detection, environmental monitoring, and natural resource
management.

Xinwei Zheng is a fourth year PhD student in finance at Durham Business School, UK. He got his
MSc accounting & finance at University of Edinburgh, UK, and Bachelor of economics at Dongbei
University of Finance and Economics, China. His major research interests are market microstructure,
asset pricing and investment, macroeconomics and stock market, Chinese economics, and data min-
ing. He is also interested in the programming of PcGive and Eviews econometrics software, and high
frequency data analysis of Visual FoxPro.

Index

A C
additive models 150, 151 causation 30
allocating pattern (ALP) 112, 113, 118, 121–131 CHi-square Automatic Interaction Detector
Amazonia 55–60–69, 72, 73, 293, 294, 300 (CHAID) 225, 229, 231, 232, 240, 314
apriori algorithm 114, 121, 140 close coupling 270
artificial neural network (ANN) 81, 112, 224 closed directed graph approach 43
association rule (AR) 43, 45, 46, 52, 53, 81, 82, cluster affinity search technique (CAST) 145
84, 86, 111, 112, 126, 131–143, 157, cluster cleaning 145
280, 283, 287, 290–298, 303–306, 311, collaborative filtering 80, 84, 85, 92, 171, 312
314, 317, 320, 322, 324 competitive intelligence 246, 248, 249, 256
association rule mining (ARM) 53, 58, 113, 114, computational intelligence iii, vi, 26
115, 132, 133, 134, 298, 304, 317, 320 computer terminal network (CTN) 190
ata transformation 7 content-based image retrieval (CBIR) 59
customer relationship management (CRM) 32,
B 96–109, 137, 185, 187, 294, 297, 306,
basel-II 227 311, 313, 316
Bayesian classifiers 84 D
bibliometrics 187, 202, 203, 219, 220, 307,
308, 313 data archeology 138
bioinformatics 41, 111, 135, 171, 278, 295, data cleaning 7
312, 322 data pattern processing 138
biophysical 214, 282, 283, 284, 287, 288, 289, decision tree 12, 14, 68, 72, 83, 140, 141, 142,
290, 319 225, 229, 231, 232, 233
brain-derived neurotrophic factor (BDNF) 12 deep packet inspection (DPI) 177
business intelligence (BI) 24, 158, 175, 177, domain knowledge 33
186, 188, 198, 241, 242, 257, 259, 260, dynamic hashing algorithm (DHA) 49, 50, 52
299, 301–321

E L
early warning system 221–229, 234–239, 306 land use change 59, 65, 66, 69, 70–73, 296
Earth observation (EO) systems 63, 64, 71, 73 limnological 163, 178
electronic data interchange (EDI) 174, 190, 195 LINDO Systems Inc 7
environmental informatics 269 loose coupling 268, 270
extrapolation 33
M
F
machine learning 2, 17, 22, 53, 74–95, 111,
false negatives 33 112, 134, 158, 159, 163, 171,
false positives 33 179, 185, 188, 222, 265, 266,
FLP classification 12 272–274, 282, 293, 299, 315–323
FLP method 6 mazonia forest 57
fuzzy clustering analysis 203 multiple criteria linear programming (MCLP) 4, 12
fuzzy system 81 multiple criteria programming (MCP) 1, 2, 3
multiple criteria quadratic programming (MCQP) 4
G
N
genetic algorithms 81, 84, 87, 158, 299
geographical information system (GIS) 61, 64, 73, nanoscience 199, 200, 203, 220, 307
173, 177, 262–279, 283, 288, 290–303, nanotechnology 199–212, 216–220, 307, 308
314, 319, 320, 321, 323 National Ecological Observatory Network (NEON)
geographic data mining 274 263
geospatial component 268, 269, 270, 274 National Institute for Space Research 55, 56, 60,
geovisualization 270 74, 75, 305, 318
global positioning systems (GPS) 264 node addition 145
graphical user interface (GUI) 193 node removal 145
grid computing 263, 267 non-linearity 30

H O
HIV-1 associated dementia (HAD) 1, 2, 22 online analytical processing (OLAP) 194
Human Drug Metabolism Database (hDMdb) 179 online transaction processing (OLTP) 194
open and distance learning xiv, 164, 172
I
P
image domain 65
image mining 55–75 Pareto principle 100
information discovery 78, 138, 223 Perpetual Inventory (PI) system 195
intelligent agents 52, 79, 84, 179 point-of-sales (POS) data 191
interpolation 33
Q
J
Query Statistics 193
Java Foundation Classes (JFC) 273
R
K
radio frequency identification (RFID) 189, 196
knowledge center 143, 144, 145, 154 recommendation systems 80, 85
knowledge extraction 138, 266 Remote sensing image mining 57
knowledge society 137, 163, 166, 170– 177, return on investment (ROI) analysis 192
183, 184, 187, 222, 223, 237, 307 rough set theory (RS) 139, 159, 324

Index

S T
satellite data 56, 284, 287, 288, 290, 296 Teradata Corporation 191
self-organizing map 81 tight coupling 268
self-organizing maps (SOM) 82
Semantic Web 79, 88, 91, 304 V
simple quadratic programming (SQP) 20 visual data mining 266, 277, 306
spatial analysis 61, 186, 268, 270, 271, 276,
279, 293, 304, 323 W
spatial data 57, 58, 74, 171, 177, 182, 184, 187,
266, 267, 270,–279, 290, 292, 301–320 Wal-Mart 189
spatial data mining 57 Web content mining 77, 78, 79, 81, 83
spatial decision support system (SDSS) 271 Web mining 76–79, 81–95, 297, 299, 300, 305,
spatial patterns 57, 59, 60–72 307, 314, 317, 324
spatial resolution 57 Web structure mining 79, 80, 81, 84
spectral resolution 57 Web usage mining 79–95, 299, 319, 323
Standardized Seasonal Greenness (SSG) weighted association rule (WAR)
285, 286, 287 110, 134, 135, 311, 322
structural classifier 65, 68, 70, 72 weighted association rule mining (WARM)
support vector machine (SVM) 3, 20, 84, 94, 319 113, 115
sustainable development 73, 172, 178, 184,
188, 262, 269, 271, 275, 279, 290, 293,
318, 323