Engineering Applications of Neural Networks
Engineering Applications of Neural Networks
Engineering Applications
of Neural Networks
13
Volume Editors
Dominic Palmer-Brown
London Metropolitan University
Faculty of Computing
London, UK
E-mail: d.palmer-brown@londonmet.ac.uk
Chrisina Draganova
University of East London
School of Computing, IT and Engineering
London, UK
E-mail: c.draganova@uel.ac.uk
Elias Pimenidis
University of East London
School of Computing, IT and Engineering
London, UK
E-mail: e.pimenidis@uel.ac.uk
Haris Mouratidis
University of East London
School of Computing, IT and Engineering
London, UK
E-mail: h.mouratidis@uel.ac.uk
CR Subject Classification (1998): F.1, I.2, H.5, I.7, I.5, I.2.7, I.4.8
ISSN 1865-0929
ISBN-10 3-642-03968-5 Springer Berlin Heidelberg New York
ISBN-13 978-3-642-03968-3 Springer Berlin Heidelberg New York
This work is subject to copyright. All rights are reserved, whether the whole or part of the material is
concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting,
reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication
or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965,
in its current version, and permission for use must always be obtained from Springer. Violations are liable
to prosecution under the German Copyright Law.
springer.com
© Springer-Verlag Berlin Heidelberg 2009
Printed in Germany
Typesetting: Camera-ready by author, data conversion by Scientific Publishing Services, Chennai, India
Printed on acid-free paper SPIN: 12744978 06/3180 543210
Preface
A cursory glance at the table of contents of EANN 2009 reveals the amaz-
ing range of neural network and related applications. A random but revealing
sample includes: reducing urban concentration, entropy topography in epilep-
tic electroencephalography, phytoplanktonic species recognition, revealing the
structure of childhood abdominal pain data, robot control, discriminating angry
and happy facial expressions, flood forecasting, and assessing credit worthiness.
The diverse nature of applications demonstrates the vitality of neural comput-
ing and related soft computing approaches, and their relevance to many key
contemporary technological challenges. It also illustrates the value of EANN in
bringing together a broad spectrum of delegates from across the world to learn
from each other’s related methods. Variations and extensions of many methods
are well represented in the proceedings, ranging from support vector machines,
fuzzy reasoning, and Bayesian methods to snap-drift and spiking neurons.
This year EANN accepted approximately 40% of submitted papers for full-
length presentation at the conference. All members of the Program Committee
were asked to participate in the reviewing process. The standard of submissions
was high, according to the reviewers, who did an excellent job. The Program
and Organizing Committees thank them. Approximately 20% of submitted pa-
pers will be chosen, the best according to the reviews, to be extended and re-
viewed again for inclusion in a special issue of the journal Neural Computing and
Applications.
We hope that these proceedings will help to stimulate further research and
development of new applications and modes of neural computing.
Organizing Committee
Chair: Dominic
Palmer-Brownn London Metropolitan University, UK
Chrisina Draganova University of East London, UK
Elias Pimenidis University of East London, UK
Haris Mouratidis University of East London, UK
Sin Wee Lee University of East London, UK
Manolis Christodoulakis University of East London, UK
Miao Kang University of East London, UK
Frank Ekpenyong University of East London, UK
Terry Walcott University of East London, UK
Administrative Support:
Linda Day University of East London, UK
Administrative Support:
Farhanaz Begum University of East London, UK
Administrative Support:
Carole Cole University of East London, UK
Program Committee
Chair: Dominic
Palmer-Brownn London Metropolitan University, UK
Graham Ball Nottingham Trent University, UK
Chris Christodoulou University of Cyprus, Cyprus
Manolis Christodoulakis University of East London, UK
Chrisina Draganova University of East London, UK
Frank Ekpenyong University of East London, UK
Alexander Gegov University of Portsmouth, UK
Lazaros Iliadis Democritus University of Thrace, Greece
Miao Kang University of East London, UK
Sin Wee Lee University of East London, UK
James Allen Long London South Bank University, UK
Liam McDaid University of Ulster, UK
Haralambos Mouratidis University of East London, UK
Elias Pimenidis University of East London, UK
VIII Organization
Sponsoring Institutions
University of East London, UK
London Metropolitan University, UK
International Neural Network Society INNS, USA
Table of Contents
New Aspects of the Elastic Net Algorithm for Cluster Analysis . . . . . . . . 281
Marcos Lévano and Hans Nowak
Democritus University of Thrace, 193 Padazidou st., 68200, Nea Orestiada, Greece
liliadis@fmenr.duth.gr
Abstract. This paper presents the design and the development of an agent-
based intelligent hybrid system. The system consists of a network of interacting
intelligent agents aiming not only towards real-time air pollution monitoring but
towards proposing proper corrective actions as well. In this manner, the concen-
tration of air pollutants is managed in a real-time scale and as the system is
informed continuously on the situation an iterative process is initiated. Four dis-
tinct types of intelligent agents are utilized: Sensor, Evaluation, Decision and
Actuator. There are also several types of Decision agents depending on the air
pollution factor examined. The whole project has a Hybrid nature, since it util-
izes fuzzy logic – fuzzy algebra concepts and also crisp values and a rule based
inference mechanism. The system has been tested by the application of actual
air pollution data related to four years of measurements in the area of Athens.
1 Introduction
1.1 Aim of This Project
The present work aims firstly to design and develop a network of intelligent distrib-
uted agents that employ hybrid inference techniques in order to offer real-time moni-
toring and to address the air pollution problem by suggesting corrective actions
mainly related to human activities. Intelligent agents are software programs with an
attitude. An agent is autonomous because it operates without the direct intervention of
humans or others and has control over its actions and internal state. An agent is social
because it cooperates with humans and with other agents in order to achieve its tasks.
An agent is reactive since it perceives the environment and it responds to changes that
occur in it [3]. A multiagent system can model complex cases and introduce the pos-
sibility of agents having common or conflict goals [3]. Hybrid intelligent systems are
becoming popular due to their capabilities in handling many real world complex prob-
lems. They provide us with the opportunity to use both, our knowledge and row data to
solve problems in a more interesting and promising way. This multidisciplinary research
field is in continuous expansion in the artificial intelligence research community [11].
D. Palmer-Brown et al. (Eds.): EANN 2009, CCIS 43, pp. 1–16, 2009.
© Springer-Verlag Berlin Heidelberg 2009
2 L.S. Iliadis and A. Papaleonidas
The research effort described here is characterized as a Hybrid one. This is due to the
fact that it utilizes fuzzy logic and fuzzy sets in one hand, but also crisp data in the
other, fuzzy “IF-THEN” rules where both the antecedents and the consequents take
linguistic values but also rule sets with crisp numerical values and at the same time it
utilizes the basic properties of a distributed network of interacting intelligent agents.
Though they are state of the art technology, several intelligent agents systems have
been developed with very useful application domains [14],[10].
The multiagent network is informed on the air pollution situation in a scale of sec-
onds and it is trapped in an iterative process. The whole model is based in a simple
and very crucial principle. The received data and the inference engine determine the
corrective actions and vise versa. The proposed and implemented system is realized
by employing a distributed architecture, since the network comprises of several inter-
acting independent agents of four distinct types. This offers the advantage of parallel
execution of several instances of the same agent type. The system has been tested
with actual air pollution data gathered from the site of Pathsion street which is located
right in the center of Athens in an altitude of 105 meters.
There are several air pollutants that have harmful effect in the health of the citizens of
major cities. Some of them do not have direct effect but they initiate secondary harm-
ful processes. This research considers the carbon oxide CO, carbon dioxide CO2, ni-
tric oxide NO, nitrogen dioxide NO2, the particular matter PM and ozone O3 which is
one of the most pervasive and potentially harmful air pollutants especially in major
metropolitan centers like Athens Greece. It is a critical atmospheric species, which
drives much of the tropospheric photochemistry. It is also considered responsible for
regulating the tropospheric oxidation capacity and it is the main ingredient of the pho-
tochemical smog. It is formed when volatile organic compounds (VOCs), nitric oxide
and nitrogen dioxide react chemically under the influence of heat and sunlight [17].
Various medical studies have revealed that ozone can be blamed for inflammation and
irritation of the respiratory tract, particularly during heavy physical activity, as well as
ocular diseases [6], [7], [19].Various artificial neural networks have been developed
in order to model the ground-level O3 concentrations [2], [6], [16], [20].
Also meteorological data are considered, namely: Temperature, Relative Humidity,
and the NW-SE direction wind component (u') and the SW-NE direction wind com-
ponent (v') are used as input to the system. The selection of the u' and v' components
instead of the conventional ones, u (W-E) and v (S-N), was considered necessary as u'
is almost parallel to the Saronic Gulf coast and v' to the direction of the sea breeze
circulation and the main axis of the Athens Basin.
2 Theoretical Background
Fuzzy logic is a “real world” approximator and it has been used in a wide
scale towards risk modeling and especially in the case of environmental or natural
hazards [13].
Intelligent Agents Networks Employing Hybrid Reasoning 3
Human inference is very approximate. Our statements depend on the contents and we
describe our physical world in rather vague terms. Imprecisely defined “classes” are
an important part of human thinking [13]. The term “risky” due to an involved
parameter is both imprecise and subjective and it is determined by a membership
function that might have dissimilar shape. The choice of a shape for each particular
linguistic variable is both subjective and problem-dependent [13]. Any function
μ s (Χ) → [0,1] describes a membership function associated with some fuzzy set. A
trapezoidal and a Triangular membership function is a special case of the following
functions 2 and 3 respectively [13].
⎧0 if X < a ⎧0 if X < a
⎪ (X - a)/(m - a) if X ∈[a, m) ⎪
⎪⎪ (2) μ (Χ) = ⎪ (X - a)/(c - a) if X ∈[a, c) (3)
μ s (Χ) = ⎨1 if X ∈[m, n] ⎨
⎪(b - X)/(b - c) if X ∈[c, b)
s
⎪(b - X)/(b - n) if X ∈ (n, b]
⎪ ⎪⎩0 if X > b
⎩⎪0 if X > b
For example in the case of the air pollution due to CO at least three fuzzy sets can be
formed and they could have either of the potential following types:
~
FS 1 = {Risky _ momentum _ due _ to _ CO }
~
FS 2 = {Very _ Risky _ momentum _ due _ to _ CO}
~
FS 3 = {Extremely _ Risky _ momentum _ due _ to _ CO} . Every temporal instance
under examination, belongs to all above three fuzzy sets with a different degree of
membership (in the closed interval [0,1]) which is determined by the triangular mem-
bership function 2. So actually each fuzzy set is a vector of ordered pairs of the form:
(Ci,μi) where Ci is the crisp value and μi is the degree of membership of the value Ci to
the corresponding fuzzy set. The fuzzy set with the highest degree of membership
indicates the actual “Linguistic” value that can be assigned to a temporal instance. For
~
example if μi ( FS3 )=0.9 then the situation is rather critical, since the temporal in-
stance is characterized as extremely risky due to CO. In this case, special actions have
to be taken. Another advantage of this approach is the fact that an “extremely risky”
Linguistic might have a different batch of actions compared to another if the degrees
of membership have a significant differentiation.
A major key issue is the determination of the overall degree of risk when two or
more parameters are involved. It can be faced by the use of the fuzzy conjunction
operators which are called T-Norms. So let’s suppose that we need to estimate the
~
unified degree of membership to a fuzzy set FS4 that is the fuzzy conjunction be-
~
tween two other fuzzy sets. FS 4 = ({ Extremely _ Risky _ momentum _ due _ to _ CO }
AND ({ Risky _ momentum _ due _ to _ NO } )
4 L.S. Iliadis and A. Papaleonidas
~
In this case, the μi( FS 4 ) will be determined by the application of one of the fol-
lowing T-Norms functions 3,4,5,6,7 [13].
Minimum Approach = MIN(μ Α (Χ), μ Β ( Χ)) (3)
Such fuzzy intelligent systems have been developed by Iliadis et al. [12]. When
weighted conjunction is required according to the importance of each parameter
(which is estimated empirically) the following function 8 can be employed. The Ag-
gregation function can be any of the above mentioned T-Norms [12].
⎛ ⎛ ⎞ ⎛ ⎞ ⎛ ⎞⎞
μ ( xi ) = Agg⎜ f ⎜⎜ μ ( xi ), w1 ⎟⎟, f ⎜⎜ μ (xi ), w2 ⎟⎟,..., f ⎜⎜ μ (xi ), wn ⎟⎟ ⎟ (8)
~
⎜ ⎟
⎝ ⎝ ⎠ ⎝ ⎠ ⎝ ⎠⎠
S ~ ~ ~
Α Α Α
The system is Hardware independent since both the multiagent development platform
used (Jade) and the implemented agents have been programmed in Java. Jade is a
software platform providing middleware-layer functionalities independent of the spe-
cific application. It was released as open source in 2000 by Telecom Italia under the
LGPL (Library Ginu Public Licence) [3]. Though there are also other agent commu-
nication languages like ARPA KSI or Knowledge Query and Manipulation Language
(KQML) [1] Jade was chosen due to the fact that it is open source and Java based.
Intelligent Agents Networks Employing Hybrid Reasoning 5
The developed system consists of four types of intelligent agents, namely the Sensor
agents, the Evaluation ones, the Decision making agents and finally the Actuators.
Before describing the system’s architecture it should be specified that the message
exchanged when two intelligent agents interact are specified and encoded according
to the FIPA (Foundation for Intelligent Physical Agents) standardization. FIPA is
dominated by computer and telecommunications companies and it is focused on
agent-level issues [4]. Thus INFORM, CONFIRM, AGREE, ACCEPT_PROPOSAL,
REFUSE, REQUEST, SUBSCRIBE are some characteristic types of exchanged mes-
sages. The following figure 2 depicts the general architecture of the system that has
been developed in this study.
interface and changeability of the temporal interval). It also checks the existence of
the Directory Facilator (DF) which (under the JADE environment) stores several de-
tails regarding each intelligent agent (e.g. name, owner, PC where the agent is con-
nected, ontologies, the type of the agent). The setup ensures that there is no identical
agent in the system and it registers the agent in the DF. After the registration, the
other two behaviors are activated.
The “Read Data behavior” is executed when a specific amount of time passes
(the interval time) and it is responsible for the input of the parameters’ values. During
the execution of each iteration, several checks are made for potential input errors in
the input file. If errors are encountered the agent is marked as “non operational”.
The third behavior is the “Interchange of messages” which undertakes the com-
munication of the agent with the rest of the system (other agents). It checks in the
message queue of the agent for incoming messages. The following message types are
executed: Subscribe messages that aim in checking and updating the system’s status.
In each subscribe request the agent responds with a confirm or failure subscribe
message depending on the functionality of the agent as it is defined by “Read Data
behavior”. “Request” messages that aim in updating the system with the last crisp
measurement that the agent holds. The sensor agent sends a reply of the type inform,
refuse or failure depending on the time elapsed from the previous request and of
course from its status. Messages of “Propose” type where the agent is asked to change
its time interval. Depending on it setup arguments it answers the proposal with agree
or with cancel. The last type of message it accepts is the “Request whenever” where
other agents require the last data input to a sensor agent regardless its operational
status. The answer is of the type inform.
Intelligent Agents Networks Employing Hybrid Reasoning 7
The risk estimation model described above has been used in the development of the
intelligent agent network in order to determine the seriousness of the situation. More
specifically the following two semi-trapezoidal functions 9 and 10 and the triangular
membership function 11 are applied. The boundaries of the following membership
functions for the air pollutants were estimated by the European Union instructions
199/30/EC, [18] 200/69/EC [5] and COM(2000)613 [9] and for meteorological fac-
tors they were specified according to the Heat index and thermal comfort models SO
Standard 7730/1984 [15]. They are all presented in the following table 1.
⎧ 1, x≤a
⎪⎪ b − x
μ s ( x; a, b) = ⎨ , a< x<b (9)
⎪b − a
⎪⎩ 0, b≤ x
⎧ 0, x≤ f
⎪
⎪x− f
μ s ( x; f , g ) = ⎨ , f <x<g (10)
⎪g − f
⎩⎪ 1, g≤x
⎧ 0, x≤c
⎪x−c
⎪ , c<x≤d
⎪
μ s ( x; c, d , e) = ⎨ d − c (11)
⎪e − x , d < x < e
⎪e − d
⎪ 0, e≤x
⎩
Parameter a b c d e f g unit
o
Temperature 20 30 20 30 38 30 38 C
CO 1 2.5 1 2.5 4 2.5 4 mgr-3
NO 18 83 18 83 148 83 148 μgr-3
NO2 50 130 50 130 200 130 250 μgr-3
O3 80 120 80 120 210 120 280 μgr-3
pm10 30 67 30 67 79 67 90 μgr-3
Air Pressure 950 1025 950 1025 1200 1026 1200 hPa
R. Humidity 30 50 30 50 70 50 70 %
Solar radiation 250 530 250 530 810 530 810 Wm-2
Wind Speed 1.8 3.6 1.8 3.6 5.4 3.6 5.4 ms-1
Wind u’ 20 28 20 28 36 28 36 ms-1
Wind v’ 20 28 20 28 36 28 36 ms-1
8 L.S. Iliadis and A. Papaleonidas
The following figure 4 presents the structure of the employed three functions.
Fig. 4. The three functions applied for the determination of the three potential situations
The Evaluation agents are responsible for the assessment of the input values that
they accept from the sensor agents by applying their internal logic. They feed the sys-
tem with fuzzy risk values according to their fuzzy perspective. From the architectural
point of view, each evaluation intelligent agent is logically connected to a sensor in-
telligent agent whereas a sensor agent can feed several evaluation agents. There are
three behaviors related to an evaluation agent: The “setup behavior” which acts in a
similar manner with the corresponding behavior of the sensor agent. It is executed
once while the agent is created and it is responsible for the registration of the agent on
the DF, for the synchronization of the time interval of an evaluation Intelligent Agent
with the time interval of its connected sensor agent. The “Read Data behavior” is
executed periodically after an interval of time passes and it is responsible for the input
of the value of the connected sensor agent, its evaluation and for providing the system
with the produced output. In every iteration, it sends a request type message to the
connected sensor agent. Depending on the answer it receives from the request it up-
dates the system. If it receives an “inform” type of answer it produces the fuzzifica-
tion processes and it updates the system for the degree of membership of the input
value to each predefined fuzzy set. If a critical situation is met, or if the situation is
getting worse from one measurement to another the evaluation agent might send a
Proposal message to the sensor agent, proposing the intensification of the time inter-
val which will respond based on its own rules. If a “refuse” type answer is received
this means that the value of the sensor agent has not changed due to the fact that the
interval has not passed. This will make the evaluation agent in a “hold on” status. A
“failure” answer will mean that the sensor agent is not working properly. In this case
a time interval reset command will be sent and executed to the sensor agent.
The last behavior is the “Reply to messages” which handles the messages that the
agent receives from its environment. It is executed periodically. For every message in
the agent’s queue its type is checked (based on the FIPA ACL Message type). The first
message type that can be accepted is the subscribe message. In each subscribe request
Intelligent Agents Networks Employing Hybrid Reasoning 9
accepted by the evaluation agent, it creates and sends a subscribe message to the con-
nected sensor agent. Depending on the operational status of the sensor agent as it is de-
fined by the “Read Data Behavior” it will answer the system with “confirm” or with
“failure subscribe” message. In this way the system is notified on the statuses of both
the evaluation and the sensor agents. Messages of type “inform” which aim in the time
synchronization between the agents. In every “inform” message the system’s timestamp
is compared to the time interval of the agent and if it is higher then it informs for a po-
tential malfunction of the system. Finally messages of the type “Requet whenever” are
employed that are used to ask for the update of the Decision Making agents with the
actual crisp values that are also used together with the fuzzy values by the decision mak-
ing process of the system. This is the Hybrid part of the system’s Inference engine. The
answer in this type of message is of the type “Inform”. The following figure 5 is a de-
tailed description of an evaluation agent and its basic behaviors.
The System’s Intelligent Agent is responsible for the proper synchronization of the
other agents and for the storage of the fuzzy output of the evaluation agents. The sys-
tem’s agent is the most complex of all and it comprises of five behaviors. The first is
the “setup behavior” which is similar with the corresponding behavior of the other
agent types. The “Time synchronize behavior” is executed periodically for random
and short time intervals. In every execution it checks the DF for connected agents
10 L.S. Iliadis and A. Papaleonidas
and it sends to all of the connected agents messages of the “inform” type with the
current timestamp of the system in order to enable its comparison to the timestamp of
each agent. Another periodically executed behavior is the “Discover Agents” which is
operating in a “ping-pong” fashion. It aims the connection status of the agents to the
system and it removes from the DF all agents that are disconnected. It sends “Sub-
scribe” messages to all agents that were expected to be connected but they are not. If
it receives a “confirm” message it keeps the agents in the DF and in the list of active
agents. If it does not receive an answer for more than a certain amount of time (e.g. 2
sec) it considers that the answer is “failure”. The “Reply to requests” behavior an-
swers to registered and authorized agents in the DF and informs them on the situation
of the environment sending actual fuzzy data from the Databank with a message of
the type “Inform” or “Failure” otherwise. The “Accept Value Inform” (periodically
executed) behavior reads messages sent by the Evaluation Agents regarding the situa-
tion of the environment and it informs the DataBank of the system. In each of its
iterations it uses messages of the “Inform” type from the messages’ queue ad it con-
trols if the message sender (agent) is registered in the DF and authorized. In this case
it informs the Databank with the new values.
The following figure 6 presents the actual structure and the basic operations per-
formed in a System’s Intelligent Agent.
The “Decision” type Intelligent Agents perform (each one based on its reasoning and on
its knowledge base) assessment of the environmental condition, regarding the parame-
ters examined by each one and they decide on the direction of changing the environ-
mental conditions by proposing interventions and actions towards the improvement of
the situation. Each one of the Decision type Intelligent Agents are characterized by their
particular distinct properties which support the decision making process, they vary in
the degree of significance of each one and on the number of checks required. The above
particularities combined with the need for the construction of an open system in which
the users groups will be able to add custom Decision type agents led to various forms of
Decision agents that use the platform to obtain data vectors and then they act independ-
ently to output proper actions and results. All of the Decision agents are compatible to
the FIPA protocol of communication and they contact systems agent or the evaluation
agents using the messages described above. They have 3 behaviors. The following de-
scription is related to the Temperature and Relative Humidity Decision agents. The first
is the “setup behavior”. The “keep connected to Evaluation Agents” behavior allows
the decision agents to control (in random time intervals) the evaluation agents if they are
connected to the system and if they operate properly. In each iteration the operational
statuses of the evaluation agents related to Temperature and to Relative Humidity are
checked. A main behavior is the (periodic) “Decide and Evaluate” that performs the
assessment of the data and the assessment of the examined case and leads to the deter-
mination of the proper actions. It reads from the system agent the fuzzy degrees of
membership for the Temperature and the Relative Humidity of the Environment.
When both parameters belong to the fuzzy set normal then the Heat Index is in the
scale “Comfortable” where the agent sets the inform groups and the proposed actions
to zero informing the actuator agent.
In the system’s agent sends values higher than Normal (high ή critical) then the
agent sends a message of type “request whenever” to the corresponding evaluation
agents in order to obtain the actual crisp Temperature and Relative Humidity values.
Based on these values it estimates the Heat Index which is then translated to proper
fuzzy Linguistic values. The Linguistics used are namely: slightly uncomfortable,
uncomfortable, very uncomfortable, intolerable and Death danger [21].
Depending on the characterization of the case the vulnerable groups of the popula-
tion are picked and all the necessary measures are proposed and sent to the connected
actuator agent.
3.6 Actuators
The Actuators are receiving messages from the connected Decision Agents and if
there is no reason for blocking their execution (any kind of contradiction to other ac-
tions proposed by other agents) it sends the decisions to policy makers in order to be
executed. For example if none of the evaluation agents are in a critical situation in any
of the air pollution factors, then the actuators propose to the decision makers the
opening of the city centre ring to the cars. The opposite action is taken if the situation
is characterized as critical at least for one air pollution parameters.
The agent for the estimation and evaluation of the heat index has checked the situation
of the environment 4233 times running in an iterative mode. From the 4233 iterations
14 L.S. Iliadis and A. Papaleonidas
the agent has calculated the heat index in 4123 cases since in the rest 109 the two de-
termining factors (temperature and relative humidity) were characterized as been in a
normal state.
The heat index in 1958 cases has proven to be higher than the actual temperature
whereas in 2615 instances it was lower. This can be easily explained due to the dry
climate of the Attica area. Totally in 2748 cases the heat index was below the tem-
perature of 30 degrees C and thus no action was required. In 1256 instances the heat
index was characterized as slightly uncomfortable, whereas 119 times it was charac-
terized as uncomfortable with immediate need for actions, directed especially towards
the protection of the most vulnerable population groups.
The agent network (based on the existing data) has checked the situation of the air
quality in terms of air pollutants’ concentration. Five times the situation became “very
risky” for the NO2 concentration with a duration extended to 2,1,2,4,3 hours respec-
tively (the official limit is 250mg for 3 hours). In four cases there was a “critical
situation” regarding the concentration of the NO2 (the official limit is 360mg for 3
hours). The temporal duration of the critical condition was 1,4,2,4 hours respectively.
Of course the critical incidents would be much more if we also had night and winder
data in our disposal. Unfortunately the data that we have gathered are daily and they are
related only to summer months. For the particular matter pm10 the situation became the
situation became “very risky” (the official limit is 90mg for 5 continuous days) whereas
in 3 cases the situation became “critical” for 7 continuous days (the official limit is
110mg for 7 continuous days). Once we had a red alert for the PM10 and for the NO2
simultaneously. In every risky situation the map of the city center ring turns to orange
color, whereas in the case of the critical conditions it turns into red.
6 Conclusions
As it has been already mentioned the system is Hardware independent. Due to its
source-internal support of various Java versions like Java 2 Micro Edition, Personal
Java and MIDP -Mobile Information Device Profile- Java, (beyond the J2SE version)
and also due to their distribution and their division in smaller ones they can be executed
in smaller mobile devices like cell phones, personal digital assistants (PDA) and palm-
tops. This is a great advantage and it shows the potential portability of the system.
Its hybrid nature employing fuzzy algebra, towards the flexible evaluation (with
proper linguistics) of the air quality and of the heat index situation, enhances the in-
telligent and rapid inference capabilities of the agent network.
So it can be characterized as a real-time innovation that can not only monitor the
air quality and the heat index for a major urban centre like Athens but moreover it can
interact with environmental policy makers towards the deployment of corrective ac-
tions. Its portability to any urban center all over the globe without any adjustments is
a very serious advantage. The only thing remaining after the first pilot application in
Athens city center is its experimental adaptation by the civil protection services to-
wards a better quality of life for all citizens.
Intelligent Agents Networks Employing Hybrid Reasoning 15
References
1. Alonso, F., Fernández, R., Frutos, S., Soriano, F.J.: Engineering agent conversations with
the DIALOG framework. In: Fischer, K., Timm, I.J., André, E., Zhong, N. (eds.) MATES
2006. LNCS, vol. 4196, pp. 24–36. Springer, Heidelberg (2006)
2. Balaguer Ballester, E., Camps Valls, G., Carrasco Rodriguez, J.L., Soria Olivas, E., Del
Valle, T.S.: Effective 1-day ahead prediction of hourly surface ozone concentrations in
eastern Spain using linear models and neural networks. Ecological Modelling 156, 27–41
(2002)
3. Bellifemine, F., Caire, G., Greenwood, D.: Developing mutli-agent systems with Jade.
John Wiley and Sons, USA (2007)
4. Bigus, J., Bigus, J.: Constructing Intelligent Agents using Java, 2nd edn. J. wiley and sons,
Chichester (2001)
5. Borrego, C., Tchepel, O., Costa, A.M., Amorim, J.H., Miranda, A.I.: Emission and disper-
sion modelling of Lisbon air quality at local scale Atmospheric Environment. 37(3)
(2003)
6. Brauer, M., Brook, J.R.: Ozone personal exposures and health effects for selected groups
residing in the Fraser Valley. Atmospheric Environment 31, 2113–2121 (1997)
7. Burnett, R.T., Brook, J.R., Yung, W.T., Dales, R.E., Krewski, D.: Association between
Ozone and Hospitalization for Respiratory Diseases in 16 Canadian Cities. Environmental
Research 72, 24–31 (1997)
8. Chaloulakou, A., Saisana, M., Spyrellis, N.: Comparative assessment of neural networks
and regression models for forecasting summertime ozone in Athens. The Science of the
Total Environment 313, 1–13 (2003)
9. Commission of the European Communities, Directive of the european parliament and of
the council relating to ozone in ambient air (2000),
http://aix.meng.auth.gr/AIR-EIA/EU/en_500PC0613.pdf
10. Content, J.M., Gechter, F., Gruer, P., Koukam, A.: Application of reactive multiagent sys-
tem tolinear vehicle platoon. In: Proceedings of the IEEE ICTAI 2007 International Con-
ference, Patras, vol. 2, pp. 67–71 (2007)
11. Corchado, E., Corchado, J.M., Abraham, A.: Hybrid Intelligence for bio Medical Informat-
ics S.I. In: 2nd International workshop on Hybrid Artificial Intelligence Systems, Spain
(2007)
12. Iliadis, L.: Intelligent Information systems and Applications in risk estimation. Stamoulis
A., publishing, Thessaloniki, Greece (2007)
13. Kecman: Learning and Soft Computing. MIT Press, Cambridge (2001)
14. Kokawa, T., Takeuchi, Y., Sakamoto, R., Ogawa, H., Kryssanov, V.: An Agent Based Sys-
tem for the Prevention of Earthquake-Induced Disasters. In: Proceedings of the IEEE IC-
TAI 2007 International Conference, Patras, vol. 2, pp. 55–62 (2007)
15. Labaki, L., Barbosa, M.P.: Thermal comfort evaluation in workplaces in Brazil: the case of
furniture industry. In: Proceedings of Clima 2007 WellBeing Indoors (2007)
16. Melas, D., Kioutsioukis, I., Ziomas, I.: Neural Network Model for predicting peak photo-
chemical pollutant levels. Journal of the Air and Waste Management Association 50, 495–
501 (2000)
17. Paschalidou, A.K., Kassomenos, P.A.: Comparison of air pollutant concentrations between
weekdays and weekends in Athens, Greece for various meteorological conditions. Envi-
ronmental Technology 25, 1241–1255 (2004)
16 L.S. Iliadis and A. Papaleonidas
18. Sixth Environment Action Programme (6th EAP) scoreboard (Date of validation:
26/10/2005),
http://ec.europa.eu/environment/newprg/pdf/
6eap_scoreboard_oct2005.pdf (data until 30/09/2005)
19. Smith-Doron, M., Stieb, D., Raizenne, M., Brook, J., Dales, R., Leech, J., Cakmak, S.,
Krewski, D.: Association between ozone and hospitalisation for acute respiratory diseases
in children less than 2 years of age. American Journal of Epidemiology 153, 444–452
(2000)
20. Soja, G., Soja, A.M.: Ozone indices based on simple meteorological parameters: potential
and limitations of regression and neural network models. Atmospheric Environment 33,
4229–4307 (1999)
21. Steadman, R.G.: The Assessment of Sultriness. Part I: A Temperature-Humidity Index
Based on Human Physiology and Clothing Science. Journal of Applied Meteorology 18(7),
861–873 (1979)
Neural Network Based Damage Detection
of Dynamically Loaded Structures
1 Introduction
Civil engineering structures such as bridges must be periodically inspected to ensure
structural integrity since many of these structures have achieved their service life and
some damage may occur. Visual inspection and local non-destructive evaluation as
the conventional approaches are expensive, subjective, inconsistent, labor and time
intensive and need easy access to the damage zone. That is the reason for recently
extended research and development of structural health monitoring (SHM) techniques
[1]. Among these, modal based techniques have been extensively investigated due to
their global nature and simplicity. They can be used for automated damage localiza-
tion and result in consistent damage assessment. From the practical point of view,
damage assessment can be categorized into four levels: 1) detecting if the structure is
damaged; 2) finding the location of damage; 3) estimating the magnitude of damage
and 4) evaluating the remaining service life of the structure.
Modal based techniques use ambient vibration measurements and can be used for
structures in usage. A typical result of the experimental measurements is the dynamic
response of the structure in form of time series (accelerations, velocities). Consequently,
modal properties (mode shapes and corresponding eigenfrequencies – “characteristic”
frequencies at which a system vibrates), damping characteristics and assurance criteria
D. Palmer-Brown et al. (Eds.): EANN 2009, CCIS 43, pp. 17–27, 2009.
© Springer-Verlag Berlin Heidelberg 2009
18 D. Lehký and D. Novák
MAC, COMAC, DLAC and so on, are evaluated (e.g. [2], [3]). The subject of research
of both academic and industrial research groups during last decade is utilization of this
kind of structural response information for damage localization and structural health
assessment. The task is based on the fact that damaged structure has smaller stiffness in
some parts – and this difference will affect vibration and modal properties. The com-
parison of vibration of virgin (undamaged) structure and damaged structure can be used
for the detection of damaged parts (localization of damage).
Efficiency of identification procedure increases with proper sensors placing on the
structure. Sensors placed closer to damaged part of the structure shows higher sensi-
tivity [4]. Besides influence of structural damage also so called operative conditions
(e.g. temperature change) should be taken into account during inverse analysis [5].
“Model updating method” is the term frequently used in connection with SHM and
damage detection [6], [7], [8], [9], [10], [1]. “Updating” means that individual pa-
rameters of FEM model are iteratively changed in order to minimize the difference
between experimentally measured and calculated response. A sensitivity of the re-
sponse on model parameters is frequently used and can be directly utilized for effi-
cient identification [11], [12].
The aim of the research is to work out a methodology of dynamic damage detec-
tion based on the coupling of Monte Carlo type simulation and artificial neural net-
works (ANN). It extends a methodology of inverse analysis developed and applied for
fracture/mechanical parameters identification [13], [14], [15], [16].
Important part of damage detection procedure is proper selection of input informa-
tion. For that purpose authors carried out several modal properties studies (numerical
and experimental) using simple laboratory beams as well as real bridge structures
[17], [18]. The aim of those studies was to find out which eigenfrequencies, mode
shapes or assurance criterions are affected by change of stiffness in certain position on
the structure. Results show that if the damage is reasonably large, eigenfrequencies
are sufficient for damage detection as it is seen later in this paper. Lower frequencies
are effected more then higher ones but not constantly along the structure. Their shift
corresponds to mode shapes. That is the reason why for detection of damage in some
positions on the structure higher eigenfrequencies must be used.
If mode shapes are available their utilization can be helpful for identification –
mode shapes itself or modal assurance criterion (MAC) [2]. It is essential that in com-
paring with frequencies higher mode shapes are more affected by stiffness change
than lower ones. Unfortunately it is not easy or even impossible to obtain higher mode
shapes from ambient vibration measurement.
Extensive research has been done by other authors to suggest other quantities
which can be used for damage detection using various methods. Among them, other
assurance criterions (COMAC [2], DLAC, MDLAC [3]), rank ordering of eigenfre-
quency shifts [19], damage index (DI) [2] and so on, can be mentioned.
for following inverse task: what damage has caused the given change of structural re-
sponse? The whole procedure of inverse analysis can be itemized as follows:
In structural mechanics the classical problem is defined in the way that for given
structural, material, loading, environmental … data (vector y) the corresponding
structural response is obtained – experimentally or numerically (vector p):
p = f (y ) . (1)
In case of damage detection of dynamically loaded structures we can simplify it
that for stiffness distribution along the structure the structural response in form of
modal properties is obtained. But the task which is being solved here is opposite to
the classical one: what damage has caused the given change of structural response?
This inverse task is defined as (see Fig. 1):
y= f −1
(p ) = T (p ) . (2)
Instead of finding inversion to function f in analytical form the transformation T
using artificial neural network is used.
Final step is to calculate structural response using active phase of artificial neural
network using identified parameters yopt.
p = f (y opt ) . (3)
Suitable type of neural network for this task is feed-forward multi-layer perceptron.
Such a network consists set of neurons arranged in several layers – input layer (some-
times called zero layer), a number of hidden layers and output layer. Number of hid-
den layers and neurons in it is driven by complexity of problem which is being
solved. Considering Kolmogorov’s theorem about the third Hilbert’s problem (e.g.
[20]) the maximum needful number of hidden layers is two with necessary amount of
neurons in each of them. For damage detection the one hidden layer proves to be
sufficient quantity (see application further).
To determine proper amount of neurons in hidden layers is not a simple task. As a
recommendation for a first rough estimation the following formula for network with
one hidden layer can be used [21]:
where Ninp is the number of inputs and Nout is the number of neurons in output layer.
As a transfer (activation) function of neurons in hidden layer hyperbolic tangent
function is used; output neurons have linear transfer function.
ANN must be trained first using appropriate training set for correct approximation of
inverse task (2). Such a training set consists of ordered pairs input–output [pi,yi]
(see Fig. 1) where inputs are modal data (eigenfrequencies, mode shapes) and outputs
are stiffness distributions along the structure for various damage situations. It is im-
possible to obtain appropriate training set for the real structure in usage (we are not
allowed to damage structure), therefore stochastic analysis using numerical model is
carried out to get training set virtually. For that purpose an appropriate numerical
Neural Network Based Damage Detection of Dynamically Loaded Structures 21
FEM model has to be first developed and checked to be in good agreement with real
structural behavior. Parameters of the model are than randomized (stiffness in certain
parts is reduced) and stochastic analysis is performed.
Numerical FEM analyses can be very time consuming, especially in non-linear
cases, therefore efficient simulation technique should be used. In proposed methodol-
ogy of damage detection a simulation method Latin Hypercube Sampling (LHS) is
used. It is Monte Carlo type simulation method with stratified sampling scheme. The
whole multi-dimensional space of IP is covered perfectly by relatively small number
of simulations [22], [23]. Efficiency of this method for training of the ANN was
proved by authors in [13].
An important task in the inverse analysis is to determine the significance of parame-
ters which are subject to identification. With respect to the small-sample simulation
techniques described above, the straightforward and simple approach can be used based
on the non-parametric rank-order statistical correlation between the basic random vari-
ables and the structural response variables by means of the Spearman correlation coeffi-
cient or Kendall tau [23]. The sensitivity analysis is obtained as an additional result of
LHS, and no additional computational effort is necessary.
Training of ANN means setting up synaptic weights and biases of neurons. That is
optimization problem where the following error of the network is minimized:
E=
1 N K v
(
∑∑ yik − yik* ,
2 i =1 k =1
) (5)
where N is a number of ordered pairs input–output in training set, yik* is required out-
put value of k-th output neuron for i-th input and yikv is the real output value (for the
same input). In below described application Levenberg–Marquardt optimization
method [24] and evolution strategy method [25] were used.
3 Software Tools
The authors combined efficient techniques of statistical simulation of Monte Carlo
type with artificial neural network and FEM structural dynamics analysis to offer an
advanced tool for the damage identification of concrete structures. The combination
of all parts (structural analysis, statistical simulation, inverse analysis and reliability
assessment) is presented together in a package as the Relid software system
(see Fig. 2). It includes:
• FReET [26] – the probabilistic engine based on LHS simulation; First, it is used
for preparation of training set for artificial neural network. Second, when damage
is identified, it can be consequently used for reliability assessment of given prob-
lem using the results from inverse analysis.
• DLNNET [27] – artificial neural network software; the key stone of inverse analy-
sis which has to communicate with FReET and SOFiSTik finite element software.
• SOFiSTiK [28] – commercial FEM software which performs structural dynamics
analysis of a particular problem.
• Relid [29] – software shell which manages a flexible communication among pro-
grams mentioned above in a user-friendly environment.
22 D. Lehký and D. Novák
a) b)
Fig. 4. a) Scheme of cantilever beam – positions of accelerometers and cuts for all five speci-
mens; b) numerical model divided into 15 parts
Table 1. Comparison of results of dynamic measurements for undamaged and damaged beams
c12
1
stiffness [-]
Relative
0.75
0.5
0.25
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Beam part
c13
1
stiffness [-]
Relative
0.75
0.5
0.25
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Beam part
c14
1
stiffness [-]
Relative
0.75
0.5
0.25
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Beam part
c15
1
stiffness [-]
Relative
0.75
0.5
0.25
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Beam part
Fig. 5. Stiffness distribution after damage detection (crosshatched element corresponds to part
where beam was cut)
When measurements using undamaged beams were carried out all five specimens
were cut in certain part (see scheme in Fig. 4). Those damaged beams were then
tested again and modal properties were evaluated. Comparison of results of dynamic
measurements for undamaged and damaged beams is introduced in Table 1.
Relative changes of all five eigenfrequencies were then used for damage detection
using proposed ANN based method. For that purpose a numerical dynamic model in
Neural Network Based Damage Detection of Dynamically Loaded Structures 25
SOFiSTiK software [28] was created. Beam was divided into 15 parts with independ-
ent bending stiffness EI1,…, EI15 (see Fig. 4).
ANN consists of 1 hidden layer with 10 nonlinear neurons and an output layer with
15 linear neurons (15 stiffness values). There are also 5 inputs of the network (5 ei-
genfrequencies). Training set contains 120 samples and was divided into two parts,
115 samples were used directly for training the network, while 5 samples served for
testing of network overfitting. After the network was trained, the eigenfrequencies
from experimental testing for damage state were used for the simulation of ANN. The
output of ANN is a spatial distribution of stiffness along the girder (15 values, see
Fig. 5). Because of inconsistency between experimental and numerical results of
specimen c11, the beam was omitted from damage detection. It was probably caused
by some testing errors along with high vibration sensitivity to damage located close to
fixed end of cantilever beam.
5 Conclusions
Efficient techniques of stochastic simulation methods were combined with artificial
neural networks in order to offer an advanced tool for damage identification of dy-
namically loaded structures. The methodology was tested using wood beams virtually
damaged. The efficiency was verified using this artificial laboratory experiment. A
multipurpose software tool for damage identification was developed. The methodol-
ogy was already applied for real structures (bridges) [18], [30] with satisfactory re-
sults. These applications are not included here, as the main target of this paper was to
present the theory of proposed damage identification and verification using “clear
experiment’ without any significant “contamination”, which is usual in case of real
structures.
References
1. Wenzel, H., Pichler, D.: Ambient vibration monitoring. John Wiley & Sons Ltd, West Sus-
sex (2005)
2. Salgado, R., Cruz, P.J.S., Ramos, L.F., Lourenço, P.B.: Comparison between damage de-
tection methods applied to beam structures. In: Third International Conference on Bridge
Maintenance Safety and Management, Porto, Portugal, CD-ROM (2006)
3. Koh, B.H., Dyke, S.J.: Structural health monitoring for flexible bridge structures using cor-
relation and sensitivity of modal data. Computers and Structures 85, 117–130 (2007)
4. Spencer Jr., B.F., Gao, Y., Yang, G.: Distributed computing strategy for damage monitor-
ing employing smart sensors. In: Shenzhen, C., Ou, L., Duan (eds.) Structural Health
Monitoring and Intelligent Infrastructure, pp. 35–47. Taylor & Francis Group, London
(2005)
26 D. Lehký and D. Novák
5. Feltrin, G.: Temperature and damage effects on modal parameters of a reinforced concrete
bridge. In: 5th European Conference on Structural Dynamics (EURODYN), Munich, Ger-
many (2002)
6. Link, M.: Updating of analytical models – basic procedures and extensions. In: Silva,
J.M.M., Maia, N.M.M. (eds.) Modal Analysis and Testing. NATO Science Series. Kluwer
Academic Publ, Dordrecht (1999)
7. Teughels, A., Maeck, J., De Roeck, G.: Damage assessment by FE model updating using
damage functions. Computers and Structures 80(25), 1869–1879 (2002)
8. Deix, S., Geier, R.: Updating FE-models using experimental modal analysis for damage
detection and system identification in civil structures. In: Third European Conference on
Structural Control (3ECSC). Vienna University of Technology, Vienna (2004)
9. Fang, X., Luo, H., Tang, J.: Structural damage detection using neural network with learn-
ing rate improvement. Computers and Structures 83, 2150–2161 (2005)
10. Huth, O., Feltrin, G., Maeck, J., Kilic, N., Motavalli, M.: Damage identification using mo-
dal data: Experiences on a prestressed concrete bridge. Journal of Structural Engineering,
ASCE 131(12), 1898–1910 (2005)
11. Strauss, A., Lehký, D., Novák, D., Bergmeister, K., Santa, U.: Probabilistic response iden-
tification and monitoring of concrete structures. In: Third European Conference on Struc-
tural Control (3ECSC). Vienna University of Technology, Vienna (2004)
12. Strauss, A., Bergmeister, K., Lehký, D., Novák, D.: Inverse statistical FEM analysis – vi-
bration based damage identification of concrete structures. In: International Conference on
Bridges, Dubrovnik, Croatia, pp. 461–470 (2006)
13. Novák, D., Lehký, D.: ANN Inverse Analysis Based on Stochastic Small-Sample Training
Set Simulation. Engineering Application of Artificial Intelligence 19, 731–740 (2006)
14. Novák, D., Lehký, D.: Inverse analysis based on small-sample stochastic training of neural
network. In: 9th International Conference on Engineering Applications of Neural Net-
works (EAAN2005), Lille, France, pp. 155–162 (2005)
15. Lehký, D., Novák, D.: Probabilistic inverse analysis: Random material parameters of rein-
forced concrete frame. In: 9th International Conference on Engineering Applications of
Neural Networks (EAAN2005), Lille, France, pp. 147–154 (2005)
16. Strauss, A., Bergmeister, K., Novák, D., Lehký, D.: Stochastische Parameteridentifikation
bei Konstruktionsbeton für die Betonerhaltung. Beton und Stahlbetonbau 99(12), 967–974
(2004)
17. Frantík, P., Lehký, D., Novák, D.: Modal properties study for damage identification of dy-
namically loaded structures. In: The Third International Conference on Structural Engi-
neering, Mechanics and Computation, Cape Town, South Africa, pp. 703–704 (2007)
18. Lehký, D., Novák, D., Frantík, P., Strauss, A., Bergmeister, K.: Dynamic damage identifi-
cation of Colle Isarco viaduct. In: 4th International Conference on Bridge Maintenance,
Safety and Management (IABMAS 2008), Seoul, Korea, pp. 2549–2556 (2008)
19. Armon, D., Ben-Haim, Y., Braun, S.: Crack detection in beams by rank-ordering of eigen-
frequencies shifts. Mechanical Systems and Signal Processing 8(1), 81–91 (1994)
20. Kůrková, V.: Kolmogorov’s theorem and multilayer neural networks. Neural Net-
works 5(3), 501–506 (1992)
21. Šnorek, M.: Neuronové sítě a neuropočítače (Neural networks and neurocomputers). Vy-
davatelství ČVUT, Prague, Czech Republic (2002) (in Czech)
22. McKay, M.D., Conover, W.J., Beckman, R.J.: A Comparison of Three Methods for Select-
ing Values of Input Variables in the Analysis of Output from a Computer Code. Tech-
nometrics 21, 239–245 (1979)
Neural Network Based Damage Detection of Dynamically Loaded Structures 27
23. Novák, D., Teplý, B., Keršner, Z.: The role of Latin Hypercube Sampling method in reli-
ability engineering. In: Proc. of ICOSSAR 1997, Kyoto, Japan, pp. 403–409 (1998)
24. Singh, V., Gupta, I., Gupta, H.O.: ANN-based estimator for distillation using Levenberg-
Marquardt approach. Engineering Applications of Artificial Intelligence 20, 249–259
(2007)
25. Schwefel, H.P.: Numerical optimization for computer models. Wiley, Chichester (1991)
26. Novák, D., Vořechovský, M., Rusina, R.: FReET v.1.5 – program documentation. Userś
and Theory Guides. Brno/Červenka Consulting, Czech Republic (2008),
http://www.freet.cz
27. Lehký, D.: DLNNET – program documentation. Theory and User’s Guides, Brno, Czech
Republic (in preparation, 2009)
28. Sofistik, A.G.: SOFiSTiK Analysis Programs, version 21.0, Oberschleissheim, Germany
(2004), http://www.sofistik.com
29. Lehký, D.: Relid – program documentation. User’s Guide, Brno, Czech Republic (in
preparation, 2009)
30. Lehký, D., Novák, D., Frantík, P., Strauss, A., Bergmeister, K.: Dynamic damage identifi-
cation based on artificial neural networks, SARA – part IV. In: The 3rd International Con-
ference on Structural Health Monitoring of Intelligent Infrastructure, Vancouver, British
Columbia, Canada, vol. 183 (2007)
Reconstruction of Cross-Sectional Missing Data
Using Neural Networks
1 Introduction
Real-world datasets are characterized by the very large dimensionality of the
input (explanatory) space often comprising thousands of attributes: in addition,
it is very often the case that, for some cases, the values of one or more ex-
planatory variables are missing. These are incomplete datasets: datasets with
missing values. Most data mining algorithms cannot work directly with incom-
plete datasets. If missing data are randomly distributed across cases, we could
even end up with no valid cases in the dataset, because each of them will have
at least one missing data element. Missing value imputation is widely used, by
necessity, for the treatment of missing values. A major focus of research is to de-
velop an imputation algorithm that preserves the multivariate joint distribution
of input and output variables. Much of the information in these joint distribu-
tions can be described in terms of means, variances and covariances. If the joint
distributions of the variables are multivariate normal, then the first and second
moments completely determine the distributions.
The practice of filling in a missing value with a single replacement is called
single imputation (SI). A major problem with SI is that this approach can-
not reflect sampling and imputation uncertainty about the actual value. Rubin
[1] proposed multiple-imputation (MI) to solve this problem. MI replaces each
D. Palmer-Brown et al. (Eds.): EANN 2009, CCIS 43, pp. 28–34, 2009.
c Springer-Verlag Berlin Heidelberg 2009
Reconstruction of Cross-Sectional Missing Data Using Neural Networks 29
missing value in a dataset with m > 1 (where m is typically small, e.g. 3-10)
statistically plausible values. A detailed summary of MI is given in [1,3,4].
Little and Rubin [2] and Schafer [4] classify missing data into three cate-
gories: Missing Completely at Random (MCAR), Missing at Random (MAR),
and Missing Not At Random (MNAR). MCAR and MAR data are recoverable,
where MNAR is not. Various methods are available for handling MCAR and
MAR missing data. The most common imputation procedure is mean substitu-
tion (MS), replacing missing values with the mean of the variable. The major
advantage of the method is its simplicity. However, this method yields biased
estimates of variances and covariances.
The most sophisticated techniques for the treatment of missing values are
model based. A key advantage of these methods is that they consider interre-
lations among variables. Model-based methods can be classified into two cat-
egories: explicit model based algorithms and implicit model based algorithms.
Explicit model based algorithms (such as least squares imputation, expectation
maximization and Markov Chain Monte Carlo) are based on a number of as-
sumptions [7,8,9]: however, if the assumptions are violated, the validity of the
imputed values derived from applying these techniques may be in question.
Implicit model based algorithms are usually semi-parametric or non-parametric.
These methods make few or no distributional assumptions about the underlying
phenomenon that produced the data. The most popular implicit model based
algorithm is hot deck imputation. This procedure replaces missing values in
incomplete records using values from similar, but complete records of the dataset.
Past studies suggest that this is promising [7]. A limitation of this method is the
difficulty in defining similarity [8]. In addition, it treats all K neighbours in a
similar way without consideration of the different distances between the query
instance and its neighbours [9]. Recently, a number of studies applied multilayer
perceptron (MLP) and radial basis function (RBF) neural networks to impute
missing values [10]. However, creation of MLP or RBF networks is complex and
requires many parameters. In this paper, we present a novel algorithm for the
imputation of missing values. The remainder of this paper is organized as follows:
the new algorithm in section 2 (with an overview of GRNN in section 2.1 and
details of the proposed algorithm in section 2.2), details of how we assess the new
technique in section 3, results and discussions in section 4, followed by summary
and conclusions in section 5.
It can handle data from different distributions appropriately. Secondly only one
parameter (the smoothing factor) needs to be adjusted for the proposed algo-
rithm. However our empirical observations indicate that the performance of the
algorithm is not very sensitive to the exact setting of this parameter value and
that the default value of the parameter is almost always a good choice. The
inherent model-free characteristics avoid the problem of model mis-specification
and parameter estimation errors. Thirdly, although the proposed imputation al-
gorithm closely resembles the hot deck imputation scheme wherein the donor is
selected from a neighbourhood comprised of similar records, it does not suffer
from the limitations associated with hot-deck imputation because GMI assigns
different weights to the k nearest example neighbours according to Gaussian
kernel functions and, in this algorithm, all observations participate (according
to their Euclidean weights) in the estimation of missing value.
1 d2
hi = √ exp(− 2 ) (1)
2πσ 2 2σ
If the output variable is binary, the GRNN calculates the probability of event of
interest. If the output variable is continuous, then it estimates the value of the
variable.
The pseudo code of our proposed algorithm is as follows: we write D0 for the
original dataset. It has Nr rows and Nc columns.
We now know the mean and variance of the missing value dij . Perform exactly
the same procedure for estimation of conditional means and variance of other
missing values. We then replicate each record of the dataset D√N a number (here
5) times, and impute the missing values setting them to Q̄ + T R where R is a
random number between -1 and +1.
32 I.A. Gheyas and L.S. Smith
1
p(xki ) = (3)
1 + exp −(β0 + βm xmi − βn xni )
random number is less than or equal to 0.2. In subset 2, we remove the value of
xk if the corresponding random number is less than or equal to 0.8.
References
1. Rubin, D.B.: Multiple imputation for non response in surveys. Wiley, New York
(1987)
2. Little, R.J.A., Rubin, D.B.: Statistical Analysis with missing data. Wiley, New
York (1987)
3. Rubin, D.B., Schenker, N.: Multiple imputation for interval estimation from simple
random values with ignorable nonresponse. Journal of the American Statistical
Association 81(394), 366–374 (1986)
4. Schafer, J.: Analysis of incomplete multivariate data. Chapman and Hall, London
(1997)
5. Bo, T.H., Dysvik, B., Jonassen, I.: LSimpute: accurate estimation of missing values
in microarray data with least squares method. Nucleic Acids Research 32(3) (2004)
6. Carlo, G., Yao, J.: A multiple-imputation metropolis version of the EM algorithm.
Biometrika 90(3), 643–654 (2003)
7. Lokupitiya, R.S., Lokupitiya, E., Paustian, K.: Comparison of missing value impu-
tation methods for crop yield data. Environmetrics 17(4), 339–349 (2006)
8. Iannacchione, V.: Weighted sequential hot deck imputation macros. In: Proceedings
of the Seventh Annual SAS Users Group International Conference, San Francisco,
pp. 759–763 (1982)
9. Dan, L., Deogun, J.S., Wang, K.: Gene function classification using fuzzy k-nearest
neighbour approach. In: IEEE International Conference on Granular Computing,
November 2-4, pp. 644–644 (2007)
10. Schioler, H., Hartmann, U.: Mapping neural network derived from the Parzen win-
dow estimator. Neural Networks 2(6), 903–909 (1992)
11. Specht, D.F: A General Regression Neural Network. IEEE Transactions on Neural
Networks. 2(6), pp. 568-576 (1991).
12. Kuncheva, L.I.: A stability index for feature selection. In: Proceedings of the 25th
IASTED International Multi-Conference: artificial intelligence and applications,
pp. 390–395. ACTA Press, Anaheim (2007)
13. UCI Machine Learning Repository: Centre for Machine Learning and Intelligent
Systems, http://archive.ics.uci.edu/MI/
14. Siegel, S., Castellan Jr, N.J.: Nonparametric statistics: for the behavioural sciences,
2nd edn. McGraw-Hill, New York (1988)
Municipal Creditworthiness Modelling by
Kernel-Based Approaches with Supervised and
Semi-supervised Learning
1 Introduction
D. Palmer-Brown et al. (Eds.): EANN 2009, CCIS 43, pp. 35–44, 2009.
c Springer-Verlag Berlin Heidelberg 2009
36 P. Hajek and V. Olej
Parameters
Economic x1 = POr , POr is population in the r-th year.
x2 = POr /POr−s , POr−s is population in the year r-s, and s is the selected
time period.
x3 = U, U is the unemployment rate in a municipality.
x4 = ki=1 (EPOi /EIN)2 , EPOi is the employed population of the munic-
ipality in the i-th economic sector, i=1,2, . . . ,k, EIN is the total number
of employed inhabitants, k is the number of the economic sector.
Debt x5 = DS/PR, x5 ∈<0,1>, DS is debt service, PR are periodical revenues.
x6 = TD/PO, TD is total debt.
x7 = SD/TD, x7 ∈<0,1>, SD is short-term debt.
Financial x8 = PR/CE, x8 ∈ R+ , CE are current expenditures.
x9 = OR/TR, x9 ∈<0,1>, OR are own revenues, TR are total revenues.
x10 = CAE/TE, x10 ∈<0,1>, CAE are capital expenditures, TE are total
expenditures.
x11 = CAR/TR, x11 ∈<0,1>, CAR are capital revenues.
x12 = LA/PO, [Czech Crowns], LA is the size of the municipal liquid
assets.
Czech Republic, n=452) by local experts. Based on the presented facts, the
following data matrix X can be designed
x1 . . . xk ... xm
o1 x1,1 . . . x1,k ... x1,m ω1,j
... ... ... ... ... ... ...
X= , (1)
oi xi,1 . . . xi,k ... xi,m ωi,j
... ... ... ... ... ... ...
on xn,1 . . . xn,k ... xn,m ωn,j
where oi ∈O, O={o1,o2 , . . . ,oi , . . . ,on } are objects (municipalities), xk is the k-th
parameter, xi,k is the value of the parameter xk for the i-th object oi ∈O, ωi,j ∈Ω,
Ω={ω1,j,ω2,j , . . . ,ωi,j , . . . ,ωn,j } is the j-th class assigned to the i-th object oi ∈O,
xi =(xi,1 ,xi,2 , . . . ,xi,k , . . . ,xi,m ) is the i-th pattern, and x=(x1 ,x2 , . . . ,xk , . . . ,xm )
is the parameters vector.
The design of SVMs [5], [6], [7], [8] depends on the non-linear projection of
the input space Ξ into multidimensional space Λ, and on the construction of
an optimal hyperplane. This operation is dependent on the estimation of inner
product kernel referred to as kernel function.
Let x be a vector from input space Ξ of dimension m. Next, let gj (x),
j=1,2, . . . ,m be a set of non-linear transformations from input space Ξ into
38 P. Hajek and V. Olej
Description
ωi,1 High ability of a municipality to meet its financial obligation. Very favourable
economic conditions, low debt and excellent budget implementation.
ωi,2 Very good ability of a municipality to meet its financial obligation.
ωi,3 Good ability of a municipality to meet its financial obligation.
ωi,4 A municipality with stable economy, medium debt and good budget implemen-
tation.
ωi,5 Municipality meets its financial obligation only under favourable economic con-
ditions.
ωi,6 A municipality meets its financial obligations with difficulty, the municipality
is highly indebted.
ωi,7 Inability of a municipality to meet its financial obligation.
m
wj gj (x) + b = 0, (2)
j=1
N
w= αi di g(vi ), (3)
i=1
where the vector g(vi ) from the q-dimensional space Λ corresponds to the i-th
support vector vi , αi are Lagrange multipliers determined in the optimization
process, and di represents the shortest distance separating hyperplane from the
nearest positive or negative patterns. Support vectors vi consist of small subset
of training data extracted by the algorithm. Then as equation (3) is substituted
Municipal Creditworthiness Modelling 39
160
140
120
100
f 80
60
40
20
0
1 2 3 4 5 6 7
Class ωi,j, j=1,2, ... ,7
Table 3. Optimal values of SVMs parameters obtained by grid search (pattern search)
where the ranges for the parameters are as follows: 0.01≤αi ≤10000, 0.01≤ri ≤20,
0≤c≤100
Kernel function
Parameters Linear Polynomial Sigmoidal RBF
αi 16.86(48.72) 0.36(28.27) 2782.56(133.24) 16.68(93.24)
Number of support vectors 168(141) 151(159) 140(174) 184(143)
Radius ri of RBF −(−) 2.21(0.29) 0.03(0.14) 2.21(0.38)
Shifting parameter c −(−) 1.67(0.62) 0.60(0.00) −(−)
Polynomial degree −(−) 3(3) −(−) −(−)
ξ[%] 88.05(88.50) 90.04(90.27) 88.50(88.27) 89.83(91.15)
As stated by [13], from the practical point of view, the main motivation for
semi-supervised learning algorithms is, of course, their capability to work in
cases where there are much more unlabelled than labelled examples. Previous
experiments have been mostly realized for benchmark datasets [13] as there is
lack of complex real-world problems. The field of municipal creditworthiness
modelling is suitable for the use of semi-supervised learning as the process of
creditworthiness evaluation (credit rating process) is costly and time-consuming.
Therefore, only low proportion of municipalities has been labelled by classes
giving information about their creditworthiness so far.
Harmonic Gaussian Model (HGM) and Global Consistency Model (GCM) de-
fined by [9] have been used as representatives of kernel-based approaches with
semi-supervised learning. Again extensive number of experiments on municipal
42 P. Hajek and V. Olej
Table 4. Classification accuracy ξtest [%] and standard deviation σ[%] for s[%]
ξtest [%] σ[%] ξtest [%] σ[%] ξtest [%] σ[%] ξtest [%] σ[%] ξtest [%] σ[%]
s[%] 2 3 4 5 10
HGM(c) 47.82 13.96 50.90 11.79 61.77 8.20 66.33 7.71 77.64 4.10
HGM(e) 44.05 11.69 54.12 11.65 57.15 11.67 66.60 6.86 73.90 4.42
GCM(c) 61.00 7.93 64.14 6.58 70.21 5.54 72.48 4.28 77.53 2.59
GCM(e) 60.93 7.18 67.94 5.80 68.58 4.74 73.22 4.18 77.69 2.74
s[%] 15 20 25 30 35
HGM(c) 81.75 2.72 83.59 2.08 83.72 2.08 84.65 2.13 85.62 1.81
HGM(e) 76.07 3.75 79.35 3.26 81.06 2.75 81.95 2.00 83.22 2.17
GCM(c) 79.49 2.88 81.26 2.63 82.71 1.99 82.82 2.12 83.12 2.16
GCM(e) 79.90 2.28 81.34 2.30 81.71 2.79 81.91 2.83 82.37 2.30
s[%] 40 45 50 55 60
HGM(c) 86.40 1.59 86.79 2.04 86.95 1.78 86.89 2.00 87.15 2.28
HGM(e) 83.59 2.19 84.60 2.41 85.41 2.16 86.25 2.18 85.23 2.51
GCM(c) 83.00 2.04 82.24 2.87 82.53 2.73 79.99 3.80 76.87 3.09
GCM(e) 82.82 2.58 81.61 2.36 80.63 2.70 78.19 3.40 73.27 4.51
creditworthiness data set has been realized for different values of input param-
eters. As a result, HGM and GCM have been trained with the following values
of parameters: radius of RBF ri =10, Epochs=50, degree of the graph h=10,
Distance=Cosine(c)/Euclidean(e). The experiments were designed for different
number s[%] of the labelled data OtrainS . Classification accuracy ξtest [%] and
standard deviation σ[%] have been observed for both kernel-based approaches
with semi-supervised methods (Table 4).
The results are in line with the results observed by [9] which means that both
HGM and GCM behave similarly, i.e. with an increase in the number of s[%] of
100
90
Classification accuracy [%]
80
70
HGM
60 GCM
FFNN
RBF
50
SVM
40
0 10 20 30 40 50 60
s [%]
Fig. 2. Classification accuracy ξtest [%] for different s[%] of the labelled data OtrainS
Municipal Creditworthiness Modelling 43
labelled data the overall models’ classification accuracies ξtest [%] improve too,
Fig. 2. Moreover, the results show that higher classification accuracy ξtest [%]
has been obtained by GCM for s<15% while HGM gives better results as the
number s[%] of labelled data increases. This effect was expected as the propor-
tion s[%] of the labelled data is unbalanced (the number of labelled data in each
class ωi,j ∈Ω differs very much), see [9]. The kernel-based approach with semi-
supervised learning HGM and GCM have been compared to supervised methods
feed-forward neural networks (FFNNs), RBF neural networks and SVMs. In all
models, according to the expectations, with an increase in the number s[%] of
labelled data OtrainS , the classification accuracies ξtest [%] improve too. Global
Consistency Model performed much better for extremely small number s<10%
of the labelled data OtrainS . However, the HGM gives good results for higher
number s[%] of the labelled data. Other methods (FFNN, RBF neural networks,
kernel-based approaches) outperform the semi-supervised methods as the num-
ber s[%] of the labelled data goes over 50%. In this case SVMs with supervised
learning gives best results.
5 Conclusion
The paper presents the design of parameters for municipal creditworthiness eval-
uation. The evaluation process is presented as a classification problem. The
model is proposed where data matrix X is used as the input of the SVMs with
supervised learning and kernel-based approaches with semi-supervised learning.
The SVMs with supervised learning were designed and studied for the classi-
fication of municipalities oi ∈O into classes ωi,j ∈Ω due to its high classification
accuracy ξtest [%] on testing data with a low standard deviation σ[%]. The best
results have been obtained for kernel RBF as optimal values of input parameters
were gained by a pattern search method.
In economic praxis it is difficult to realize creditworthiness evaluation for each
of the municipalities. This process is usually costly and time-consuming. There-
fore, the kernel-based approaches with semi-supervised learning HGM and GCM
for different number s[%] of the labelled data OtrainS were studied. The results of
semi-supervised methods HGM and GCM are compared to FFNNs, RBF neural
networks and SVMs with supervised learning considering classification accuracy
ξtest [%] for different number s[%] of the labelled data OtrainS . The results of the
designed model for classification of municipalities oi ∈O into classes ωi,j ∈Ω show
the possibility of evaluating municipal creditworthiness even if only low propor-
tion of municipalities oi ∈O are labelled with classes ωi,j ∈Ω. Thereby, the model
presents an easier conception of the municipal creditworthiness evaluation for
the public administration managers, banks, investors or rating agencies.
Acknowledgements
This work was supported by the scientific research of Czech Science Founda-
tion, under Grant No: 402/09/P090 with title Modelling of Municipal Finance
by Computational Intelligence Methods, and Grant No: 402/08/0849 with title
Model of Sustainable Regional Development Management.
44 P. Hajek and V. Olej
References
[1] Olej, V., Hajek, P.: Hierarchical Structure of Fuzzy Inference Systems Design
for Municipal Creditworthiness Modelling. WSEAS Transactions on Systems and
Control 2, 162–169 (2007)
[2] Olej, V., Hajek, P.: Modelling of Municipal Rating by Unsupervised Methods.
WSEAS Transactions on Systems 7, 1679–1686 (2006)
[3] Hajek, P., Olej, V.: Municipal Creditworthiness Modelling by Kohonen’s Self-
organizing Feature Maps and LVQ Neural Networks. In: Rutkowski, L.,
Tadeusiewicz, R., Zadeh, L.A., Zurada, J.M. (eds.) ICAISC 2008. LNCS (LNAI),
vol. 5097, pp. 52–61. Springer, Heidelberg (2008)
[4] Hajek, P., Olej, V.: Municipal Creditworthiness Modelling by Kohonen’s Self-
organizing Feature Maps and Fuzzy Logic Neural Networks. In: Kůrková, V.,
Neruda, R., Koutnı́k, J. (eds.) ICANN 2008, Part I. LNCS (LNAI), vol. 5163, pp.
533–542. Springer, Heidelberg (2008)
[5] Cristianini, N., Shawe-Taylor, J.: An Introduction to Support Vector Machines
and other Kernel-based Learning Methods. Cambridge University Press, Cam-
bridge (2000)
[6] Abe, S.: Support Vector Machines for Pattern Classification. Springer, London
(2005)
[7] Haykin, S.: Neural Networks: A Comprehensive Foundation. Prentice-Hall Inc.,
New Jersey (1999)
[8] Vapnik, V.N.: The Nature of Statistical Learning Theory. Springer, New York
(1995)
[9] Huang, T.M., Kecman, V., Kopriva, I.: Kernel Based Algorithms for Mining Huge
Data Sets. In: Supervised, Semi-supervised, and Unsupervised Learning. Studies
in Computational Intelligence. Springer, Heidelberg (2006)
[10] Bennett, K.P., Demiriz, A.: Semi-supervised Support Vector Machines. In: Int.
Conf. on Advances in Neural Information Processing Systems, vol. 2. MIT Press,
Cambridge (1999)
[11] Chapelle, O., Scholkopf, B., Zien, A.: Semi-Supervised Learning. MIT Press, Cam-
bridge (2006)
[12] Zhu, X.: Semi-Supervised Learning. Literature Survey,
http://www.cs.wisc.edu/~ jerryzhu/pub/ssl_survey (2005)
[13] Klose, A.: Extracting Fuzzy Classification Rules from Partially Labelled Data.
Soft Computing 8, 417–427 (2004)
Clustering of Pressure Fluctuation Data Using
Self-Organizing Map
1 Introduction
Many process engineers in recent years are interested in compact, safe, energy-
efficient and environment-friendly sustainable process. Spread of concept of “process
intensification” leads to innovation of new chemical device and various methods [1].
Microreactor consists of microfabricated channels (microchannels), which is one of
the new chemical devices. The microfabrication technology has developed devices for
microanalysis. Microchannel of size of several hundred micrometers has demon-
strated to show rapid and precise control of temperature in an exothermic reaction
process, because of large ratio of surface to reaction volume [2]. Therefore, the micro
chemical technology has advantages of the reduction of solvent and the safe operation
of dangerous substance.
Process based on multiphase reactions in microchemical systems, especially liquid-
liquid two-phase reactions, occur in a broad range of application areas, such as nitra-
tion, extraction, emulsification, and so on [3]. For example, a three-phase flow,
water/n-heptane/water, was constructed in a microchannel (100-μm width, 25-μm
depth) and was used for separation of metal ions (Y3+, Zn2+) [4]. In the previous stud-
ies for extraction system on a microchip, difficulty in the system operation is how to
stabilize the interface of two fluids with a high specific interface area. Hibara et al.
proposed a method of modification of the wall inside microchannel by using octadecyl-
silane group [5]. When the surface of the channel was chemically modified to be
D. Palmer-Brown et al. (Eds.): EANN 2009, CCIS 43, pp. 45–54, 2009.
© Springer-Verlag Berlin Heidelberg 2009
46 M. Ogihara et al.
hydrophobic, organic solvent and water flowed stably in the confluence. It was re-
ported that two-phase flow could be operated stably by applying a microchip with
patterned surface featuring deep and shallow microchannel areas [6].
While many methods for stabilization of two-phase flow have been investigated
from the viewpoint of design of microchannel, there are few studies for online moni-
toring of the flow pattern in the start-up or long-term operation of microchemical
system. It is not possible that we monitor the flow pattern inside microchannel by
optical methods, if the transparent device like the glass microchip [5] cannot be used.
Pressure sensors are considered to provide useful information about the two-phase
flow inside the opaque device, and thus the integration system of microchannels and
pressure sensors has been developed in recent years [7].
In our previous studies, the Y-shaped microchannel that pressure sensors were in-
stalled into was developed. When the pressure fluctuation data were processed by the
spectrum analysis, it was seen that dynamic behaviors of interface between aqueous
phase and organic phase could be distinguished by the spectrogram. In order to con-
trol dynamic behavior of the interface based on the spectrogram, it is necessary to
analyze nonlinear relationships between waveform of the pressure fluctuation and
operational conditions. Therefore, in the present paper, waveform of the pressure
fluctuation is classified by using the self-organizing map (SOM), and the clustering
by the SOM is related to classification of operational conditions.
A purpose of the present paper is to propose an application method of the batch
SOM for classification of operational conditions based on the dynamic behavior of
interface between immiscible liquid-liquid flows. A method of combining the cluster-
ing of pressure fluctuation data with the clustering of operational conditions will be
discussed below.
In the present paper, we used SOMine ver. 4.0 (Viscovery Software GmbH), which
was based on the concept and algorithm of the “batch” SOM introduced by
T. Kohonen [8]. Application of SOMine was demonstrated in our previous studies to
be effective in monitoring of small displacement of an impeller shaft inside an agita-
tion vessel [9]. Although two-dimensional Kohonen nets are used in the SOMine, its
training mode is not the sequential mode that is used conventionally.
In the training mode, the batch SOM algorithm first processes all data vectors, and
then updates the map once. The batch SOM algorithm is faster and more robust than
the original Kohonen algorithm. According to user’s manual of SOMine, the initial
data vectors are decided by the principal component analysis (PCA) of the input data
that are prepared for training. That is to say, in the present paper, training starts from a
map representing linearized space for the multi-dimensional input data. Topological
48 M. Ogihara et al.
connection between two arbitrary nodes is defined by the Gaussian function. The
radius of the Gaussian function is referred to as tension in the SOMine, which is used
for determining the degree of smoothing of the map.
In the training of the map, the SOMine updates a node vector by setting it to the
mean value of all weighted data vectors that match that node and its neighboring
nodes, in a manner similar to the K-means method. During the training process,
the number of nodes in a map is not fixed but grows from a fairly small number to the
desired number of nodes, in order to implement more efficient training. This scheme
is shown in Fig.3. Each map is trained for a certain number of batches using decreas-
ing tension. When the number of nodes is increased, the growth of the map is com-
pensated by a corresponding decrease in the tension.
In the clustering of a generated map after the training, the SOM-Ward clusters
were used. This clustering method combines the local ordering information of the
map with the classical hierarchical cluster algorithm for Ward.
As the first step of the clustering, fifteen characteristic waveforms were selected from
time-series pressure data that were acquired under seventeen operational conditions.
Fig. 5 showed four examples of the selected waveforms of pressure fluctuation.
50 M. Ogihara et al.
In the next step (step 2), the provisional clustering map of operational conditions
was got as shown in Fig. 6. The generated map was seen to be divided into four clus-
ters (A-D). “Q2R0.5 (org)” in Fig. 6, which is indicated by a bullet, represents a loca-
tion of node recalled for the operational condition that the total flow rate Q is 2.00
ml/min and the ratio of flow rate R is 0.50. Then, “org” and “aq” in the parentheses
represent measurement points of the organic phase side and the aqueous phase side,
which are CH1 and CH2 in Fig. 1.
In the third step of clustering, we selected three clusters of A, B and C in Fig. 6,
which were considered to be groups of operational conditions that pressure fluctuation
with large amplitude were seen frequently. By inputting all waveforms classified to
the three clusters, a new clustering map of the pressure fluctuation, which was sepa-
rated into six clusters (a-f), was computed as shown in Fig. 7. Six waveforms around
the map in Fig. 7 show the representative data of pressure fluctuation in the clusters.
The cluster “a” was considered to be a group of the waveforms of small fluctuation,
whereas the cluster “f” was a group of the waveforms with high frequency and large
amplitude. The other clusters were considered to be groups of the waveforms, fre-
quency of which was 0.2 – 0.5 Hz.
The clustering map of operational conditions (Fig. 8) was computed in the last step
(step 4). It was seen that the map could be separated into three clusters (A-C). As
shown in Fig. 8, the operational conditions that Q was high and 4.0 – 6.0 ml/min were
classified into the cluster C. On the other hand, the operating conditions that Q was
2.0 ml/min or less were classified into the clusters A and B. It was considered that the
cluster A could be distinguished from the cluster B by value of R.
In the previous study [10], we analyzed dynamic behavior of interface between aque-
ous and organic phases by using two indices of W* and α. As shown in Fig. 9, the
angle which the segment OM and the base line l made was defined as the index of α.
The point M was the contact of interface to the wall in the confluence of microchan-
nels. Another index of W* was ratio of the width of organic phase flow (W’) to the
width of channel (W), which was measured in the position that separated from the
confluence at 18 mm.
Fig. 10. Three examples of the representative form of interface in two-phase flow
5 Conclusions
The batch SOM was applied to the classification of operational conditions based on
the dynamic behavior of interface between immiscible liquid-liquid flows in the mi-
crochannel. Pressure fluctuation data were used for monitoring the dynamic behavior
54 M. Ogihara et al.
References
1. Stankiewicz, A.I., Moulijn, A.: Process Intensification: Transforming Chemical Engineer-
ing. Chem. Eng. Progress. 96, 22–34 (2000)
2. Hessel, V., Hardt, S., Lowe, H.: Chemical Micro Process Engineering. Wiley-VCH, Wein-
heim (2004)
3. Zhao, Y., Chen, G., Yuan, Q.: Liquid-Liquid Two-Phase Flow Patterns in a Rectangular
Microchannel. AIChE J. 52, 4052–4060 (2006)
4. Maruyama, T., Matsushita, H., Uchida, J., Kubota, F., Goto, M.: Liquid Membrane Opera-
tions in a Microfluidic Device for Selective Separation of Metal Ions. Anal. Chem. 76,
4495–4500 (2004)
5. Hibara, A., Nonaka, M., Hisamoto, H., Uchiyama, K., Kikutani, Y., Tokeshi, M., Ki-
tamori, T.: Stabilization of Liquid Interface and Control of Two-Phase Confluence and
Seaparation in Glass Microchips by Utilizing Octadecylsilane Modification of Microchan-
nels. Anal. Chem. 74, 1724–1728 (2002)
6. Aota, A., Hibara, A., Kitamori, T.: Pressure Balance at the Liquid-Liquid Interface of Mi-
cro Countercurrent Flows in Microchips. Anal. Chem. 79, 3919–3924 (2007)
7. Kohl, M.J., Abdel-Khalik, S.I., Jetter, S.M., Sadowski, D.L.: An Experimental Investiga-
tion of Microchannel Flow with Internal Pressure Measurements. Int. J. Heat Mass Trans-
fer 48, 1518–1533 (2005)
8. Kohonen, T.: Self-Organizing Maps (the Japanese language edition). Springer, Tokyo
(2001)
9. Matsumoto, H., Masumoto, R., Kuroda, C.: Feature Extraction of Time-Series Process Im-
ages in an Aerated Agitation Vessel using Self Organizing Map. Neurocomputing (in
press)
10. Marumo, T., Matsumoto, H., Kuroda, C.: Measurement of Pressure Fluctuation in Liquid-
Liquid Two-Phase Flow in a Microchannel and Analysis of Stabilization of Liquid Inter-
face. In: Proceeding of the 39th Autumn Meeting of the SCEJ(Japanese). S119 (2007)
Intelligent Fuzzy Reasoning for Flood Risk Estimation
in River Evros
Abstract. This paper presents the design of a fuzzy algebra model and the im-
plementation of its corresponding Intelligent System (IS). The System is capa-
ble of estimating the risk due to extreme disaster phenomena and especially due
to natural hazards. Based on the considered risk parameters, an equal number of
fuzzy sets are defined. For all of the defined fuzzy sets trapezoidal membership
functions are used for the production of the partial risk indices. The fuzzy sets
are aggregated to a single one that encapsulates the overall degree of risk. The
aggregation operation is performed in several different ways by using various
Fuzzy Relations. The degree of membership of each case to an aggregated
fuzzy set is the final overall degree of risk. The IS has been applied in the prob-
lem of torrential risk estimation, with data from river Evros. The compatibility
of the system to existing models has been tested and also the results obtained by
two distinct fuzzy approaches have been compared.
1 Introduction
Risk evaluation is a very important, complex and crucial task for scientists in our
days. Especially, vulnerability estimation caused by natural disasters and extreme
weather phenomena is of main concern. The term natural disasters includes many
very serious problems of our times and it becomes worse every year, due to the cli-
mate changes and due to the deforestation.
This paper deals with the development of a Risk Approximation Model (RAM) in-
corporated by a prototype Intelligent System. The IS has been applied as a pilot ap-
proach in a real world environmental risk assessment problem, using actual data.
According to Turban and Aronson [1] decision making comprises of three phases:
a) The intelligence (searching for conditions that call for decisions) b) The design
(inventing, developing and analyzing possible courses of action) c) The choice (Se-
lecting a course of action from the available ones). An unstructured problem is the
one in which none of the three above phases has a predefined and routine solution.
Risk estimation is exactly the case of an unstructured problem. It is a fact that fuzzy
reasoning is widely used as a modern tool for decision making and it is often met in
the literature for environmental risk problems [2].
D. Palmer-Brown et al. (Eds.): EANN 2009, CCIS 43, pp. 55–66, 2009.
© Springer-Verlag Berlin Heidelberg 2009
56 L.S. Iliadis and S. Spartalis
There is no widely accepted definition of risk. Four items are found for risk in
Webster Dictionary: (1) Possibility of loss or injury; (2) Someone or something that
creates or suggests a hazard. According to Huang et al, [3] Risk is a scene in the future
associated with some adverse incident. Scene means something seen by a viewer; a
view or prospect. Adverse is contrary to one’s interests or welfare; harmful or unfa-
vorable [3]. A scene must be described with a system consisting of time, site and
objects [3]. The association would be measured with a metric space whereas incident
would be scaled with a magnitude [3].
Several successful efforts for risk estimation using fuzzy sets have also been made
[3], [4], [5], [6]. The developed RAM uses specific fuzzy algebra concepts that in-
clude membership functions (MF) and distinct fuzzy conjunction operators (FCO).
This is the second approach that has been performed on river Evros by our research
team. The initial one [7] had encouraging results, whereas compatibility to the exist-
ing methods of Gavrilovic and Stiny [8], [9] was as high as 75%. It used triangular
membership functions and four FCO [7] and all of the parameters had equal contribu-
tion in the determination of overall risk. Except from the different membership func-
tion and the additional FCO this paper presents a comparative study between the
initial and the late approach in an effort to determine which MF fits better in this
specific case. Also it enhances the initial model significantly by adding the option of
scenarios performance with the use of uneven contribution weights to the parameters
involved. So this research effort offers a significant step ahead.
The developed IS has been tested and validated on the actual problem of floods and
torrential disasters in a specific area of Southern Balkans. Floods and erosion cause
immense problems on settlements and infrastructure and intense feelings of insecurity
among the citizens world wide. As a consequence, mitigation measures must be taken
based on the design of an effective protection and prevention policy. The design of
such a policy requires the correct estimation of the torrential risk for each river or
stream watershed in the wider area of interest. Every year, the most significant phe-
nomena of floods and erosion in the southern Balkans are observed in the mountain-
ous part of the watershed of river Evros that belongs to Greece, Turkey and Bulgaria.
Due to the lack of data from Turkey and Bulgaria the testing has been performed only
in the watersheds that belong to Greece. This is obviously a limitation of the testing
process. However its solution requires years of interstate cooperation. This is one of
the reasons that make this effort a pilot one.
The most widely used existing approaches are the ones of Gavrilovic and Stiny [8],
[9]. The equation of Gavrilovic considers the average annual production of sediments,
the average annual temperature, the average annual rain height of the watershed, its
area, the kind of its geodeposition, the vegetation, the erosion of the watershed, and
the calculation of the special degradation is really important. The method of Stiny
determines the torrential risk based on sediment production with a periodicity of 100
years with the equation of Stiny-Herheulidze [9].
The main drawback of the existing modeling approaches is that they consider only
partial risk indices (due to independent parameters) and that they apply crisp sets
methods with specific boundaries, not capable of describing proper risk linguistics.
Intelligent Fuzzy Reasoning for Flood Risk Estimation in River Evros 57
The basic innovative characteristic of the RAM is the production of an integrated risk
index that has an overall consideration of all involved factors. The natural disasters
risk estimation problem is a composite one and it should be seen under different per-
spectives before the design of a prevention and protection policy. For example other
areas will be at highest risk under extreme weather phenomena and other areas will be
at highest risk in average situations. The older methods do not make such distinctions.
The IS performs both partial risk index estimation (for each factor affecting the prob-
lem) and it also calculates a unified risk index (URI) which can be considered as an
estimation of the overall risk for each area. Finally a very important advantage of the
proposed IS is its ability to assign different weights to the attributes involved. In this
way it can either perform scenarios or it can consider the attributes as having equal
importance. The Decision Support System (DSS) has been named TORRISDESSYS
(Torrential Risk Decision Support System) and it can be applied to any area of the
world as long as sufficient amount of data is present. Its outcome has been compared
to the outcome of other established methodologies and the results have proven a high
degree of compatibility between them. The results of this study can become a basis of
the long-term planning of mitigation measures against floods and erosion.
Two different algebraic approaches can be considered for the estimation of risk gen-
erally. The one applies crisp sets on existing data and the other uses fuzzy sets on data
and on meta-data [10]. In this modelling approach the initial primitive data come from
actual measurements. Then as the system learns this piece of information is trans-
formed to degrees of membership (DOM) in the closed interval [0,1]. The DOMs are
used as new sources of data for the production of the overall degree of membership
which is the final risk measurement. If it is required by the case study, this overall risk
index can also be transformed from a pure dimensionless number to a parameter with
a physical meaning by applying known de-fuzzification equations and processes.
From this point of view, the term meta-data is used here to indicate this continuous
transformation in the nature of data. Fuzzy sets (FS) can be used to produce the ra-
tional and sensible representation of real world concepts [11]. For each FS there exists
a degree of membership function μs(Χ) that is mapped on the closed interval of values
[0,1] [12]. Figure 1 graphically presents the fuzzy sets Tiny and Small using a semi-
trapezoidal and a triangular membership functions respectively.
for the corresponding torrential factor. This means that PDRi = μi(FSi) where μi is the
function estimating the degree of membership (DOM) of area i to the FSi. trapezoidal
membership functions (TRAMF) have been applied on Primitive data vectors (PDV)
for the estimation of the DOM of each area to the corresponding fuzzy set. The fol-
lowing equation 1 describes the TRAMF [13],[17],[18] [19].
⎧0, if X ≤ a
⎪(X - a)/(m - a), if X ∈ (a, m)
⎪⎪
μ s (X) = ⎨1, if X ∈[m, n] (1)
⎪(b - X)/(b - n), if X ∈ (n, b)
⎪
⎪⎩0, if X ≥ b
The TRAMF has a tolerance interval for the watersheds with the highest degree of
risk. Also other membership functions (MF) like the triangular (having only one peak)
are common in fuzzy algebra. In this research effort the TRAMF has been applied on
PDV due to the importance of the tolerance interval. The following equation 2 pre-
sents a triangular membership function [13],[17] [19].
⎧0 if X < a
⎪ (X - a)/(c - a) if X ∈[a, c)
⎪ (2)
μ s (Χ) = ⎨
⎪(b - X)/(b - c) if X ∈[c, b)
⎪⎩0 if X >= b
The final target is the aggregation of all the partial DOM and the estimation of the
degree of membership of each area to the final fuzzy set “Torrentially risky Area”.
Various special types of fuzzy relations (the fuzzy T-Norms) have been applied in
order to unify the partial risk indices. In this way meta-data vectors (MDV) that cor-
respond to the unified risk indices are extracted from the PDV. Of course this is
automatically performed by an intelligent prototype Information system. The follow-
ing equations 3 and 4 offer a good definition of a fuzzy relation.
Let A be an input fuzzy region with element x using a membership function μΑ and
let B be an output fuzzy region with element y and membership function μB. The
fuzzy relation on the fuzzy product A X B is a mapping of the following type:
A T-Norm performs the logical AND operation between the fuzzy sets. The result of
each T-Norm is presented in the following equation 5.
T_Normi = PDR1 ^ PDR2 ^ ….^ PDRi = μ1(FS1) ^ μ2(FS2) ^ …. ^ μi(FSi) (5)
It should be defined that i is the number of independent parameters, and consequently
the number of FS used. The following Table 1 contains the five different cases of T-
Norms that have been used for the production of the unified risk index (URI)
[20],[13],[21].
60 L.S. Iliadis and S. Spartalis
After the application of the decision support system, each area is assigned a MDV
which is a vector of unified risk indices due to the various T-Norms applied. Each
element of each meta-data vector corresponds to the overall degree of risk of the ex-
amined case under a different angle. Consequently each element of the MDV has a
different perspective of the risk problem. The interpretation of all the unified risk
indices has equal importance towards risk estimation.
Since the T-Norms perform conjunction they are aggregation functions Agg(x). In
this case we can have multiple attribute decision making with unequal weights as-
signed to the attributes [22], [23].
⎛ ⎛ ⎞ ⎛ ⎞ ⎛ ⎞⎞
μ (xi ) = Agg⎜ f ⎜⎜ μ (xi ), w1 ⎟⎟, f ⎜⎜ μ ( xi ), w2 ⎟⎟,.......f ⎜⎜ μ ( xi ), wn ⎟⎟ ⎟ (10)
~
⎜ ⎟
⎝ ⎝ ⎠ ⎝ ⎠ ⎝ ⎠⎠
S ~ ~ ~
Α Α Α
where i=1,2…..k and k is the number of cases and n is the number of the attributes.
The following equation 11 represents function f(μ,w).
⎛ ⎞ 1
f ⎜ μ Α , w⎟ = μ Α w (11)
⎝ ⎠
The above equations 10 and 11, offer a very significant approach in the consideration
of the potential uneven contribution of each parameter to the final torrential risk. The
TORRISDESSYS offers the option of uneven weights however in this paper we have
performed the scenario of even contribution. This was done in order to have a similar
and comparable approach to the one of [7]. According to [24] the T-Norms can be
considered as optimistic conjunction operators. There are also other types of aggrega-
tion operators called S-Norms that assign higher degrees of memberships to risky
areas and they view the problem exactly from a reverse angle, compared to the T-
Norms. This research faces the problem from an optimistic point of view. Of course a
future research effort should involve the application of S-Norms as well. The univer-
sal nature of the model is due to the fact that in a different application, the only thing
that has to change is the nature of the parameters involved in the estimation of risk.
The whole algebraic framework remains exactly the same.
Intelligent Fuzzy Reasoning for Flood Risk Estimation in River Evros 61
The IS has been developed in MS-Access. It uses a visual environment and it applies a
modern graphical user interface. It has been designed and implemented to store the data
in a relational access database. The retrieval mechanism of the DSS reads the required
data from the database. Data is stored in several Tables according to the relational phi-
losophy. The main Table contains six fields, the primary key and the five fields corre-
sponding to the torrential risk factors. It has been designed to follow the first and second
Normal form [25]. The DOM to the five fuzzy sets (partial risks) and the URI were
calculated by performing SQL (Structured Query Language) operations on the database.
The following statement is an example of a DOM estimation using SQL.
SELECT fuzzy_help.perioxi, fuzzy_help!help_drastic AS drastic_product FROM
fuzzy_help ORDER BY fuzzy_help!help_drastic DESC;
Table 3 shows the influence of extreme values to the produced risk index. Al areas
that are characterized by the complete lack of compact geological forms have a sig-
nificant overall degree of risk. The drastic product T-Norm offers a good approach for
the characterization of areas that have extreme values affecting positively the torren-
tial phenomena in a watershed area.
Table 3. Torrential risk of Northern Evros using the Drastic product. The influence of the low
level of compact geological forms to the TRMW.
In the above case the overall DOM was estimated by co-evaluating the partial
~
DOM to the FS F ={Area with low percentage of Compact Geological Forms}. Due
~
to the fact that for all areas presented in the above table, the DOM to the FS F equals
to 1, the Drastic Product produces an overall DOM equal to the minimum of all. The
following Tables 4 and 5, show clearly the most risky areas for the area of Central
Evros and Southern Evros respectively.
Intelligent Fuzzy Reasoning for Flood Risk Estimation in River Evros 63
Table 4. Torrential risk of Central Evros by the TRAMF (10 most risky areas)
In both cases the TRAMF has been applied for the estimation of the partial degrees
of risk and the Einstein, the algebraic product the Hamacher product and the Drastic
product T-Norms for the final TRMW calculation.
Table 5. Torrential risk of Southern Evros by the TRAMF (10 most risky areas)
Einstein
Area Algebraic product Area Drastic product Area
product
this is a serious indication towards its suitability to the torrential risk estimation. As it
was mentioned in the early sections of this paper, this research is a pilot one. The results
of the system have been compared to the outcome given by the method of Gavrilovic1
(G1), Gavrilovic2 (G2), [8] and the method of Stiny (St) [9] which were described
briefly in the previous sections and they are widely used. The comparative study be-
tween the system and the other existing methods has proven that there exists a very high
level of compatibility in most of the cases. In the case of Central Evros for the TRAMF
the compatibility between the TORRISDESSYS and the Gavrilovic2 method is stable
and it equals to 80% in the cases of the algebraic product, the min, the drastic product
and the Hamacher product. Only in the case of Einstein product it is equal to 75%. The
use of the Einstein T-Norm gives an average compatibility of 62.5% in Northern Evros,
63.3% in Central Evros and 52.77% in Southern Evros. The use of the algebraic product
T-Norm gives an average compatibility of 62.5% in Northern Evros, 61.6% in Central
Evros and 52.77% in Southern Evros.
The use of the Minimum T-Norm gives an average compatibility of 62.5% in
Northern Evros, 61.66% in Central Evros and 52.77% in Southern Evros. The use of
the Hamacher product T-Norm gives an average compatibility of 62.5% in Northern
Evros, 61,66% in Central Evros and 52.77% in Southern Evros. In the case of the
Drastic Product the compatibility between the TORRISDESSYS and the Gavrilovic2
method for Central Evros is 80% (agreement in 16 out of 20 cases). Generally the
testing has proven that the DSS estimates the most risky torrential streams in the area
of river Evros having an agreement higher than 50% with the methods of G1, G2 and
Stiny that can reach up to 80%. According to the above results and to [7] the TRAMF
has a much higher percentages of agreement with the established common methods
than the TRIMF. An achievement of this research effort is the development of an
effective and flexible risk estimation model that can be used in various risk situations
and problems. The application of the system has been quite successful. A drawback of
the system’s testing process is the lack of torrential data from the Turkish and Bulgar-
ian side of river Evros. It would be essential for the Greek side to cooperate with the
other two sides and to input torrential data from both countries into the IS due to the
fact that the natural disasters do not have any borders. This should be materialized in
the future in a much wider effort.
References
1. Turban, E., Aronson, J.: Decision support systems and Intelligent systems, 5th edn. Pren-
tice Hall, New Jersey (1998)
2. Carlsson, C., Fuller, R.: Fuzzy Reasoning in Decision-Making and Optimization, 1st edn.
Physica-Verlag, Heidelberg (2001) (Studies in Fuzziness and soft computing)
3. Huang, C.F., Moraga, C.: A fuzzy risk model and its matrix algorithm. International Jour-
nal of Uncertainty, Fuzziness and Knowledge –based systems 10(4), 347–362 (2002)
4. Iliadis, L., Spartalis, S.: Fundamental fuzzy Relation Concepts of a D.S.S. for the estima-
tion of Natural Disasters risk (The case of a trapezoidal membership function). Journal of
Mathematical and Computer modelling 42, 747–758 (2005)
5. Kaloudis, S., Tocatlidou, A., Lorentzos, N., Sideridis, A., Karteris, M.: Assessing Wildfire
Destruction Danger: a Decision Support System incorporating uncertainty. Journal Eco-
logical Modelling 181(1), 25–38 (2005)
66 L.S. Iliadis and S. Spartalis
6. Loboda, T.V., Csiszar, I.: University of Maryland USA. Assessing the risk of ignition in
the russian far east within a modeling framework of fire threat. Ecological Applica-
tions 17(3), 791–805 (2007)
7. Iliadis, L., Maris, F., Marinos, D.: A decision support system using fuzzy relations for the
estimation of long-term torrential risk of mountainous watersheds: The case of river Evros.
In: Proceedings of the 5th International Symposium on Eastern Mediterranean Geology,
Thessaloniki, Greece (2004)
8. Gavrilovic, S.: Inzenjering o bujicnim tovoklima i eroziji. Beograd (1972)
9. Kotoulas, D.: Management of Torrents I. Publications of the University of Thessaloniki,
Greece (1997)
10. Leondes, C.T.: fuzzy logic and Expert systems Applications. Academic Press, California
(1998)
11. Kandel, A.: Fuzzy Expert systems. CRC Press, USA (1992)
12. Zadeh, L.A.: Fuzzy logic Computing with words. IEEE Trans. fuzzy systems 4(2), 103–
111 (1996)
13. Kecman, V.: Learning and soft computing. MIT Press. London (2001)
14. Kotoulas, D.: Research on the characteristics of torrential streams in Greece, as a causal
factor for the decline of mountainous watersheds and flooding, Thessaloniki, Greece
(1987)
15. Stefanidis, P.: The torrent problems in Mediterranean Areas (example from Greece). In:
Proc. XXIUFRO Congress, Finland (1995)
16. Viessman, J.W., Levis, G.L., Knappt, J.W.: Introduction to Hydrology. Harper and Raw
Publishers, New York (1989)
17. Cox, E.: The fuzzy systems Handbook, 2nd edn. Academic Press, New York (1999)
18. Dubois, D., Prade, H., Yager, R.: Fuzzy Information Engineering. John Wiley and sons,
New York (1996)
19. Nguyen, H.E., Walker, E.: A First Course in fuzzy logic. Chapman and Hall, Library of the
Congress, USA (2000)
20. Cox, E.: Fuzzy modeling and Genetic Algorithms for data Mining and Exploration. El-
sevier, USA (2005)
21. De Cock, M.: Representing the Adverb Very in fuzzy set Theory. In: Proceedings of the
ESSLLI Student Session, Ch.19 (1999)
22. Calvo, T., Mayor, G., Mesira, R.: Aggragation Operators: New Trends and Applications
(Studies in Fuzziness and soft computing). Physica-Verlag, Heidelberg (2002)
23. Fan, Z., Ma, P., Zhang, J.Q.: An approach to multiple attribute decision-making based on
fuzzy preference information on alternatives. Fuzzy sets and systems 131(1), 101–106
(2002)
24. Iliadis, L.: Intelligent Systems and applications in risk estimation (Book in Greek). Sta-
moulh publishing co, Thessaloniki (2007)
25. Date, C.J.: An Introduction to database systems. Addison-Wesley, New York (2007)
26. Coppin, B.: Artificial Intelligence Illuminated Jones and Bartlett Publishers USA (2004)
Fuzzy Logic and Artificial Neural Networks
for Advanced Authentication Using Soft Biometric Data
Mario Malcangi
1 Introduction
Biometrics uses physiological and behavior characteristics possessed exclusively by
one individual to attest her/his identity. These claim to be better than current and es-
tablished authentication methods, such as personal identification numbers (PINs),
passwords, smart cards, etc., because they offer key advantages such as:
• Availability (always)
• Uniqueness (to each person)
• Not transferable (to other parties)
• Not forgettable
• Not subject to theft
• Not guessable
As a result of these advantages, biometrics offers very high-level application security
compared to the security level offered by traditional identification methods based on
D. Palmer-Brown et al. (Eds.): EANN 2009, CCIS 43, pp. 67–78, 2009.
© Springer-Verlag Berlin Heidelberg 2009
68 M. Malcangi
2 System Architecture
The biometric authentication system (fig. 1) consists of three processing layers, the
feature-extraction layer, the matching layer, and the fuzzy logic-based decision layer.
The feature-extraction layer uses signal processing-based algorithms for hard and soft
feature extraction. The matching layer uses hard computing to identify hard features
Fuzzy Logic and Artificial Neural Networks 69
and soft computing (ANN) to identify soft features. The decision layer uses a fuzzy
logic engine to fuse the identification scores and the crisp soft features.
A floating-point digital signal processor (DSP) executes the feature-extraction
algorithms, the artificial neural network, and the fuzzy logic engine. This kind of
processing device is the best choice to efficiently run both the hard computing and the
soft computing engines.
Di (x) = (x − x )T W −1 (x − x ) (5)
where W is the covariance array computed using the average and the standard
deviation features of the utterance. The input pattern x is processed with reference
to the utterance-averaged feature vector x that represents the person to be identified.
The distance Di (x) is a score for the authorized user.
The DTW-KNN method combines the dynamic-time-warping measurement with
k-nearest neighbor decision algorithm. The DTW first clusters similar elements that
Fuzzy Logic and Artificial Neural Networks 71
refer to a feature into classes. The cost function is computed using Euclidean distance,
with a granularity of one frame. The KNN algorithm is then applied to select k mini-
mal distance matching and to choose the most recurring person in k minimal distance
matches. This results in lower false-positive and false-negative rates during identifica-
tion, compared to the original DTW algorithm.
This is a very fast algorithm for mapping fingerprint minutiae. The most critical step
in this procedure is to avoid computing false minutiae caused by noise in the scanned
fingerprint image. To overcome this problem, a backtrack control is executed on each
feature pattern before it is validated, to check that each of the three branches of the
bifurcation are significantly long.
The scanned fingerprint image is transformed into a set of coordinates x, y and the
direction. Pattern matching consists of a procedure that first tries to align the template
pattern and the input pattern, then computes an overlapping score. This score is a
measurement of the authenticity of the person who created the input.
72 M. Malcangi
• speed
• stress
Speed is measured as the total duration of the speech utterance. Stress is measured as
the ratio between the peak amplitude of the stressed vowel and the average amplitude
of the whole utterance. Both these voiceprint features are related to the way the per-
son is used to speaking a requested word.
The following soft features are extracted from fingerprints:
• total area
• mean intensity
Total area is measured as the ratio between the total pixels available on the finger-
print-scanning device and the total pixels of the captured fingerprint image that have a
value higher than the estimated peak noise level. Mean intensity is measured as the
sum of the intensity of all the pixels with a value higher than the estimated peak level.
Both these fingerprint features are related to the way the person approaches contact
with the fingerprint sensor.
Fuzzy Logic and Artificial Neural Networks 73
This FFBP-ANN is trained using data collected during the tuning and testing of the
hard-biometric features of the person authorized for access. This data is then mixed
with the biometric data of unauthorized persons and used to build the training pat-
terns. The purpose is to train the ANN so it can learn to recognize behavior specific to
the authorized individual.
Rules were derived from feature distribution. Each rule was manually tuned using
a fuzzy-logic development environment specially adapted to this purpose. The pri-
mary purpose of this tuning action is to integrate single biometric matchers into a sin-
gle smart-decision system.
For fine tuning, many other rules can be generated to take additional soft-biometric
measurements into account. Using more rules leads monotonically toward greater
reliability in the authentication process.
Triangular membership functions are used to process inputs. The inference rule set
is then applied. The result of all the rules is evaluated using the centroid method [8],
so a crisp output value can be calculated by finding the variable value of the center of
gravity. A singleton membership function was used to defuzzify the final decision.
8 Performance Evaluation
9 Embedded Implementation
10 Conclusions
Preliminary results of this research demonstrate that a soft-multibiometric approach to
personal authentication on embedded systems can require fewer system resources if
smart logic is used to implement the decision process. Artificial neural networks can
be used to make inferences about soft-biometric features. Fuzzy logic proves a very
practical solution for implementing smart decision logic.
The integration of multiple biometric identification systems through a fuzzy logic
inference engine is a good strategy for keeping system complexity low while increas-
ing performance. A positive side effect of this approach is that embedded information
such as behavior and emotion can be included in final decision. Artificial neural net-
works prove to work optimally in mapping such information, enabling an effective
multimodal approach to personal biometric authentication.
References
1. Bishop, C.: Neural Networks for Pattern Recognition. Oxford University Press, Oxford
(1995)
2. Ciota, Z.: Improvement of Speech Processing Using Fuzzy Logic Approach. In: Proceed-
ings of IFSAWorld Congress and 20th NAFIPS International Conference (2001)
3. Bosteels, R.T.K., Kerre, E.E.: Fuzzy Audio Similarity Measures Based on Spectrum His-
togram and Fluctuation Patterns. In: Proceedings of the International Conference Multime-
dia and Ubiquitous Engineering 2007, Seul, Korea, April 27–28 (2007)
4. Malcangi, M.: Soft-computing Approach to Fit a Speech Recognition System on a Single-
chip. In: Proceedings of 2002 International Workshop System-On-Chip For Real-Time
Applications, Banff, Canada, July 6-7 (2002)
5. Malcangi, M.: Improving Speech Endpoint Detection Using Fuzzy Logic-based Method-
ologies. In: Proceedings of the Thirteenth Turkish Symposium on Artificial Intelligence
and Neural Networks, Izmir, Turkey, June 10-11 (2004)
6. O’Shaugnessy, D.: Speech Communication – Human and Machine. Addison-Wesley,
Reading (1987)
7. Pak-Sum Hui, H., Meng, H.M., Mak, M.: Adaptive Weight Estimation in Multi-biometric
Verification Using Fuzzy Logic Decision Fusion. In: Proceedings of IEEE International
Conference on Acoustic, Speech, and Signal Processing (2007)
8. Runkler, T.A.: Selection of Appropriate Defuzzification Methods Using Application Spe-
cific Properties. IEEE Transactions on Fuzzy Systems 5(1) (Feburary 2007)
78 M. Malcangi
9. Wahab, A., Ng, G.S., Dickiyanto, R.: Speaker Authentication System Using Soft Comput-
ing Aprroaches. Neurocomputing 68, 13–17 (2005)
10. Jain, A.K., Dass, S.C., Nandakumar, K.: Soft Biometric Traits for Personal Recognition
Systems. In: Zhang, D., Jain, A.K. (eds.) ICBA 2004. LNCS, vol. 3072, pp. 731–738.
Springer, Heidelberg (2004)
11. Jain, A.K., Nandakumar, K., Lu, X., Park, U.: Integrating faces, fingerprints, and soft bio-
metric traits for user recognition. In: Maltoni, D., Jain, A.K. (eds.) BioAW 2004. LNCS,
vol. 3087, pp. 259–269. Springer, Heidelberg (2004)
12. Hong, A., Jain, S., Pankanti, S.: Can Multibiometrics improve performance? In: Proceed-
ings of IEEE Workshop on Automatic Identification of Advanced Tecnologies (1999)
13. Malcangi, M.: Robust Speaker Authentication Based on Combined Speech and Voiceprint
Recognition, proposed to IECCS (2008)
Study of Alpha Peak Fitting by Techniques
Based on Neural Networks
Abstract. There have been many studies which analyze complex alpha spectra
based on numerically fitting the peaks to calculate the activity level of the sam-
ple. In the present work we propose a different approach – the application of
neural network techniques to fit the peaks in alpha spectra. Instead of using a
mathematical function to fit the peak, the fitting is done by a neural network
trained with experimental data corresponding to peaks of different characteris-
tics. We have designed a feed-forward (FF) multi-layer perceptron (MLP)
artificial neural network (ANN), with supervised training based on a back-
propagation (BP) algorithm, trained on the peaks of Polonium, extracted from
many spectra of real samples analyzed in the laboratory. With this method, we
have achieved a fitting procedure that does not introduce any error greater than
the error of measurement, evaluated to be 10%.
1 Introduction
In nature, there are unstable atomic nuclei which disintegrate emitting ionizing radia-
tion, commonly known as radioactivity. One of these types of radiation is the emis-
sion of alpha particles. It is observed in elements heavier than lead in the periodic
table, including uranium, thorium, and polonium. This emission consists of helium
nuclei (two protons and two neutrons) expelled from the parent nuclei with high ki-
netic energies, in the order of MeV (mega-electron-volt). These energies are discrete
and characteristic of the emitting radionuclide. The emission consists therefore basi-
cally of charged particles with a significantly greater mass than electrons. This im-
plies that they can only travel a very short distance until they completely deplete their
energy. Indeed, even a simple sheet of paper can stop them. Even though they are
short-range particles, their determination is frequently necessary to characterize a
material radiologically.
The alpha particle detection process is usually carried out using PIPS (Passivated
Implanted Planar Silicon) detectors in which the energy of the alpha particle is depos-
ited. Due to the strong interaction of alpha particles with matter, to avoid losing energy
in the detection the distance between sample and detector must be as short as possible,
D. Palmer-Brown et al. (Eds.): EANN 2009, CCIS 43, pp. 79–85, 2009.
© Springer-Verlag Berlin Heidelberg 2009
80 J. Miranda et al.
typically a few millimeters, and a certain degree of vacuum is also needed in the detec-
tion chamber. Figure 1 shows a typical polonium alpha spectrum. One observes that
each alpha-emitting radionuclide has a discrete and characteristic emission energy,
which allows its complete identification. However, despite the aforementioned precau-
tions, it is not possible for all the alpha particles to deposit their complete energy in the
detector, and low-energy tails are commonly observed in the spectrum of an alpha
analysis (see Fig. 1). An alpha spectrum consists of a plot of the number of alpha parti-
cles detected, called counts, at defined energy intervals, called channels.
The resolution of the detectors used is in the order of 25 keV. So, the low-energy
tails usually pose no problem if the alpha peaks are well separated, i.e., their energies
are so different that they do not overlap. Unfortunately, that is not always possible.
The content of an alpha-emitting radionuclide in a sample is measured by analysing
the spectrum, specifically by determining the total number of counts (integral) in
several Regions Of Interest (ROIs), defined as intervals of energy in which an alpha
particle peak is detected. These ROIs should be wide enough to include the aforemen-
tioned low-energy tail.
The analysis of alpha spectra is usually carried out using models that represent the
shape of a single-energy alpha peak. In this sense, there are numerous works dealing
with the generation of a function that takes the shape of a theoretical peak and then
fits a combination of them to the real peaks present in the alpha spectrum. Bortels and
Collaers [2] proposed a function based on the convolution of several components, as
also did Westemeier and Van Aarle [7]. Most of these models depend on a number of
parameters that define the function. The fitting procedure optimizes these parameters
to best approximate the real spectrum. The more parameters the function contains, the
Study of Alpha Peak Fitting by Techniques Based on Neural Networks 81
better the results obtained, although the optimization process is more complicated and
several types of ambiguities appear.
There are many programs that cover all types of needs in the analysis of alpha
spectra from low to high statistics. Most focus on guaranteeing software performance
that is very stable and reliable (Blaauw et al. [1]). There are also non-commercial
programs whose focus is more on high performance optimization algorithms. Some
examples are FITBOR (Martín Sánchez et al. [4]), ALPHA (Pomme and Sibbens [6]),
and ALFIT (Lozano and Fernandez [3]).
2 Method
First, we selected the set of isolated peaks to form part of the overall training of the
network. These singlet peaks are obtained from the measurement of different real
samples. In a pre-selection process, alpha peaks with too low integrals or a visually
unrepresentative shape, such as double peaks, were discarded. If this information
were included, it would impede the convergence of the network in minimizing the
fitting error, and lead to poor generalization of the shape of an alpha peak.
The spectra selected were obtained from sources of polonium. These spectra con-
tain alpha peaks of different polonium isotopes – 208Po, 209Po, and 210Po (see Fig. 1).
This type of source was chosen because the peaks observed in their alpha spectra are
sufficiently separated as not to overlap, and consequently each one is due to a single
energy emission. They can thus be considered as representative of the shape of a
singlet alpha peak.
We grouped these peaks based on their integrals. The total number of counts of a
peak was therefore the unit of measurement to establish the working range of the
neural network. This range was set between 50 and 5000 counts. The criteria for es-
tablishing the limits of this range were based on experience with the analysis of alpha
spectra. A peak with an integral below 50 counts is considered to have too low statis-
tics for the analysis to be reliable because it implies an uncertainty of about 14% in
82 J. Miranda et al.
the determination of the activity of the radionuclide. Alpha peaks with integrals above
5000 counts are rarely obtained in the measurement of environmental radioactive
samples. The training set must cover the entire working range. Figure 2 shows the
percentage of peaks within each integral interval used to train the network.
In order to extract these peaks from the data files of spectra of the samples as they
were obtained in the laboratory, we modified the software we used for alpha spectra
analysis (AlphaSpec [5]) to allow the export of the data corresponding to selected
peaks to a file in a format compatible with the tool constructed to test and experiment
with the neural networks of this work.
The inputs to the network are a set of numerical values which identify the shape of the
alpha peak obtained from real samples, and are sufficiently descriptive to allow the
network to determine the shape on their basis (see Fig. 3). In particular, we considered
the following 10 characteristics of a peak:
values will be in the range 0 to 1. The output values of the training set peaks had
therefore to be scaled to this range for the correct calculation of the MSE.
3 Results
Multiple training cycles were tested by varying the parameters of the network, using a
tool specifically designed to experiment with different configurations. The network
parameters used were: Learning-rule (0.009), Momentum (0.8), Error (0.001). The
network was trained with these parameters using a cross-validation method on 102
peaks of different integrals. The algorithm takes 60% of peaks for the training set,
20% for the validation set, and the rest for the test set.
Table 1 presents the detailed results of the test set, with a 22-neuron hidden layer
configuration, 8395 epochs, and 2 minutes of training (Intel Core 2 Duo 2,6 GHz).
Table 1. MAX is the maximum counts in a channel, IEXPECTED is the expected integral, IANN is
the ANN output, and ERROR is the difference between expected integral and the output,
expressed in %
4 Conclusions
The neural network that we have described here is capable of reproducing the shape
of a single alpha peak from real samples without needing the generation of a theoreti-
cal function. The goodness of the result provided by the neural network used was
compared directly with the real value of the integral of these alpha peaks. The errors
for the 90.4% of the test set of 21 peaks different from the training and validation sets
were below 10%, with a mean error of 5.98%.
The technique is therefore currently being extended, studying its application to the
analysis of multi-emission peaks, and to radionuclides other than polonium.
Acknowledgment
This work has been financed by the Spanish Ministry of Science and Education under
the project number CTM2006-11105/TECNO, entitled “Characterization of the time
evolution of radioactivity in aerosols in a location exempt of a source term”. Also we
are grateful to the Autonomous Government of Extremadura for the “Studentship for
the pre-doctoral formation for researchers (resolved in D.O.E. 130/2007)”, and for the
financial support to the LARUEX research group (GRU09041).
References
1. Blaauw, M., García-Toraño, E., Woods, S., et al.: The 1997 IAEA Intercomparison of
Commercially Available PC-Based Software for Alpha-Particle Spectrometry. Nuclear In-
struments and Methods in Physics Research Section A: Accelerators, Spectrometers, Detec-
tors and Associated Equipment 428, 317–329 (1999)
2. Bortels, G., Collaers, P.: Analytical Function for Fitting Peaks in Alpha-Particle Spectra
from Si Detectors. International Journal of Radiation Applications and Instrumentation. Part
A. Applied Radiation and Isotopes 38, 831–837 (1987)
3. Lozano, J.C., Fernández, F.: ALFIT: A Code for the Analysis of Low Statistic Alpha-
Particle Spectra from Silicon Semiconductor Detectors. Nuclear Instruments and Methods
in Physics Research Section A: Accelerators, Spectrometers, Detectors and Associated
Equipment 413, 357–366 (1998)
4. Martín Sánchez, A., Rubio Montero, P., Vera Tomé, F.: FITBOR: A New Program for the
Analysis of Complex Alpha Spectra. Nuclear Instruments and Methods in Physics Re-
search, Section A: Accelerators, Spectrometers, Detectors and Associated Equipment 369,
593–596 (1996)
5. Miranda Carpintero, J., Pérez Utrero, R.: Universidad de Extremadura. Escuela Politécnica:
Desarrollo del software básico de análisis en espectrometría alfa (2006)
6. Pommé, S., Sibbens, G.: A New Off-Line Gain Stabilisation Method Applied to Alpha-
Particle Spectrometry. Advanced Mathematical and Computational Tools in Metrology VI,
327–329 (2004)
7. Westmeier, W., Van Aarle, J.: PC-Based High-Precision Nuclear Spectrometry. Nuclear
Instruments and Methods in Physics Research Section A: Accelerators, Spectrometers, De-
tectors and Associated Equipment 286, 439–442 (1990)
Information Enhancement Learning: Local
Enhanced Information to Detect the Importance
of Input Variables in Competitive Learning
Ryotaro Kamimura
1 Introduction
In this paper, we propose a new information-theoretic method in which a network
is enhanced to respond more explicitly to input patterns [1]. We propose here
three types of enhancement: self-enhancement, collective enhancement and local
enhancement. With self-enhancement, a state of a network is self-divided into
an enhanced and relaxed state, and the relaxed state tries to reach the enhanced
state. With collective enhancement, all neurons in a network are collectively
enhanced and respond to input patterns. With self- and collective enhancement,
we can realize self-organizing maps similar to those obtained by the conventional
SOM [2], [3]. Finally, local information is used to detect the importance of some
components, such as input units, variables, competitive units and so on. In this
paper, we focus in particular upon the local enhanced information to detect
important input variables.
Input variable selection or the detection of the importance of variables has
been considered to be one of the most important problems in neural networks as
well as machine learning. In particular, study on feature and variable selection
D. Palmer-Brown et al. (Eds.): EANN 2009, CCIS 43, pp. 86–97, 2009.
c Springer-Verlag Berlin Heidelberg 2009
Information Enhancement Learning 87
enh
p (j|s)
p(j|s)
Enhancement
wjk
(b) Small σ(1/α)
x ks
p(j|s)
Relaxation
Competitive unit
Fig. 1. Enhancement (b) and relaxation (c) of competitive units for a network itself
88 R. Kamimura
in which all competitive units respond uniformly to input units. Because com-
petitive units cannot differentiate between input patterns, no information on
input patterns is stored.
As shown in Figure 1(a), a network is composed of connection weights wjk
from the kth input unit to the jth competitive unit. The kth element of the
sth input pattern is represented by xsk . A neuron output can be defined by the
Gaussian-like function:
1
vjs ∝ exp − (xs − wj )T Σ−1 (xs − wj ) , (1)
2
where Σ denotes the scaling matrix1 . The enhanced scaling matrix can be defined
by
(α)
Σ = Σenh
= (1/α)2 I, (2)
where I is an identity matrix. When the relaxation is applied, we have
(α)
Σ = Σrel
= α2 I. (3)
2.2 Self-enhancing
We have shown that an initial state can be split into an enhanced and a re-
laxed state. Then, we must decrease the gap between the two states as much as
possible. In the self-enhancing process, the difference between two probabilities
should be as small as possible.
We now present update rules for self-enhancement learning for a general case.
At an enhanced state, competitive units can be computed by
s 1
∝ exp − (xs − wj )T (Σenh )−1 (xs − wj ) .
(α)
vj,enh (4)
2
Normalizing this output, we have enhanced firing probabilities
exp − 21 (xs − wj )T (Σenh )−1 (xs − wj )
(α)
penh (j | s) = . (5)
M s T (α) −1 s
j=1 exp − 2 (x − wj ) (Σenh ) (x − wj )
1
Suppose that p(j|s) denotes the probability of the jth neuron’s firing in a relaxed
state. Then, we should make these probabilities as close as possible to the prob-
abilities of enhanced neurons’ firing. Thus, we have an objective cross entropy
function defined by
S
M
p(j | s)
I= p(s) p(j | s) log , (6)
s=1 j=1
penh (j| s)
1
We used the scaling matrix instead of the ordinary covariance matrix, because the
output does not exactly follow the Gaussian function.
Information Enhancement Learning 89
where M and S denote the number of competitive units and input patterns,
respectively. In addition, we should decrease quantization errors, defined by
S
M
E= p(s) p(j | s)xs − wj 2 . (7)
s=1 j=1
It is possible to differentiate this cross entropy and the quantization error func-
tion to obtain update rules [9], [10]. However, the update rules become compli-
cated, with heavy computation required for computing conditional probabilities.
Fortunately, we can skip the complicated computation of conditional entropy
by introducing the free energy used in statistical mechanics [11], [12], [13], [14],
[15], [16], [17]. Borrowing the concept from statistical mechanics, let us introduce
free energy, or a free energy-like function, defined by
S
M
F = −2α2 p(s) log penh (j | s)
s=1 j=1
1
× exp − (xs − wj )T (Σrel )−1 (xs − wj ) .
(α)
(8)
2
Thus, to decease the free energy, we must decrease the cross entropy and the
corresponding error function. By differentiating the free energy, we have
S s
s=1 p(j | s)xk
wjk = S
, (10)
s=1 p(j | s)
where p(s) is set to 1/S. We should note that, thanks to the excellent work
of Heskes [18], the free energy can be interpreted in the framework of an EM
algorithm.
In the above formulation, we have dealt with a general case of self-enhancement
learning. However, this is a self-enhancement learning version of competitive
learning. In application to competitive learning, an enhanced state is one where
a winner takes all in the extreme case. This is realized by the enhancement
parameter 1/α. On the other hand, a relaxed state is one where competitive
units respond to input patterns almost equally, which is realized by setting the
enhancement parameter to α. The self-enhancement learning, in terms of com-
petitive learning, tries to attain a state where the winner-take-all is predominant.
90 R. Kamimura
Collective layer
Competitive layer
Vj s Collective
outputs
Competitive units
For the local enhancement, we focus upon input units for enhancement. For
example, we use input units for enhancement, and we can then estimate the
importance of input variables. Now, suppose that the tth input unit is a target
for enhancement. For this, we use a scaling matrix Σ(t,α) , meaning that the tth
unit is a target with the enhancement parameter α. The scaling parameter Σ(t,α)
is defined by
(α) (α)
Σ(t,α) = It Σenh + (I − It )Σrel , (16)
Thus, when the tth input unit is a target for enhancement, we have competitive
s
unit outputs vjt computed by
s 1
vjt ∝ exp − (xs − wj )T (Σ(t,α) )−1 (xs − wj ) . (18)
2
This means that, when the tth input unit is a target for enhancement, the
tth element of the diagonal scaling matrix is set to (1/α)2 , while all the other
elements in the diagonal are set to α2 . In this case, we can detect the importance
of input variables. We can normalize these activations for probabilities,
t exp − 12 (xs − wj )T (Σ(t,α) )−1 (xs − wj )
p (j | s) = M
1 . (19)
s T (t,α) )−1 (xs − w )
m=1 exp − 2 (x − wm ) (Σ m
And we have
S
pt (j) = p(s)pt (j | s). (20)
s=1
92 R. Kamimura
By using these probabilities, we have local enhanced information for the tth
input unit
S
M
pt (j | s)
It = p(s)pt (j | s) log . (21)
s=1 j=1
pt (j)
0.7
1
2 0.65
3
0.6
Information
4
k 0.55
5
0.445(SOM) 0.428
6 0.5
7 0.45
8
0.4
1 2 3 4 5 6 7 8 3 5 7 9 11 13
s Epoch
(a) Input patterns x sk (b) Input patterns
Fig. 3. Input patterns (a) and quantization errors (b) by enhancement learning in blue
and by SOM in red. Black and white squares represent one and zero, respectively.
3 2 4 3 2
1 1
8 8
5 7 5 6 7
Fig. 4. U-matrix and labels obtained by SOM (a) and enhancement learning (b)
0.055
0.05
Enhanced information
0.045
0.04
0.035
0.03
2 3 4 5 6 7 8 9 10
Alpha
Information
Information
Information
Information
the magnitude of the ratio is significantly increased. On the other hand, even if
the enhancement parameter α is increased from four to seven, little difference
can be seen, and all we can see is the increase of the magnitude of enhanced
information.
Figures 7(a) and (b) show enhanced information and quantization errors, re-
spectively. As the input unit moves to the center of the input pattern, the en-
hanced information is gradually increased. Then, Figure 7(b) shows quantization
Information Enhancement Learning 95
0.92
0.6
0.9
0.88
0.55
Quantization error
Information
0.86
0.5 0.84
0.82
0.45 0.8
0.78
0.4 0.76
2 4 6 8 2 4 6 8
Input unit Input unit
(a) Information (b) Quantization error
Fig. 7. Local enhanced information and quantization errors. The correlation coefficient
is -0.999.
3 2 4 3
1 2
5 7 5
Fig. 8. U-matrices and labels for connection weights with only a unit with maximum
information (No. 4) (a) and with minimum information (No. 1) (b)
errors; we can see that errors are decreased as the input units move to the cen-
ter. These results are natural, because the input patterns are symmetrical to the
center. The correlation between enhanced information and quantization errors
becomes -0.999, and this means that enhanced information is closely correlated
with quantization errors.
Figures 8 (a) and (b) show examples of U-matrices and labels for input No.
4 (maximum variance) and for input unit No. 1 (minimum variance). As can
be seen in Figure 8(a), input patterns in the U-matrix are clearly divided into
two parts by the strongly warmer-colored boundaries in the middle. On the other
hand, for the input unit with the minimum variance, the U-matrix is significantly
different from the U-matrix obtained by SOM and information enhancement. In
addition, the labels in Figure 8(b2) show a map with a smaller number of labels,
meaning that the number of empty cells is significantly increased.
4 Conclusion
In this paper, we have proposed information enhancement learning to realize self-
organizing maps, detect the importance of input variables and find the optimal
96 R. Kamimura
input variables. In our information enhancement learning, there are three types
of information, namely, self-enhancement, collective enhancement and local en-
hancement. With self-enhancement and collective enhancement, we can realize
self-organizing maps. In addition, we use local enhanced information to detect
the importance of input units or input variables. An input variable is considered
to be optimal when the variance of local information is maximized. We applied
the method to an artificial data problem. In the problem, information enhance-
ment learning was able to produce self-organizing maps close to those produced
by the conventional SOM in terms of quantization errors and U-matrices. In ad-
dition, the importance of input variables detected by local enhanced information
corresponded to the importance obtained by directly computing errors.
One of the problems of this method is that of how to find optimal local
information. We have proposed the ratio of the variance to the average as a
criterion. However, we have found that this ratio is not necessarily valid for all
cases. Thus, a more general approach to this optimality problem is necessary.
In addition, to evaluate the performance of the method, we should apply it to
larger and more practical problems.
Acknowledgment
The author is very grateful to two reviewers and Mitali Das for her valuable
comments.
References
1. Kamimura, R.: Feature discovery by enhancement and relaxation of competi-
tive units. In: Fyfe, C., Kim, D., Lee, S.-Y., Yin, H. (eds.) IDEAL 2008. LNCS,
vol. 5326, pp. 148–155. Springer, Heidelberg (2008)
2. Kohonen, T.: The self-organizing maps. Proceedings of the IEEE 78(9), 1464–1480
(1990)
3. Kohonen, T.: Self-Organizing Maps. Springer, Heidelberg (1995)
4. Guyon, I., Elisseeff, A.: An introduction of variable and feature selection. Journal
of Machine Learning Research 3, 1157–1182 (2003)
5. Rakotomamonjy, A.: Variable selection using svm-based criteria. Journal of Ma-
chine Learning Research 3, 1357–1370 (2003)
6. Perkins, S., Lacker, K., Theiler, J.: Grafting: Fast, incremental feature selection
by gradient descent in function space. Journal of Machine Learning Research 3,
1333–1356 (2003)
7. Reunanen, J.: Overfitting in making comparison between variable selection meth-
ods. Journal of Machine Learning Research 3, 1371–1382 (2003)
8. Castellano, G., Fanelli, A.M.: Variable selection using neural-network models. Neu-
rocomputing 31, 1–13 (1999)
9. Kamimura, R.: Cooperative information maximization with gauissian activation
functions for self-organizing maps. IEEE Transactions on Neural Networks 17(4),
909–919 (2006)
10. Linsker, R.: How to generate ordered maps by maximizing the mutual information
between input and output. Neural Computation 1, 402–411 (1989)
Information Enhancement Learning 97
11. Ueda, N., Nakano, R.: Deterministic annealing variant of the em algorithm. In:
Advances in Neural Information Processing Systems, pp. 545–552 (1995)
12. Rose, K., Gurewitz, E., Fox, G.C.: Statistical mechanics and phase transition in
clustering. Physical review letters 65(8), 945–948 (1990)
13. Martinez, T.M., Berkovich, S.G., Schulten, K.J.: Neural-gas network for vector
quanitization and its application to time-series prediction. IEEE transactions on
neural networks 4(4), 558–569 (1993)
14. Erdogmus, D., Principe, J.: Lower and upper bounds for misclassification probabil-
ity based on renyi’s information. Journal of VLSI signal processing systems 37(2/3),
305–317 (2004)
15. Torkkola, K.: Feature extraction by non-parametric mutual information maximiza-
tion. Journal of Machine Learning Research 3, 1415–1438 (2003)
16. Kamimura, R.: Free energy-based competitive learning for mutual information
maximization. In: Proceedings of IEEE Conference on Systems, Man, and Cy-
bernetics, pp. 223–227 (2008)
17. Kamimura, R.: Free energy-based competitive learning for self-organizing maps.
In: Proceedings of Artificial Intelligence and Applications, pp. 414–419 (2008)
18. Heskes, T.: Self-organizing maps, vector quantization, and mixture modeling. IEEE
Transactions on Neural Networks 12(6), 1299–1305 (2001)
19. Versanto, J., Himberg, J., ALhoniemi, E., Parhankagas, J.: Som toolbox for matlab
5. Tech. Rep. A57, Helsinki University of Technology (2000)
Flash Flood Forecasting by Statistical Learning in the
Absence of Rainfall Forecast: A Case Study
Abstract. The feasibility of flash flood forecasting without making use of rain-
fall predictions is investigated. After a presentation of the “cevenol flash
floods“, which caused 1.2 billion Euros of economical damages and 22 fatalities
in 2002, the difficulties incurred in the forecasting of such events are analyzed,
with emphasis on the nature of the database and the origins of measurement
noise. The high level of noise in water level measurements raises a real chal-
lenge. For this reason, two regularization methods have been investigated and
compared: early stopping and weight decay. It appears that regularization by
early stopping provides networks with lower complexity and more accurate
predicted hydrographs than regularization by weight decay. Satisfactory results
can thus be obtained up to a forecasting horizon of three hours, thereby allow-
ing an early warning of the populations.
1 Introduction
The need for accurate predictions of flash floods has been highlighted by the recent
occurrences of catastrophic floods such as in Vaison-la-Romaine (1991), Nîmes
(1988), Gardons (2002), Arles (2003), to name only a few, located in the south of
France. These disasters result from intense rainfalls on small (some hundreds of km2),
high-slope watersheds, resulting in flows of thousands of m3/s with times of concen-
tration of a few hours only. The death toll (over 100) in these circumstances in the
southeast of France, and the cost of more than 1.2 billion Euros in 2002, showed that
the design of a reliable tool to forecast such phenomena is mandatory.
Faced with this major risk, the French Ministry in charge of Sustainable Develop-
ment (currently MEEDADT) created in 2003 the national center for flood forecasting
and warning SCHAPI (Service Central d’Hydrométéorologie et d’Appui à la Prévi-
sion des Inondations), which is in charge of the “vigicrue” surveillance service. The
Gardon d’Anduze, in the South-East of France, has been chosen by this Center as a
pilot site to compare the flash floods (concentration time of 2h-4h) forecasting mod-
els. In this context, this paper describes the study of neural network models to build
an efficient real time flash flood forecaster.
D. Palmer-Brown et al. (Eds.): EANN 2009, CCIS 43, pp. 98–107, 2009.
© Springer-Verlag Berlin Heidelberg 2009
Flash Flood Forecasting by Statistical Learning in the Absence of Rainfall Forecast 99
Real time flash flood forecasting is usually addressed by coupling complex atmos-
pheric and hydrologic models. The complexity generated by this coupling is huge, and
the performances of the present models are limited by several factors: the observations
may not be accurate enough for these models to produce useful predictions, the models
may be biased by a lack of observations on the ground at an appropriate scale, and the
models themselves do not take into account the whole complexity of the phenomena.
An alternative approach consists of capitalizing on the available data in order to
build models by statistical machine learning. This will reduce the computational bur-
den and free the model designers from the limitations of physical modeling when the
phenomena are too complex, or when the estimation of physical parameters is difficult.
Due (i) to the lack of accurate estimations of rainfalls, and (ii) to the high noise level
in water level measurements, and in order to guarantee the best possible generalization
capabilities, complexity control is a particularly critical issue. Two traditional regulari-
zation methods have been investigated: early stopping and weight decay. After careful
variable and model selection, the ability of models, obtained by either regularization
method, to predict the most dramatic event of the database (September 2002) is as-
sessed. Hydrographs are displayed and comparisons between the results of both meth-
ods are performed. Finally, we conclude that, in the present case, training with early
stopping provides networks with lower complexity, longer training but more satisfactory
predictions.
2 Problem Statement
2.1 Flash Flood Forecasting
forecasting operates on Grid technologies [6]. From the viewpoint of the end users,
the very short computation times involved in the execution of neural network algo-
rithms once training has been performed makes them very attractive as components of
a warning system, without having to resort to grid computing. Another advantage is
that any nonlinear, dynamical behavior may be modeled by neural networks, particu-
larly the relation between the rainfall up to time t and the discharge at time t+f. Fore-
cast is thus possible without estimating future rainfalls. Although neural networks
were applied previously to the forecasting of outflows at several forecasting horizons
[7] [8], or for water supply management in mountainous areas [9], they were never
applied to events of such speed and intensity.
In the present paper, we show that, despite the difficulty of the task, the evolution
of the water level at Anduze can be forecast up to 3 hours ahead of time, without any
assumption about the evolution of future rainfall, for the catastrophic, most intense
event of the database, namely the event of September 2002.
In the present section, we focus on the nature and quality of the information available
in the database.
Rainfall measurements are performed with rain gauges. These are very accurate
sensors, which broadcast the water level every five minutes; however, they provide
very local information, so that the heterogeneity of the rainfalls is an important source
of inaccuracy: for example, for the event of September 2002, the cumulated rainfalls
were 3 times as large in Anduze as in Saint-Jean-du-Gard, which is only ten kilome-
ters away. Therefore, the most important rainfall may be located between rain gauges,
thereby causing inaccurate estimates due to the too large mesh of the rain gauge net-
work. For this reason, radar acquisition of rainfalls with a definition of 1 km2 has been
performed since 2002, but complete, homogeneous sequences are not yet available for
all events.
Water level measurements are available with several sampling periods: 1 hour from
1994 to 2002, and 5 minutes after this date. However, because of real time con-
straints, the sampling period used in this work is T = 30 mn, although variance analy-
sis has shown that 15 minutes would be more appropriate. Thus for events recorded
before 2002, the peak value is probably underestimated, possibly by 10% to 30%. For
the event of 2002, the error results from an accident: instrumentation was damaged
during the event, and the water level was estimated a posteriori.
Therefore, the unreliability of the available data makes the forecasting of such
catastrophic events a challenging task.
3 Model Design
3.1 Definition of the Model
Given a forecasting horizon f, the model is intended to forecast, at discrete time kT, (k
∈N+) the water level at Anduze at time (k + f) T (f ∈N+).
The available information for the Anduze catchment is the water level at the An-
duze station, the rainfalls at 6 rain gauges delivering the cumulated rainfalls over the
sampling period (30 min), and the soil moisture (Soil Water Index, given by the ISBA
(Interactions between Soil, Biosphere, and Atmosphere) model [10]).
The 6 rain gauges: Barre-des-Cévennes, Saint-Roman de Tousques, Saumane, Mi-
alet, Soudorgues and Anduze are spatially well distributed and one can consider that
each of them is important. The information about rainfalls is conveyed to the network
as sliding windows. All sliding windows have equal width w, whose optimal value is
chosen as described in section 3.2. Different values were found, depending on the
forecasting horizon (Table 2). Similarly, the information about past water levels is
conveyed as sliding windows, whose optimal width was found to be r = 2, irrespec-
tive of the forecasting horizon.
102 M.S. Toukourou, A. Johannet, and G. Dreyfus
Forecasting horizon (f) 0.5 hour 1 hour 2 hours 3 hours 4 hours 5 hours
w 2.5 3 3 2 0.5 0.5
Because of the non-linearity of the physical process, we take advantage of the uni-
versal approximation property of a neural network with one hidden layer of sigmoid
neurons and a linear output neuron [11]. The water level at time t+f is forecast from
(i) the measured rainfalls in a sliding window of width w, and (ii) from the measured
water levels in a sliding window of width r. The training data is chosen (see section
3.2) in the set of flood sequences recorded over several years (1994-2007), described
in Table 1.
Since the model takes into account measured past values of the water level, during
the same flood, the available information about soil moisture is not explicitly con-
veyed to the model since it is implicitly present in the input data.
One of the events was set apart for use as a test set (see section 3.3); another event
was selected for use either as an early stopping set when the latter regularization tech-
nique was used, or as an additional test set when regularization was performed by
weight decay (see section 4 for more details on regularization). In the latter case, it
was also set apart and used neither for training nor for model selection.
Fig. 1. The model is fed by cumulated rainfall measurements provided by the 6 rain gauges
over a temporal window of width w: ui(kT) is the vector of the w rainfall level measurements
provided by rain gauge i (i = 1... 6) at times kT, (k−1)T, ..., (k−w+1)T. Water level measure-
ments, over a sliding window of width r, are also input to the model. The output is the forecast
water level, f sampling periods ahead.
Flash Flood Forecasting by Statistical Learning in the Absence of Rainfall Forecast 103
where MSEi is the mean squared forecasting error of the model, for the time sequence
recorded during event i of the validation set.
In the present case, K = 4 events were chosen: the 1995, 1997, and 2006 very in-
tense events reported in Table 1. The 2002 event was selected as a test set because it
is typical of events whose forecasting is crucial for early warning. In addition, this
event is also more intense than those used for training and validation: it is a difficult
test for the models. The other nine floods were always present in the training set.
The above procedure was used for complexity selection, spanning the space of
rainfall window width w, water level window width r and number of hidden neurons
NC. For each model, 100 different parameter initializations were performed. Complex-
ity selection was performed separately for each forecasting horizon.
After selecting the appropriate complexity (i.e. after selecting the appropriate val-
ues of w, r and NC) for a given forecasting horizon f, a final model was trained for that
horizon, from thirteen sequences: all floods except the test sequence and the early
stopping sequence, or all floods except the two test sequences when weight decay
regularization was performed. Its performance was assessed on the test sequence(s).
3.3 Training
The usual squared error cost function was minimized by the Levenberg-Marquardt
algorithm [13] during training, after computation of the gradient of the cost function
by backpropagation.
4 Regularization
In addition to performing input and model selection by cross-validation, regulariza-
tion methods were applied during training. Two different methods were assessed in
this study: weight decay and early stopping.
Weight decay prevents parameters from taking excessive values (resulting in overfit-
ting), by introducing a term in the cost function that penalizes large parameter values;
this idea is implemented in a systematic fashion in Support Vector Machines. In the
present case, the new cost function is expressed as:
104 M.S. Toukourou, A. Johannet, and G. Dreyfus
J = γ MSE + (1 − γ ) θ .
2
(2)
where MSE is the usual mean squared prediction error, θ is the vector of parameters and
γ is the hyperparameter that controls the balance between the terms of the cost function.
Similarly to model selection, the hyperparameter γ was selected by cross validation,
for each forecasting horizon, for γ varying from 0.5 to 0.95 with an increment step of
0.05. Table 3 shows the optimal value of γ obtained for each forecasting horizon.
∑ ( y (t + f ) − yˆ (t + f ))2
test sequence
P = 1− . (3)
∑ ( y (t ) − y (t + f ))2
test sequence
Flash Flood Forecasting by Statistical Learning in the Absence of Rainfall Forecast 105
where y is the observed water level and ŷ is the estimated water level. P is equal to 0 if
the predictor is perfectly dumb, i.e. it always predicts that the future value is equal to the
present one, and it is equal to 1 if the predictor provides perfectly accurate forecasts.
Tables 4 and 5 describe the models and the accuracy of their predictions on the
2002 flood.
In the present case, early stopping provides consistently more parsimonious models
than weight decay, with consistently higher values of the determination criterion.
However, the persistency criterion is larger for models obtained with weight decay.
Figures 2 and 3 show the predicted and observed curves for the test sequence (2002
event). Given the difficulty of the task, these results are extremely encouraging, since
they show that the model would have allowed the public services to issue early warn-
ings to the population if it had been available during that event.
6 Conclusion
Flash flood forecasting is a very challenging task due to high variability and noise in
the data, especially when no rainfall forecast is available. In the present study, we
have shown the feasibility of forecasting the catastrophic event of September 2002 in
Anduze with an accuracy and forecasting horizon that are compatible with an early
warning of the populations.
This requires a careful methodology for model selection and regularization; it is
shown that early stopping and weight decay result in different generalization capabili-
ties, and that, in this specific case, early stopping provides more satisfactory results on
the test set. This is not claimed to be a general result, but it shows that a variety of
methods must be used in order to solve such difficult problems satisfactorily.
From the viewpoint of hydrology, this methodology should easily be applied to
small (less than 1000 km2), fast (concentration time less than 10 h) basins providing
only rainfalls and water level. Because no exogenous data is necessary, the method
should be applicable to many European mountainous watersheds.
Acknowledgments. The authors acknowledge the SCHAPI and its regional service
SPC of Gard for their collaboration on this exciting application.
Flash Flood Forecasting by Statistical Learning in the Absence of Rainfall Forecast 107
References
1. European Flood Forecasting System (2003), http://effs.wldelft.nl/
2. PREVIEW (2008), http://www.preview-risk.com
3. Le Lay, M., Saulnier, G.-M.: Exploring the Signature of Climate and Landscape Spatial
Variabilities in Flash Flood Events: Case of the 8–9 September 2002 Cévennes-Vivarais
Catastrophic Event. Geophysical Research Letters 34, 5 page (2007)
4. Taramasso, A.C., Gabellani, S., Marsigli, C., Montani, A., Paccagnella, T., Parodi, A.: Op-
erational flash-flood forecasting chain: an application to the Hydroptimet test cases. Geo-
physical Research Abstracts 7, 9–14 (2005)
5. Jasper, K., Gurtz, J., Lang, H.: Advanced flood forecasting in Alpine watersheds by cou-
pling meteorological observations and forecasts with a distributed hydrological model.
Journal of Hydrology 267, 40–52 (2002)
6. CrossGrid (2005), http://www.eu-crossgrid.org
7. Zealand, C.M., Burn, D.H., Simonovic, S.P.: Short term streamflow forecasting using arti-
ficial neural networks. Journal of Hydrology 214, 32–48 (1999)
8. Schmitz, G.H., Cullmann, J.: PAI-OFF: A new proposal for online flood forecasting in
flash flood prone catchments. Journal of Hydrology 1, 1–14 (2008)
9. Iliadis, S.L., Maris, F.: An artificial neural networks model for mountainous water-
resources management: the case of Cyprus mountainous watersheds. Environmental Mod-
elling & Software 22, 1066–1072 (2007)
10. Noilhan, J., Mahfouf, J.F.: The ISBA land surface parameterization scheme. Global and
Planetary Change 13, 145–159 (1996)
11. Hornik, K., Stinchcombe, M., White, H.: Multilayer Feedforward Networks Are Universal
Approximators. Neural Networks 2, 359–366 (1989)
12. Dreyfus, G.: Neural Networks, Methodology and Applications. Springer, Heidelberg
(2005)
13. Hagan, M.-T., Menhaj, M.-B.: Training feedforward networks with the Marquardt Algo-
rithm. IEEE Transaction on Neural Networks 5(6), 989–993 (1994)
14. Sjöberg, J., Ljung, L.: Overtraining, regularization, and searching for a minimum, with ap-
plication to neural networks. International Journal of Control 62(6), 1391–1407 (1995)
15. Kitadinis, P.K., Bras, R.L.: Real-time forecasting with a conceptual hydrologic model: 2
applications and results. Water Resour. Res. 16, 1034–1044 (1980)
An Improved Algorithm for SVMs Classification
of Imbalanced Data Sets
Cristiano Leite Castro, Mateus Araujo Carvalho, and Antônio Padua Braga
1 Introduction
Since their introduction by V. Vapnik and coworkers [1], [2], [3], Support Vector
Machines (SVMs) have been successfully applied to many pattern recognition
real world problems. SVMs are based on Vapnik-Chervonenkis’ theory and the
structural risk minimization principle (SRM) [2], [4] which aims to obtain a clas-
sifier with high generalization performance through minimization of the global
training error and the complexity of the learning machine. However, it is well
established that in applications with imbalanced data sets [5], where the training
D. Palmer-Brown et al. (Eds.): EANN 2009, CCIS 43, pp. 108–118, 2009.
c Springer-Verlag Berlin Heidelberg 2009
An Improved Algorithm for SVMs Classification of Imbalanced Data Sets 109
examples of the target class (minority) are outnumbered by the training exam-
ples of the non-target class (majority), the SVM classifier performance becomes
limited. This probably occurs because the global training error considers differ-
ent errors as equally important assuming that the class prior distributions are
relatively balanced [6]. In the case of most real world problems, when the im-
balanced class ratio is huge, one can observe that the separation surface learned
by SVMs is skewed toward the minority class [5]. Consequently, test examples
belonging to the small class are more often misclassified than those belonging to
the prevalent class [7].
In order to improve SVMs performance on applications with imbalanced class
distributions, we designed the Boundary Elimination and Domination Algorithm
(BED). It eliminates outliers and noisy examples of the majority class in lower
density areas and generates new synthetic examples of the minority class by
considering both the analyzed example, and its k minority nearest neighbors. By
increasing the number of minority examples in areas near the decision boundary,
the representativity of support vectors of this class is improved. Moreover, the
BED has parameters that allow the control of the resampling process intensity
and can be adjusted according to the level of imbalance and overlapping of the
problem.
In our experiments, we used several real world data sets extracted from UCI
Repository [8] with different degrees of class imbalance. We compared the BED
algorithm with Synthetic Minority Oversampling Technique (SMOTE) [9], a
popular oversampling strategy. In both algorithms, we used SVMs as base classi-
fiers. The performances were evaluated using appropriate metrics for imbalanced
classification such as F-measure [10], G-mean [11] and ROC (Receiver Operating
Characteristics) curves [12].
This paper is organized as follows: Section 2 reviews the SVMs learning algo-
rithm and previous solutions to the learning problem with imbalanced data sets,
specially in the context of SVMs. Section 3, presents our approach to the prob-
lem and describes the BED algorithm. Section 4 describes how the experiments
were performed and the results obtained. Finally, Section 5 is the conclusion.
2 Background
2.1 Support Vector Machines
In their original formulation [1], [2], SVMs were designed to estimate a linear
function f (x) = sgn (w · x + b) of parameters w ∈ d and b ∈ , using only
a training set drawn i.i.d. according to an unknown probability distribution
P (x, y). This training set is a finite set of samples,
(x1 , y1 , · · · , xn , yn ) . (1)
where xi ∈ d and yi ∈ {−1, 1}. The SVMs learning aims to find the hyperplane
which gives the largest separating margin between the two classes. For a linearly
separable training set, the margin ρ is defined as euclidean distance between
110 C.L. Castro, M.A. Carvalho, and A.P. Braga
the separating hyperplane and the closest training examples. Thus, the learning
problem can be stated as follows: find w and b that maximize the margin while
ensuring that all the training samples are correctly classified,
1 2
min(w,b) w . (2)
2
s.t. yi (w · xi + b) ≥ 1, i = 1, . . . , n . (3)
For the non-linearly separable case, slack variables εi are introduced to allow
for some classification errors (soft-margin hyperplane) [3]. If a training example is
located inside the margins
nor the wrong side of the hyperplane, its corresponding
εi is greater than 0. The i=1 εi corresponds to an upper bound of the number of
training errors. Thus, the optimal hyperplane is obtained by solving the following
constrained (primal) optimization problem,
1 2
n
min(w,b,εi ) w + C εi . (4)
2 i=1
s.t. yi (w · xi + b) ≥ 1 − εi , i = 1, . . . , n . (5)
εi ≥ 0, i = 1, . . . , n . (6)
where the constant C > 0, controls the trade-off between the margin size and
the misclassified examples. Instead of solving the primal problem directly, one
considers the following dual formulation,
n
1
n
max(α) αi − yi yj αi αj xi · xj . (7)
i=1
2 i,j=1
s.t. 0 ≤ αi ≤ C, i = 1, . . . , n . (8)
n
αi yi = 0 . (9)
i=1
Solving this dual problem the Lagrange multipliers αi are obtained whose sizes
are limited by the box constraints (αi ≤ C); the parameter b can be obtained
from some training example (support vector) with non-zero corresponding αi .
This leads to the following decision function,
n
f (xj ) = sgn yi αi xi · xj + b . (10)
i=1
The SVM formulation presented so far is limited to linear decision surfaces in
input space, which are definitely not appropriate for many classification tasks.
The extension to more complex decision surfaces is conceptually quite simple
and, is done by mapping the data into a higher dimensional feature space F ,
where the problem becomes linear. More precisely, a non-linear SVM first maps
the input vectors by Φ : x → Φ (x), and then estimates a separating hyperplane
in F ,
f (x) = sgn (Φ (x) · w + b)) . (11)
An Improved Algorithm for SVMs Classification of Imbalanced Data Sets 111
It can be observed, in (7) and (10), that the input vectors are only involved
through their inner product xi · xj . Thus, to map the data is not necessary to
consider the non-linear function Φ in explicit form. The inner products can only
be calculated in the feature space F . In this context, a kernel is defined as a way
to directly compute this product [4]. A kernel is a function K, such that for all
pair x, x in input space,
Two main approaches have been designed to address the learning problem with
imbalanced data sets: algorithmic and data preprocessing [13]. In the algorith-
mic approach, learning algorithms are adapted to improve performance of the
minority (positive) class. In the context of SVMs, [14] and [15] proposed tech-
niques to modify the threshold (parameter b) in the decision function given by
the equation (10). In [16] and [17], the error for positive examples was distin-
guished from the error for negative examples by using different constants C +
and C − . The ratio C + /C − is used to control the trade-off between the number
of false negatives and false positives. This technique is known by Asymmetric
Misclassification Costs SVMs [17].
Another algorithmic approach to improve SVMs on imbalanced classification
is to modify the employed kernel. Thus, based on kernel-target alignment al-
gorithms [18], Kandola and Taylor [19] assigned different alignment targets to
positive and negative examples. In the same direction, [5] proposed the kernel
boundary alignment algorithm (KBA) which adapts the decision function toward
the majority class by modifying the kernel matrix.
In the data preprocessing approach, the objective is to balance the class dis-
tribution by resampling the data in input space, including oversampling exam-
ples of the minority class and undersampling examples of the majority class.
Oversampling works by duplicating pre-existing examples (oversampling with
replacement) or generating new synthetic data which is usually obtained by
interpolating. For instance, in the SMOTE algorithm [9], for each minority ex-
ample, its nearest minority neighbors are identified and new minority examples
are created and placed randomly between the example and the neighbors.
Undersampling involves the elimination of class majority examples. The ex-
amples to be eliminated can be selected randomly (random undersampling)
or through some prior information (informative oversampling). The one-sided
1
Kernel functions used in this work:
Linear kernel : K (x, x ) = x ·x
2
RBF kernel: K (x, x ) = exp −(x−x 2r 2
)
.
112 C.L. Castro, M.A. Carvalho, and A.P. Braga
selection proposed by Kubat and Matwin [11], for instance, is a informative un-
dersampling approach which removes noisy, borderline, and redundant majority
examples.
Previous data preprocessing strategies that aim to improve the SVMs learn-
ing on imbalanced data sets include the following: [20] combined the SMOTE
algorithm with Asymmetric Misclassification Costs SVMs mentioned earlier. [21]
used SMOTE and also random undersampling for SVM learning on intestinal-
contraction-detection task. Recently, [22] proposed the Granular SVM - Repet-
itive Undersampling algorithm (GSVM-RU) whose objective is to minimize the
negative effect of information loss in the undersampling process.
where, xci corresponds to the ith training example of the C c class. The symbol
c = {+, −} indicates if the example belongs to the positive (minority) class or
the negative (majority) class, respectively. The nC + value corresponds to the
number of positive neighbors of the xci . For a given positive example, cs(x+ i )
evaluates the proportion of positive examples between the k nearest neighbors
of the x+i . Therefore, if cs(xi ) ≈ 1, one can state that xi is on a region with
+ +
high density of the positive examples. Equivalent definitions hold for the negative
class.
For majority class examples (x− −
i ), the credibility score cs(xi ) is used to find
noisy examples that occupy class overlapping areas and also isolated examples
belonging to the minority class regions. Thus, BED establishes a rule to detect
and eliminate these examples. This rule depends on the maxtol parameter and
is given by,
An Improved Algorithm for SVMs Classification of Imbalanced Data Sets 113
if cs(x− −
i ) < maxtol → xi is eliminated,
if cs(x− −
i ) ≥ maxtol → xi is not eliminated.
Otherwise, if cs(x+ +
i ) is not valid, xi is considered an isolated or a noisy example
and, a new example should not be created around it. The rule which evaluates
the positive example is based on the mintol parameter and is given by,
if cs(x+
i ) < mintol → xi is not eliminated,
+
if cs(x−
i ) ≥ mintol → x̂ is created.
+
The parameter mintol defines the validity degree of positive examples in the
input space and similar to the maxtol parameter can have values from 0 to 1.
The lower the mintol value the higher the probability of a positive example on
the class overlapping area to be considered valid and used to generate a new
synthetic example x̂+ . mintol values however, should not be very close to 0 in
order to ensure that isolated examples belonging to the C + do not generate new
examples. The mintol adjustment should also be done by the user according to
the level of class imbalance and overlapping.
At each iteration of BED algorithm, a new training example is analyzed. The
algorithm ends when the classes become balanced. The effect caused by BED
algorithm on an imbalanced training set is illustrated in Fig. 1: majority exam-
ples in regions of prevalence of minority examples are eliminated. Meanwhile,
new minority examples are better represented in these regions.
It is expected, therefore, that our resampling strategy allows that a SVM clas-
sifier obtains a separation surface that maximize the number of correct positive
classifications. Moreover, it is important to notice that BED can be used with
any classification algorithm.
114 C.L. Castro, M.A. Carvalho, and A.P. Braga
Table 1. Characteristics of the six data sets used in experiments: number of attributes
and number of positive and negative examples. For some data sets, the class label in
the parentheses indicates the target class we chose. Moreover, this table shows the
optimal choice of parameters for SMOTE (% oversampling) and BED (k, maxtol and
mintol).
4.2 Results
√
Table 2 illustrates the results using the G-mean metric, defined as tpr · f pr
[11], which corresponds to the geometric mean between the correct classification
rates for positive (sensitivity) and negative (specificity) examples, respectively.
Note that, in five out of the six data sets evaluated, BED achieved better results
than SMOTE and original SVM (the best results are marked in bold). The G-
mean values in Table 2 indicate that BED achieved a better balance between
sensitivity and specificity. It is worth noting that in the case of Abalone data set,
characterized by a huge imbalance degree, when both original SVM and SMOTE
were unable to give satisfactory values for G-mean, the BED algorithm worked
well.
In Table 3, we evaluated the algorithms using the metric F-measure [10] that
considers only the performance for the positive class. F-measure is calculated
through two important measures: Recall and Precision. Recall (R) is equivalent
to sensitivity and denotes the ratio between the number of positive examples
correctly classified and the total number of original positive examples. Precision
(P ), in turn, corresponds to the ratio between the number of positive examples
correctly classified and the total number of examples identified as positives by
the classifier. Thus, F-measure is defined as 2·R·PR+P and represents the harmonic
mean between Recall and Precision. As shown in Table 3, the BED algorithm
produced better results. Compared to the original SVM and SMOTE algorithms,
Table 2. This table compares G-mean values on UCI data sets. The first column lists
the data sets used. The following columns show the results achieved by the algorithms:
Support Vector Machines (SVMs) (column 2), Synthetic Minority Oversampling Tech-
nique (SMOTE) (column 3) and Boundary Elimination and Domination (BED) (col-
umn 4). Mean and standard deviation values for each data set were calculated for 7
runs with different test subsets obtained from stratified 7-fold crossvalidation.
Table 3. This table compares F-measure values on UCI data sets. The first column lists
the data sets used. The following columns show the results achieved by the algorithms:
Support Vector Machines (SVMs) (column 2), Synthetic Minority Oversampling Tech-
nique (SMOTE) (column 3) and Boundary Elimination and Domination (BED) (col-
umn 4). Mean and standard deviation values for each data set were calculated for 7
runs with different test subsets obtained from stratified 7-fold crossvalidation.
BED performance was superior especially for data sets with higher imbalance
degree which is the case of most of the real world problems. The F-measure
values, described in Table 3, shows that BED improves the classifier performance
for the positive class.
Average ROC curves (Receiver Operating Characteristics) [12] were plotted
for all data sets and gave similar results. The ROC curve for a binary classifier
shows graphically the true positive rate as a function of the false positive rate
when the decision threshold varies. To illustrate, Fig. 2 shows the example for the
Diabetes data set. Note that BED generates a better ROC curve than SMOTE
and original SVM.
ROC Curve
1
0.9
Original SVM
0.8 BED
SMOTE
0.7
True Positive rate
0.6
0.5
0.4
0.3
0.2
0.1
0
0 0.2 0.4 0.6 0.8 1
False Positive rate
Fig. 2. Average ROC curves for test set obtained from Diabetes data set
An Improved Algorithm for SVMs Classification of Imbalanced Data Sets 117
5 Conclusions
References
1. Boser, B.E., Guyon, I.M., Vapnik, V.: A training algorithm for optimal margin
classifiers. In: Proceedings of the fifth annual workshop on Computational learning
theory, pp. 144–152. ACM Press, New York (1992)
2. Vapnik, V.N.: The nature of statistical learning theory. Springer, New York (1995)
3. Cortes, C., Vapnik, V.: Support-Vector Networks. Mach. Learn. 20, 273–297 (1995)
4. Cristianini, N., Shawe-Taylor, J.: An introduction to support vector machines and
other kernel-based learning methods. Cambridge University Press, London (2000)
5. Wu, G., Chang, E.Y.: KBA: kernel boundary alignment considering imbalanced
data distribution. IEEE Trans. Knowl. Data Eng. 17, 786–795 (2005)
6. Provost, F., Fawcett, T.: Robust classification for imprecise environments. Mach.
Learn. 42, 203–231 (2001)
7. Sun, Y., Kamel, M.S., Wong, A.K.C., Wang, Y.: Cost-sensitive boosting for clas-
sification of imbalanced data. Pattern Recognition 40, 3358–3378 (2007)
8. Asuncion, A., Newman, D.J.: UCI Machine Learning Repository. University of
California, Irvine, School of Information and Computer Sciences,
http://www.ics.uci.edu/mlearn/MLRepository.html
9. Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: synthetic
minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002)
10. Tan, P., Steinbach, M.: Introduction to Data Mining. Addison Wesley, Reading
(2006)
11. Kubat, M., Matwin, S.: Addressing the curse of imbalanced training sets: one-sided
selection. In: Proceedings of 14th International Conference on Machine Learning,
pp. 179–186. Morgan Kaufmann, San Francisco (1997)
12. Egan, J.P.: Signal detection theory and ROC analysis. Academic Press, London
(1975)
118 C.L. Castro, M.A. Carvalho, and A.P. Braga
13. Weiss, G.M.: Mining with rarity: a unifying framework. SIGKDD Explor. Newsl. 6,
7–19 (2004)
14. Karakoulas, G., Shawe-Taylor, J.: Optimizing classifiers for imbalanced training
sets. In: Proceedings of Conference on Advances in Neural Information Processing
Systems II, pp. 253–259. MIT Press, Cambridge (1999)
15. Li, Y., Shawe-Taylor, J.: The SVM with uneven margins and Chinese document
categorization. In: Proceedings of the 17th Pacific Asia Conference on Language,
Information and Computation, pp. 216–227 (2003)
16. Veropoulos, K., Campbell, C., Cristianini, N.: Controlling the sensitivity of support
vector machines. In: Proceedings of the International Joint Conference on Artificial
Intelligence, pp. 55–60 (1999)
17. Joachims, T.: Learning to classify text using support vector machines: methods,
theory and algorithms. Kluwer Academic Publishers, Norwell (2002)
18. Cristianini, N., Shawe-Taylor, J., Kandola, J.: On kernel target aligment. In: Pro-
ceedings of the Neural Information Processing Systems NIPS 2001, pp. 367–373.
MIT Press, Cambridge (2002)
19. Kandola, J., Shawe-Taylor, J.: Refining kernels for regression and uneven classifica-
tion problems. In: Proceedings of International Conference on Artificial Intelligence
and Statistics. Springer, Heidelberg (2003)
20. Akbani, R., Kwek, S., Japkowicz, N.: Applying support vector machines to imbal-
anced datasets. In: Proceedings of European Conference on Machine Learning, pp.
39–50 (2004)
21. Vilariño, F., Spyridonos, P., Vitri, J., Radeva, P.: Experiments with SVM and
stratified sampling with an imbalanced problem: detection of intestinal contrac-
tions. In: Proceedings of International Workshop on Pattern Recognition for Crime
Prevention, Security and Surveillance, pp. 783–791 (2005)
22. Tang, Y., Zhang, Y.Q., Chawla, N.V., Krasser, S.: SVMs modeling for highly im-
balanced classification. IEEE Trans. Syst., Man, Cybern. B 39, 281–288 (2009)
Visualization of MIMO Process Dynamics Using
Local Dynamic Modelling with Self Organizing
Maps
1 Introduction
Obtaining a good model of the process dynamics is a key part in the control
system design workflow, as well as for process understanding towards its opti-
mization and supervision. Many real industrial processes, however, are tightly
coupled nonlinear MIMO systems whose dynamics depend heavily on the work-
ing point. Moreover, we often lack prior knowledge about their behaviour, such as
theoretical (first principles) models, having only large amounts of input-output
data, what makes us rely on system identification techniques. There exists a
large amount of works in the literature to identify nonlinear systems –see [1,2]
for comprehensive overviews and [3] for an exhaustive survey on the field– and
D. Palmer-Brown et al. (Eds.): EANN 2009, CCIS 43, pp. 119–130, 2009.
c Springer-Verlag Berlin Heidelberg 2009
120 I. Dı́az et al.
particularly for MIMO systems [4,5,6]. However, most of them produce black
box models which are able to accurately reproduce the system behaviour, but
do not leave room for interpretation of the dynamic nature of the process.
In this paper, we propose a MIMO system identification method inspired
on the SOM-based local dynamic modelling approach described in [7,8] that,
in addition allows to visualize multivariable dynamic features such as singular
gains and directions of transfer function matrices. To show the possibilities of
the proposed method, we apply it to an industrial-scale version of the well known
4-tank process [9]. The paper is organized as follows. In Sec. 2 a generalization
to the MIMO case of the SOM local dynamic modelling approach is described,
showing also how this technique can be further exploited as a powerful way to
explore the system dynamics. In Sec. 3 experimental results using an industrial-
scale 4-tank model are described, comparing the accuracy of the estimator with
other approaches, and showing some of its possibilities for visual exploration of
the process dynamics. Finally, Sec. 4 provides a general discussion and concludes
the paper.
where ϕ(k) = [y(k − 1), · · · , y(k − ny ), u(k), · · · , u(k − nu )]T is the data vector
that contains past inputs and outputs which are known at sample k, p(k) =
T
[p1 (k), · · · , pp (k)] is a vector of parameters, and f (·, ·) is a given functional
T
relationship that may be linear or nonlinear, being u(k) = [u1 (k), · · · , uM (k)]
T
and y(k) = [y1 (k), · · · , yL (k)] the input and output vectors at sample k. Model
(1) expresses the present output as a function of past outputs and inputs, as
well as of a set of model parameters p(k) that determine a dynamic relationship
between the sequences {u(k)} and {y(k)}.
where
that is, those pairs whose corresponding selectors s(k) are mapped onto a neigh-
borhood of unit i of width σloc . For each SOM unit i, a particular case of the
parametric model (1) can be the following local MIMO ARX(ny ,nu ) model
ny
yl (k) = ailj yl (k − j) +
j=1
nu
M
bilmj um (k − j) + Bli + ε(k) (6)
m=1 j=0
2.3 Retrieval
Once the model is trained, a local transfer function matrix model Gi (z) is as-
signed to each neuron i. The problem of retrieval, at sample k, is stated as to get
122 I. Dı́az et al.
y(k) given the data vector ϕ(k) and the dynamic selectors s(k). This is accom-
plished in two steps: 1) obtain the best matching unit, c = arg mini {s(k) − mi }
and 2) apply the local model Gc (z), equivalent to (6), whose parameters pc =
{aclj , bclmj , Blc } were previouly estimated. Note that depending on the problem
(one- or multiple-step ahead prediction) data vector ϕ(k) may contain real or
estimated past outputs.
3 Experimental Results
3.1 Industrial-Scale 4-Tank Model
The proposed analysis of the dynamic behaviour of multivariable systems has
been tested using a quadruple-tank industrial-scale model. This equipment was
developed by the group of Automatic Control of the University of León [14].
Figure 1 shows a picture of the industrial-scale model. It keeps the original
structure of the quadruple-tank process proposed by [9] but it is built using
common industrial instrumentation:
– Grundfos UPE 25-40 flow pumps equipped with expansion modules Grudfos
MC 40/60 that control them by means of an analog signal.
Visualization of MIMO Process Dynamics 123
A block diagram of the 4-tank, including the variables and conventions used in
this paper is shown in Fig. 2. As seen in Fig. 3, the system is modelled as a
MIMO process with two inputs, two outputs having external parameters γ1 , γ2 .
Fig. 1. Industrial-scale model of the quadruple-tank process used to test the proposed
method
and using an MLP nonlinear mapping to approximate function f (·, ·). A battery
of trainings using an MLP with two hidden layers was carried out, making three
trainings for different combinations of the ny , nu orders and number of neurons
in both hidden layers, resulting the best structure for a NARX(4,4) model with
8 neurons in both hidden layers.
The accuracy of the three estimators was measured for multiple-step ahead
predictions, feeding the estimators only with the input data, that is, the power
of both pumps u1 , u2 and the positions of the valves γ1 , γ2 , while past values
of the outputs needed to build the data vector ϕ were obtained from previous
estimations, except for the initial value. The results are summarized in Table 2.
Table 2. Comparison of training, validation and test, errors for local models
One interesting kind of visualization are frequency response maps [11]. These
maps show how the system pumps-to-levels gains at a given frequency vary with
respect to γ1 and γ2 . Since, after training, a transfer function matrix is defined
for every unit, two new component planes can be defined for any normalized
frequency θ ∈ (0, π) as the absolute values of the singular values of Gi (z)
These maps were obtained for θ = 0.5 in Fig. 5. It can be seen how the
principal system gains are strongly correlated to γ1 +γ2 , showing that the system
dynamics is strongly influenced by changes in the valves positions, producing
larger variations in the levels for large values of the γ’s when the powers of the
pumps vary at frequency θ.
Also shown in these maps are the singular input and output vectors V1 , U1 for
singular value σ1 and the singular input and output vectors V2 , U2 for singular
value σ2 . Looking at the maps of γ1 + γ2 of Fig. 5, it can be observed that for low
values of γ1 + γ2 , where pump 1 feeds mostly tank 4 and pump 2 feeds mostly
tank 3, the singular input vector points in the direction of input u1 and its
corresponding output vector points in the direction of output y2 (tank 2, which
is just below tank 4). Conversely, as expected, when the second singular input
vector points to input u2 the second output singular vector points to output y1 .
On the other hand, for high values of γ1 + γ2 , where pumps 1 and 2 feed mostly
tanks 1 and 2 respectively, the singular input vector has the same components
for u1 and u2 , and points to the same direction as the corresponding output
vector, showing that the highest gain is achieved if the powers of both pumps
are equal, producing also equal changes in the levels of both tanks.
Another parameter representing a measure of interaction in multivariable con-
trol systems is the relative gain array (RGA) [9]. The RGA can be defined for
discrete systems as Λ = G(1) ∗ G−T (1). Since the elements of each row and
Fig. 5. Frequency maps showing the singular values (gains) |σ1 | and |σ2 | as well as
the singular input and output directions at frequency θ = 0.5. The gains (in dB) are
represented by colors and the directions are represented by two arrows (input in grey
and output in black) at each SOM unit.
128 I. Dı́az et al.
Fig. 6. Empirical RGA obtained from the data (left), and theoretical RGA (center)
obtained from relationship (12). Also the component plane of γ1 + γ2 is shown (right)
for comparison.
column sum up to one, the element λ = Λ11 uniquely determines the RGA. In
[9], the author shows that for the 4-tank process this parameter is given by
γ1 γ2
λ= . (12)
γ1 + γ2 − 1
To show the consistency of the local model approach, we obtained two RGA
component planes: 1) computing the RGA for every Gi (z) obtained empirically
from the data after the training process, and 2) using the theoretical relationship
(12) using the γ1 and γ2 for every SOM unit i. Both maps along with the map
of γ1 + γ2 are shown in Fig. 6.
The region on the right map of Fig. 6 where γ1 + γ2 is approximately 2
is in agreement with the same region on the left and center maps, where the
theoretical and empirical values of the RGA are about 1. That reveals a high
correlation between input u1 - output y1 and between input u2 - output y2 . This
conclusion admits a highly intuitive physical explanation, because pump 1 sends
most of the flow to tank 1 and pump 2 sends most of the flow to tank 2 in these
Visualization of MIMO Process Dynamics 129
4 Conclusion
In this work we have proposed an extension to MIMO systems of a SOM-based
local linear modelling algorithm and shown its performance on an industrial-
scale prototype of the well known 4-tank plant system. While its performance in
terms of estimation accuracy is slightly lower than other state of the art nonlinear
identification methods, such as MLP-NARX models, it is much superior to any
single global linear MIMO model (i.e. transfer function matrix). Moreover, unlike
the MLP-NARX model, the nature of the proposed approach, based on local
transfer function matrix models organized in an ordered fashion by the SOM
algorithm, makes it possible to explore in a visual way the dynamic features
using maps, allowing to discover subtle relationships about its behaviour.
While the results presented in this paper are still in a preliminary stage, they
are encouraging and reveal the potencial of the proposed approach to improve
supervision, control and optimization of industrial processes, based on a deeper
understanding of their dynamic behaviour.
Acknowledgements
This work was supported by the spanish Ministerio de Educación y Ciencia and
FEDER Funds, under grants DPI-2006-13477-C01 and DPI-2006-13477-C02.
References
1. Sjoberg, J., Zhang, Q., Ljung, L., Benveniste, A., Delyon, B., Glorennec, P., Hjal-
marsson, H., Juditsky, A.: Nonlinear black-box modeling in system identification:
a unified overview. Automatica 31(12), 1691–1724 (1995)
2. Chen, S., Billings, S., Grant, P.: Non-linear system identification using neural net-
works. International Journal of Control 51(6), 1191–1214 (1990)
3. Giannakis, G., Serpedin, E.: A bibliography on nonlinear system identification.
Signal Processing 81(3), 533–580 (2001)
4. Dayal, B., MacGregor, J.: Multi-output process identification. Journal of Process
Control 7, 269–282 (1997)
5. Peng, H., Wu, J., Inoussa, G., Deng, Q., Nakano, K.: Nonlinear system modeling
and predictive control using the RBF nets-based quasi-linear ARX model. Control
Engineering Practice (2008)
6. Yu, D., Gomm, J., Williams, D.: Neural model input selection for a MIMO chemical
process. Engineering Applications of Artificial Intelligence 13(1), 15–23 (2000)
7. Principe, J.C., Wang, L., Motter, M.A.: Local dynamic modeling with self-
organizing maps and applications to nonlinear system identification and control.
Proceedings of the IEEE 86(11), 2240–2258 (1998)
130 I. Dı́az et al.
8. Cho, J., Principe, J.C., Erdogmus, D., Motter, M.A.: Modeling and inverse con-
troller design for an unmanned aerial vehicle based on the self-organizing map.
IEEE Transactions on Neural Networks 17(2), 445–460 (2006)
9. Johansson, K.: The quadruple tank process: A multivariable laboratory process
with an adjustable zero. IEEE Transactions on Control Systems Technology 8(3),
456–465 (2000)
10. Dı́az Blanco, I., Cuadrado Vega, A.A., Diez González, A.B., Fuertes Martı́nez, J.J.,
Domı́nguez González, M., Reguera, P.: Visualization of dynamics using local dy-
namic modelling with self organizing maps. In: de Sá, J.M., Alexandre, L.A., Duch,
W., Mandic, D.P. (eds.) ICANN 2007. LNCS, vol. 4668, pp. 609–617. Springer, Hei-
delberg (2007)
11. Dı́az Blanco, I., Domı́nguez González, M., Cuadrado, A.A., Fuertes Martı́nez, J.J.:
A new approach to exploratory analysis of system dynamics using som. applications
to industrial processes. Expert Systems With Applications 34(4), 2953–2965 (2008)
12. Kohonen, T., Oja, E., Simula, O., Visa, A., Kangas, J.: Engineering applications
of the self-organizing map. Proceedings of the IEEE 84(10), 1358–1384 (1996)
13. Kohonen, T.: Self-Organizing Maps. Springer, Heidelberg (1995)
14. Domı́nguez González, M., Reguera, P., Fuertes, J.: Laboratorio Remoto para
la Enseñanza de la Automática en la Universidad de León (España). Revista
Iberoamericana de Automática e Informática Industrial (RIAI) 2(2), 36–45 (2005)
Data Visualisation and Exploration with Prior
Knowledge
1 Introduction
Data visualisation is widely recognised as a key task in exploring and under-
standing data sets. Including prior knowledge from experts into probabilistic
models for data exploration is important since it constrains models, which usu-
ally leads to more interpretable results and greater accuracy. As measurement
becomes cheaper, datasets are becoming steadily higher dimensional. These high-
dimensional data sets pose a great challenge when using probabilistic models
since the training time and the generalisation performance of these models de-
pends on the number of free parameters.
A common fix for Gaussian models is to reduce the number of parameters
and to ensure sparsity in the model by constraining the covariance matrix to
be either diagonal, or spherical in the most restricted case. These constraints
exclude valuable information about the data structure, especially in cases where
there is some understanding of the structure of the covariance matrix.
A good example of this is data from chemical analysis like Gas Chromatography-
Mass Spectrometry (GC-MS). When one examines the results of GC-MS runs over
D. Palmer-Brown et al. (Eds.): EANN 2009, CCIS 43, pp. 131–142, 2009.
c Springer-Verlag Berlin Heidelberg 2009
132 M. Schroeder, D. Cornford, and I.T. Nabney
different samples, one knows that certain compounds are highly correlated with
each other. This information can be incorporated into the model with a block-
matrix covariance structure. This will help to reduce the number of free parameters
without losing too much valuable information. In this paper we will look at a com-
mon probabilistic model for data exploration called the Generative Topographic
Mapping (GTM) [2]. The standard GTM uses a spherical covariance matrix and
we will modify this algorithm to work with an informative block covariance matrix.
The paper has the following structure. First we shortly review models for data
visualisation and put them into context. Then we introduce the standard GTM
model, extend it to the case of a block covariance matrix and describe how GTM
can deal with missing data. Then we present some experiments on artificial data
and real data, where we compare the block version of GTM against spherical
and full covariance versions and show where the advantages of the models lie.
Finally we conclude the paper and point out further areas of research.
2 Data Exploration
A fundamental requirement for visualisation of high-dimensional data is to be
able to map, or project, the high-dimensional data onto a low-dimensional rep-
resentation (a ‘latent’space) while preserving as much information about the
original structure in the high-dimensional space as possible.
There are many possible ways to obtain such a low-dimensional representation.
Context will often guide the approach, together with the manner in which the
latent space representation will be employed. Some methods such as PCA and
factor analysis [5] linearly transform the data space and project the data onto
the lower-dimensional space while retaining the maximum information. Other
methods like the Kohonen, or Self Organising, Maps [11] and the related Gen-
erative Topographic Mapping (GTM) [2,1] try to capture the topology of the
data. Another recent topology-preserving method, the Gaussian Process Latent
Variable Model (GP-LVM) [12] utilises a Gaussian Process prior over the map-
ping function. Instead of optimising the mapping function one considers a space
of functions given by the Gaussian Process. One then fits the model by directly
optimising the positions of the point in the latent space. Geometry-preserving
methods like multi-dimensional scaling [3] and Neuroscale [13] try to find a rep-
resentation in latent space which preserves the geometric distances between the
data points. The later approach can even be extended through a technique called
Locally Linear Embedding [15,9], which defines another metric to calculate the
geometric distances, one used to optimise the mapping function.
In this paper we will focus on the classical Generative Topographic Map-
ping (GTM) and an extension which we will call Block Generative Topographic
Mapping (B-GTM).
Fig. 1. The non-linear function f (x, W) defines a manifold S embedded in the data
space given by the image of the latent variable space under the mapping x → t
1
K
p(x) = δ(x − xi ) (2)
K i=1
134 M. Schroeder, D. Cornford, and I.T. Nabney
in which case the integral in (1) can be evaluated and and the corresponding log
likelihood becomes
N
1
K
L(W, β) = ln p(tn |xi , W, β) . (3)
n=1
K i=1
Maximising the log likelihood one gets the updates for W and β in the M-step:
ΦT Gold ΦWnew
T
= ΦT RT , (6)
with Φ being a K × M matrix with elements Φij = Φj (xi ), T being a N × D
matrix with elements tnk , R being a K × N matrix with elements Rin and G
being a K × K diagonal matrix with elements
N
Gii = Rin (Wold , βold ) . (7)
n=1
and for β:
1
N K
1
= Rin (Wold , β)||Wnew Φ(xi ) − tn ||2 . (8)
βnew N D n=1 i=1
1
N K
Σb = Rin abkn aT,b
kn , (10)
N D n=1
k=1
where abkn = (f (xk , W)b − tbn ) with tbn being the point tn only at the dimensions
for block b.
best in the sense of telling us the most about a certain dataset. In the simple
case of artificial data one can use prior knowledge about the structure of the
data in the original space to quantify the error on the projection. For the more
complex case of real data there are various approaches to this problem rang-
ing from different resampling methods [19] to a Bayesian approach using the
GP-LVM [8].
In this paper we are going to focus mainly on the following measures of the
quality of a projection:
(a) (b)
(c) (d)
Fig. 2. The nearest neighbour label error on the artificial test data with high
(STD=20) and low(STD=2) structure for the GTM model with different co-
variance structures. PCA=(blue, dotted line with big dot), S-GTM=(green, constant
line with X), B-GTM=(red, slashed line with diamond), F-GTM=(black, slashed and
dotted line).
(a) (b)
(c) (d)
Fig. 3. The root mean square error for imputation on the artificial test data with
high (STD=20) and low(STD=2) structure for the GTM model with different
covariance structures. PCA=(blue, dotted line with big dot), S-GTM=(green, constant
line with X), B-GTM=(red, slashed line with diamond), F-GTM=(black, slashed and
dotted line).
structure in the data the B-GTM performs as well as, or only slightly worse
than, S-GTM, while F-GTM is clearly struggling with increasing dimensionality.
However once more structure is present B-GTM clearly outperforms S-GTM,
albeit once dimensionality increases the performance difference narrows. The
difference in number of blocks is significant as well since more blocks mean
fewer parameters for the B-GTM model. This results in improved performance
as well.
The RMSE was calculated on the same 20 projections. The results for this
experiment, shown in Figure 3, indicate that the block as well as full version of
GTM always outperform the S-GTM regardless of the amount of block structure
in the data. However the amount by which the spherical GTM is outperformed
Data Visualisation and Exploration with Prior Knowledge 139
% Missing 5 10 20 30 40 50 60 70
RMSE Improvement in % 0.04 0.13 0.03 0.04 0.13 0.10 0.10 0.04
Fig. 4. The overall improvement of performance in the RMSE from B-GTM to S-GTM
140 M. Schroeder, D. Cornford, and I.T. Nabney
Fig. 5. The imputation root mean square error on North Sea Oil data. S=spherical,
B=Block, F=Full.
5 Conclusions
6 Future Work
Future work in this area will be aimed at assessing the possibility of including
methods like Bayesian Correlation Estimation [10] into the algorithm in order
to learn the correlation structure rather than rely on it being imposed a priori.
Another approach might be the variational formulation of the GTM algorithm
which includes the estimate of the correlation structure as well.
Acknowledgements
MS would like to thank the EPSRC and IGI Ltd. for funding his studentship
under the CASE scheme. As well as the geochemical experts from IGI Ltd.
(Chris Cornford, Paul Farrimond, Andy Mort, Matthias Keym) for their help
and patience.
References
16. Schroeder, M., Cornford, D., Farrimond, P., Cornford, C.: Addressing missing
data in geochemistry: A non-linear approach. Organic Geochemistry 39, 1162–1169
(2008)
17. Schroeder, M., Nabney, I.T., Cornford, D.: Block gtm: Incorporating prior knowl-
edge of covariance structure in data visualisation. Technical report, NCRG, Aston
University, Birmingham (2008)
18. Sun, Y.: Non-linear Hierarchical Visualisation. PhD thesis, Aston University (2002)
19. Yu, C.H.: Resampling methods: concepts, applications, and justification. Practical
Assessment, Research and Evaluation 8 (2003)
Reducing Urban Concentration Using a Neural
Network Model
1 Introduction
The urban design concerns primarily with the design and management of public
space in towns and cities, and the way public places are experienced and used.
We can see [2] and [10] as general references to introduce general ideas and
concepts related to urban development.
One of the more relevant topics that urban design considers is the urban
topology, density and sustainability. We want to remark the aspect of density.
The geographical area were the authors of this paper are living presents a very
high level of urban density. We are convinced that a change in the pattern of
urban development is absolutely desirable. In this sense what is proposed is a
reduction in the concentration of buildings using a neural network model.
See, for example, [6] for a detailed description on surface simplification meth-
ods. Several geometric problems for sets of n points require to triangulate planar
points sets [8]. This motivates an increase attention in problems related to the
triangulation of planar point sets and 2D triangle meshes, as we can see in recent
works [13], [4], [3], [5], and [12].
A recent topic in mesh simplification is the preservation of topological char-
acteristics of the mesh and of data attached to the mesh [14].
D. Palmer-Brown et al. (Eds.): EANN 2009, CCIS 43, pp. 143–152, 2009.
c Springer-Verlag Berlin Heidelberg 2009
144 L. Tortosa et al.
K = {k1 , k2 , . . . , kM },
where ki , for i = 1, 2, . . . , M are the new nodes in the simplified triangle mesh.
Note that always M < N and M is fixed. The set of vertices K must be computed
by means of the following algorithm.
Phase 1: the self-organizing algorithm.
INIT: Start with M points k1 , k2 , kM at random positions wk1 , wk2 , . . . , wkM in
R2 . Initialize a local counter to zero for every point.
1. Generate an input signal ξ that will be a random point ni ∈ A from the
original mesh.
2. Find the nearest node s1 .
3. Find the second and third nearest nodes, s2 and s2 to the input signal.
4. Increment the local counter of the winner node s1 .
5. Move s1 towards ξ by fractions win respect to the total distance
Δws1 = win (ξ − ws1 ).
6. Move s2 and s3 towards ξ by fractions n respect to the total distance:
Δws2 = n (ξ − ws2 ),
Δws3 = n (ξ − ws3 ).
7. Repeat steps 1 to 6 λ times, with λ an integer.
– If the local counter for any ki , i = 1, 2, . . . , M is zero, then remove this
node and add a new node between the node with higher local counter
value and any node adjacent to it.
– If the local counter for every ki , i = 1, 2, . . . , M is greater than zero, then
continue and repeat steps 1 to 6.
8. Stop when the maximum number of iterations has been reached.
Phase 2: the triangulation process.
INIT: Consider the set A of the original nodes, T the triangles of the original 2D
triangle mesh, and K the set of the nodes obtained by the above self-organizing
algorithm.
1. Associate each node of the original mesh with a node of the set K.
For every ni , for i = 1, 2, . . . , N , find j ∈ {1, 2, . . . , M } such that
|wni − wkj | < |wni − wkl |, l ∈ {1, 2, . . . , M },
where wni represents the position of the node ni .
Save (ni , kj ). We say that kj is the node associated to ni .
2. Change the nodes of the original triangles by their associated nodes.
For every ti = {ni1 , ni2 , ni3 } ∈ T , substitute
{ni1 , ni2 , ni3 } −→ {kj1 , kj2 , kj3 },
where kj1 , kj2 , kj3 are the associated nodes of ni1 , ni2 , ni3 , with j1 , j2 , j3 ∈
{1, 2, . . . , M }.
146 L. Tortosa et al.
– If kj1 =
kj2 = kj3 , then save ti = {ni1 , ni2 , ni3 }.
– If kj1 = kj2 , or kj1 = kj3 , or kj2 = kj3 , then continue.
3. Graph the set
4. If some node is isolated we add a new triangle, linking this node with their
adyacent nodes.
This first phase of the model can be seen as a training process, where a set of
nodes representing the new vertices of a planar mesh is obtained. So far, nothing
about the triangles or faces of the original planar mesh has been mentioned. The
second phase of the model is a triangulation process developed with the aim to
reconstruct the new mesh. Our proposal constitutes a post-process which uses
the information provided by the self-organizing algorithm and the information
about the nodes and triangles of the original mesh.
Respect to the self-organizing algorithm presented in the Phase 1 of the model,
we want to consider, in broad terms, that although the central idea of the self-
organizing models to develop an unsupervised incremental clustering algorithm
is shared, some basic considerations must be remarked from this algorithm.
Some clarification is necessary regarding the parameter λ and the point 7 of the
self-organizing algorithm. As we choose in a random way the initial geometric
positions of the initial M nodes (point 1), it is possible that some of these initial
points are generated in positions far away from the area covered by the original
mesh. Accordingly, no signal will be near these points and, therefore, these nodes
will never be winners in the training process. Consequently, its local counter will
remain zero as the number of iterations increase. So, this is the way we have to
Reducing Urban Concentration Using a Neural Network Model 147
detect when a node is outside the area of the initial triangle mesh. This is the
reason to introduce the condition in the point 7 of the self-organizing algorithm.
In the self-organizing algorithm the positions of the initial vertices are modi-
fied with the purpose to learn the shape of the original mesh. After this process
we have no information about the triangles that make up the simplified mesh.
We only have the geometric position of the new vertices; therefore, an efficient al-
gorithm may be implemented to reconstruct the 2D mesh preserving the original
shape.
The most popular triangulation method is the Delaunay triangulation [7],
which basically finds the triangulation that maximizes the minimum angle of
all triangles, among all triangulations of a given point set. We follow, as it has
been described in the algorithm, a rather different approach. The idea underlying
our approach is to establish a special equivalence relation in the original set of
vertices. The equivalence relation consists on assigning to every node of the set
A a node of the set K which is nearest to it. After this association, we will have
some equivalence sets because some nodes of the original set A will have as a
representative the same node of the set K (remember that the cardinal of K is
much less that the cardinal of A). After this concordance process, we only have
to determine the triangles with different representatives to determine the nodes
of the set K that must be linked to graph the final simplified mesh.
We have added a condition at the end of the triangulation algorithm. The
primary objective of this condition is to avoid the existence of holes or uncon-
nected regions in the final mesh. When the shapes of the original meshes are not
too complicated, we have no problem in the final result with holes or isolated
regions; however, as it has been observed in the examples performed, when the
shape of the original mesh is really complex, we can have problems with the
appearance of unconnected regions or isolated vertices, specially when M N .
4 A Real Example
To assess the performance of the model proposed in this paper, several exper-
iments were conducted using some original 2D triangle meshes. In this section
we want to analyze in detail the urban reduction and simplification of an area
representing a real residential area with geographical coordinates (38.25, −0.7)
(latitude and longitude). The residential area to study is shown in Figure 1.
In the residential area shown in Figure 1, we identify each of the houses or
buildings with a node in the mesh and perform a triangulation process with
these nodes, obtaining a two-dimensional grid made up of planar triangles, as
we can see in Figure 2. This initial mesh has 293 vertices or nodes (houses) and
352 triangles.
Note that our goal is to achieve, while maintaining the same topology of the
mesh, an optimal distribution of the new nodes (houses) with the aim to create
more space between them to establish new services or enhance green spaces. So,
we consider, as an initial parameter, that the final mesh will be composed by
125 nodes, that is, we have that N = 273 and M = 125. Therefore, the new
148 L. Tortosa et al.
design of the residential area will have 125 houses, with the same shape as the
original one.
To reach this objective we apply our simplification model to the mesh in
Figure 2.
The starting point of the self-organizing algorithm is to place K points,
k1 , k2 , . . . , kK , at random positions in the graphic. We take, for this example,
K = 125 points because it is the number of nodes (houses) that we want for the
simplified network. After generating, at random, the initial positions of the 125
nodes of the simplified mesh, we continue running the training process in the
self-organizing algorithm. In this case, after 5000 iterations, we check the local
counters of the n1 , . . . , n293 nodes of the original network; then, we proceed to
remove the vertices that have not been winners in any iteration. After this step,
we follow with the iterations until the stopping criteria is reached.
Figure 3 shows us the positions of the initial nodes, after the process of remov-
ing isolated nodes is completed. It may be remarked that thanks to the process
of removing isolated vertices as it has been described above, we get that all the
initial nodes are close to the area covered by the original network. This is very
important in the further development of the algorithm.
Now, we continue the training process with the rest of the iterations. After
performing 750000 iterations, we stop the process, obtaining the final positions
of the vertices for the simplified network. The experimental result obtained is
shown in Figure 4.
At this moment it is possible to carry out the triangulation process from the
information provided by the self-organizing algorithm. The basis of triangulation
process is the comparison between the original nodes of the network and the new
nodes. The reconstruction of the simplified mesh is shown in Fig. 5.
Fig. 3. The original 2D triangle mesh with the initial 125 nodes placed at random
150 L. Tortosa et al.
Fig. 4. The original 2D triangle mesh with the final 125 nodes placed after running
the self-organizing algorithm
The simplified mesh of Fig. 5 has been drawn to show how the mesh conforms
to the final topology of the original network. In this example we see the efficiency
of the model to approximate the shape of the original mesh. But, what is more
relevant, Fig. 5 shows the new design that could have the residential area studied.
By means of this figure we visualize a new distribution of the nodes (houses) with
Reducing Urban Concentration Using a Neural Network Model 151
the property that the nodes are now located in the best places to approximate
the shape of the original area, maximizing the space between them, with the
advantages that this gives us regarding the creation of green spaces, bicycle
tours, sporting facilities, ...
To carry out this example, it has been necessary to specify a set of parame-
ters to run the self-organizing algorithm. The parameters involved in it are the
following:
– win is related to the displacement of the winner node.
– n is related to the displacement of the neighbor nodes.
– λ was introduced in order to be sure that any node of the initial set of
random points will stay isolated during all the execution of the algorithm.
There is no theoretical result that provides us a parameter set that can always
produce the best results for the simplified mesh. Consequently, the determination
of the set of parameters to run the algorithm is obtained experimentally. The
values of these parameters used in our examples have been win = 0.8, while
n = 0.1. The parameter λ is not fixed, as it depends on the number of nodes
N ; we used λ = 13 · N .
In all the original 2D meshes studied with different sizes and shapes, the
results have been equally successful in the sense that we simplify the original
mesh, preserving the topology of the initial network. No matter how irregular it
is the shape of the original mesh, as the self-organizing algorithm always learns
how the shape is and place the initial nodes in the original mesh.
5 Conclusion
A new 2D triangle mesh simplification model has been introduced with the cen-
tral property of preserving the shape of the original mesh. The model presented
consists of a self-organizing algorithm which objective is to generate the posi-
tions of the nodes of the simplified mesh; afterwards, a triangulation algorithm
is carried out to reconstruct the triangles of the new simplified mesh.
Among the various applications of the model we have chosen one related to
urban density. An example using a real geographical area is shown to demon-
strate that the model is able to reduce urban concentration, keeping the exact
topology of the terrain and creating more space between buildings.
Some experimental results using original meshes with irregular shapes show
that the final simplified meshes are generated very fast and with a high level of
approximation respect to the original ones, no matter how irregular shapes they
present.
References
1. Alvarez, R., Noguera, J., Tortosa, L., Zamora, A.: A mesh optimization algorithm
based on neural networks. Information Sciences 177, 5347–5364 (2007)
2. Barnett, J.: An introduction to urban design. Harper and Row, New York (1982)
152 L. Tortosa et al.
1 Introduction
The survey of marine ecosystem is a major current concern in our society since its im-
pact is essential for many domains: ecology (biodiversity, production, survey), climate,
economy (tourism, resources control, transport). In 2000, Directive DCE adopted by
the European Parliament [1] defines the Phytoplankton as a biologic factor for marine
quality assessment. In this context, we propose an automatic dissimilarity-based classi-
fication method designed to assess the marine water quality by Phytoplankton species
classification and counting.
The available signals are fluorescence and scatter parameter scans of each particle
detected by a flow cytometer in a marine sample. So our problem comes down to the
classification of multidimensional signals, whose shape profile is class-specific. Up to
date, this classification is made by visual comparison of the obtained profiles and ref-
erences ones, or by the inverted microscope method [2][3]. The major difficulty of
discrimination task is that Phytoplankton is a live vegetal species. So its internal struc-
ture (pigments, size, nucleus position) varies according to its belonging group but also
D. Palmer-Brown et al. (Eds.): EANN 2009, CCIS 43, pp. 153–164, 2009.
c Springer-Verlag Berlin Heidelberg 2009
154 É. Caillault, P.-A. Hébert, and G. Wacquet
and above all according to its physiological condition (life cycle, cell or colony) and its
environment [4][5]. To make a Phytoplankton classifier system robust to these variabil-
ities, it appears relevant to use an elastic measure to compare two signal profiles. So
our approach is based on a classical elastic matching from Sakoe and Chiba (Dynamic
Time Warping DTW [6]). This method was largely tested, initially for speech recogni-
tion (comparison of 1D time-frequency amplitude patterns) or for handwritten pattern
recognition (1D spatial matching) [7].
In order to get a more understandable qualitative measure, simple and comprehen-
sible for any biologist, we adapt the matching cost of this algorithm so as to get a
[0, 1]-normalized dissimilarity degree that deals with multidimensional time signals.
Figure 1 presents the global scheme of the recognition system based on a dissimilar-
ity measure applied to Phytoplankton characterization. It is composed of two classical
parts:
1. a feature-like extraction module which computes for the unknown cell a dissimilar-
ity vector in relation to some reference cells;
2. a standard classifier which takes in entry this vector and gives in output the recog-
nized species name.
Next section describes the proposed accomodation of Sakoe and Chiba’s algorithm to
get a [0,1]-dissimilarity degree between nD signals. Section 3 presents the experimenta-
tion protocol and results that show the efficiency of the system with different classifiers.
Different variants are tested and compared first with features-based classifiers, then with
classical DTW algorithm, in term of recognition rate.
measure represents the quantity of geometrical distortion needed to match both curves,
regardless of some time distortions.
More precisely, the method matches the points of both signals, and defines their
matching cost as the mean-distance between the paired points. For example, in the
ideal case where all paired points are identical, the matching is perfect, and the cost
is consequently zero. The softness of the algorithm comes from its ability to pair some
points time shifted, with a cost equal to zero.
Let X = {(xi ), i = 1, . . . , nx } and Y = {(y j ), j = 1, . . . , ny } be the two signals to be
compared, with i and j their time index, and nx and ny their respective length. We first
consider monodimensional signals: each value xi or y j belongs to ℜ.
The algorithm builds a matching P = {(ik , jk ), k = 1, . . . , nk } between the points of
signals X and Y , according to some time conditions. The resulting matching is defined
as the one minimizing the following weighted mean distance C between paired points,
based upon some distance d and a weight vector W :
n
∑k=1
k
d(xik , y jk ).w(k) Dist(X,Y, P,W )
C(X,Y, P,W ) = nk = nk . (1)
∑k=1 w(k) ∑k=1 w(k)
Selected conditions of the pairing in this DTW variant are the following:
1. End-Points conditions: first (and last) points of both signals are paired: (1, 1) ∈ P
(and (nx , ny ) ∈ P);
2. Continuity conditions: all points are matched;
3. Monotonicity conditions: pairs are time ordered: ik−1 ≤ ik and jk−1 ≤ jk .
the way the least cost was computed, from last pair (nx , ny ) to first pair (1, 1). This
is generally used to normalize final cost Dist(nx , ny ), so as to get the mean distortion
measure by matched pair.
In order to get the accumulated distance Dist minimization equivalent to the cost C
minimization, the weigths are defined according to the symetric solution proposed by
Sako and Chiba. Let (ik , jk ) be the k-th pair of matching P:
– w(k) = 2, if k = 1 or if (i(k−1) , j(k−1) ) = (ik − 1, jk − 1);
– w(k) = 1, otherwise.
n
Then ∑k=1
k
w(k) = nx + ny does not depend on P, and optimizing Dist(nx , ny ) makes C
optimized too.
Algorithm 1. DTW algorithm computing the accumulated distance of the best matching
Dist(1, 1) = 2.d(xi , y j )
for all i = 2, . . . , nx do
Dist(i, 1) = Dist(i − 1, 1) + d(xi , y1 )
end for
for all j = 2, . . . , ny do
Dist(1, j) = Dist(1, j − 1) + d(x1 , y j )
end for
for all i = 2, . . . , nx do
for all j = 2, . . . , nydo
Dist(i, j) = min Dist(i − 1, 1) + d(x i , y j ), Dist(i, j − 1) + d(xi , y j ), . . .
. . . , Dist(i − 1, j − 1) + 2.d(xi , y j )
end for
end for
return Dist(nx , ny )
where ji∗ denotes the indice defined in the previous linear matching.
In order to make this ratio consistent, we suppose the signals to be positive, otherwise
the dissimilarity degree could exceed 1.
Using this measure, the cost function C becomes a mean dissimilarity between paired
points, and consequently a global dissimilarity measure for signals.
with dL1 the L1 -distance. The choice consists in accumulating the distortion measures
over the nc curves.
In our dissimilarity version of DTW, we accumulate dissimilarity measures instead
of L1 -distances for the nc positive curves, and we then normalize the result:
1 nc
s(x̄i , ȳ j ) = ∑ s(xic , yic ).
nc c=1
(6)
point x̂k (or ŷk ). The set of pairs is totally ordered, then X-axis values of points x̂k and
ŷk are set to k, following this time order. Y-axis values simply are the 1D measures xik
and y jk , then:
Plotting points this way makes easier the perception of the distances or dissimilarities
accumulated over the pairs of P, because they are directly measured along the Y-axis.
Moreover time distortions are visualized. Note that we used dotted lines to differentiate
the time distortions from any initial constant part of the curves.
1000 1000
0 0
0 5 10 15 20 25 30 35 40 0 5 10 15 20 25 30 35 40
2000 2000
Linear
1000 1000
0 0
0 5 10 15 20 25 30 35 40 0 5 10 15 20 25 30 35 40
Cost = 474.234 Dissimilarity = 0.52
2000 2000
Restricted
1000 1000
0 0
0 5 10 15 20 25 30 35 40 0 5 10 15 20 25 30 35 40
2000 2000
Restricted
1000 1000
0 0
0 5 10 15 20 25 30 35 40 45 0 5 10 15 20 25 30 35 40 45
Cost = 218.97 Dissimilarity = 0.30
Not−restricted
2000 2000
1000 1000
0 0
0 5 10 15 20 25 30 35 40 0 5 10 15 20 25 30 35 40
Not−restricted
2000 2000
1000 1000
0 0
0 5 10 15 20 25 30 35 40 45 50 0 5 10 15 20 25 30 35 40 45 50
Cost = 69.994 Dissimilarity = 0.17
Figure 2 shows the comparison of two artificial 1D curves according to the variants
of DTW algorithm previously described; results are presented as follows:
– two columns: left, Sakoe and Chibas’s original algorithm; right, our variant with a
bounded dissimilarity degree;
– three couples of rows, for the different ways of limiting the neighborhood: first the
”linear”, then the ”p-restricted”, and finally the ”no-restricted” variant. Results are
presented in the classical way, then in the way we propose.
Matchings visualized in Figure 2 show how similar both types of algorithms are, either
distance or dissimilarity oriented: the only significative difference appears in the costs,
normalized in [0,1]. Furthermore this example attests the role of the neighborhood re-
striction, that allows limitation of the time distortions.
ND signals acquisition. In this study, nD signals were gathered in the LOG laboratory1
from different phytoplanktonic species living in Eastern Channel, with a CytoSense
flow cytometer (CytoBuoy2), and labelled by biologists [3], once having them isolated
from the natural environment.
Flow cytometry is a technique used to characterize individual particles (cells or bac-
teria) drived by a liquid flow at high speed in front of a laser light (cf. Figure 3). Dif-
ferent signals either optical or physical are provided: forward scatter (which reflects
the particle length), sideward scatter (which is more dependant on the particle internal
structure) and several wavelengths of fluorescence (which depend upon the type of its
photosynthetic pigments) measures.
More precisely, in the used signals library, each detected particle is described by 8
monodimensional raw signals issued from the flow cytometer in identical experimental
conditions (same sampling rates, same detection threshold, etc.):
These signals are composed of voltage measures (mV), and their sampling period was
here chosen to correspond to 0.5µ-meter displacement of the water flow. Consequently,
1 Laboratoire d’Océanologie et de Géosciences, UMR 8187: http://log.univ-littoral.fr
2 Cytobuoy system: http://www.cytobuoy.com
160 É. Caillault, P.-A. Hébert, and G. Wacquet
Fig. 3. Signals acquisition with a flow cytometer, image extracted from CytoBuoy’s site
the longer the cell is, the higher the number of sampled measures is, and the time axis
can be interpreted as a spatial length axis.
Phytoplanktonic species identification is a hard task, that is the reason why all these
signals are used to make the particles characterization as complete as possible. Each
particle of our experiment is consequently characterized by a 8D signal.
Description of the studied Phytoplankton cells. The dataset is issued from a unique
culture cells sample, whose particles belong to 7 distinct Phytoplanktonic species: Cha-
etoceros socialis, Emiliania Huxleyi, Lauderia annulata, Leptocylindrus minimus, Pha-
eocystis globosa, Skeletonema costatum and Thalassiosira rotula.
Each species is equally represented by 100 Phytoplanktonic cells, which were la-
belled by biologists using a microscope [3].
Figures in Table 1 show some signal samples of species Lauderia annulata and Emil-
iania huxleyi. For the first species, three individuals are selected: two very close, and an
outlier.
Despite a high similarity between the profiles, intra-species variability can be quite
important. In particular rising and falling edges of Lauderia annulata signals are not
exactly synchronous. The curves SWS HS (the highest ones) of L. annulata species
show a size variability (L. annulata 10: 45µm, L. annulata 11: 55µm and L. annulata
5: 90µm), as well as a variability of the nucleus position (at the center of the cell for L.
annulata 10-11, but clearly left shifted for L. annulata 5). In the case of FL RED HS
signals (the second highest ones), we can also see differences in spatial shifts and in in-
tensity levels between L. annulata 5 and the two others: this is due to different positions
and different numbers of chloroplasts in cells (cf. Fig. 3).
Dissimilarity-Based Classification of Multidimensional Signals 161
4000
FWS
3500 SWS_LS
FL_YELLOW_LS
3000 1000 FL_ORANGE_LS
Voltage level (mV)
1500 500
1000
500
0 0
0 10 20 30 40 50 60 70 80 90 0 5 10 15 20
Length (μm) or Time (x0.5μs) Length (μm) or Time (x0.5μs)
4000 4000
3500 3500
3000 3000
Voltage level (mV)
2500 2500
2000 2000
1500 1500
1000 1000
500 500
0 0
0 10 20 30 40 50 60 70 80 90 0 10 20 30 40 50 60 70 80 90
Length (μm) or Time (x0.5μs) Length (μm) or Time (x0.5μs)
Last example E. huxleyi is an extreme case showing how similar cytometric curves
of distinct species can be. However the length of this particle is clearly smaller in this
particular case.
4 Conclusion
In this paper, we proposed a conjoint dissimilarity [0,1]-measure for signals, based upon
their shape. Such a bounded measure makes the interpretation by human users easier,
and it can also be more relevant than a simple distance in some applications like the
one presented. This dissimilarity measure was adapted to multidimensional signals, by
equally weighting each dimension.
The proposed measure was applied to the automatic classification of Phytoplanktonic
cells, which appears to be an innovative method: only few automatic species recogni-
tions have yet been proposed. The experiment was performed on a labelled set of 700
Phytoplankton cells, with 100 cells per species. The quality of the obtained rates (which
reach 97.2%) tends to show the relevance of the proposed dissimilarity measure, first in
comparison with more classical distortion measures, then in comparison with a feature-
based characterization.
These promising results encourage some future works, like the use of other distances
(for instance in order to weight the distinct signal dimensions), or like the fusion of this
distortion dissimilarity with some other dissimilarity measures (for instance, a duration
dissimilarity).
Acknowledgements. The authors thank all the members of the LOG laboratory who
took part in the data acquisition of the nD signals data, in particular L. Felipe Artigas,
Natacha Guiselin, Elsa Breton, and Xavier Mériaux. The data were collected thanks
to project CPER ”Phaeocystis Bloom” funded by Région Nord-Pas-de-Calais, Europe
(FEDER) and Université du Littoral Côte d’Opale.
164 É. Caillault, P.-A. Hébert, and G. Wacquet
They also thank Pr. Denis Hamad and our institute ULCO, for coordinating and
earning BQR project ”PhytoClas”.
References
1. The European Parliament, the European Council: Directive 2000/60/ec of the european par-
liament and of the council of 23 october 2000 establishing a framework for community action
in the field of water policy. Official Journal of the European Communities EN 2000/60/EC
(2000)
2. Lund, J., Kipling, G., Cren, E.L.: The inverted microscope method of estimating algal numbers
and the statistical basis of estimation by counting. Hydrobiologia 11, 143–170 (1958)
3. Guiselin, N., Courcot, L., Artigas, L.F., Jéloux, A.L., Brylinski, J.M.: An optimised protocol
to prepare phaeocystis globosa morphotypes for scanning electron microscopy observation.
Journal of Microbiological Methods 77(1), 119–123 (2009)
4. Cloern, J.E.: Phytoplankton bloom dynamics in coastal ecosystems: a review with some gen-
eral lessons from sustained investigation of san francisco bay, california. Reviews of Geo-
physics 34, 127–168 (1996)
5. Takabayashi, M., Lew, K., Johnson, A., Marchi, A., Dugdale, R., Wilkerson, F.P.: The effect
of nutrient availability and temperature on chain length of the diatom, skeletonema costatum.
Journal of Plankton Research 28(9), 831–840 (2006)
6. Sakoe, H., Chiba, S.: Dynamic programming algorithm optimization for spoken word recog-
nition. IEEE Transactions on Acoustics, Speech, and Signal Processing ASSP-26(1), 43–49
(1978)
7. Niels, R., Vuurpijl, L.: Introducing trigraph - trimodal writer identification. In: Proc. European
Network of Forensic Handwr. Experts (2005)
Revealing the Structure of Childhood Abdominal Pain
Data and Supporting Diagnostic Decision Making
D. Palmer-Brown et al. (Eds.): EANN 2009, CCIS 43, pp. 165–177, 2009.
© Springer-Verlag Berlin Heidelberg 2009
166 A. Adamopoulos et al.
1 Introduction
Physicians’ medical diagnosis using medical protocols and observation of the patient
often lurks risks, which can be effectively moderated by intelligent computational
techniques.
Intelligent computational techniques are able to make a reliable suggestion about a
particular instance of the problem [20] thereby facilitating accurate diagnosis. It is
proven that the use of such methods during the medical diagnosis not only increases
the overall diagnostic accuracy but improves markedly the physician’s diagnostic
performance [13,14,11,12,10], helping focus his effort on the selection of relevant
knowledge and the proper use of the selected facts [21].
Machine learning is one of major branches of artificial intelligence and indeed ma-
chine learning algorithms were from the very beginning designed and used to analyse
medical data sets [18]. Therefore, an abundance of corresponding applications have
been developed for solving medical diagnosis problems: Artificial Neural Networks
(ANN), Genetic Algorithms (GA) and Symbolic Learning are some of these
approaches. A very incomplete and only indicative list of applications of machine
learning in medical diagnosis includes applications in oncology [39], rheumatology
[37], neuropsychology [38], gynaecology [40], urology and cardiology [18].
One of the most common chronic pain syndromes in children is recurrent abdominal
pain, affecting 10-30% of all school aged children [2,3]. Although many cases of acute
abdominal pain are benign, some require rapid diagnosis and herald a surgical or medi-
cal emergency. Rapid and certain diagnosis in these cases is essential to avoid undesir-
able complications, such as peritonitis, hemorrhage, infertility, unnecessarily performed
laparotomy and chronic abdominal pain or anxiety disorders in adult life [6,7].
Abdominal pain often presents as a diagnostic problem to the clinician, because the
disease is too protean to admit to a certain methodology or to be diagnosed by clinical
instinct. Hospitalization, followed by active clinical observation, seems to be the most
widely used method of clinical management of abdominal pain, for many years.
However, this method entails negative consequences to young patients: children miss,
on average, 21 more days of school per year [29], as the median length of stay in
hospital for appendicitis ranges from 2 days (non-ruptured appendicitis) to 11 days
(ruptured appendicitis) and days on antibiotic vary from 4.6 to 7.9 days [1]. Further-
more, the predictive value of clinical diagnosis reached with this method has been
estimated between 68%-92% [8,9], while generally, the diagnostic inaccuracy rates
from 15%-20% the last years [11]. This raises the need to adopt computer-aided
methods on medical diagnosis, in order to improve the clinical outcomes.
In this study the application of a genetic clustering algorithm and a random forest
classifier for appendicitis diagnosis is explored, using 15 clinical and laboratorial
factors, which are usually used in clinical practice for the diagnosis of appendicitis in
children [22,23].
The paper is organized as follows: Initially, in Section 2 the Random Forests classi-
fier is introduced. Then, Section 2.1 presents the Experiment Setup and Section 2.2
discusses the results of the application of the RF classifier to real data from the Pediatric
Surgery Department of the University Hospital of Alexandroupolis, Hellas. Section 3
deals with the Genetic Clustering and Feature Selection Approach. Section 3.1 presents
Revealing the Structure of Childhood Abdominal Pain Data 167
the Experiment Setup and Experimental Results are given in Section 3.2. Finally,
Section 4 summarizes and concludes the paper.
The samples were classified, as presented in Tables 1-3, in either two categories
(normal vs operative treatment category), or in three categories (normal category,
operative treatment category or emergency operative treatment category) or in seven
categories (discharge, observation, no-findings, focal, phlegmonous, gangrenous,
appendicitis or peritonitis category).
RF experiments were performed using the Matlab interface of Ting Wang on top of
the Random Forest implementation (version 3.3) written by Leo Breiman and Adele
Cutler. During RF experiments we varied two parameters: the number N of trees that
formed the forest and the number m of variables which are selected at random out of
the totality of variables, in order to be used to split the internal nodes of each tree.
Specifically, we increased the number of trees from 50 to 200 with a step of 50 and
the number of variables m from 2 to 8 with a step of 1.
In order to compare with the RF application, CART Decision trees were imple-
mented using the Statistics Toolbox, Version 2, of MATLAB. Throughout the ex-
periments two parameters were adjusted using an internal crossvalidation procedure:
the criterion for choosing a split and the number of observations per node needed for a
split to be done.
170 A. Adamopoulos et al.
Due to the limited amount of data cases, the 7-categories RF classification resulted in
poor results and was not further considered. The random forest classifier performed
well over the data using a 3-categories classification. Interestingly, the 2-categories
classification also resulted in a very high error rate suggesting that the Operative
Treatment and the Emergency Operative Treatment cases could have different diag-
nostic patterns which should not be grouped together. To further investigate on this
we employed a genetic clustering method described in Section 3.
The results of the Random Forest application for the 3-categories classification av-
eraged for several experiments and for different pairs of parameters are shown in
Table 6. The test set error rate ranged from 3%-4.90 % using all the features (input
with Subset 1) and limited to 3%-4.30 % when using part of the features (input with
Subset 2 and 3). As expected, the performance was relatively insensitive to the
parameter values.
Revealing the Structure of Childhood Abdominal Pain Data 171
Observing indicative confusion matrices in Table 7 we can see in each column the
way that the samples portioned out in different categories according to RF simulation.
The main diagonal denotes the performance. As it is shown the higher percentage of
misclassification is observed in the operative treatment category, while there is no
divergence in the emergency operative treatment category. This could suggest that
only the emergency operative treatment cases have a diagnostic pattern that differs
significantly from the normal cases, explaining the difficulties in the diagnosis of
abdominal pain symptoms.
In Fig. 1 the average Variable Importance Graphs are displayed, as outputted from
Breimans’ RF implementation. The important features as depicted in the graph in
Fig. 1(a) are related to the diagnostic factors of tenderness, rebound, leucocytosis,
neutrofilia and urinalysis (diagnostic features #9, #10, #11, #12 and #13, according to
Table 4). This result motivated the formation of the input subset 3, as presented in
Table 5, which consisted of only these factors. Experimenting only with subset 3
inputs resulted in graphs as shown in Fig 1(b). Again the diagnostic factors relating
to leucocytosis and neutrophilia (diagnostic features #11 and #12) achieved a very
high scoring, which appears very rational from a medical point of view.
(a) (b)
Fig. 1. Average Variable Importance Graphs for (a) subset 1 (all diagnostic features), and (b)
subset 3 (diagnostic features #9, #10, #11, #12 and #13)
Table 8. Average confusion matrices (%) for Decision Trees implementations for 7 categories
of classification. Results on the testing set.
Coding -2 -1 0 1 2 3 4
-2 23.53 58.62 55.88 81.82 75 50 33.33
-1 47.06 34.48 29.41 9.09 0 50 66.67
0 0 0 0 0 0 0 0
1 0 0 11.77 0 25 0 0
2 29.41 0 2.94 0 0 0 0
3 0 3.45 0 9.09 0 0 0
4 0 3.45 0 0 0 0 0
Number
of cases 17 29 34 11 4 2 3
Table 9. Average confusion matrices (%) for Decision Trees implementations for 3 categories
of classification. Results on the testing set.
Early works using GA for clustering, feature selection and classification can be found
in [32, 33]. In [34] a centroid-based real encoding of the chromosomes was proposed.
According to this technique, each chromosome is a sequence of real numbers repre-
senting the K clusters centers. For a data set considering N features, the length of each
chromosome of the GA is N ·K. For each one of the individual of the GA population,
the K cluster centers are initialized to randomly chosen points form the data set. If xi,
i = 1, 2, … n, are the data points in the N-dimensional space, each point is assigned to
the cluster Cj with center zj , if:
xi − z j ≤ xi − z p , p = 1, 2, …, K . (1)
After the clustering is done, the cluster centers are replaced by the mean points of
the perspective clusters. The task for the GA is to minimize the distance of the data
points from the corresponding cluster centers and therefore to search and find for the
optimal classification of the data points to the corresponding clusters.
After the formation of the clusters, each cluster is assigned a class label according to
a majority vote of the corresponding examples that populate the cluster (i.e. each ex-
ample votes for its known true class). The classification performance depends on the
number of the examples with class labels that differ from the corresponding cluster class
label. The goal of using a clustering algorithm is not only to measure the classification
performance, but more important, to study if the selected features are able to inherently
discover the different classes and to thereby prove their effectiveness.
Moreover, in the present work, GA were used for the achievement of an additional
task: that of genetically selection of features that will be considered for the classifica-
tion of the clinical cases into clusters. Thus, the genome of the GA individuals is
composed of two parts. The first part is a substring of 15 bits related to the corre-
sponding 15 features of the clinical data set presented in Table 4. In this binary sub-
string a value of 0 denotes that the corresponding feature is ignored and therefore it is
not considered during the clustering procedure of the clinical cases. On the opposite, a
value of 1 in this binary substring denotes that the corresponding feature is considered
during the clustering procedure. In addition, the genome contains in binary form the
coordinates of the centers of the clusters. In order to further investigate on the hy-
pothesis outlined in Section 2.2, two clusters are considered: the normal cluster, and
the pathological operative treatment cluster. Again, in accordance to Table 3, in the
normal cluster are grouped the cases that have no need for additional medical treat-
ment (coded by -2, -1 and 0, in the data set). All the rest cases (coded by 1, 2, 3 or 4 in
174 A. Adamopoulos et al.
the data set), are grouped in the pathological cluster. Therefore, the GA deals with
two objectives: (a) to search and find the most suitable subset of features that will be
considered for the classification of the cases into clusters, and (b) to find the coordi-
nates of the clusters centers that minimize the distance of the points corresponding to
the clinical cases from that cluster centers. Considering the issue of the fitness func-
tion of the GA, the number of correctly classified cases gives the fitness of the GA
individuals. Due to the fact that the GA genome is coded in binary form all the well
known genetic operators for binary crossover and binary mutation can be applied. For
the experiments performed for the purposes of the present work, the GA algorithm
was implemented in the Matlab® programming environment, using the GA and Direct
Search tool incorporated there. Mostly, the scattered crossover operator was used for
crossover and the binary flip mutation operator was used for the generation of mu-
tants. For purpose of comparison, the genetic clustering results are compared to the
corresponding ones obtained by using the K-means clustering algorithm [35, 36].
Table 10. Error rates (%) of the K-means clustering algorithm (testing set)
Error (%)
Number of clusters Subset of features
1 2 3
2 4.07 48.26 4.26
3 45.74 49.42 21.90
7 83.72 82.95 87.98
Table 11. Error rates (%) of the Genetic Algorithm clustering algorithm and Genetic Feature
Selection
The results obtained in several experiments of genetic clustering and feature selection
are presented in Table 11. The implemented GA managed to perform clustering of all
the 516 cases into two clusters (the “normal” and the “pathological” ones) in a very
satisfactory fashion, resulting to a classification error ranging from 1.94% (10 mis-
classified cases out of 516 in total), to 3.88 (20 misclassified cases out of 516 in total).
Most significant is that the features used by the GA are limited from 2 up to 4 fea-
tures. According to Table 4, the features considered by the GA for clustering of the
cases refer to the religion of the patients (feature #2), the leucocytosis and neutro-
philia levels, (features #11 and #12), and the body temperature (feature #14). As far as
Revealing the Structure of Childhood Abdominal Pain Data 175
for the clustering classification itself, the GA clustering outperforms the results ob-
tained by K-means clustering algorithm (presented in Table 10).
4 Conclusions
In this paper, a Random Forest approach and a Genetic Clustering Algorithm were em-
ployed for the prediction of abdominal pain in children. To the best of our knowledge
no similar analysis strategy for abdominal pain data has been reported in literature pre-
viously. Random Forests were able to efficiently tackle the diagnosis prediction prob-
lem of the 3-categories classification approach achieving performance up to 97%. Error
rates fluctuated from 3%-4.9% using all the three types of input subsets. A crucial point
is that the RF classifiers performed well over the emergency operative treatment cases
(100% performance). However as far as the operative treatment cases are concerned the
percentage of misclassification was quite high (up to 33% cases were misclassified to
the normal category), which denoted that RF fail to recognize the seriousness of corre-
sponding symptoms. Decision trees classifiers in contrast failed to even distinguish the
necessity of operative treatment, misclassifying 86% of the dataset.
The Genetic clustering and feature selection approach revealed the underlying data
structure and succeeded to cluster the 2-categories data for which the RF approach
failed, with an error rate ranging from 1.94% to 3.38%. Additionally, it clearly outper-
formed the K-Means clustering approach, which resulted in an error rate above 4%.
The important features as automatically predicted for both approaches –RF and Ge-
netic Clustering- were limited and in agreement with medical knowledge. Interest-
ingly, the Genetic Algorithm selected the religion among the important features.
Although at a first glance this selection seems inappropriate, it reflects the way the
religion influences different dietary habits which ultimately are associated with
abdominal pain symptoms.
References
1. Newman, K., Ponsky, T., Kittle, K., Dyk, L., Throop, C., Gieseker, K., Sills, M., Gilbert,
J.: Appendicitis 2000: Variability in practice, outcomes, and resource utilization at thirty
pediatric hospitals. In: 33rd Annual Meeting of the American Pediatric Surgical Associa-
tion, Phoenix, Arizona, May 19-23 (2002)
2. Apley, J., Naish, N.: Recurrent abdominal pains: A field survey of 1000 school children.
Archives of Disease in Childhood 33, 165–170 (1958)
3. Hyams, J.S., Burke, G., Davis, P.M., Rzepski, B., Andrulonis, P.A.: Abdominal pain and
irritable bowel syndrome in adolescents: A community-based study. J Pediatr 129, 220–
226 (1996)
4. Campo, J.V., DiLorenzo, C., Chiappetta, L., Bridge, J., Colborn, D.K., Gartner Jr., J.C.,
Gaffney, P., Kocoshis, S., Brent, D.: Adult outcomes of pediatric recurrent abdominal
pain: do they grow out of it? Pediatrics 108(e1) (2001) doi: 10.1542/peds.108.1.e1
5. Berry Jr., J., Malt, R.A.: Appendicitis near its centenary. Massachussetts General Hospital
and the Department of Surgery, Harvard Medical School
6. Paterson-Brown, S.: Emergency laparoscopy surgery. Br. J. Surg. 80, 279–283 (1993)
176 A. Adamopoulos et al.
7. Olsen, J.B., Myrén, C.J., Haahr, P.E.: Randomized study of the value of laparoscopy be-
fore appendectomy. Br. J. Surg. 80, 922–923 (1993)
8. Raheja, S.K., McDonald, P., Taylor, I.: Non specific abdominal pain an expensive mys-
tery. J. R. Soc. Med. 88, 10–11 (1990)
9. Hawthorn, I.E.: Abdominal pain as a cause of acute admission to hospital. J. R. Coll. Surg.
Edinb. 37, 389–393 (1992)
10. McAdam, W.A., Brock, B.M., Armitage, T., Davenport, P., Chan, M., de Dombal, F.T.:
Twelve years experience of computer-aided diagnosis in a district general hospital. Aire-
dale District General Hospital, West Yorkshire
11. Sim, K.T., Picone, S., Crade, M., Sweeney, J.P.: Ultrasound with graded compression in
the evaluation of acute appendicitis. J. Natl. Med. Assoc. 81(9), 954–957 (1989)
12. Graham, D.F.: Computer-aided prediction of gangrenous and perforating appendicitis. Br.
Med. J. 26 2(6099), 1375–1377 (1977)
13. de Dombal, F.T., Leaper, D.J., Horrocks, J.C., Staniland, J.R., McCann, A.P.: Human and
Computer-aided Diagnosis of Abdominal Pain: Further Report with Emphasis on Perform-
ance of Clinicians. Br. Med. J. 1(5904), 376–380 (1974)
14. de Dombal, F.T., Leaper, D.J., Staniland, J.R., McCann, A.P., Horrocks, J.C. Computer-
aided Diagnosis of Acute Abdominal Pain. Br Med J. 1, 2(5804), 9–13 (1972)
15. Weydert, J.A., Shapiro, D.B., Acra, S.A., Monheim, C.J., Chambers, A.S., Ball, T.M.:
Evaluation of guided imagery as treatment for recurrent abdominal pain in children: a ran-
domized controlled trial. BMC Pediatr. 6, 29 (2006)
16. Gardikis, S., Touloupidis, S., Dimitriadis, G., Limas, C., Antypas, S., Dolatzas, T., Poly-
chronidis, A., Simopoulos, C.: Urological Symptoms of Acute Appendicitis in Childhood
and Early Adolescence. Int. Urol. Nephrol. 34, 189–192 (2002)
17. Mantzaris, D., Anastassopoulos, G., Adamopoulos, A., Gardikis, S.: A non-symbolic im-
plementation for abdominal pain estimation in childhood. Information Sciences 178(20),
3860–3866 (2008)
18. Kononenko, I.: Machine Learning for Medical Diagnosis: History, State of the Art and
Perspective. University of Ljubljana Faculty of Computer and Information Science
19. Boinee, P., de Angelis, A., Foresti, G.L.: Meta Random Forests. International Journal of
Computational Intelligence 2, 3 (2006)
20. Tsirogiannis, G.L., Frossyniotis, D., Stoitsis, J., Golemati, S., Stafylopatis, A., Nikita,
A.S.: Classification of Medical Data with a Robust Multi-Level Combination Scheme.
School of Electrical and Computer Engineering National Technical University of Athens
21. Rubin, A.D.: Artificial Intelligence approaches to medical diagnosis. MIT Press, Cam-
bridge
22. McCollough, M., Sharieff, G.Q.: Abdominal Pain in Children. University of California,
San Diego, Pediatr Clin. N Am. 53, 107–137 (2006)
23. de Edelenyi, F.S., Goumidi, L., Bertrans, S., Phillips Ross McManus, C., Roche, H.,
Planells, R., Lairon, D. Springer, Heidelberg (2008)
24. Bailey, S.: Lawrence Berkeley National Laboratory (University of California), Paper
LBNL-696E (2008)
25. Breiman, L.: Technical Report 670, Statistics Department University of California, Berke-
ley, September 9 (2004)
26. Liaw, A., Wiener, M.: Classification and Regression by Random Forest. The Newsletter of
the R Project, vol. 2/3 (December 2002)
27. Holland, J.H.: Adaptation in Natural and Artificial Systems. Univ. Michigan Press, Ann
Arbor (1975)
Revealing the Structure of Childhood Abdominal Pain Data 177
28. Goldberg, D.: Genetic Algorithms in Search, Optimization, and Machine Learning. Addi-
son-Wesley, Reading (1989)
29. Li, B.U.: Recurrent abdominal pain in childhood: an approach to common disorders. Com-
prehensive Therapy 13, 46–53 (1987)
30. Gislason, P.A., Benediktsson, J.A., Sveinsson, J.R.: Random Forests for land cover classi-
fication. Pattern Recognition Letters 27, 294–300 (2006)
31. Bandyopadhyay, S., Pal, S.K.: Classification and learning using genetic algorithms.
Springer, Heidelberg (2007)
32. Tseng, L., Yang, S.: Genetic Algorithms for clustering, feature selection, and classifica-
tion. In: Proceedings of the IEEE International Conference on Neural Networks, Houston,
pp. 1612–1616 (1997)
33. Bhuyan, N.J., Raghavan, V.V., Venkatesh, K.E.: Genetic algorithms for clustering with an
ordered representation. In: Proceedings of the Fourth International Conference Genetic Al-
gorithms, pp. 408–415 (1991)
34. Maulik, U., Bandyopadhyay, S.: Genetic algorithm based clustering technique. Pattern
Recognition 33, 1455–1465 (2000)
35. Jain, A.K., Dubes, R.C.: Algorithms for clustering data. Prentice-Hall, Englewood Cliffs
(1988)
36. Tou, J.T., Gonzalez, R.C.: Pattern Recognition Principles. Addison-Wesley, Reading
(1974)
37. Karalic, A., Pirnat, V.: Significance Level Based Classification with Multiple Trees. In-
formatica 15(1), 54–58 (1991)
38. Muggleton, S.: Inductive Acquisition of Expert Knowledge. Turing Institute Press
&Addison_Wesley (1990)
39. Bratko, I., Kononenko, I.: Learning Rules from Incomplete and Noisy Data. In: Phelps, B.
(ed.) Interactions in Artificial Intelligence and Statistical Methods. Technical Press, Hamp-
shire (1987)
40. Nunez, M.: Decision Tree Induction Using Domain Knowledge. In: Wielinga, B., et al.
(eds.) Current Trends in Knowledge Acquisition. IOS Press, Amsterdam (1990)
41. Prasad, A.M., Iverson, L.R., Liaw, A.: Newer Classification and Regression Tree Tech-
niques: Bagging and Random Forests for Ecological Prediction. Ecosystems 9, 181–199
(2006)
Relating Halftone Dot Quality to Paper Surface
Topography
1 Introduction
The structure of the paper surface is one of the key issues that determine the attainable
print quality. The overall roughness and surface compressibility measurement have
shown to be important in predicting the quality of gravure print [3, 9]. More recent
studies have shown with aligned 2D measurements that topography can explain part
of the missing ink in fulltone printing [13, 1].
The quality of halftone print is important since most of the printed material is pro-
duced by printing dot patterns (a.k.a raster dots). That introduces a new challenge for
the image-based analysis of the small-scale quality properties of the print: the regular
dot pattern area must be separated from the void areas between the dots to be able to
focus the analysis of the 2D data to those points that were supposed to be covered by
ink. This ensures the correct detection of missing ink and poor quality dots. A robust
D. Palmer-Brown et al. (Eds.): EANN 2009, CCIS 43, pp. 178–189, 2009.
© Springer-Verlag Berlin Heidelberg 2009
Relating Halftone Dot Quality to Paper Surface Topography 179
and accurate method for detecting the raster dot grid is presented in [14]. We use it to
extract each individual raster dot area from the 2D measurements of print reflectance
and surface topography.
In this paper we analyze the individual raster dots from two paper samples. First, in
the explorative analysis part, we verify the hypothesis that there is a relation between
the print quality and the unprinted topography. The qualities of the printed dots are
investigated and described using Self Organizing Map (SOM) [12]. We then calculate
descriptive features of the printed dots to train another SOM, which is clustered [18]
to further emphasize the most significant properties of the dots. We compare the re-
sults with properties of the corresponding areas in the unprinted surface topography.
Finally, for the second part of the study, we select one paper sample and divide the
printed dots into high and low print quality. We use the features from the unprinted
paper surface topography to classify the corresponding print areas using Support
Vector Machine classification [5].
In the following section we briefly introduce the paper samples under study and the
process required to achieve the data for our analysis. In the next sections we describe
the methods applied to the data and present the results of the analysis of the print
quality and the classification of the dots based on the topography. Concluding
remarks are given in the last section.
2 Data Acquisition
We examine supercalendered (SC) paper samples printed by an IGT gravure test
printer [10]. The test set contains paper samples of three roughness levels, each one
printed with three printing nip pressure levels. In this work we study the two extreme
cases: the one with highest roughness and lowest force in the printing nip, and the
lowest roughness and the highest force. We refer to these cases as sample A and sam-
ple B correspondingly. We focus on the conventional screening area of the Heliotest
strip that consists of a dense regular raster dot pattern due to the engraved cells on the
printing cylinder. Examples of printed paper strips with an arrow pointing the area
under study are shown in Fig. 1. The upper corners are marked with pencil for the
alignment of the measurements before and after printing.
The selected areas have been imaged with a photometric stereo device that applies
the principles described in [8]. It is based on photographic imaging with slanting
illumination and it provides both reflectance and surface topography maps of the
paper sample. The image size is 22.5 x 15 mm and contains 2268 x 1512 pixels, thus
180 P. Kumpulainen et al.
the pixel size is approximately 10 x 10 μm. The same area of each paper strip is im-
aged before and after printing, and these images are aligned at subpixel accuracy [15].
The photometric stereo images are in RGB colors but the topography map is com-
puted from the mean of the color channels. In the analysis of the reflectance of the
printed paper, we use only green channel which is the most sensitive to the red color
used in printing.
The basic regular structure of the printed dots is detected using 2D FFT and the fi-
nal positions of the dots are estimated at subpixel resolution by maximizing the local
spatial correlation between the print reflectance and a raster pattern model [14]. The
latter step is essential to allow the detected point grid to slightly deviate from the
perfectly regular pattern. The raster dots are extracted from the print reflectance map
by selecting a square from around each grid point and interpolating the pixel values
inside each square so that the subpixel grid point becomes the geometrical centre of
the square. Each of the final interpolated dots is presented by a matrix of 13 by 13
pixels. Finally, certain erroneous measurement values are eliminated from the data.
They occur at the transparent fibers on the paper surface that appear as depressions in
the topography map although they are really elevations. They are detected from
the measurements of the unprinted paper by the principal component analysis [11] of
the color channels. After removing the areas that contain fibers or other artifacts we
have 12628 individual raster dots remaining in sample A and 10822 dots in sample B.
3 Analysis Methods
In section 4 we use Self Organizing Maps (SOM) to present the main characteristics
of the print quality. We use clustering in section 5 to further emphasize the most sig-
nificant properties of the printed dots. Support Vector Machines (SVM) are used in
section 6 to study if the surface topography of unprinted paper is sufficient to classify
the paper areas to high and low final print quality.
All the methods we use have several parameters which affect the results. They all
are also influenced by the preprocessing of the data, especially scaling and weighting
of the variables [6, 7]. We will show the computations for optimizing the parameters
on SVM classification only. The details of other decisions are discussed in appropri-
ate sections.
node. Belonging to a node means that the code vector of that node has the minimum
distance of all nodes to the data point. The standard deviations of the data are pre-
sented in a similar manner. We also use the hit histogram, which shows the number of
the data points that belong to each of the nodes.
3.2 Clustering
Clustering is a term for methods that discover groups of similar objects in multivariate
data where no predefined classes exist and thus there are no known right or false results
[6]. In this study we find clusters of the code vectors of the SOM [18], which adds
another condensed view to the data. Using the SOM as an intermediate step reduces the
computational load. Even though these data do not have evident clusters hierarchical
clustering with complete linkage [11] reveals informative groups from the data.
Support Vector Machines (SVM) are learning systems that use hypothesis space of
linear functions in a high-dimensional kernel-induced feature space [5]. In section 6
we use SVM classifiers on the features of the topography measurement before print-
ing to classify the printed reflectance into good and poor quality.
In the first subsection we use C-Support Vector Classification (C-SVC) [2, 4].
Given l samples of training vectors xi in two classes, defined by yi ∈
{-1, 1}, C-SVC
solves the primal problem
1
min C
, , 2
(1)
1 ,
0, 1, … , .
In the second subsection we use ν-SVM [16] to find out if there are more sparse
solutions available. The training errors and the number of support vectors can be
controlled by an additional parameter ν. The primal form is
1
min
, , , 2
(2)
,
0, 1, … , , 0.
The dual for the scaled version used in [4] is
1
min ,
2 (3)
0 1, 1, … , ,
, 0.
where Q is a positive semidefinite matrix Qij ≡ yiyjK(xi,xj) and the kernel
K(xi,xj) ≡ φ(xi)Tφ(xj).
182 P. Kumpulainen et al.
sgn / , . (4)
Fig. 2. Typical raster dots in sample A. Average of the raster dots in each node (left), standard
deviation (middle) and the number of dots in each node (right).
The map size is 20 x 10 nodes and it is initialize along the first two principal com-
ponents of the data space. Results of paper sample A are depicted in Fig. 2. Darker
gray level in the figure presents more ink in print and the lighter levels and white
depict points where ink is missing. There are a lot of poor quality dots. The highest
values in the hit histogram are in the lower corners, which present poor quality.
The corresponding results of the paper sample B are shown in Fig. 3. Most of the
map is covered with high quality dots. The highest values in the hit histogram are also
at the upper side, which presents good quality.
Relating Halftone Dot Quality to Paper Surface Topography 183
Fig. 3. Typical raster dots in sample B. Average of the raster dots in each node (left), standard
deviation (middle) and the number of dots in each node (right).
These results give a general description of the variety of the printed dots. The fun-
damental properties of the dots seem to be related to the overall level of darkness and
variance as well as the level and variance in the middle and at the edges of each dot.
These observations are utilized in the following section.
The values of the overall mean, which represents the total amount of ink in the
dots, rise from left to right in both samples. This clearly states that the good quality
dots are represented by the nodes in the left end and the poor quality in the right. This
is also proven by the mean values in the inner and outer circles.
Fig. 4. Component lines of the 1-dimensional SOM in the 6-dimensional feature space
For more condensed presentation we divide the data into ten clusters. The cluster
borders are marked with vertical lines in Fig. 4.
The code vectors present a lower dimensional feature space and they cannot be
used to visualize the data. Therefore we calculate the pixel-wise mean and standard
deviation in each cluster. We also calculate the mean and standard deviation of the
topography at the locations determined by the clustering of the reflectance.
The results of the paper sample A are shown in Fig. 5. The numbers at the top refer
to the percentage of the dots contained in each cluster. The following line tells the
cumulative percentage.
Fig. 5. Mean and standard deviation of the printed reflectance in paper sample A in clusters 1 to
10 (top). Results from the corresponding areas in unprinted topography (bottom).
Relating Halftone Dot Quality to Paper Surface Topography 185
In average the high and low quality dots are organized to the opposite ends of the
line. Less than half of the dots are of proper quality. The rest towards the right end are
more or less defective. The variation within the cluster also increases from left to
right. The mean of the topography is lower in the clusters on the right, which means
that there are deeper depressions in the paper in these clusters.
Fig. 6. Mean and standard deviation of the printed reflectance in paper sample B in clusters 1 to
10 (top). Results from the corresponding areas in unprinted topography (bottom).
6 SVM Classification
We select the highest and lowest quality raster dots from the paper sample A for the
classification study. We form two classes using only the data in two clusters from
both ends of the 1-D SOM line. Clusters 1 and 2 form one class, presenting the high
print quality and clusters 9 and 10 form the class for the poor quality. We use
LIBSVM, a library for support vector machines [4] for the classification.
We balance the data set so that each class has the same number of dots, 1023 in
both classes. 10% of these data are dedicated as test data, and thus there are 1840 dots
in the training data and 206 in the test data. The same six features that were used in
SOM are now calculated from the measurement of the topography before printing.
We scale the topography features to the range [0 1] and give the same weights that
were used for the reflectance features: 0.9 and 0.4 for the circular means and standard
deviations, respectively.
186 P. Kumpulainen et al.
We use the C-SVC classification with Radial Basis Function (RBF) kernels:
K(xi,xj) = exp(-γ‖xi - xj‖2), γ > 0. The classification accuracy from the 5-fold cross
validation on the training data is depicted in Fig. 7 as function of the coefficient C in
eq. (2) and the parameter γ of the RBF kernel.
The maximum value 64.8%, highlighted by a star was achieved at values C = 0.5
and γ = 2−3. The total number of support vectors in this model is 1530, thus it is not a
very sparse solution.
In this subsection we study if it is possible to have more sparse solutions and reduce
the number of support vectors by ν-SVM. The parameter C of the C-SVC is now
replaced by ν, which acts as an upper bound of the fraction of training errors and a
lower bound of the fraction of support vectors. Similar 5-fold cross validation proce-
dure as above is conducted on the training data set.
Fig. 8. Accuracy of the ν-SVM classification (left) and the number of support vectors (right) as
functions of the parameters ν and γ
Relating Halftone Dot Quality to Paper Surface Topography 187
The results are presented in Fig. 8. The best classification accuracy 64.5% is
achieved at values γ = 22 and ν = 0.8. This model has 1485 support vectors, slightly
less than in C-SVC.
In this final example we select ν = 0.8 and the γ = 2-4, from the previous experiments
and calculate the probabilities for each data vector. Differences of the probabilities in
the two classes are shown in Fig. 9. Negative value stands for class -1 and positive for
1. The data are arranged so that the negative class is on the left and the positive on the
right, separated by a line.
Table 1 shows the proportion of the raster dots covered and the classification accu-
racy when only dots with high probability are considered. The probability threshold in
this example is 0.7. Large variance in both reflectance and topography at the low print
quality end as observed in Figs. 5 and 6 indicates that other factors than topography
depressions also cause printing defects. Thus the accuracy of classification based only
on the relation of reflectance and topography remains low.
Table 1. Classification results including only the data where the probability p > 0.7
7 Conclusion
We have examined the quality of raster dots printed by an IGT gravure test printer.
Two samples were studied: sample A with high roughness level and low printing
pressure, and sample B with low roughness level and high printing pressure. The
overall goals were to describe the quality of the raster dots and to find out if the topo-
graphy measurement before printing can be used to distinguish the high and low qual-
ity dots, thus predicting the final print quality.
Visualizing the dots by SOM reveals several kinds of defects in the dots, missing
ink in the middle being the most severe. The overall quality in sample B is higher than
in sample A as expected considering the smoothness of the paper surface and the
188 P. Kumpulainen et al.
printing pressure. The best SVM classification results were below 65%. Although
topography partly determines the contact between the paper surface and the printing
roll, it does not classify the print quality very well. This is in line with prior studies
which also recognize that several other factors, such as surface compressibility and
porosity, have a large impact on the printing result. However, the current work has
shown the efficiency of SOM and SVC in the analysis and visualization of large data
sets such as thousands of small images of printed dots, and the connections found
between print quality and surface topography are plausible from the paper physical
point of view.
In future work we will apply these methods to a wider range of print samples. Uti-
lizing the classification probabilities seems promising in finding the common features
of the surface topography that most likely affect the print quality. In this study we
have only used the areas covered by the printed raster dots. The defects in print quali-
ty can be caused by properties of larger neighborhood in the unprinted topography,
which will be further studied.
References
1. Barros, G.G.: Influence of substrate topography on ink distribution in flexography. Ph.D.
thesis, Karlstad University (2006)
2. Boser, B.E., Guyon, I., Vapnik, V.: A training algorithm for optimal margin classifiers. In:
Fifth Annual Workshop on Computational Learning Theory, pp. 144–152. ACM Press,
New York (1992)
3. Bristow, J.A., Ekman, H.: Paper properties affecting gravure print quality. TAPPI Jour-
nal 64(10), 115–118 (1981)
4. Chang, C-C., Lin, C-J.: LIBSVM: a library for support vector machines,
http://www.csie.ntu.edu.tw/~cjlin/papers/libsvm.pdf (Last updated
February 27, 2009), Software available at,
http://www.csie.ntu.edu.tw/~cjlin/libsvm (2001)
5. Cristianini, N., Shawe-Taylor, J.: An Introduction to Support Vector Machines and Other
Kernel-based Learning Methods. Cambridge University Press, Cambridge (2000)
6. Everitt, B., Landau, S., Leese, M.: Cluster analysis, 4th edn., Arnold (2001)
7. Gnanadesikan, R., Kettenring, J., Tsao, S.: Weighting and selection of variables for cluster
analysis. Journal of Classification 12(1), 113–136 (1995)
8. Hansson, P., Johansson, P.-Å.: Topography and reflectance analysis of paper surfaces us-
ing a photometric stereo method. Optical Engineering 39(9), 2555–2561 (2000)
9. Heintze, H.U., Gordon, R.W.: Tuning of the GRI proof press as a predictor of rotonews
print quality in the pressroom. TAPPI Journal 62(11), 97–101 (1979)
10. IGT Testing Systems: IGT Information leaflet W41 (2003)
11. Johnson, R.A., Wichern, D.W.: Applied multivariate statistical analysis, 4th edn. Prentice-
Hall, Englewood Cliffs (1998)
12. Kohonen, T.: Self-Organizing Maps. Series in Information Sciences, vol. 30. Springer,
Heidelberg (1995)
13. Mettänen, M., Hirn, U., Lauri, M., Ritala, R.: Probabilistic analysis of small-scale print de-
fects with aligned 2D measurements. In: Transactions of the 14th Fundamental Research
Symposium, Oxford, UK (Accepted for publication, 2009)
Relating Halftone Dot Quality to Paper Surface Topography 189
14. Mettänen, M., Lauri, M., Ihalainen, H., Kumpulainen, P., Ritala, R.: Aligned analysis of
surface topography and printed dot pattern maps. In: Proceedings of Papermaking Re-
search Symposium 2009, Kuopio, Finland (2009)
15. Mettänen, M., Ihalainen, H., Ritala, R.: Alignment and statistical analysis of 2D small-
scale paper property maps. Appita Journal 61(4), 323–330 (2008)
16. Schölkopf, B., Smola, A., Williamson, R.C., Bartlett, P.L.: New support vector algorithms.
Neural Computation 12, 1207–1245 (2000)
17. Vesanto, J.: SOM-based data visualization methods. Intell. Data Anal. 3(2), 111–126
(1999)
18. Vesanto, J., Alhoniemi, E.: Clustering of the self-organizing map. IEEE Transactions on
Neural Networks 11(3), 586–600 (2000)
19. Vesanto, J., Himberg, J., Alhoniemi, E., Parhankangas, J.: Self-organizing map in Matlab:
the SOM toolbox. In: Proceedings of the Matlab DSP Conference 1999, Espoo, Finland,
pp. 35–40 (1999)
20. Wu, T.-F., Lin, C.-J., Weng, R.C.: Probability estimates for multi-class classification by
pairwise coupling. Journal of Machine Learning Research 5, 975–1005 (2004)
Combining GRN Modeling and Demonstration-Based
Programming for Robot Control
Keywords: Recurrent neural network, gene regulation, time series data, bio-
inspired robot control, learning by demonstration.
1 Introduction
Gene regulatory networks (GRNs) are essential in cellular metabolism during the
development of living organism. They dynamically orchestrate the level of expression
for each gene in the genome by controlling whether and how the gene will be tran-
scribed into RNA [1]. In a GRN, the network structure is an abstraction of the chemi-
cal dynamics of a system, and the network nodes are genes that can be regarded as
functions obtained by combining basic functions upon the inputs. These functions
have been interpreted as performing a kind of information processing within the cell,
which determines cellular behaviors. As can be observed, GRNs act as analog bio-
chemical computers to specify the identity of and level of expression of groups of
target genes. Such systems often include dynamic and interlock feedback loops for
further regulation of network architecture and outputs. With the unique characteris-
tics, GRNs can be modeled as reliable and robust control mechanisms for robots.
The first important step in applying GRNs to controlling robots is to develop a
framework for GRN modeling. In the work of GRN modeling, many regulation models
have been proposed [2][3]; they can range from very abstract models (involving Boo-
lean values only) to very concrete ones (including fully biochemical interactions with
* Corresponding author.
D. Palmer-Brown et al. (Eds.): EANN 2009, CCIS 43, pp. 190–199, 2009.
© Springer-Verlag Berlin Heidelberg 2009
Combining GRN Modeling and Demonstration-Based Programming for Robot Control 191
stochastic kinetics). The former approach is mathematically tractable that provides the
possibility of examining large systems; but it cannot infer networks with feedback
loops. The latter approach is more suitable in simulating the biochemical reality and
more realistic to experimental biologists. To construct a network from experimental
data, an automated procedure (i.e., reversely engineering) is advocated, in which the
GRN model is firstly determined, and then different computational methods are devel-
oped for the chosen model to reconstruct networks from the time-series data [2][3]. As
can be observed from the literature, works in modeling GRNs shared similar ideas in
principle. However, depending on the research motivations behind such works, different
researchers have explored the same topic from different points of view; thus the imple-
mentation details of individual work are also different. This is especially apparent in the
application of GRNs for robot control. For example, in the work by Quick et al. [4] and
Stewart et al. [5], the authors have emphasized on describing the operational details of
their artificial genes, enzymes, and proteins to show how their GRN-based control sys-
tems are close to the biological process, rather than how their models can be used to
control robot practically. Different from their work, we focus on investigating whether
the presented approach can be used to model GRNs in practice and on how to exploit
this approach to construct robot controllers in particular.
As recurrent neural networks (RNNs) consider the feedback loops and take internal
states into account, they are able to show the system dynamics of the network over
time. With these characteristics, this kind of network is especially suitable for control
systems. Therefore, in this work we develop a RNN-based regulatory model for robot
control. To simulate the regulatory effects and make our model inferable from time-
series expression data, we also implement an enhanced RNN learning algorithm,
coupled with some heuristic techniques of data processing to improve the learning
performance. After establishing a framework for GRN modeling, we develop a
method of programming by demonstration to collect behavior sequence data of a
robot as the time series profiles, and then employ our framework to infer behavior
controllers automatically. To verify the presented approach, two series of experiments
have been conducted to demonstrate how it operates and how it can be used to con-
struct controllers for robots successfully.
In a fully recurrent net, each node has a link to any node of the net, including itself.
Using such a model to represent a GRN is based on the assumption that the regulatory
effect on the expression of a particular gene can be expressed as a neural network in
which each node represents a particular gene and the wiring between the nodes define
regulatory interactions. When activity flows through the network in response to an in-
put, each node influences the states of all nodes in the same net. Since all activity
changes are determined by these influences, each input can be seen as setting constraints
on the final state that the system can settle into. When the system operates, the activities
of individual nodes will change in order to increase the number of constraints satisfied.
The ideal final state would be a set of activities for the individual nodes where all the
constraints are satisfied. The network would then be stable because no node would be
trying to change the state of any of the nodes to which it is connected. Similarly, in a
GRN, the level of expression of genes at time t can be measured from a gene node, and
the output of a node at time t+Δt can be derived from the expression levels and connec-
tion weights of all genes connected to the given gene at the time t. That is, the regula-
tory effect to a certain gene can be regarded as a weighted sum of all other genes that
regulate this gene. Then the regulatory effect is transformed by a sigmoidal transfer
function into a value between 0 and 1 for normalization.
The same set of the above transformation rules is applied to the system output in a
cyclic fashion until the input does not change any further. As in [8], here we use the
basic ingredient to increase the power of empirical correlations in signaling constitu-
tive regulatory circuits. It is to generate a network with nodes and edges correspond-
ing to the level of gene expression measured in microarray experiments, and to derive
correlation coefficients between genes. To calculate the expression rate of a gene, the
following transformation rules are used:
dyi
= k1,i Gi − k 2,i yi
dt
−( ∑ j wi , j y j +bi )
Gi = {1 + e }−1
where yi is the actual concentration of the i-th gene product; k1,i and k2,i are the accu-
mulation and degradation rate constants of gene product, respectively; Gi is the regu-
latory effect on each gene that is defined by a set of weights (i.e., wi,j) estimating the
regulatory influence of gene j on gene i, and an external input bi representing the
reaction delay parameter.
When the above GRN model is used to control a robot, each gene node now corre-
sponds to an actuator of the robot in principle. Two extra nodes are added to serve as
inter-genes and their roles are not specified in advance. The redundancy makes the
controllers easier to be inferred from data. Fig. 1 illustrates the architecture of our
GRN controller. In this architecture, the sensor information received from the envi-
ronment is continuously sent to all nodes of the fully interconnected network, and the
outputs of the actuator nodes (i.e., ai) are interpreted as motor commands to control
the robot. For control tasks in which the sensor information is not required, for exam-
ple the locomotion task, the perception part in the figure is simply disabled.
Combining GRN Modeling and Demonstration-Based Programming for Robot Control 193
Sensor Information
….
a1 a2 …. an
After the network model is decided, the next phase is to find settings of the thresholds
and time constants for each neuron as well as the weights of the connections between
the neurons so that the network can produce the most approximate system behavior
(that is, the measured expression data). By introducing a scoring function for network
performance evaluation, the above task can be regarded as a parameter estimation
problem with the goal of maximizing the network performance (or minimizing an
equivalent error measure). To achieve this goal, here we use the backpropagation
through time (BPTT) [9] learning algorithm to update the relevant parameters of re-
current networks in discrete-time steps.
Instead of mapping a static input to a static output as in a feedforward network,
BPTT maps a series of inputs to a series of outputs. The central idea is the “unfold-
ing” of the discrete-time recurrent neural network (DTRNN) into a multilayer feed-
forward neural network when a sequence is processed. Once a DTRNN has been
transformed into an equivalent feedforward network, the resulting feedforward net-
work can then be trained using the standard backpropagation algorithm.
The goal of BPTT is to compute the gradient over the trajectory and update net-
work weights accordingly. As mentioned above, the gradient decomposes over time.
It can be obtained by calculating the instantaneous gradients and accumulating the
effect over time. In BPTT, weights can only be updated after a complete forward step
during which the activation is sent through the network and each processing element
stores its activation locally for the entire length of the trajectory. More details on
BPTT are referred to Werbos' work [9].
In the above learning procedure, learning rate is an important parameter. Yet it is
difficult to choose an appropriate value to achieve an efficient training, because the
cost surface for multi-layer networks can be complicated and what works in one loca-
tion of the cost surface may not work well in another location. Delta-bar-delta is a
heuristic algorithm for modifying the learning rate in the training procedure [10]. It is
inspired by the observation that the error surface may have a different gradient along
each weight direction so that each weight should have its own learning rate. In our
modeling work, to save the effort in choosing appropriate learning rate, this algorithm
is implemented for automatic parameter adjustment.
194 W.-P. Lee and T.-H. Yang
After establishing a framework for GRN modeling, we then use it to construct behav-
ior controllers for robots. In this work, we develop an imitation-based method to col-
lect data of behavior sequence as gene expression profiles, and employ the presented
methodology to infer behavior controllers automatically.
The imitation mechanism has two parts: one is an active process that is for acquiring
new behaviors; the other is a passive process for imitating known behaviors. For a
robot, the former is to try to employ a certain learning strategy to produce the behavior
that is currently shown but not known previously. And the latter is to recognize what
kind of behavior a demonstrator is performing and to retrieve the same kind of behav-
ior that has been previously developed and recorded in the memory. As the passive
process can be achieved by a straightforward way (i.e., building a mapping table to link
the extracted behavior trajectory to the most similar one recorded previously), in this
work we concentrate on the active process (i.e., learning new behaviors). For active
imitation, we take an engineering point of view and consider imitation as a vehicle for
learning new behaviors. It can be considered as a method of programming by demon-
stration [11][12]. In this method, the robot is firstly shown how to perform the desired
behavior: it is driven manually to achieve the target task. In this stage, the robot can be
regarded as a teacher showing the correct behavior. During the period of human-driven
demonstration, at each time step the relevant information received from the robot’s
sensors and actuators are recorded to form a behavior data set for later training. In
other words, it is to derive the time-series expression profiles of sensors and actuators
from the qualitative behavior demonstrated by the robot.
After the behavior data is obtained, in the second stage the robot plays the role of a
learner that is trained to achieve the target task. As described in the above sections, a
RNN-based GRN model is adopted as the behavior controller here for the learner, and
the corresponding learning algorithm is used to train the controller. To cope with
different environment situations, the robot is operated to achieve the target behavior a
few times so that a reliable and robust controller can be obtained. All expression data
from different behavior trials are collected and arranged in a single training set. By
minimizing the accumulated action error over the entire training data set, the robot
can improve its behavior and finally achieve the task. If the robot cannot produce a
similar behavior as in the demonstration, the user can modify the training set by driv-
ing the robot to repeat the sub-behaviors that it failed to achieve, and then adding the
newly obtained patterns to the data set to start the re-learning procedure.
In the experiments of GRN modeling, we firstly used the GRN simulation software
Genexp (reported in [8]) to produce expression data. Different gene networks were
defined in which the accumulation and degradation rate constants of gene product
were chosen from preliminary test. Due to the space limitation, only one set of the
experiments is reported here.
In this experiment, a four genes network was defined. The simulation was run for
30 time steps for data collection. Then the proposed approach was employed to infer
the above network. The result is shown in Fig. 2. It compares the system behaviors of
the original and reconstructed networks, in which the x-axis represents time step and
y-axis, the concentrations of different gene components. As can be observed, the
behaviors of the two systems are nearly identical and the accumulated error for
the five nodes is very small. It shows that the network can be reconstructed from the
expression data by the modeling framework presented.
0.5 0.5
0 0
1
10
13
16
19
22
25
28
10
13
16
19
22
25
28
1.2 1.2
desired
0.3 0.3
actual (G3)
0 0
1
10
13
16
19
22
25
28
10
13
16
19
22
25
28
The second experiment for evaluating our GRN modeling method is to infer a S-
system that has a power-law form and has been proposed as a precise model for regula-
tory networks. Here, we chose the same regulatory network reported in [13] as the target
network. It consists of five nodes and their relationship can be described as the following:
X 1 = 15.0 X 3 X 5−0.1 − 10.0 X 12.0
X 2 = 10.0 X 12.0 − 10.0 X 22.0
X = 10.0 X −0.1 − 10.0 X −0.1 X 2.0
3 2 2 3
Again, the proposed approach has been employed to infer the above network. The
upper part of Fig. 3 shows the original (desired) time series data for the five nodes,
and the lower part, the expressions of the synthesized system. As can be observed, the
behaviors of the two systems are nearly identical and the accumulated error for the
five nodes is very small. It shows that a S-system can also be modeled by our RNN-
based network, and the network can be reconstructed from the expression data by the
learning mechanism presented.
X1 X2 X3 X4 X5
1.2
0.8
0.6
0.4
0.2
0
1
10
13
16
19
22
25
28
X1 X2 X3 X4 X5
1.2
0.8
0.6
0.4
0.2
0
1
10
13
16
19
22
25
28
gripper (to pick up/release the object), and two nodes as inter-genes. The perception
information from infrared sensors mounted in the front side of the robot and the sen-
sor equipped on the inner side of the gripper (to detect the object) was used as extra
input for each nodes. The simulated robot was driven manually to perform the target
task, and the relevant information was recorded to train the robot. With the quantita-
tive behavior data set, the learning algorithm associated with the GRN model was
employed to derive a controller that satisfied the collected data by minimizing the
accumulated motor error. Fig. 4 illustrates the robot and the typical behavior produced
by the controller that was successfully inferred by the proposed approach.
open/close
1 2
up/down
3 4
Different from the above reactive control task, the second task involves internal
states: to develop a locomotion controller for a walking robot. It has been shown that
animals walk in a stereotyped way by adopting rhythmic patterns of movement, and
the neural control of these stereotyped movements is hierarchically organized. It has
also become clear that these kinds of motion patterns are controlled by the rhythm
generator mechanism called central pattern generator (CPG) that provides feedfor-
ward signals needed for locomotion even in the absence of sensory feedback and
high-level control [15]. In this experiment, the proposed GRN model works as the
role of CPG to control the motors of the robot legs.
As the main goal here is to examine whether the locomotion task can be achieved
by the presented approach, we did not built a walking robot but simply took the acti-
vation data of the successful controller developed in [16] as the time-series data for
the inference of a GRN model. The robot used in their work has six legs that may be
either up or down, and the robot can swing the leg forward or backward when it is up.
Each leg has a limited angular motion that can be used as sensory information of the
leg controller. To simplify the control architecture for a six legs insect, they assumed
that the locomotion controller exhibits left-right and front-back symmetries. There-
fore, only one set of leg controller parameters needs to be determined, and then these
parameters can be copied to each of the six legs. More details of the robot can be
found in [16].
198 W.-P. Lee and T.-H. Yang
The left part of Fig. 5 shows the activation data produced by a typical leg controller
reported in [16], in which the x-axis represents the time steps and y-axis, the normal-
ized activations of the actuators (from top to down: foot, swing forward, and swing
backward). To obtain a controller with the same behavior, in our experiments, a GRN
model with five gene nodes was used to control a leg: one node for foot control, two
nodes for forward and backward swing control, and two nodes as inter-genes. As in
[21], two sets of experiments, with (a) and without (b) utilizing sensory information,
have been conducted for the locomotion task here. The proposed approach was em-
ployed again to derive controllers by minimizing the accumulated motor error. The
results are presented in the right part of Fig. 5, which show the controllers can be
inferred successfully.
1.2 1.2
0.8 0.8
0.4 0.4
0 0
1.2 1.2
0.8 0.8
0.4 0.4
0 0
1.2 1.2
0.8 0.8
0.4 0.4
0 0
(a)
1.2 1.2
0.8 0.8
0.4 0.4
0 0
1.2 1.2
0.8 0.8
0.4 0.4
0 0
1.2 1.2
0.8 0.8
0.4 0.4
0 0
(b)
Fig. 5. The activation data of the desired (left) and inferred (right) leg controllers
References
1. Davison, E.H., Rast, J.P., Oliveri, P., et al.: A genomic regulatory network for develop-
ment. Science 295, 1669–1678 (2002)
2. deJong, H.: Modeling and simulation of genetic regulatory systems: a literature review.
Journal of Computational Biology 9, 67–103 (2002)
3. Cho, K.H., Choo, S.M., Jung, S.H., et al.: Reverse engineering of gene regulatory net-
works. IET System Biology 1, 149–163 (2007)
4. Quick, T., Nehaniv, C.L., Dautenhahn, K., et al.: Evolving embodied genetic regulatory
network-driven control systems. In: Proceeding of the Seventh European Conference on
Artificial Life, pp. 266–277 (2003)
5. Stewart, F., Taylor, T., Konidaris, G.: METAMorph: Experimenting with genetic regula-
tory networks for artificial development. In: Capcarrère, M.S., Freitas, A.A., Bentley, P.J.,
Johnson, C.G., Timmis, J. (eds.) ECAL 2005. LNCS, vol. 3630, pp. 108–117. Springer,
Heidelberg (2005)
6. Blasi, M.F., Casorelli, I., Colosimo, A., et al.: A recursive network approach can identify
constitutive regulatory circuits in gene expression data. Physica A 348, 349–377 (2005)
7. Beer, R.D.: Dynamical approaches to cognitive science. Trends in Cognitive Sciences 4,
91–99 (2000)
8. Vohradsky, J.: Neural network model of gene expression. The FASEB Journal 15, 846–
854 (2001)
9. Werbos, P.J.: Backpropagation through time: what it does and how to do it. Proceedings of
the IEEE 78, 1550–1560 (1990)
10. Jacobs, R.A.: Increased rates of convergence through learning rate adaptation. Neural Net-
works 1, 295–307 (1988)
11. Dillmann, R.: Teaching and learning of robot tasks via observation of human performance.
Robotics and Autonomous Systems 47, 109–116 (2004)
12. Nakaoka, S., Nakazawa, A., Kanehiro, F., et al.: Learning from observation paradigm: leg
task models for enabling a biped humanoid robot to imitate human dance. International
Journal of Robotics Research 26, 829–844 (2007)
13. Ando, S., Sakamoto, E., Iba, H.: Evolutionary modeling and inference of gene network. In-
formation Sciences 145, 237–259 (2002)
14. Nolfi, S., Floreano, D.: Evolutionary Robotics: The Biology, Intelligence, and Technology
of Self-Organizing Machine. MIT Press, MA (2000)
15. Ijspeert, A.J.: Central pattern generators for locomotion control in animals and robots : a
review. Neural Networks 21, 642–653 (2008)
16. Beer, R.D., Gallagher, J.C.: Evolving dynamical neural networks for adaptive behavior.
Adaptive Behavior 1, 91–122 (1992)
Discriminating Angry, Happy and Neutral Facial
Expression: A Comparison of Computational Models
1 Introduction
According to Ekman and Friesen [1] there are six easily discernible facial expres-
sions: anger, happiness, fear, surprise, disgust and sadness, apart from neutral. More-
over these are readily and consistently recognized across different cultures [2]. In the
work reported here we show how a computational model can identify facial expres-
sions from simple facial images. In particular, we show how happy faces with neutral
faces and angry faces with neutral faces can be differentiated.
Data presentation plays an important role in any type of recognition. High dimen-
sional data is normally reduced to a manageable low dimensional data set. We perform
dimensionality reduction using Principal Component Analysis (PCA) and Curvilinear
Component Analysis (CCA). PCA is a linear projection technique and it may be more
appropriate to use a non linear Curvilinear Component Analysis (CCA) [3]. The Intrin-
sic Dimension (ID) [4], which is the true dimension of the data, is often much less than
the original dimension of the data. To use this efficiently, the actual dimension of the
data must be estimated. We use the Correlation Dimension to estimate the Intrinsic
D. Palmer-Brown et al. (Eds.): EANN 2009, CCIS 43, pp. 200–209, 2009.
© Springer-Verlag Berlin Heidelberg 2009
Discriminating Angry, Happy and Neutral Facial Expression 201
Dimension. We compare the classification results of these methods with raw face im-
ages and of Gabor Pre-processed images [5, 6]. The features of the face (or any object
for that matter) may be aligned at any angle. Using a suitable Gabor filter at the required
orientation, certain features can be given high importance and other features less impor-
tance. Usually, a bank of such filters is used with different parameters and later the
resultant image is a L2 max (at every pixel the maximum of feature vector obtained
from the filter bank) superposition of the outputs from the filter bank.
2 Background
We perform feature extraction with Gabor filters and then use dimensionality reduc-
tion techniques such as Principal Component Analysis (PCA) and Curvilinear Com-
ponent Analysis (CCA) followed by a Support Vector Machine (SVM) [7] based
classification technique and these are described below.
A Gabor filter can be applied to images to extract features aligned at particular orienta-
tions. Gabor filters possess the optimal localization properties in both spatial and fre-
quency domains, and they have been successfully used in many applications [8]. A
Gabor filter is a function obtained by modulating a sinusoidal with a Gaussian func-
tion. The useful parameters of a Gabor filter are orientation and frequency. The Gabor
filter is thought to mimic the simple cells in the visual cortex. The various 2D receptive
field profiles encountered in populations of simple cells in the visual cortex are well
described by an optimal family of 2D filters [9]. In our case a Gabor filter bank is im-
plemented on face images with 8 different orientations and 5 different frequencies.
Recent studies on modeling of visual cortical cells [10] suggest a tuned band pass
filter bank structure. Formally, the Gabor filter is a Gaussian (with variances S x and
The variance terms S x and S y dictates the spread of the band pass filter centered
at the frequencies U and V in the frequency domain. This filter has real and imaginary
part.
A Gabor filter can be described by the following parameters: The S x and S y of
the Gaussian explain the shape of the base (circle or ellipse), frequency ( f ) of the
202 A. Shenoy et al.
sinusoid, orientation (ϴ) of the applied sinusoid. Figure 1 shows examples of various
Gabor filters. Figure 2 b) shows the effect of applying a variety of Gabor filters
shown in Figure 1 to the sample image shown in Figure 2 a). Note how the features at
particular orientations are exaggerated.
Fig. 1. Gabor filters - Real part of the Gabor kernels at five scales and eight orientations
An augmented Gabor feature vector is created of a size far greater than the original
data for the image. Every pixel is then represented by a vector of size 40 and demands
dimensionality reduction before further processing. So a 63 × 63 image is trans-
formed to size 63 × 63 × 5 × 8. Thus, the feature vector consists of all useful
information extracted from different frequencies, orientations and from all locations,
and hence is very useful for expression recognition.
(a) (b)
Once the feature vector is obtained, it can be handled in various ways. We simply
take the L2 max norm for each pixel in the feature vector. So that the final value of a
pixel is the maximum value found by any of the filters for that pixel.
The L2 max norm Superposition principle is used on the outputs of the filter bank
and the Figure 3 b) shows the output for the original image of Figure 3 a).
Discriminating Angry, Happy and Neutral Facial Expression 203
(a) (b)
Fig. 3. a) Original Image used for the Filter bank b) Superposition output (L2 max norm)
the output space y i ’s is calculated such that the distance relationship between the
data points is maintained.
Fig. 4. (a) 3D horse shoe dataset (b) 2D CCA projection (c) dy − dx plot
CCA puts more emphasis on maintaining the short distances than the longer ones.
Formally, this reasoning leads to the following error function:
( ) ( )
2
1 N N X Y Y
E= ∑ ∑ d i , j − d i , j Fλ d i , j ∀ j ≠ i (2)
2 i =1 j =1
204 A. Shenoy et al.
where d ix, j and d iY, j are the Euclidean distances between the points i and j in the input
space X and the projected output space Y respectively and N is the number of data
points. ( )
F d iY, j is the neighbourhood function, a monotonically decreasing function
of distance. In order to check that the relationship is maintained a plot of the dis-
tances in the input space and the output space ( dy − dx plot) is produced. For a well
maintained topology, dy should be proportional to the value of dx at least for small
values of dy ' s . Figure 4 shows CCA projections for the 3D data horse shoe data.
The dy − dx plot shown is good in the sense that the smaller distances are very well
matched [3].
One problem with CCA is deciding how many dimensions the projected space should
occupy, and one way of obtaining this is to use the intrinsic dimension of the data
manifold. The Intrinsic Dimension (ID) can be defined as the minimum number of
free variables required to define data without any significant information loss. Due to
the possibility of correlations among the data, both linear and nonlinear, a D-
dimensional dataset may actually lie on a d-dimensional manifold (D ≥ d). The ID of
such data is then said to be d. There are various methods of calculating the ID; here
we use the correlation Dimension [8] to calculate the ID of face image dataset.
A number of classifiers can be used in the final stage for classification. We have con-
centrated on the Support Vector Machine. Support Vector Machines (SVM) are a set
of related supervised learning methods used for classification and regression. SVM’s
are used extensively for many classification tasks such as: handwritten digit recogni-
tion [11] or Object Recognition [12]. A SVM implicitly transforms the data into a
higher dimensional data space (determined by the kernel) which allows the classifica-
tion to be accomplished more easily. We have used the LIBSVM tool [7] for SVM
classification.
Fig. 5. Example BINGHAMTON images used in our experiments which are cropped to the size
of 128 × 128 to extract the facial region and reduced to 63 × 63 for all experiments. The
first row has examples of angry expression, middle row has happy expression and last row has
images with neutral expression.
For PCA reduction we always use the first principal components which account for
95% of the total variance of the data, and project the data onto these principal compo-
nents - we call this is our standard PCA reduction. With Neutral and Happy faces, this
resulted in using 100 components of the raw dataset and 23 components in the Gabor
pre-processed dataset. With Neutral and Angry faces, this resulted in using 97 com-
ponents of the raw dataset and 22 components in the Gabor pre-processed dataset. As
CCA is a highly non-linear dimensionality reduction technique, we use the intrinsic
206 A. Shenoy et al.
dimensionality technique and reduce the components to its Intrinsic Dimension. The
Intrinsic Dimension of the raw faces with Neutral and Happy was approximated as 6
and that of Gabor pre- processed images was 5. Likewise, the Intrinsic Dimension of
the raw faces with Neutral and Angry was approximated as 5 and that of Gabor pre-
processed images was 6. Figure 6 shows the Eigenfaces obtained by the PCA tech-
nique with raw faces (Happy with Neutral set and Angry with Neutral set).
(a)
(b)
Fig. 6. a) The first 5 eigenfaces of the neutral and happy data set. b) The first 5 eigenfaces of
the neutral and angry data set.
The results of the SVM classification for Neutral and Happy are as in Table 1 and
for Neutral and Angry are as in Table 2. The PCA, being a linear dimensionality reduc-
tion technique, did not do quite as well as CCA with happy and neutral data set; how-
ever, there has been no difference with the angry and neutral dataset. With CCA there
was good generalization, but the key point to be noted here is the number of compo-
nents used for the classification. The CCA makes use of just 6 components with raw
faces get good classification result and 5 components with the Gabor pre-processed
images with the neutral and happy dataset. With the angry and neutral dataset, the
CCA makes use of 5 components with raw faces and 6 components with Gabor pre-
processed faces with results comparable with the raw faces.
Table 1. SVM Classification accuracy of raw faces and Gabor pre-processed images with PCA
and CCA dimensionality reduction techniques for Neutral and Happy dataset
Table 2. SVM Classification accuracy of raw faces and Gabor pre-processed images with PCA
and CCA dimensionality reduction techniques for Neutral and Angry dataset
(a)
(b)
Fig. 7. Examples of the most often misclassified set of faces. (a) Top row shows happy faces
wrongly classified as neutral. Bottom row shows neutral faces wrongly classified as happy. (b)
Top row shows angry faces wrongly classified as neutral. Bottom row shows neutral faces
wrongly classified as angry.
The classification results for the Neutral and Happy face images shown in Table 1,
indicates best classification using raw faces. The intrinsic dimensionality of the raw
images is found to be just 6 and the CCA projection therefore reduces the images to 6
components. It should be noted that even with just these 6 components, the SVM
gives very good classification. The standard PCA reduced raw images did not give
good classification. However, with Gabor pre-processed faces followed by standard
208 A. Shenoy et al.
PCA reduction gave much better results. Interestingly, Gabor pre-processing does not
help the non-linear CCA method.
The classification results for the Neutral and Angry face images shown in Table 2,
indicates the overall classification accuracy is not as good as with the happy versus
neutral dataset. Classifying angry faces is a difficult task for computation models and
can be seen from these results. Nevertheless, the SVM performs well with 79.54%
accuracy with raw faces. There is not much difference in the classification accuracy
with raw faces reduced in dimensionality with PCA and CCA.
The results with both sets of data suggest that the raw face images give the best clas-
sification results. From some of the examples of misclassifications shown in Figure 7, it
is not clear which feature has caused misclassification. Hence, we are currently under-
taking further experiments with human subjects. We are attempting to see if there are
any associations between the computational model and human performance.
4 Conclusions
Identifying facial expressions is a challenging and interesting task. Our experiment
shows that identification from raw images can be performed very well with happy
faces and angry faces. However, with a larger data set, it may be computationally
intractable to use the raw images. It is therefore important to reduce the dimensional-
ity of the data. The dimensionality reduction methods do fairly well. A linear method
such as PCA does not appear to be sufficiently tunable to identify features that are
relevant for facial expression characterization. However, performing Gabor pre-
processing on the images increases the classification accuracy of the data after per-
forming PCA in the case of happy and neutral face images. This, however, does not
apply to images that are subjected to dimensionality reduction with CCA. Gabor pre-
processed PCA data with just 23 components is capable of performing well in com-
parison to the raw images reduced with PCA. The Gabor pre-processed CCA images,
however, with just 5 components does not yield such comparable results. With the
second model, classifying angry with neutral faces, the raw faces manage to give just
35 out of 44 faces correct (79.54%) and indicates the difficulty of classifying angry
faces. Though the results of the classification for PCA and CCA processed raw im-
ages are comparable, it can be noted that Gabor pre-processing has managed to pro-
vide good classification with PCA reduced data and with CCA with just 23 and 6
components respectively. Future work will include extending the experiment to other
four expressions and comparing the performance of the computational model with
performance by human subjects.
References
1. Ekman, P., Friesen, W.V.: Constants across cultures in the face of the emotion. Journal of
Personality and Social Psychology 17 (1971)
2. Batty, B., Taylor, M.J.: Early processing of the six basic facial emotional expressions. In:
Cognitive Brain Research, p. 17 (2003)
Discriminating Angry, Happy and Neutral Facial Expression 209
1 Introduction
Recently a new class of financial instruments -weather derivatives- has been intro-
duced. The purpose of weather derivatives is to allow business to insure themselves
against fluctuations in the weather. According to [1, 2] nearly $1 trillion of the US
economy is directly exposed to weather risk. Just as traditional contingent claims,
whose payoffs depend upon the price of some fundamental, a weather derivative has
its underlying measure such as: rainfall, temperature, humidity or snowfall. Weather
derivatives are used to hedge volume risk, rather than price risk,
The Chicago Mercantile Exchange (CME) reports that the estimated value of its
weather products reached $22 billion through September 2005, with more than
600,000 contracts traded. This represents sharp rise comparing with 2004 in which
notional value was $2.2 billion [3]. Moreover, it is anticipated that the weather market
will continue to develop, broadening its scope in terms of geography, client base and
inter-relationship with other financial and insurance markets. In order to fully exploit
all the advantages that this market offers, adequate pricing approach is required.
Weather risk is unique in that it is highly localized, and despite great advances in
meteorological science, still cannot be predicted precisely and consistently. Weather
* Corresponding author.
D. Palmer-Brown et al. (Eds.): EANN 2009, CCIS 43, pp. 210–222, 2009.
© Springer-Verlag Berlin Heidelberg 2009
Modeling and Forecasting CAT and HDD Indices for Weather Derivative Pricing 211
derivatives are also different than other financial derivatives in that the underlying
weather index, like Heating Degree Days (HDD), Cooling Degree Days (CDD), Cu-
mulative Average Temperature (CAT), etc. cannot be traded. Mathematical explana-
tions of these indices are given in section 4. Furthermore, the corresponding market is
relatively illiquid. Consequently, since weather derivatives cannot be cost-efficiently
replicated with other weather derivatives, arbitrage pricing cannot directly apply to
them. The weather derivatives market is a classic incomplete market, because the
underlying weather variables are not tradable
The first and simplest method that has been used in weather derivative pricing is
burn analysis. Burn analysis is just a simple calculation of how a weather derivative
would perform in the past years. By taking the average of these values an estimate of
the price of the derivative is obtained. Burn analysis is very easy in calculation since
there is no need to fit the distribution of the temperature or to solve any stochastic
differential equation. Moreover, burn analysis is based in very few assumptions. First,
we have to assume that the temperature time-series is stationary. Next, we have to
assume that the data for different years are independent and identically distributed.
For a detailed explanation of Burn Analysis we refer to [4].
A closer inspection of a temperature time series shows that none of these assump-
tions are correct. It is clear that the temperature time-series is not stationary since it
contains seasonalities, jumps and trends [5, 6]. Also the independence of the years is
under question. [4] show that these assumptions can be used if the data can be cleaned
and detrended. Although their results show that the pricing still remains inaccurate.
Other methods as index and daily modeling are more accurate but still burn analysis is
usually a good first approximation of the derivative’s price.
In contrast to the previous methods, a dynamic model can be used which directly
simulates the future behavior of temperature. Using models for daily temperatures
can, in principle, lead to more accurate pricing than modeling temperature indices. On
the other hand, deriving an accurate model for the daily temperature is not a straight-
forward process. Observed temperatures show seasonality in all of the mean, vari-
ance, distribution and autocorrelations and long memory in the autocorrelations. The
risk with daily modeling is that small misspecifications in the models can lead to large
mispricing in the contracts.
The continuous processes used for modeling daily temperatures usually take a
mean-reverting form, which has to be descretized in order to estimate its various pa-
rameters. Previous works suggest that the parameter of the speed of mean reversion,
α, is constant. The work of [5] indicates exactly the opposite. In addition, α is impor-
tant for the correct and accurate pricing of temperature derivatives, [7]. In [5] the
parameter a was modeled by a Neural Network (NN). More precisely wavelet analy-
sis was used in order to identify the trend and the seasonal part of the temperature
signal and then a NN was used in the detrended and deseasonalized series. However
WA is limited to applications of small input dimension since its constructing wavelet
basis of a large input dimension is computationally expensive, [8]. In NN framework
the initial values of the weights are randomly chosen which usually leads to large
training times and to convergence to a local minimum of a specified loss function.
Finally, the use of sigmoid NN does not provide any information about the network
construction. In this study we expand the work of [5] by combining these two steps.
To overcome these problems we use networks with wavelets as activation functions,
namely Wavelet Networks, (WN). More precisely we use truncated Fourier series to
212 A. Zapranis and A. Alexandridis
remove the seasonality and the seasonal volatility of the temperature in various loca-
tions as in [9]. Next a wavelet network is constructed in order to fit the daily average
temperature in 13 cities and to forecast daily temperature up to two months hoping
that the waveform of the activation functions of the feedforward network will fit
much better the temperature process than the classical sigmoid functions. For a con-
cise treatment of wavelet analysis we refer to [10-12] while for wavelet networks we
refer to the works of [8, 13, 14]. The forecasting accuracy from the proposed method-
ology is validated in a two month ahead out of sample window. More precisely the
proposed methodology is compared against historical burn analysis and the Benth &
Benth’s model in forecasting CAT and HDD indices. Finally, we extend the work of
[5] by presenting the pricing equations for future HDD contracts when the speed of
mean reversion is not constant.
The rest of the paper is organized as follows. In section 2, we describe the process
used to model the average daily temperature. In section 3 a brief introduction to wavelet
networks is given. In section 4 we describe our data and apply our model to real data. In
section 5 we discuss CAT and HDD derivatives pricing and finally, in section 6 we
conclude.
where, T(t) is the daily average temperature, B(t) is a standard Brownian motion, S(t)
is a deterministic function modelling the trend and seasonality of the average tem-
perature, while σ(t) is the daily volatility of temperature variations. In [9] both S(t)
and σ2(t) were modeled as truncated Fourier series:
I1 J1
S (t ) = a + bt + a0 + ∑ ai sin(2iπ(t − f i ) / 365) + ∑ b j cos(2 jπ(t − g j ) / 365) (2)
i =1 j =1
I2 J2
σ 2 (t ) = c + ∑ ci sin(2iπt / 365) + ∑ d j cos(2 jπt / 365) (3)
i =1 j =1
From the Ito formula an explicit solution for (1) can be derived:
t
T (t ) = s (t ) + (T (t − 1) − s (t − 1))e −κ t + ∫ σ (u )e −κ ( t − u ) dB (u ) (4)
t −1
Modeling and Forecasting CAT and HDD Indices for Weather Derivative Pricing 213
T (t + 1) = aT (t ) + σ (t )ε (t ) (6)
where
T (t ) = T (t ) − S (t ) (7)
a = e− k (8)
In order to estimate model (6) we need first to remove the trend and seasonality com-
ponents from the average temperature series. The trend and the seasonality of daily
average temperatures is modeled and removed as in [9]. Next a wavelet neural net-
work is used to model and forecast daily detrended and deseasonalized temperatures.
Hence, equation (6) reduces to:
T (t ) = ϕ (T (t − 1) ) + et (9)
The analytic expression for the wavelet network derivative dϕ / dTi can be found in
[14]. Due to space limitations we will refer to the works of [5, 9] for the estimation of
parameters in equations (2), (3), (6) and (8).
λ +1 ∑ j j
j =1
∑ i i
y (x) = w[2] + w[2] ⋅ Ψ (x) + w[0] ⋅ x
i =1
(11)
λ is the number of hidden units and w stands for a network weight. Following [23] we
use as a mother wavelet the Mexican Hat function. The multidimensional wavelets are
computed as follows:
m
Ψ j (x) = ∏ψ ( zij ) (12)
i =1
w = ( wi[0] , w[2]
j , wλ +1 , w(ξ ) ij , w(ζ ) ij )
[2] [1] [1]
(14)
There are several approaches to train a WN. In our implementation we have used
ordinary back-propagation which is less fast but also less prone to sensitivity to initial
[2] [1]
conditions than higher order alternatives. The weights wi[0] , w j and parameters w(ξ )ij
[1]
and w(ζ )ij are trained for approximating the target function.
In WN, in contrast to NN that use sigmoid functions, selecting initial values of the di-
lation and translation parameters randomly may not be suitable, [22]. A wavelet is a
waveform of effectively limited duration that has an average value of zero and localized
properties hence a random initialization may lead to wavelons with a value of zero. Also
random initialization affects the speed of training and may lead to a local minimum of
the loss function, [24]. In literature more complex initialization methods have been
proposed, [13, 23, 25]. All methods can be summed in the following three steps.
1. Construct a library W of wavelets
2. Remove the wavelets that their support does not contain any sample points of the
training data.
3. Rank the remaining wavelets and select the best regressors.
The wavelet library can be constructed either by an orthogonal wavelet or a wave-
let frame. However orthogonal wavelets cannot be expressed in closed form. It is
shown that a family of compactly supported non-orthogonal wavelets is more appro-
priate for function approximation, [26]. The wavelet library may contain a large num-
ber of wavelets. In practice it is impossible to count infinite frame or basis terms.
However arbitrary truncations may lead to large errors, [27].
In [23] three alternative methods were proposed in order to reduce and rank the
wavelet in the wavelet library namely the Residual Based Selection (RBS) a Stepwise
Selection by Orthogonalization (SSO) and a Backward Elimination (BE) algorithm. In
this study we use the BE initialization method that proved in previous studies to out-
perform the other two methods, [14, 23].
Modeling and Forecasting CAT and HDD Indices for Weather Derivative Pricing 215
All the above methods are used just for the initialization of the dilation and transla-
tion parameters. Then the network is further trained in order to obtain the vector of
the parameters w = w0 which minimizes the cost function.
where the temperature is measured in degrees of Celsius. In USA, CME weather de-
rivatives are based on HDD or CDD index. A HDD is the number of degrees by
which daily temperature is below a base temperature, while a CDD is the number of
degrees by which the daily temperature is above the base temperature,
i.e., Daily HDD = max (0, base temperature – daily average temperature),
1
Atlanta, Detroit, New York, Baltimore, Houston, Philadelphia, Boston, Jacksonville, Portland,
Chicago, Kansas City, Raleigh, Cincinnati, Las Vegas, Sacramento, Colorado Spring, Little
Rock, Salt Lake City, Dallas, Los Angeles, Tucson, Des Moines, Minneapolis-St. Paul,
Washington, D.C.
2
Amsterdam, Barcelona, Berlin, Essen, London, Madrid, Paris, Rome, Stockholm, Oslo.
3
Tokyo, Osaka.
4
Calgary, Montreal, Vancouver, Edmonton, Toronto, Winnipeg.
216 A. Zapranis and A. Alexandridis
Table 1 shows the descriptive statistics of the temperature in each city for the past
11 years. The mean and standard deviation HDD represent the mean and the standard
deviation of the HDD index for the past 11 years for a period of two months, January
and February. For consistency all values are presented in degrees Fahrenheit. It is
clear that the HDD index exhibits large variability. Similar the difference between the
maximum and minimum is close to 70 degrees Fahrenheit in average for all cities
while the standard deviation of temperature is close to 15 degrees Fahrenheit. Also for
all cities there is kurotsis significant smaller than 3 and with exceptions of Barcelona
Madrid and London there is negative skewness.
Mean St.Dev Max Min Skewness Kurtosis Mean HDD std. HDD
Paris 54.38 12.10 89.90 13.80 -0.04 2.50 1368.40 134.95
Rome 60.20 11.37 85.80 31.10 -0.04 1.96 1075.00 121.52
Stockholm 45.51 14.96 79.20 -5.00 -0.09 2.33 2114.36 197.93
Amsterdam 51.00 11.00 79.90 12.20 -0.18 2.54 1512.07 189.05
Barcelona 61.56 10.59 85.70 32.60 0.09 2.03 899.23 102.75
Madrid 58.61 13.84 89.80 24.90 0.17 1.94 1262.56 128.53
New York 55.61 16.93 93.70 8.50 -0.15 2.08 1783.44 207.13
London 52.87 10.03 83.00 26.70 0.02 2.36 1307.11 98.80
Oslo 41.47 15.65 74.60 -8.70 -0.31 2.50 2404.53 236.19
Atlanta 62.18 14.52 89.60 13.70 -0.45 2.25 1130.75 121.19
Chicago 50.61 19.40 91.40 -12.90 -0.25 2.17 2221.33 211.89
Portland 46.80 17.36 83.20 -3.70 -0.22 2.22 2382.04 192.47
Philadelphia 56.02 17.13 90.50 9.50 -0.19 2.03 1766.98 211.65
Next we forecast the two months, 59 days, ahead out-of-sample forecasts for the
CAT and cumulative HDD indices. Our method is validated and compared against
two forecasting methods proposed in prior studies, the historical burn analysis (HBA)
and the Benth’s & Saltyte-Benth’s (B-B) model which is the starting point for our
methodology.
Table 2 shows the relative (percentage) errors for the CAT index of each method.
It is clear that the proposed method using WN outperforms both HBA and B-B. More
precisely the WN give smaller out-of-sample errors in 9 out of 13 times while it out-
performs B-B in 11 out of 13 times. It is clear that the WN can be used with great
success in European cities where the WN produces significant smaller errors than the
alternative methods. Only in Oslo and Amsterdam WN performs worse than the HBA
method but still the forecasts are better than the B-B. In USA cities WN produces the
smallest out of sample error in three cases while HBA and B-B produce the smaller
out of sample error in one and two cases respectively. Observing Table 1 again, we
Modeling and Forecasting CAT and HDD Indices for Weather Derivative Pricing 217
notice that when the temperature shows large negative skewness, with exception of
New York, Portland and Philadelphia, the proposed method is outperformed either by
HBA or by B-B. On the other hand in the cases of Barcelona and Madrid where the
skewness is positive the errors using the wavelet network method are only 0.03% and
0.74% and significant smaller than the errors produced by the other two methods.
Table 3 shows the relative (percentage) errors for the HDD index of each method. The
results are similar.
Finally we examine the fitted residuals in model (6). Note that the B-B model, in
contrast to the wavelet network model, is based on the hypothesis that the remaining
residuals follow the normal distribution. It is clear from Table 4 that only in Paris the
normality hypothesis marginally accepted. The Jarque-Bera statistic is slightly higher
than 0.05. In every other case the normality test is rejected. More precisely the Jarque-
Bera statistics are very large and the p-values are close to zero. Hence, alternative meth-
ods like wavelet analysis, must be used to capture the seasonal part of the data, [5].
Table 2. Relative errors for the three forecasting models. CAT index.
Table 3. Relative errors for the three forecasting models. HDD index.
Jarque-Bera P-Value
Paris 5.7762 0.054958
Rome 170.12339 0.001
Stockholm 60.355 0.001
Amsterdam 44.6404 0.001
Barcelona 685.835 0.001
Madrid 69.52 0.001
New York 53.91428 0.001
London 11.66163 0.003947
Oslo 37.272738 0.001
Atlanta 403.0617 0.001
Chicago 44.329798 0.001
Portland 21.91905 0.001
Philadelphia 89.54923 0.001
τ2
CDD = ∫ max (T ( s) − c, 0 )ds (17)
τ1
T (t ) = S (t ) + e ∫0 (T (0) − S (0) ) + e∫ σ ( s )e ∫
κ ( u ) du κ ( u ) du t − κ ( u ) du
0
∫0
0
dB( s) (19)
Our aim is to give a mathematical expression for the HDD future price. It is clear
that the weather derivative market is an incomplete market. Cumulative average tem-
perature contracts are written on a temperature index which is not a tradable or stor-
able asset. In order to derive the pricing formula, first we must find a risk-neutral
probability measure Q~P, where all assets are martingales after discounting. In the
case of weather derivatives any equivalent measure Q is a risk neutral probability. If
Q is the risk neutral probability and r is the constant compounding interest rate then
the arbitrage free future price of a HDD contract at time t ≤ τ 1 ≤ τ 2 is given by:
τ2
e − r (τ 2 − t ) ΕQ ⎡ ∫ max ( 0, c − T (τ ) ) dτ − FHDD (t ,τ 1 ,τ 2 ) | Ft ⎤ = 0 (20)
⎢⎣ τ1 ⎥⎦
where θ(t) is a real-valued measurable and bounded function denoting the market
price of risk. The market price of risk can be calculated by historical data. More spe-
cifically θ(t) can be calculated by looking the market price of contracts. The value that
makes the price of the model fits the market price is the market price of risk. Using
Ito formula, the solution of equation (23) is:
t t s
+ e ∫0 σ ( s )e ∫
κ ( u ) du t − κ ( u ) du
∫ 0
0
dB( s )
By replacing this expression to (21) we find the price of future contract on HDD in-
dex at time t where 0 ≤ t ≤ τ 1 ≤ τ 2 . Following the notation of [28] we have the follow-
ing proposition.
Proposition 1. The HDD future price for 0 ≤ t ≤ τ 1 ≤ τ 2 is given by
τ2 τ2 ⎛ m (t , s ) ⎞
FHDD (t ,τ 1 ,τ 2 ) = Ε Q ⎡ ∫ max ( c − T ( s ) ) ds |Ft ⎤ = ∫ v(t , s ) Ψ ⎜ ⎟ ds (25)
⎢⎣ τ1 ⎥⎦ τ1 ⎝ v(t , s ) ⎠
where,
s s u
m(t , s ) = c − S ( s) − e ∫t Ti (t ) − e ∫t
κ ( z ) dz κ ( z ) dz s − ∫ κ ( z ) dz
∫ σ (u )θ (u)e t du
t
(26)
s u
v 2 (t , s, x) = e ∫t
2 κ ( z ) dz s 2 −2 ∫ κ ( z ) dz
∫ σ (u)θ (u )e t du t
(27)
220 A. Zapranis and A. Alexandridis
τ2
FHDD (t ,τ 1 ,τ 2 ) = ΕQ ⎡ ∫ max ( c − T ( s ) ) ds |Ft ⎤
⎣⎢ τ1 ⎦⎥
and using Ito’s Isometry we can interchange the expectation and the integral
τ2 τ2
Ε Q ⎡ ∫ max ( c − T ( s ) ) |Ft ⎤ == ∫ Ε Q ⎣⎡ max ( c − T ( s ) ) | Ft ⎦⎤ ds
⎣⎢ τ1 ⎦⎥ τ1
T(s) is normally distributed under the probability measure Q with mean and variance
given by:
s s u
EQ [T ( s) | Ft ] = S ( s) + e ∫t Ti (t ) + e ∫t
κ ( z ) dz κ ( z ) dz s − ∫ κ ( z ) dz
t ∫ σ (u)θ (u )e t du
s u
VarQ [T ( s) | Ft ] = e ∫t
2 κ ( z ) dz s −2 ∫ κ ( z ) dz
∫ σ (u )θ (u )e t du
2
t
Hence, c − T ( s ) is normally distributed with mean given by m(t , s ) and variance given
by v 2 (t , s ) and the proposition follows by standard calculations using the properties of
the normal distribution.
6 Conclusions
This paper proposes and implements a modeling and forecasting framework for tem-
perature based weather derivatives. The proposed method is an extension of the works
proposed by [5] and [9]. Here the speed of mean reversion parameter is considered to
be a time varying parameter and it is modeled by a wavelet neural network. It is
proved that the waveform of the activation function of the proposed network provides
a better fit of the data.
Our method is validated in a two month ahead out of sample forecast period. More-
over the relative errors produced by the wavelet network are compared against the
original B-B model and historical burn analysis. Results show that the wavelet network
outperforms the other methods. More precisely the wavelet network forecasting ability
is better than the B-B and HBA in 11 times out of 13. Finally testing the fitted residuals
of B-B we observe that the normality hypothesis is rejected in almost every case. Hence,
B-B cannot be used for forecasts. Finally, we provided the pricing equations for tem-
perature futures of a HDD index derivative when α is time depended.
The results in this study are preliminary and can be improved. More precisely the
number of sinusoids in equations (2) and (3) in B-B framework, representing the
Modeling and Forecasting CAT and HDD Indices for Weather Derivative Pricing 221
seasonal part of the temperature and the variance of residuals, are chosen according to
[9]. Alternative methods can improve the fitting in the original data. Hence a better
training set is expected for the wavelet network and more accurate forecasts.
Another important aspect is to test the largest forecasting window of each method.
Meteorological forecasts of a window larger than 10 days considered inaccurate.
Hence, it is important to develop a model than can accurately predict daily average
temperatures in larger windows. Also, this analysis will let us use the best model
according to the desired forecasting interval.
References
1. Challis, S.: Bright Forecast for Profits, Reactions. June edn. (1999)
2. Hanley, M.: Hedging the Force of Nature. Risk Professional 1, 21–25 (1999)
3. Ceniceros, R.: Weather derivatives running hot. Business Insurance 40 (2006)
4. Jewson, S., Brix, A., Ziehmann, C.: Weather Derivative Valuation: The Meteorological,
Statistical, Financial and Mathematical Foundations. Cambridge University Press, Cam-
bridge (2005)
5. Zapranis, A., Alexandridis, A.: Modelling Temperature Time Dependent Speed of Mean
Reversion in the Context of Weather Derivetive Pricing. Applied Mathematical Fi-
nance 15, 355–386 (2008)
6. Zapranis, A., Alexandridis, A.: Weather Derivatives Pricing: Modelling the Seasonal Re-
siduals Variance of an Ornstein-Uhlenbeck Temperature Process With Neural Networks.
Neurocomputing (accepted, to appear)
7. Alaton, P., Djehince, B., Stillberg, D.: On Modelling and Pricing Weather Derivatives.
Applied Mathematical Finance 9, 1–20 (2000)
8. Zhang, Q., Benveniste, A.: Wavelet Networks. IEEE Trans. Neural Networks 3, 889–898
(1992)
9. Benth, F.E., Saltyte-Benth, J.: The volatility of temperature and pricing of weather deriva-
tives. Quantitative Finance 7, 553–561 (2007)
10. Daubechies, I.: Ten Lectures on Wavelets. SIAM, Philadelphia (1992)
11. Mallat, S.G.: A Wavelet Tour of Signal Processing. Academic Press, San Diego (1999)
12. Zapranis, A., Alexandridis, A.: Wavelet analysis and weather derivatives pricing. HFFA,
Thessaloniki (2006)
13. Oussar, Y., Dreyfus, G.: Initialization by Selection for Wavelet Network Training. Neuro-
computing 34, 131–143 (2000)
14. Zapranis, A., Alexandridis, A.: Model Identification in Wavelet Neural Networks Frame-
work. In: Iliadis, L., Vlahavas, I., Bramer, M. (eds.) Artificial Intelligence Applications
and Innovations III. IFIP, vol. 296, pp. 267–277. Springer, New York (2009)
15. Cao, M., Wei, J.: Pricing the weather. In: Risk Weather Risk Special Report, Energy And
Power Risk Management, pp. 67–70 (2000)
16. Davis, M.: Pricing weather derivatives by marginal value. Quantitative Finance 1, 1–4
(2001)
17. Dornier, F., Queruel, M.: Caution to the wind. Weather risk special report. In: Energy
Power Risk Management, pp. 30–32 (2000)
18. Moreno, M.: Riding the temp. Weather Derivatives. FOW Special Support (2000)
19. Caballero, R., Jewson, S., Brix, A.: Long Memory in Surface Air Temperature: Detection
Modelling and Application to Weather Derivative Valuation. Climate Research 21, 127–
140 (2002)
222 A. Zapranis and A. Alexandridis
20. Brody, C.D., Syroka, J., Zervos, M.: Dynamical Pricing of Weather Derivatives. Quanti-
tave Finance 2, 189–198 (2002)
21. Benth, F.E., Saltyte-Benth, J.: Stochastic Modelling of Temperature Variations With a
View Towards Weather Derivatives. Applied Mathematical Finance 12, 53–85 (2005)
22. Oussar, Y., Rivals, I., Presonnaz, L., Dreyfus, G.: Trainning Wavelet Networks for
Nonlinear Dynamic Input Output Modelling. Neurocomputing 20, 173–188 (1998)
23. Zhang, Q.: Using Wavelet Network in Nonparametric Estimation. IEEE Trans. Neural
Networks 8, 227–236 (1997)
24. Postalcioglu, S., Becerikli, Y.: Wavelet Networks for Nonlinear System Modelling. Neural
Computing & Applications 16, 434–441 (2007)
25. Xu, J., Ho, D.W.C.: A Basis Selection Algorithm for Wavelet Neural Networks. Neuro-
computing 48, 681–689 (2002)
26. Gao, R., Tsoukalas, H.I.: Neural-wavelet Methodology for Load Forecasting. Journal of
Intelligent & Robotic Systems 31, 149–157 (2001)
27. Xu, J., Ho, D.W.C.: A constructive algorithm for wavelet neural networks. In: Wang, L.,
Chen, K., S. Ong, Y. (eds.) ICNC 2005. LNCS, vol. 3610, pp. 730–739. Springer, Heidel-
berg (2005)
28. Benth, F.E., Saltyte-Benth, J., Koekebakker, S.: Putting a price on temperature. Scandina-
vian Journal of Statistics 34, 746–767 (2007)
Using the Support Vector Machine as a
Classification Method for Software Defect
Prediction with Static Code Metrics
David Gray, David Bowes, Neil Davey, Yi Sun, and Bruce Christianson
1 Introduction
Software defect prediction is the process of locating defective modules in soft-
ware and is currently a very active area of research within the software engineer-
ing community. This is understandable as “Faulty software costs businesses $78
billion per year” ([1], published in 2001), therefore any attempt to reduce the
number of latent defects that remain inside a deployed system is a worthwhile
endeavour.
Thus the aim of this study is to observe the classification performance of the
Support Vector Machine (SVM) for defect prediction in the context of eleven
data sets from the NASA Metrics Data Program (MDP) repository; a collection
of data sets generated from NASA software systems and intended for defect
prediction research. Although defect prediction studies have been carried out
with these data sets and various classifiers (including an SVM) in the past, this
study is novel in that thorough data cleansing methods are used explicitly.
The main purpose of static code metrics (examples of which include the num-
ber of: lines of code, operators (as proposed in [2]) and linearly independent
paths (as proposed in [3]) in a module) is to give software project managers
an indication toward the quality of a software system. Although the individual
worth of such metrics has been questioned by many authors within the software
engineering community (see [4], [5], [6]), they still continue to be used.
Data mining techniques from the field of artificial intelligence now make it
possible to predict software defects; undesired outputs or effects produced by
D. Palmer-Brown et al. (Eds.): EANN 2009, CCIS 43, pp. 223–234, 2009.
c Springer-Verlag Berlin Heidelberg 2009
224 D. Gray et al.
software, from static code metrics. Views toward the worth of using such metrics
for defect prediction are as varied within the software engineering community
as those toward the worth of static code metrics. However, the findings within
this study suggest that such predictors are useful, as on the data used here they
correctly classify modules with an average accuracy of 70%.
2 Background
2.1 Static Code Metrics
Static code metrics are measurements of software features that may potentially
relate to quality. Examples of such features and how they are often measured
include: size, via lines of code (LOC) counts; readability, via operand and oper-
ator counts (as proposed by [2]) and complexity, via linearly independent path
counts (as proposed by [3]).
Consider the C program shown in Figure 1. Here there is a single function
called main. The number of lines of code this function contains (from opening
to closing bracket) is 11, the number of arguments it takes is 2, the number of
linearly independent paths through the function (also known as the cyclomatic
complexity [3]) is 3. These are just a few examples of the many metrics that can
be statically computed from source code.
#include <stdio.h>
Because static code metrics are calculated through the parsing of source code
their collection can be automated. Thus it is computationally feasible to calculate
the metrics of entire software systems, irrespective of their size. [7] points out
that such collections of metrics can be used in the following contexts:
2.3 Data
The data used within this study was obtained from the NASA Metrics Data
Program (MDP) repository1. This repository currently contains thirteen data
1
http://mdp.ivv.nasa.gov/
226 D. Gray et al.
Fig. 2. The importance of optimal parameter selection. The solid and hollow dots
represent the training data for two classes. The hollow dot with a dot inside is the test
data. Observe that the test dot will be misclassified if too simple (underfitting, the
straight line) or too complex (overfitting, the jagged line) a hyperplane is chosen. The
optimum hyperplane is shown by the oval line.
Table 1. The eleven NASA MDP data sets that were used in this study. Note that
KLOC refers to thousand lines of code.
sets, each of which represent a NASA software system / subsystem and contain
the static code metrics and corresponding fault data for each comprising module.
Note that a module in this domain can refer to a function, procedure or method.
Eleven of these thirteen data sets were used in this study: brief details of each are
shown in Table 1. A total of 42 metrics and a unique module identifier comprise
each data set (see Table 5, located in the appendix), with the exception of MC1
and PC5 which do not contain the decision density metric.
All the metrics shown within Table 5 with the exception of error count and
error density, were generated using McCabeIQ 7.1; a commercial tool for the
automated collection of static code metrics. The error count metric was calcu-
lated by the number of error reports that were issued for each module via a bug
tracking system. Each error report increments the value by one. Error density
is derived from error count and LOC total, and describes the number of errors
per thousand lines of code (KLOC).
3 Method
3.1 Data Pre-processing
The process for cleansing each of the data sets used in this study is as follows:
Initial Data Set Modifications. Each of the data sets initially had their
module identifier and error density attribute removed, as these are not required
for classification. The error count attribute was then converted into a binary
target attribute for each instance by assigning all values greater than zero to
defective, non-defective otherwise.
Table 2. The result of removing all repeated and inconsistent instances from the data
Original Instances
Name % Removed
Instances Removed
CM1 505 51 10
KC3 458 134 29
KC4 125 12 10
MC1 9466 7470 79
MC2 161 5 3
MW1 403 27 7
PC1 1107 158 14
PC2 5589 4187 75
PC3 1563 130 8
PC4 1458 116 8
PC5 17186 15382 90
Total 38021 27672 73
Missing Values. Missing values are those that are unintentionally or otherwise
absent for a particular attribute in a particular instance of a data set. The only
missing values within the data sets used in this study were within the decision
density attribute of data sets CM1, KC3, MC2, MW1, PC1, PC2, and PC3.
Manual inspection of these missing values indicated that they were almost
certainly supposed to be representing zero, and were replaced accordingly.
Balancing the Data. All the data sets used within this study, with the excep-
tion of KC4, contain a much larger amount of one class (namely, non-defective)
than they do the other. When such imbalanced data is used with a supervised clas-
sification algorithm such as an SVM, the classifier will be expected to over predict
the majority class [10], as this will produce lower error rates in the test set.
There are various techniques that can be used to balance data (see [13]). The
approach taken here is the simplest however, and involves randomly undersam-
pling the majority class until it becomes equal in size to that of the minority
class. The number of instances that were removed during this undersampling
process, along with the final number of instances contained within each data
set, are shown in Table 3.
Using the SVM as a Classification Method for Software Defect Prediction 229
Normalisation. All values within the data sets used in this study are numeric,
so to prevent attributes with a large range dominating the classification model
all values were normalised between -1 and +1. Note that this pre-processing
stage was performed just prior to training for each training and testing set, and
that each training / testing set pair were scaled in the same manner [11].
Randomising Instance Order. The order of the instances within each data
set were randomised to defend against order effects, where the performance of a
predictor fluctuates due to certain orderings within the data [14].
Due to the high percentage of information lost when balancing each data set
(with the exception of KC4), the experiment is repeated fifty times. This is in
order to further minimise the effects of sampling bias introduced by the random
undersampling that takes place during balancing.
Pseudocode for the full experiment carried out in this study is shown in Fig 3.
Our chosen SVM environment is LIBSVM [15], an open source library for SVM
experimentation.
DATASETS = ( CM1, KC3, KC4, MC1, MC2, MW1, PC1, PC2, PC3, PC4, PC5 )
repeat M times:
for dataSet in DATASETS:
dataSet = pre_process(dataSet) # As described in Section 3.1
repeat N times:
for i in 1 to V:
testerSet = dataSet[i]
trainingSet = dataSet - testerSet
params = gridSearch(trainingSet)
model = svm_train(params, trainingSet)
results += svm_predict(model, testerSet)
FinalResults = avg(results)
4 Assessing Performance
The measure used to assess predictor performance in this study is accuracy. Ac-
curacy is defined as the ratio of instances correctly classified out of the total
number of instances. Although simple, accuracy is a suitable performance mea-
sure for this study as each test set is balanced. For imbalanced test sets more
complicated measures are required.
5 Results
The average results for each data set are shown in Table 4. The results show an
average accuracy of 70% across all 11 data sets, with a range of 64% to 82%.
Notice that there is a fairly high deviation shown within the results. This is to be
expected due to the large amount of data lost during balancing and supports the
decision for the experiment being repeated fifty times (see Fig. 3). It is notable
that the accuracy for some data sets is extremely high, for example with data
set PC4, four out of every five modules were being correctly classified.
The results show that all data sets with the exception of PC2 have a mean
accuracy greater than two standard deviations away from 50%. This shows the
statistical significance of the classification results when compared to a dumb
classifier that predicts all one class (and therefore scores an accuracy of 50%).
Using the SVM as a Classification Method for Software Defect Prediction 231
% Mean
Name Std.
Accuracy
CM1 68 5.57
KC3 66 6.56
KC4 71 4.93
MC1 65 6.74
MC2 64 5.68
MW1 71 7.3
PC1 71 5.15
PC2 64 9.17
PC3 76 2.15
PC4 82 2.11
PC5 69 1.41
Total 70 5.16
6 Analysis
Previous studies ([16], [17], [18]) have also used data from the NASA MDP
repository and an SVM classifier. Some of these studies briefly mention data
pre-processing, however we believe that it is important to explicitly carry out all
of the data cleansing stages described here. This is especially true with regard to
the removal of repeating instances, ensuring that all classifiers are being tested
against previously unseen data.
The high number of repeating instances found within the MDP data sets was
surprising. Brief analysis of other defect prediction data sets showed a repeating
average of just 1.4%. We are therefore suspicious of the suitability of the data
held within the MDP repository for defect prediction and believe that previous
studies which have used this data and not carried out appropriate data cleansing
methods may be reporting inflated performance values.
An example of such a study is [18], where the authors use an SVM and four of
the NASA data sets, three of which were used in this study (namely CM1, PC1
and KC3). The authors make no mention of data pre-processing other than the
use of an attribute selection algorithm. They then go on to report a minimum
average precision, the ratio of correctly predicted defective modules to the total
number of modules predicted as defective, of 84.95% and a minimum average
recall, the ratio of defective modules detected as such, of 99.4%. We believe that
such high classification rates are highly unlikely in this problem domain due to
232 D. Gray et al.
the limitations of static code metrics and that not carrying out appropriate data
cleansing methods may have been a factor in these high results.
7 Conclusion
This study has shown that on the data studied here the Support Vector Machine
can be used successfully as a classification method for defect prediction. We hope
to improve upon these results in the near future however via the use of a one-
class SVM; an extension to the original SVM algorithm that trains upon only
defective examples, or a more sophisticated balancing technique such as SMOTE
(Synthetic Minority Over-sampling Technique).
Our results also show that previous studies which have used the NASA data
may have exaggerated the predictive power of static code metrics. If this is not
the case then we would recommend the explicit documentation of what data
pre-processing methods have been applied. Static code metrics can only be used
as probabilistic statements toward the quality of a module and further research
may need to be undertaken to define a new set of metrics specifically designed
for defect prediction.
The importance of data analysis and data quality has been highlighted in this
study, especially with regard to the high quantity of repeated instances found
within a number of the data sets. The issue of data quality is very important
within any data mining experiment as poor quality data can threaten the validity
of both the results and the conclusions drawn from them [19].
References
1. Levinson, M.: Lets stop wasting $78 billion per year. CIO Magazine (2001)
2. Halstead, M.H.: Elements of Software Science (Operating and programming sys-
tems series). Elsevier Science Inc., New York (1977)
3. McCabe, T.J.: A complexity measure. In: ICSE 1976: Proceedings of the 2nd in-
ternational conference on Software engineering, p. 407. IEEE Computer Society
Press, Los Alamitos (1976)
4. Hamer, P.G., Frewin, G.D.: M.H. Halstead’s Software Science - a critical examina-
tion. In: ICSE 1982: Proceedings of the 6th international conference on Software
engineering, pp. 197–206. IEEE Computer Society Press, Los Alamitos (1982)
5. Shen, V.Y., Conte, S.D., Dunsmore, H.E.: Software Science Revisited: A critical
analysis of the theory and its empirical support. IEEE Trans. Softw. Eng. 9(2),
155–165 (1983)
6. Shepperd, M.: A critique of cyclomatic complexity as a software metric. Softw.
Eng. J. 3(2), 30–36 (1988)
7. Sommerville, I.: Software Engineering, 8th edn. International Computer Science
Series. Addison Wesley, Reading (2006)
8. Menzies, T., Greenwald, J., Frank, A.: Data mining static code attributes to learn
defect predictors. IEEE Transactions on Software Engineering 33(1), 2–13 (2007)
9. Schölkopf, B., Smola, A.J.: Learning with Kernels: Support Vector Machines, Reg-
ularization, Optimization, and Beyond. In: Adaptive Computation and Machine
Learning. The MIT Press, Cambridge (2001)
Using the SVM as a Classification Method for Software Defect Prediction 233
10. Sun, Y., Robinson, M., Adams, R., Boekhorst, R.T., Rust, A.G., Davey, N.: Using
sampling methods to improve binding site predictions. In: Proceedings of ESANN
(2006)
11. Hsu, C.W., Chang, C.C., Lin, C.J.: A practical guide to support vector classifica-
tion. Technical report, Taipei (2003)
12. Witten, I.H., Frank, E.: Data Mining: Practical Machine Learning Tools and Tech-
niques, 2nd edn. Morgan Kaufmann Series in Data Management Systems. Morgan
Kaufmann, San Francisco (2005)
13. Wu, G., Chang, E.Y.: Class-boundary alignment for imbalanced dataset learning.
In: ICML 2003 Workshop on Learning from Imbalanced Data Sets, pp. 49–56 (2003)
14. Fisher, D.: Ordering effects in incremental learning. In: Proc. of the 1993 AAAI
Spring Symposium on Training Issues in Incremental Learning, Stanford, Califor-
nia, pp. 34–41 (1993)
15. Chang, C.C., Lin, C.J.: LIBSVM: a library for support vector machines (2001),
http://www.csie.ntu.edu.tw/~ cjlin/libsvm
16. Li, Z., Reformat, M.: A practical method for the software fault-prediction. In: IEEE
International Conference on Information Reuse and Integration, 2007. IRI 2007,
pp. 659–666 (2007)
17. Lessmann, S., Baesens, B., Mues, C., Pietsch, S.: Benchmarking classification mod-
els for software defect prediction: A proposed framework and novel findings. IEEE
Transactions on Software Engineering 34(4), 485–496 (2008)
18. Elish, K.O., Elish, M.O.: Predicting defect-prone software modules using support
vector machines. J. Syst. Softw. 81(5), 649–660 (2008)
19. Liebchen, G.A., Shepperd, M.: Data sets and data quality in software engineering.
In: PROMISE 2008: Proceedings of the 4th international workshop on Predictor
models in software engineering, pp. 39–44. ACM, New York (2008)
234 D. Gray et al.
Appendix
Table 5. The 42 metrics originally found within each data set
Metric Type
Metric Name
01. Cyclomatic Complexity
02. Cyclomatic Density
03. Decision Density
04. Design Density
05. Essential Complexity
06. Essential Density
McCabe
07. Global Data Density
08. Global Data Complexity
09. Maintenance Severity
10. Module Design Complexity
11. Pathological Complexity
12. Normalised Cyclomatic Complexity
13. Number of Operators
Raw 14. Number of Operands
Halstead 15. Number of Unique Operators
16. Number of Unique Operands
17. Length (N)
18. Volume (V)
19. Level (L)
Derived 20. Difficulty (D)
Halstead 21. Intellegent Content (I)
22. Programming Effort (E)
23. Error Estimate (B)
24. Programming Time (T)
25. LOC Total
26. LOC Executable
27. LOC Comments
LOC
28. LOC Code and Comments
Counts
29. LOC Blank
30. Number of Lines (opening to closing bracket)
31. Node Count
32. Edge Count
33. Branch Count
34. Condition Count
35. Decision Count
Misc.
36. Formal Parameter Count
37. Modified Condition Count
38. Multiple Condition Count
39. Call Pairs
40. Percent Comments
41. Error Count
Error
42. Error Density
Adaptive Electrical Signal Post-processing with
Varying Representations in Optical
Communication Systems
1 Introduction
D. Palmer-Brown et al. (Eds.): EANN 2009, CCIS 43, pp. 235–245, 2009.
c Springer-Verlag Berlin Heidelberg 2009
236 S. Hunt et al.
distorted, and its own characteristic pattern of errors introduced into the digital
data stream.
There is great value in a signal post-processing system that can undo some of
these signal distortions, or that can separate line-specific distortions from non-
recoverable errors. Signal post-processing in optical data communication can
offer new margins in system performance in addition to other enabling tech-
niques. A variety of post-processing techniques have been already used to im-
prove overall system performance, such as tunable dispersion compensation and
electronic equalization (see e.g. [3], [4], [9], [11], and references therein). Note
that post-processing can be applied both in the optical and electrical domain
(after conversion of the optical field into electrical current). Application of elec-
tronic signal processing for compensation of transmission impairments is an at-
tractive technique that has become quite popular thanks to recent advances in
high-speed electronics. An adaptive system, as proposed here, is of even greater
value, because it may be tuned to the specific charateristics of each data trans-
mission link, and re-tuned as the characteristics of the link change, which they
inevitably will over time.
In this work we apply machine learning techniques to adaptive signal post-
processing in optical communication systems. We adopt several different rep-
resentations of signal waveforms, including the discrete wavelet transform, and
independent components from independent components analysis. To the best of
our knowledge this is the first time that such techniques have been applied in
this area. One key feature of this problem domain is that the trainable classi-
fier must perform at an extremely high speed, because optical communication
systems typically operate at bit rates of around 40GHz. We demonstrate the
feasibility of bit-error-rate improvement by adaptive post-processing of received
electrical signals.
The data represents the received signal taken in the electrical domain after con-
version of the optical signal into an electrical current. The data consists of a
large number of received bits with the waveforms represented by 32 floating
point numbers corresponding to values of electrical current at each of 32 equally
spaced sample points within a bit time slot. A sequence of 5 consecutive bits is
shown in Figure 1. As already explained the pulse can be classified according to
the current integrated over the width of a single bit. For each of the time slots
in our data we have the original bit that was transmitted. Therefore the data
consists of 32-ary vectors each with a corresponding binary label.
In all we have a stream of 65536 bits to classify. Categorising the vast majority
of these bits is straightforward. In fact with an optimally set electrical current
integrated over the whole time slot (energy threshold ) we can correctly classify
all but 1842 bits correctly. We can therefore correctly classify 97.19% of the data,
an error rate of 2.81%. This error rate is, however, significantly too high. The
target error rate is less than one bit in a thousand, or 0.1%. Figure 2 (a) gives
an example of a misclassification. The middle bit of the sequence is a 0 but is
identified from its energy as a 1. This is due to the presence of two 1 s on either
side and to distortion of the transmitted signal. It would be difficult for any
classifier to rectify this error.
However other cases can be readily identified by the human eye and therefore
could be amenable to automatic identification. Figure 2 (b) shows an example
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1 1 0 1 0 1
0
0 16 48 80 112 144 160
1 1
0.9 0.9
0.8 0.8
0.7 0.7
0.6 0.6
0.5 0.5
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0 1 0 1 1 0.1 1 0 1 0 0
0 0
0 16 48 80 112 144 160 0 16 48 80 112 144 160
(a) (b)
Fig. 2. (a) An example of a difficult error to identify. The middle bit is meant to be
a 0, but jitter has rendered it very hard to see; (b) The central bit has been dragged
down by the two 0s surrounding it and is classified as a 0 from its energy. However to
the human eye the presence of a 1 is obvious.
where the bit pattern is obvious to the eye but where a misclassification actually
occurs. The central bit is a 1 but is misclassified as a 0 from its energy alone.
Table 1. Nine sequences for which difficulties are most likely to occur
0 0 1 0 0
0 0 1 0 1
0 1 0 1 0
0 1 0 1 1
1 0 0 1 1
1 0 1 0 0
1 0 1 0 1
1 1 0 1 0
1 1 0 1 1
Wavelet transforms are similar to Fourier transforms, and are particularly useful
for deconstructing non-periodic and/or non-stationary signals. Wavelet analysis
has been applied successfully to areas of signal processing. The goal of the wavelet
transform is to turn the information contained in a signal into a set of coefficients,
which can be analysed.
An efficient way to implement the discrete wavelet transform using filters
was developed by Mallat [6]. In wavelet analysis, a signal is decomposed to
approximation and detail components. The approximations are the high-scale,
low frequency components of the signal, which are also called smoothed signals.
The details are the low-scale, high-frequency components.
Denote g as a high pass filter, and h a low pass filter. The original signal
c0 of length N is passed through the two complementary filters. The transform
consists of a convolution of c0 with each of the filters, where every other element
is discarded (a process known as dyadic decimation), and produces the lower
resolution approximation coefficients CA1 and detail coefficients CD1 at level 1.
240 S. Hunt et al.
and
CDj−1,k = gn−2k CAj,n , (2)
n
4 Method
One difficulty for the trainable classifier is that in this dataset the vast majority
of examples are straightforward to classify. The hard cases are very sparsely
represented, so that, in an unusual sense, the data is imbalanced. Figure 3
is a diagram of error rates of 0 and 1 as a function of the energy thresh-
old. It shows that if the energy threshold is set to roughly 2.5, then all bits
Classifying Optical Signals 241
1
0
1
0.9
0.8
0.7
0.6
Error rates of 0/1
0.5
0.4
0.3
0.2
0.1
0
0 5 10 15 20 25 30
Energy threshold
with energy less than this threshold are correctly classified into the 0 class;
on the other hand, if the energy threshold is set to about 11, then all bits
with energy greater than this are correctly classified into the 1 class. The op-
timal energy threshold to separate the two classes is 5.01, in which case, only
1842 of the bits 65532 are incorrectly classified - a bit error rate of 2.81%.
Using this threshold we divide the data into easy and hard cases, that is,
those classified correctly by the method are easy ones, otherwise they are hard
cases.
1.5 0.22
Easy 1
Easy 0
Hard 1
1 Hard 0 0.2
0.5
0.18
0
0.16
−0.5
0.14
−1
0.12
−1.5
0.1
−2
−2.5 0.08
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 0 5 10 15 20 25 30 35
(a) (b)
Fig. 4. (a) Projection of the easy set using PCA, where the hard patterns are also
projected into the easy ones’ first two principal components space; (b) Eigenwave of
the first principal component
5 Performance Measures
where V alues(A) is the set of all possible values for attribute A, and Sv is the
subset of S for which attribute A has value v. The value of IG is the number
of bits saved when encoding the target value of an arbitrary member of S,
by knowing the value of attribute A [8]. In this work, one can consider a set
of labels of waveform signals as S, and predictions as A to obtain a value of
IG.
Classifying Optical Signals 243
6 Experiments
The first experiment was carried out with Energy and Waveform representations.
The main results are given in Table 3.
Table 3. The results of classifying the different test sets for the different data repre-
sentations. We also give the standard deviation for the test sets.
The SLN classifier does give an improvement over the optimal energy thresh-
old method (Energy-1), with the Waveform-3 and E-W-E datasets giving the
best mean accuracy. There is 10% more information gained with E-W-E when
compared to the optimal threshold method. Interestingly, the very simple SLN
classifier using Energy-3 decreased the error rate on the test sets by nearly 42%
when compared to the optimal threshold method. This classifier is simply a single
unit with 3 weighted inputs.
Table 4. The results of classifying the different test sets for the wavelet representations
there is no big difference between them. Actually, the information gained with
SLN/CA3 is only 0.0012 more than the one obtained with SLN/Waveform-3.
However, we must point out that the number of inputs to SLN with CA3 (18)
is much smaller than the one with Waveform-3 (96).
Table 5. The results of classifying the different test sets for the ICA representations
The result of SLN/ICA in Table 5 is much the same as the one from
SLN/Waveform-3 in Table 3. However, again, the number of inputs to SLN
with ICA (19) is much smaller than the one with Waveform-3 (96).
7 Discussion
The fast decoding of a stream of data represented as pulses of light is a com-
mercially important and challenging problem. Computationally, the challenge is
in the need for a classifier that is highly accurate, yet is sufficiently simple that
it can be made to operate extremely quickly. We have therefore restricted our
investigation, for the most part, to SLNs, and used data is either a sampled ver-
sion of the light waveform or just the energy of the pulse. Experiment 1 showed
that an SLN trained with the 96-ary representation of the waveform (using 1
bit either side of the target) gave the best performance, reducing the bit error
rate from 2.81% to 1.33%. This figure is still quite high and we hypothesised
that this could be explained by the fact that despite the data set being very
Classifying Optical Signals 245
large (65532 items), the number of difficult examples (those misclassified by the
threshold method) was very small and dominated by the number of straightfor-
ward examples. We undertook experiments 2 and 3 to see if we could construct
a classifier that can correctly identify a significant number of these infrequent
but difficult examples with a much small number of feature representations. Al-
though the improvement obtained by using wavelet coefficients and independent
components is minor, the number of inputs to the SLN is much smaller than
the one from Waveform-3. One of the most interesting features of these results
is that they suggest all the wavelet coefficients capture the same information
with respect to this classification task. Further work is needed on much larger
datasets.
This is early work and much of interest is still to be investigated, such as
threshold band sizes, and other methods to identify difficult cases.
References
1. Bell, A.J., Sejnowski, T.J.: AN information-maxmization approach to blind sepa-
ration and blind deconvolution. Neural Computation 7, 1129–1159 (1995)
2. Bishop, C.M.: Neural Networks for Pattern Recognition. Oxford University Press,
New York (1995)
3. Bulow, H.: Electronic equalization of transmission impairments. In: OFC, Anaheim,
CA, Paper TuE4 (2002)
4. Haunstein, H.F., Urbansky, R.: Application of Electronic Equalization and Error
Correction in Lightwave Systems. In: Proceedings of the 30th European Conference
on Optical Communications (ECOC), Stockholm, Sweden (2004)
5. Hyvärinen, A.: Fast and robust fixed-point algorithms for independent component
analysis. IEEE Trans. on Neural Networks 10(3), 626–634 (1999)
6. Mallat, S.: A theory for multiresolution signal decomposition: the wavelet repre-
sentation. In: IEEE Transactions of Pattern Analysis and Machine Intelligence, pp.
674–693 (1989)
7. Mackay, D.: Information Theory, Inference, and Learnig Algorithms. Cambridge
University Press, Cambridge (2003)
8. Mitchell, T.M.: Machine Leanring. The McGraw-Hill Companies, Inc., New York
(1997)
9. Rosenkranz, W., Xia, C.: Electrical equalization for advanced optical commu-
nication systems. AEU - International Journal of Electronics and Communica-
tions 61(3), 153–157 (2007)
10. Vidakovic, B.: Statistical Modelling by Wavelets. John Wiley & Sons, Inc., Chich-
ester (1999)
11. Watts, P.M., Mikhailov, V., Savory, S., Bayvel, P., Glick, M., Lobel, M., Chris-
tensen, B., Kirkpatrick, P., Shang, S., Killey, R.I.: Performance of single-mode
fiber links using electronic feed-forward and decision feedback equalizers. IEEE
Photon. Technol. Lett. 17(10), 2206–2208 (2005)
Using of Artificial Neural Networks (ANN) for Aircraft
Motion Parameters Identification
1 Introduction
Calculations and mathematical modeling are essential for the aircraft development
and determination of its operating limitations, including estimation of aircraft behav-
ior safety limits. The recent development of aircraft modeling applied for real time
analysis of flight data enables to prevent accidents [1,2] and demand the reliable
mathematical presentation of an aircraft and its systems behavior. However, applica-
tion of computational methods requires compliance between the computation (math
modeling) results and the experimental data, i.e. it is necessary to identify the math
model parameters from the experimental data of real object behavior.
The experience in development and application of the procedure for the flight dy-
namics math model parameters estimation according to flight tests data [3, 4] has
shown that the most complicated element of practical identification tasks is the ad-
justment of identification results obtained from different samples of initial data. The
effort to solve this problem was made in the procedure [3] by identification of correc-
tions for aerodynamic coefficients with “parallel” optimization of disagreement crite-
ria in two (some) flight test data fragments obtained in similar conditions at similar
speed and altitudes (for example: maneuvers with stick “to the left” and “to the right”,
“forwards” and “backwards”). However, when solving the problem [4], main efforts
were made to adjust the corrections determined by identification procedure on differ-
ent data samples.
D. Palmer-Brown et al. (Eds.): EANN 2009, CCIS 43, pp. 246–256, 2009.
© Springer-Verlag Berlin Heidelberg 2009
Using of ANN for Aircraft Motion Parameters Identification 247
Artificial neural networks (ANN) allow determination of the required relations be-
tween input and output parameters of the object. Moreover, unlike the traditional
identification methods, neural networks have a memory: it means that the results
could be verified and accumulated during repeated “training” cycles (during the proc-
essing of new samples of initial data). Thus, neural networks allow getting the re-
quired relations for wide range of conditions at once and, as a result, to align random
factors, which are unavoidable for experimental data.
This paper presents the use of neural network to solve the problem of identification
(during the aircraft takeoff or landing run on a concrete runway) of rolling resistance
and wheels braking coefficients. In recent years international aviation community has
paid much attention to the analysis of aircraft behavior during its motion on runway,
in particular during take-off and landing on precipitation-covered runway. For in-
stance, since 1996 NASA, FAA and Transport Canada have been performed
JW/RFMP program (The Joint Winter Runway Friction Measurement Program).
Present-day knowledge about aircraft behavior on the contaminated runway are gen-
eralized in the amendment to European certification requirements (NPA No.
14/2004), developed by JAA, but the conclusion on the necessity of further investiga-
tion of this problem was made. Reports on flight accidents, related to aircraft overrun
the runway indicate the necessity of such research.
At this stage we considered to estimate the efficiency of neural networks as the
means of identification of the aircraft motion math model parameters. For this pur-
pose the simple enough but practically important task was selected: evaluation of
wheels compression and their rotational speed influence on the friction coefficient
value (for dry concrete runway). This task is presented here in details.
A more complicated procedure based on ANN application, which is presented here
in main results, is intended for identification of dependencies describing wheels resis-
tance and braking performances on precipitation-covered runway (see [5]).
G – aircraft weight; i – runway slope angle (uphill >0); P – engines thrust; Xaero = CD *q*S; q – dynamic
pressure; S – wing area; FR = µ R* Fy; Fy = G*cos(i) - Yaero; Yaero = CL*q*S.
Nature of rolling friction coefficient µ R is shown in Fig. 1 taken from [6]. At this
figure, parameter “а” determines the resultant normal force displacement from the
248 A. Bondarets and O. Kreerenko
wheel axis. In compliance with this figure the value of force resisting to rolling is
equal to Fy*(а/r), i. e. µ R =а/r (see а and r in Fig. 1) . It is clear that on non-
contaminated flat surface, when other conditions are the same, the value “а/r” de-
pends on wheel compression and wheel rotational speed. For dry, rigid horizontal
surfaces, accordingly [6] µ R=0.005-:-0.03.
Fig. 1.
3 Problem Definition
Neural network was applied as a tool for identification of the aircraft motion math
model parameters. Dependency of friction coefficient from the ground speed (Vtravel)
and the load on wheels (Fy) must be obtained during “training” process in the neural
network (minimization of mismatch between empirical and calculated accelerations).
To solve this problem the simplified math model of aircraft motion along runway
was developed. For identification purposes, math model is implemented in the form
of longitudinal accelerations computation in the process of real aircraft run conditions
reproduction
wxcalc=Fx/m, where
Fx – refer to formula (1); m – aircraft mass
4 Identification Procedure
DCSL (Dynamic Cell Structure) neural network from “Adaptive Neural Network
Library” [7] was selected as a tool of identification. The units from this library have a
call interface format which seems to be the most suitable for our problem of identifi-
cation. The DCSL unit scheme of data exchange in Matlab Simulink environment is
shown in Fig. 2.
Fig. 2.
Input 'x’ – arguments (in our task, these are Fy and Vtravel) of the relation being deter-
mined by the neural network.
Output 'ys’ – a function value, which is determined by neural network.
Input 'e’ – a quality criterion (misalignment between the current calculated accelera-
tion and the empirical acceleration).
Input 'LE’ – an on-off switch (training process activation).
Output 'X’ – internal parameters of neural network (data array). Training procedure
changes this data array content.
The library includes several versions of neural networks. During this exercise other
variants were tested as well. The trial runs show that DCSL neural network is the
most suitable for our task, as it requires a relatively small number of repeated “train-
ing” cycles to get high convergence in the different samples of data. It is also very
important that this neural network is converging well when returning to the initial
sample after training with the other samples. Main elements of identification algo-
rithm are shown in Fig. 3. The DCSL unit parameters, used during the given task
performing, are shown in Fig. 4.
5 Identification Results
The identification has been performed using four samples of aircraft take-off run in
actual conditions. The samples, describing takeoff run process for different take-off
weights, were selected. About 300 training cycles (50-:-100 for every sample of take-
off run) were done.
250 A. Bondarets and O. Kreerenko
Fig. 3. The identification algorithm scheme (the shaped arrows indicate the data that are taken
from the flight experiment records)
Problem of the describing of aircraft behavior is one of the most challenging due to
complexity of the object. Recent scheme of aircraft description as a cortege of three mu-
tually dependent matrix <Dd,Ds,De> where Dd describe data dependency, Ds – state
dependency and De element dependency proved to be efficient to analyze as various
segments of the aircraft construction well as aircraft as a whole [1]. Using this approach
an aircrafts air pressure system was simulated and analyzed in [2]. Applying such ap-
proach it is possible to assume that matrix of dependency is simulated using neural net-
work.But at this stage of research we require a visual presentation of the µ R (Fy , Vtravel)
dependence received. To present the identification results visually, the neural network
“probing” was performed. The results obtained by “probing” are given in a graphic form
and corrected manually (smoothed). The obtained dependence is shown in Fig.5.
Fig. 4.
Using of ANN for Aircraft Motion Parameters Identification 251
It is worth to mention that not all of the “load-speed” combinations, which are
given in Fig. 5, were available in the flight tests data. At low speeds (about zero), the
aerodynamic unloading is not available, and the loads were not lower than 140 tons.
At high speeds, the loads grew smaller (high loads were absent).
6 Check Modeling
To analyze the validity of obtained rolling friction coefficient estimations, the check
modeling task has been developed. In this task, the determination of friction coeffi-
cient is provided using three methods:
– by direct usage of neural network;
– by the use of µ R (Fy , Vtravel) dependence obtained by neural network “probing”;
– by usage of constant value of µ R =0.02.
Fig. 5. The µ R (Fy , Vtravel) dependency obtained by a) neural network “probing” and b)
smoothed relationship
252 A. Bondarets and O. Kreerenko
Fig. 6.
Using of ANN for Aircraft Motion Parameters Identification 253
Fig. 7.
FR+FD =FR+D – rolling resistance plus contamination drag force; FR+D = µ R+D (δ MLG/r MLG, V,dc) ∗ Fy MLG
+µ R+D (δ NLG/r NLG, V,dc)∗ Fy NLG; r MLG and r NLG – MLG wheel and NLG wheel radius; δ MLG and δ NLG MLG
wheel and NLG wheel inflation; dc – contaminant thickness ;FB = µ B (V) ∗ Fy MLG – braking force; Fy MLG
and Fy NLG – load on MLG wheels and on NLG wheels;
µ B = ⎨ MAX
⎪⎩µ t /g (V )
Fig.8 shows the final stage of identification process: how neural networks changes the
braking coefficient (left) and how it influences the ground speed time histories. Fig.9
shows identification results: a) for dry and the compacted snow covered runway (no
contamination drag exists: dc=0), b) for runway covered with slush (dc≈0.02m). De-
pendencies are defined by “probing” of neural networks (ANN) and further smooth-
ing (sm).
254 A. Bondarets and O. Kreerenko
Fig. 8.
G/r =0.06
G/r =0.07
dry
slush
compacted snow
a) b)
Fig. 9.
Using of ANN for Aircraft Motion Parameters Identification 255
About 200 training cycles for every sample of aircraft run were done. Data samples
were alternated during training process – about 20 cycles for one sample, then 20
cycles for another, and so on.
In Fig.10 the received braking coefficient dependency for slush covered runway is
compared with dependency recommended in NPA-14 [8]. Unfortunately so far we
have a single example of empiric data for such conditions. And that data are not quite
suitable for identification purposes – pilot turned brakes on and off very frequently
and never used full braking pressure. So, additional empiric data sets processing is
required to make any more detailed conclusions.
Fig. 10.
References
1. Bukov, V., Kirk, B., Schagaev, I.: Applying the Principle of Active Safety to Aviation. In:
2nd European Conference for Aerospace Sciences EUCASS, Brussels (July 2007)
2. Bukov, V., Schagaev, I., Kirk, B.: Analytical synthesis of aircraft control law. In: 2nd Euro-
pean Conference for Aerospace Sciences EUCASS, Brussels (July 2007)
3. Bondarets, A.Ya.: The flight dynamics math model refinement procedure based on flight
tests data. In: The reports of 3-d Scientific Conference on Hydroaviation “Gidroaviasalon
2000” (2000) (in Russian)
4. Bondarets, A.Y., Ogolev, Y.A.: The results of Be-200 amphibian aircraft flight dynamics
math model refinement based on flight tests data. In: The reports of 4-th Scientific Confer-
ence on Hydroaviation “Gidroaviasalon-2002” (2002) (in Russian)
5. Bondarets, A.Y., Kreerenko, O.D.: The neural networks application for estimation of
wheels braking actual parameters for an airplane on the runway covered with precipitations.
In: The reports of 5-th scientific conference Control and information technologies, Saint Pe-
tersburg (2008) (in Russian)
6. Andresen, A., Wambold, J.C.: Friction Fundamentals, Concepts and Methodology, TP
13837E Prepared for Transportation Development Centre Transport Canada (October 1999)
7. Campa, G., Fravolini, M.L.: Adaptive Neural Network Library, Version 3.1 (Matlab R11.1
through R13), West Virginia University (July 2003)
8. JAA NPA. No 14/2004 on certification specifications for large airplanes (CS-25) Operation
on Contaminated Runways
Ellipse Support Vector Data Description*
1 Introduction
The one-class classification problem is an interesting field in pattern recognition and
machine learning researches. In this kind of classification, we assume the one class of
data as the target class and the rest of data are classified as the outlier. One-class clas-
sification is particularly significant in applications where only a single class of data
objects is applicable and easy to obtain. Objects from the other classes could be too
difficult or expensive to be made available. So we would only describe the target class
to separate it from the outlier class. Three general approaches have been proposed to
resolve the one-class classification problems [1]:
*
This work has been partially supported by Iran Telecommunication Research Center (ITRC),
Tehran, Iran. Contract No: T/500/1640.
D. Palmer-Brown et al. (Eds.): EANN 2009, CCIS 43, pp. 257–268, 2009.
© Springer-Verlag Berlin Heidelberg 2009
258 M. GhasemiGol, R. Monsefi, and H.S. Yazdi
ular three density models are the Gaussian model, the mixture of Gaussians and
the Parzen density [2, 3].
2) In the second method a closed boundary around the target set is optimized. K-
centers, nearest neighborhood method and support vector data description (SVDD)
are example of the boundary methods [4, 5].
3) Reconstruction methods are another one-class classification method which have
not been primarily constructed for one-class classification, but rather to model the
data. By using prior knowledge about the data and making assumptions about the
generating process, a model is chosen and fitted to the data. Some types of recon-
struction methods are: the k-means clustering, learning vector quantization, self-
organizing maps, PCA, a mixture of PCAs, diabolo networks, and auto-encoder
networks.
F R, R , (1)
s. t. x R , i. (2)
In order to allow for outliers in the training data set, the distance of each training
sample xi to the center of the sphere should not be strictly smaller than R2. However,
large distances should be penalized. Therefore, after introducing slack variables
ξ 0 the minimization problem becomes:
F R, R C∑ ξ , (3)
s. t. x R ξ , i. (4)
The parameter C gives the tradeoff between the volume of the description and the
errors. The constraints can be incorporated into the error function by introducing
Lagrange multipliers and constructing the Lagrangian.
L R, , α , γ , ξ R C∑ ξ
∑ α R ξ x 2 .x ∑ γξ . (5)
0 ∑ α 1, (6)
∑
0 ∑ αx , (7)
∑
0 C α γ 0. (8)
From the above equations and the fact that the Lagrange multipliers are not all nega-
tive, when we add the condition 0 α , Lagrange multipliers γ can be safely
260 M. GhasemiGol, R. Monsefi, and H.S. Yazdi
removed. So the problem can be transformed into maximizing the following function
L with respect to the Lagrange multipliers α :
L ∑ α x .x ∑ , α α x .x , (9)
s. t 0 α . (10)
Note that from Eq. (7), the center of the sphere is a linear combination of the training
samples. Only those training samples xi which satisfy Eq. (4) by equality are needed
to generate the description since their coefficients are not zero. Therefore these sam-
ples are called Support Vectors. The radius can be computed using any of the support
vectors:
R x .x 2∑ α x .x ∑ , α α x .x . (11)
To judge a test sample z whether is in the target class, its distance to the center of
sphere is computed and compared with R, if satisfies Eq. (12), it will be accepted,
and otherwise, rejected.
z z. z 2 ∑ α z. x ∑ , α α x .x R . (12)
SVDD is stated in terms of inner products. For more flexible boundaries, therefore,
inner products of samples (xi.xj) can be replaced by a kernel function K(xi,xj), where
K(xi,xj) satisfies Mercer’s theorem [8]. This implicitly, maps samples into a nonlinear
feature space to obtain a more tight and nonlinear boundary. In this context, the
SVDD problem of Eq. (9) can be expressed as:
L ∑ α K x .x ∑ , α α K x .x . (13)
Several kernel functions have been proposed for the SV classifier. Not all kernel func-
tions are equally useful for the SVDD. It has been demonstrated that using the Gaus-
sian kernel:
K x, y exp , (14)
results in tighter description. By changing the value of S in the Gaussian kernel, the
description transforms from a solid hypersphere to a Parzen density estimator.
Therefore we want to define a new method which generates a better decision boun-
dary around the target class in the input space. In this space a more precision boun-
dary is reached if we use an ellipse instead of a sphere which is presented in the
SVDD method. In the high dimensional input space, we can also use a hyperellipse as
a substitute for a hypersphere.
Although this technique is more useful for the elliptical datasets (Fig 1 (a)), it also
generates noticeable results on the other datasets such as banana datasets (Fig 1 (c)).
On the other hand an ellipse is a general form of a sphere; so in the worst state it
transforms to a sphere and we obtain the same decision boundary as the SVDD me-
thod (Fig1 (b)). Here we try to find a hyperellipse with a minimum volume which
encloses all or most of these target objects. We demonstrate this technique result in
better decision boundary in the feature space as well as the input space. The problem
of finding the minimum hyperellipse around n samples with d dimensions represented
by a center a and the radii Rj which can be formulated into:
min ∑ , (15)
,
s.t. ∑ 1, 1, … , , 1, … , . (16)
Corresponding to the presented SVDD to allow for outliers in the training data set,
each training sample xi should not be strictly into the hyperellipse. However, large
262 M. GhasemiGol, R. Monsefi, and H.S. Yazdi
min ∑ C∑ ξ , (17)
,
s.t. ∑ 1 , 0, 1, … , , 1, … , . (18)
where C controls the trade-off between the hyperellipse volume and the description
error. In order to solve the minimization problem in Eq. (17), the constraints of
Eq. (18) are introduced to the error function using Lagrange multipliers:
,
, , , ∑ C∑ ξ –∑ α ∑ 1 ∑ γξ . (19)
where α 0 and γ 0 are Lagrange multipliers. Note that for each object xi in
dimension j a corresponding α and γ are defined. L has to be minimized with respect
to Rj, ξ and maximized with respect to α and γ . A test object z is accepted when it
satisfies the following inequality.
,
∑ 1. (20)
Analogous to SVDD, we might obtain a better fit between the actual data boundary
and the hyperellipse model. Assume we are given a mapping Φ of the data which
improves this fit. We can apply this mapping to Eq. (17) and we obtain:
, , , ∑ C∑ ξ –∑ α ∑ 1 ∑ γξ . (21)
According to [9] we can get these Φ functions from the standard kernels which is
been proposed for the support vector classifier. For example to find the Φ function for
polynomial kernel with d=2, in the 2-dimensional space, we should do the following
procedure:
,
,
, 1
, 1 2 2 2
,
1, , √2 , , √2 , √2 .
Here we run SVDD and ESVDD on a simple ellipsoid dataset by using the above
polynomial kernel. Fig 2 shows that the better boundary is obtained when we use the
ESVDD algorithm by the same parameters. In other words, the proposed algorithm
Ellipse Support Vector Data Description 263
boundaries on the contrary of SVDD boundaries are less influenced by changing the
user defined parameters.
Using the Gaussian kernel instead of the polynomial kernel results in tighter de-
scriptions in the SVDD method. So we use this kernel for a random ellipsoid dataset.
First we extract the Φ function from the Gaussian kernel. For this reason we solve the
following problem:
, .
, ,…,
… .
By using proper substitutions we can get desired Φ function. For example in the 1-
dimensional input space and using four terms to compute the Taylor formula, the
following Φ function is obtained.
4
1 2 2 .
3
, √2 , √2 , . (22)
√
For the Gaussian kernel no finite mapping Φ x of object x can be given. But we
can get an approximation of Φ functions by using the Taylor formula. So we can use
this function for mapping input space into feature space.
Fig. 3 compares the SVDD boundary to the ESVDD results. The proposed method
generates the better boundary when the same parameters are used. In this experiment
the Taylor formula is expand until the first four terms.
Fig. 3. SVDD and ESVDD boundary with the Gaussian kernel (By using ten terms of Taylor
formula)
Now we confronted a new difficulty about the Φ functions. As mentioned in the pre-
vious section, the Gaussian kernel has no finite mapping Φ x of object x. So we use
an approximation of Φ functions. If we consider these functions with more attention,
interesting results are obtained.
For example in the 4-dimensional space we get a Φ function which is mapping this
space into 10000-dimensional space. Although many of these dimensions contained
very small coefficient which can be connived them. So eliminating of these dimen-
sions does not impose a critical error in the final results. Fig. 4 shows the logarithm of
coefficients for each dimension. Just few dimensions have considerable coefficients
which can be efficient in space transformation.
Ellipse Support Vector Data Description 265
Therefore we can map 4-dimensional space into a smaller feature space with fewer
dimensions. These dimensions are selected from 10000 dimensions which have con-
siderable coefficients (bigger than 10-6). Hence many of these dimensions are useless
and can be eliminated. In the next section we do simple experiments for proving this
claim.
4 Experiments
We compare the performances of the SVDD and ESVDD methods with a synthetic
dataset and some of datasets taken from UCI Machine Learning Dataset Repository
[10]. Table 1 provides details about the datasets used here.
Table 1. UCI datasets used for the evaluation of the outlier detection methods
In Iris and balance-scale datasets three classes with four features are existed. Since,
to use them for outlier detection, two of the classes are used as the target class while
the remainder class is considered as outlier. Haberman's Survival dataset has two
classes and uses three features. In this dataset we have 225 samples in one class and
81 samples in the other one. So we can use it easily for a one class classification prob-
lem. In this situation, the class with more samples supposed as the target class.
In the first step we should create some Φ functions for mapping the various input
space into a high dimension feature space.
266 M. GhasemiGol, R. Monsefi, and H.S. Yazdi
(a) (b)
Input Space Feature Space Input Space Feature Space
functions functions
Dimensions Dimensions Dimensions Dimensions
2 100 2 6
3 1000 3 10
4 10000 4 15
Table 3. Recognition Rate in SVDD and ESVDD methods for the Iris, balance-scale and he-
berman datasets
Table 4. Recognition Rate in SVDD and ESVDD methods for an elliptical Synthetic dataset
5 Conclusion
In this paper, we propose a new approach to make the SVDD boundary closely fit the
contour of the target data. The SVDD method uses a hypersphere which cannot be a
good decision boundary for the target data, in the input space. So we define a hyperel-
lipse instead of a hypersphere and resolve the equations by applying this alteration.
On the other hand the SVDD tries to improve the results by using the kernel func-
tions. In this state, data are mapped to a high dimensional space. Then we use a
hypersphere around the data in the new space. Experiments show that using a hyperel-
lipse lead to better results in the feature space beside the input space. Furthermore as
an important benefit, it is less influenced by changing the user defined parameters and
we even obtained acceptable results with the inappropriate parameters.
References
1. Tax, D.M.J.: One-class classification concept learning in the absence of counter-examples.
Technische Universiteit Delft, Netherlands 65 (2001)
2. Parzen, E.: On estimation of a probability density function and mode. Annals of Mathenat-
ical Statistics 33, 1065–1076 (1962)
3. Bishop, C.: Neural Networks for Pattern Recognition. Oxford University Press, Oxford
(1995)
4. Ypma, A., Duin, R.: Support objects for domain approximation. In: Proceedings of Int.
Conf. on Artificial Neural Networks (ICANN 1998), Skovde, Sweden (1998)
5. Tax, D.M.J., Duin, R.P.W.: Support Vector Data Description. Machine Learning 54, 45–66
(2004)
6. Tax, D.M.J., Duin, R.P.W.: Support vector domain description. Pattern Recognition Let-
ters 20, 1191–1199 (1999)
7. Guo, S.M., Chen, L.C., Tsai, J.S.H.: A boundary method for outlier detection based on
support vector domain description. Pattern Recognition 42, 77–83 (2009)
8. Scholkopf, B., Smola, A.J., Muller, K.: Nonlinear component analysis as a kernel eigenva-
lue problem. Neural Computation 10, 1299–1319 (1999)
268 M. GhasemiGol, R. Monsefi, and H.S. Yazdi
1 Introduction
Radial Basis Function Networks (RBFNs) have shown their capability to solve a
wide range of problems, such as classification, function approximation, and time
series prediction, among others.
RBFNs represent a special kind of nets since, once their structure has been fixed,
the optimal sets of weights linking hidden to outputs neurons can be analytically
computed. For this reason, many data mining algorithms have been developed to
fully configure RBFNs, and researchers have applied data mining techniques to the
task of finding the optimal RBFN that solves a given problem. Very often, these
algorithms need to be given a set of parameters for every problem they face, thus
methods to automatically find these parameters are required.
One the methods present in literature is Symbiotic CHC RBF [11],[12], a co-
evolutionary algorithm which tries to automatically design RBF neural network,
establishing an optimal set of parameters for the method EvRBF [14], [15], (an
evolutionary algorithm developed to automatically design asymmetric RBFNs).
D. Palmer-Brown et al. (Eds.): EANN 2009, CCIS 43, pp. 269–280, 2009.
c Springer-Verlag Berlin Heidelberg 2009
270 E. Parras-Gutierrez, M.I. Garcia-Arenas, and V.M. Rivas-Santos
Harpham et al. reviewed in [5] some of the best known methods that apply
evolutionary algorithms to RBFNs design. As for others kind of nets, they con-
cluded that methods tend to concentrate in only one aspect when designing
RBFNs. Nevertheless, there also exist methods intended to optimize the whole
net, such as [9] for Learning Vector Quantization (LVQ) nets or [3] for multilayer
perceptrons.
Recent applications try to overcome the disadvantages of the preceding me-
thods. Rivera [16] made many neurons compete, modifying them by means of
fuzzy evolution. Ros et al.’s method [17] automatically initializes RBFNs, finding
a good set of initial neurons, but relying in some other method to improve the
net and to set its widths.
On the other hand, EvRBF [14], [15] is an evolutionary algorithm designed
to fully configure an RBFNs, since it searches for the optimal number of hidden
neurons, as well as their internal parameters (centers and widths). EvRBF is a
steady state evolutionary algorithm that includes elitism; it follows a Pittsburgh
scheme, in which each individual is a full RBFNs whose size can change, while
population size remains equal.
Different kinds of co-evolution have also been used to design ANN, as can
be found in literature. Paredis [10] proposed a general framework for the use of
Enhanced Radial Basis Function Neural Network Design 271
3 Method Overview
This section firstly describes EvRBF, and after this describes Symbio-
tic CHC RBF and SymbPar. These last two evolutionary algorithms have been
developed to find a suitable configuration of parameters necessary for EvRBF.
this algorithm, SymbPar’s main goal is to automatically design the RBFNs finding
a suitable configuration of parameters for the method EvRBF.
The task of parallelizing an algorithm involves to find the critical points of
the sequential evolutionary algorithm in order to carry out a successful para-
llelization. There are two important parameters of sequential algorithm, which
set the speed of the algorithm. The first one is the population size, many times,
this parameter is high to get a suitable diversity, so the speed of the algorithm
is minor. And the other one is fitness function of the individuals of population,
thus with complex evaluation functions the execution time can rocket.
Therefore, sequential evolutionary algorithms sometimes need a huge amount
of physical memory, to keep big populations, and an elevated processing time.
Both the requirement of memory and computation time are two good reasons to
attempt to speed up the execution of an evolutionary algorithm by means of pa-
rallelization. In Symbiotic CHC RBF, good solutions are achieved with relatively
small populations (around 50 individuals), so the reason of its parallelization is
the total execution time of the algorithm.
Candidates operations to be parallelized are those that can be independently
applied to every individual, or a small set of individuals, for example mutation or
evaluation. However, operations which need the whole population, like selection,
can’t be parallelized using this technique. The operation more commonly used is
the individual evaluation, when it is independent from the rest of individuals [20],
since it is the most difficult operation of the whole sequent evolutionary algorithm.
SymbPar tries to solve the problem that Symbiotic CHC RBF has when it
works with large databases and the computation time is huge, owing that each
individual evaluation be made in a sequential way. The evolution schemes of
Symbiotic CHC RBF and SymbPar are the same,nevertheless, there is an im-
portant difference between them, the execution scheme of evaluations in Symbio-
tic CHC RBF is sequential. It means that until the evaluation of the algorithm
EvRBF with the first individual does not finish, does not begin with the second
individual, and so on. On the other hand, SymbPar carries out every individual
evaluation in a parallel form, following a Master-Worker process scheme [2].
274 E. Parras-Gutierrez, M.I. Garcia-Arenas, and V.M. Rivas-Santos
Fig. 3. Differences in the evaluation of individuals when (a) Symbiotic CHC RBF is
used and (b) SymbPar is used
In order to test the performance of Symbiotic CHC RBF and SymbPar, the
algorithms have been evaluated with the following data sets: Flag, German,
Glass, Haberman, Ionosphere, New-thyroid, Pima, Postoperative, Sonar, Vehicle,
and WDBC from UCI data set repository1 .
In addition, for Ionosphere, Sonar, and WDBC databases, 7 different data
sets generated using various feature selection methods, have been used in order
to test Symbiotic CHC RBF and SymbPar with a higher number of datasets.
So that the methods have been tested over 29 different classification data sets.
Table 1 shows features, instances and model parameters of data sets.
Then, a 10-crossfold validation method has been used for every data set, so
that every one has been divided into 10 different sets of training-test patterns.
1
http://www.ics.uci.edu/∼mlearn/MLRepository.html
Enhanced Radial Basis Function Neural Network Design 275
Fig. 4. Execution scheme of Symbiotic CHC RBF for every generation. SymbPar
execute the evaluations steps (underlined in this figure) in parallel, while Symbio-
tic CHC RBF executes them sequentially.
276 E. Parras-Gutierrez, M.I. Garcia-Arenas, and V.M. Rivas-Santos
For every training-set couple, the algorithms have been executed three times,
and the results show the average over these 30 executions. So, since we have 29
data sets, a total of 870 executions have been performed.
The set of experiments have been executed using the Scientific Computer
Center of Andalucia’s cluster2 . In case of SymbPar, a single machine with four
cores has been used. The work was randomly distributed among the cores, by the
operating system itself (Ubuntu 7.10 with a kernel Linux 2.6.22-14-generic SMP).
Moreover, the function library JEO [1], which uses support for distribution of
tasks in a distributed virtual machine named DRM [6], has been used.
In table 2 can be seen the obtained results for the two methods. This table is
organized by columns, showing for both algorithms and from left to right, the
number of neurons (Nodes), the percentage of correct classified patterns (Test),
and the computation time represented in minutes (T). For every measure, the
average and standard deviation are shown. In bold are marked the best results
with respect to the execution time with every database, that is, the fastest
comparing the two algorithms.
As can be observed in the table, the quality of the solutions yielded by Symbio-
tic CHC RBF is neither improved nor worsened by SymbPar. The sizes of the nets
founds by both methods are quite similar and also the percentages of classification.
On the other hand, there are obvious differences between the sequential and paral-
lel algorithms regarding the execution time, since SymbPar drastically reduces the
computation in almost every data set. Figure 5 shows the ratio between the results
yielded by Symbiotic CHC RBF and SymbPar for both classification percent-
ages and executiontime. Every ratio is calculated dividing Symbiotic CHC RBF
2
CICA: http://www.cica.es/condor.html
Fig. 5. Ratio of classification percentages versus ratio of time executions. Ratio is calculated dividing the results yielded by Symbio-
tuc CHC RBF by the ones yielded by SymbPar.
Enhanced Radial Basis Function Neural Network Design
277
278 E. Parras-Gutierrez, M.I. Garcia-Arenas, and V.M. Rivas-Santos
by SymbPar. Bars in the figure show that the ratio for classifications keeps close to 1
in all the considered problems. Lines in the figure show the ratio for time executions,
proving that Symbiotic CHC RBF can be up to 37 times slower than SymbPar.
The above observations have been corroborated by the Wilcoxon test. For
both nodes and test measures, the statistical p-value shows that there are no
differences between the algorithms. In the case of time, Wilcoxon test shows that
SymbPar is definitively better than Symbiotic CHC RBF, since the p-value gets
a value of 2.625e − 06.
Acknowledgment
This work has been partially supported by the Spanish project MYCYT
NoHNES (Ministry of Education and Science - TIN2007-6083), the excellent
project of Junta de Andalucia PC06-TIC-02025, and the project UJA 08 16 30
of the University of Jaen.
References
1. Arenas, M.G., Dolin, B., Merelo, J.J., Castillo, P.A., Fernandez, I., Schoenauer,
M.: JEO: Java evolving objects. In: GECCO2: Proceedings of the Genetic and
Evolutionary Computation Conference (2002)
2. Bethke, A.D.: Comparison of genetic algorithms and gradient-based optimizers on
parallel processors: Efficiency of use of processing capacity. Tech. Rep., University
of Michigan, Ann Arbor, Logic of Computers Group (1976)
3. Castillo, P.A., et al.: G-Prop: Global optimization of multilayer perceptrons using
GAs. Neurocomputing 35, 149–163 (2000)
4. Eshelman, L.J.: The CHC adptive search algorithm: How to have safe search when
engaging in nontraditional genetic recombination. In: First Workshop on Foun-
dations of Genetic Algorithms, pp. 265–283. Morgan Kaufmann, San Francisco
(1991)
5. Harpham, C., et al.: A review of genetic algorithms applied to training radial basis
function networks. Neural Computing & Applications 13, 193–201 (2004)
6. Jelasity, M., Preub, M., Paechter, B.: A scalable and robust framework for dis-
tributed application. In: Proc. on Evolutionary Computation, pp. 1540–1545 (2002)
7. Kriegel, H., Borgwardt, K., Kroger, P., Pryakhin, A., Schubert, M., Zimek, A.:
Future trends in data mining. Data Mining and Knowledge Discovery: An Inter-
national Journal 15(1), 87–97 (2007)
8. Mayer, A.H.: Symbiotic Coevolution of Artificial Neural Networks and Training
Data Sets. LNCS, pp. 511–520. Springer, Heidelberg (1998)
280 E. Parras-Gutierrez, M.I. Garcia-Arenas, and V.M. Rivas-Santos
9. Merelo, J., Prieto, A.: G-LVQ, a combination of genetic algorithms and LVQ. In:
Artificial Neural Nets and Genetic Algorithms, pp. 92–95. Springer, Heidelberg
(1995)
10. Paredis, J.: Coevolutionary Computation. Artificial Life, 355–375 (1995)
11. Parras-Gutierrez, E., Rivas, V.M., Merelo, J.J., del Jesus, M.J.: Parameters estima-
tion for Radial Basis Function Neural Network design by means of two Symbiotic
algorithms. In: ADVCOMP 2008, pp. 164–169. IEEE computer society, Los Alami-
tos (2008)
12. Parras-Gutierrez, E., Rivas, V.M., Merelo, J.J., del Jesus, M.J.: A Symbiotic CHC
Co-evolutionary algorithm for automatic RBF neural networks design. In: DCAI
2008, Advances in Softcomputing, Salamanca, pp. 663–671 (2008) ISSN: 1615-3871
13. Mitchell Potter, A., De Jong, K.A.: Evolving Neural Networkds with Collaborative
Species. In: Proc. of the Computer Simulation Conference (1995)
14. Rivas, V.M., Merelo, J.J., Castillo, P.A., Arenas, M.G., Castellanos, J.G.: Evolv-
ing RBF neural networks for time-series forecasting with EvRBF. Information Sci-
ences 165(3-4), 207–220 (2004)
15. Rivas, V.M., Garcia-Arenas, I., Merelo, J.J., Prieto, A.: EvRBF: Evolving RBF
Neural Networks for Classification Problems. In: Proceedings of the International
Conference on Applied Informatics and Communications, pp. 100–106 (2007)
16. Rivera Rivas, A.J., Rojas Ruiz, I., Ortega Lopera, J., del Jesus, M.J.: Co-
evolutionary Algorithm for RBF by Self-Organizing Population of Neurons. In:
Mira, J., Álvarez, J.R. (eds.) IWANN 2003. LNCS, vol. 2686, pp. 470–477. Springer,
Heidelberg (2003)
17. Ros, F., Pintore, M., Deman, A., Chrtien, J.R.: Automatical initialization of RBF
neural networks. In: Chemometrics and intelligent laboratory systems, vol. 87, pp.
26–32. Elsevier, Amsterdam (2007)
18. Schwaiger, R., Mayer, H.A.: Genetic algorithms to create training data sets for
artificial neural networks. In: Proc. of the 3NWGA, Helsinki, Finland (1997)
19. Thompson, J.N.: The Geographic Mosaic of Coevolution. University of Chicago
Press, Chicago (2005)
20. Tomassini, M.: Parallel and distributed evolutionary algorithms: A review. In: Mi-
ettinen, K., et al. (eds.) Evolutionary Algorithms in Engineering and Computer
Science, pp. 113–133. J. Wiley and Sons, Chichester (1999)
New Aspects of the Elastic Net Algorithm for Cluster
Analysis
1 Introduction
Problems of classification of groups of similar objects exist in different areas of science.
Classification of patterns of genes expressions in genetics and botanic taxonomy are
two out of many examples. The objects are described as data points in a multivariate
space and similar objects form clusters in this space. In order to find clusters without
the knowledge of a prior distribution function, one uses an arbitrary partition of the data
points in clusters and defines a cost functional which will then be minimized. From the
minimization results a distribution function which gives the optimal clusters and there
positions.
One of the most successful method which uses this procedure is the Elastic Net Al-
gorithm (ENA) of Durbin-Willshaw [1] which was initially formulated as a heuristic
method and applied to the travelling salesman problem. Later it was given an mechan-
ical statistic foundation [2], [3] and used for diverse problems in pattern research [4],
[5], [6]. In this formulation a chain of nodes interact with a given set of data points by a
square distance cost-or energy function. For the distribution of the data points one uses
the principle of maximum entropy which gives a probability to find the distance be-
tween data points and nodes as a function of a parameter which, in a physical analogy,
D. Palmer-Brown et al. (Eds.): EANN 2009, CCIS 43, pp. 281–290, 2009.
c Springer-Verlag Berlin Heidelberg 2009
282 M. Lévano and H. Nowak
with the unknown probability distribution Pij to find the distance |xi − yj |. Putting the
system in contact with a heat reservoir and using the maximum entropy principle, one
finds with the constraint (2) a Gaussian probability distribution [7],[2]:
e−βEij
Pij = (3)
Zi
with the partition function for the data point i
Zi = e−βEij . (4)
j
in form of a chain of springs, with the spring constant λ. Minimization of the free
energy with respect to the node positions gives for every value of β a non-linear coupled
equation for the optimal node positions:
Pij (xi − yj ) + λ(yj+1 − 2yj + yj−1 ) = 0, ∀j. (8)
i
We solve the equation for different β values iteratively with the steepest descent algo-
rithm [4]:
∂F
yj = −τ = τ Pij (xi − yj ) + τ λ(yj+1 − 2yj + yj−1 ) ∀j, (9)
∂yj i
we note that for λPij 1 the second term in equation (10) is negligible. This is
i
the case if around the node yj are sufficient data points which contribute with a large
probability. If there are only few data points which contribute, then some next neighbor
nodes must be present at a small distance and again the second term is negligible be-
cause of the difference of the nodes. That means that in a good approximation equation
(10) reduces to
Pij xi
yj = i ∀j, (11)
i Pij
which represent the optimal centroid’s of fuzzy clusters around every node [4].
In the deterministic annealing method one solves the equation (8) firstly for a low
value of β (high temperature). In this case one should find only one cluster, the whole
cluster, equation (11) gives its centroid and the free energy is in its global minimum.
Then one solves equation (8) successively for increasing β, tracing the global minimum.
Rose et.al.[2] have shown that for the case where no node-node interaction (Enod =
0) is taken in account, the whole clusters splits into smaller clusters and one finds some
of the nodes yj as optimal centroid’s of the clusters. They interpret the forming of
different numbers of clusters as a phase transition. In every phase one has a well defined
number of fuzzy clusters and one can estimate their mean size, using the Gaussian form
of the probability distribution, given by equations (3),(4) and (1). One expects that only
284 M. Lévano and H. Nowak
those data points for which on the average 12 βσc2 ≈ 1 contribute to the clusters, with σc
the average value of the standard deviation. This gives an average value for the size of
the fuzzy clusters:
σc ≈ 2/β (12)
The same situation is found when one takes in account the node-node interaction as a
perturbation. If there are more nodes then clusters, the rest of the nodes will be found
near the nodes which correspond to the clusters. For more clusters then nodes these will
not be detected anymore. There exist a typical value of β, where the number of nodes
are the same as the forming clusters. These clusters are optimal, e.g. there nodes are
optimal centroid’s of the corresponding clusters. One would expect that these clusters
are well defined in the sense that there probability distribution has large values for the
data points of the clusters and goes fast to zero for other data points.
In order to establish a criterion for finding the β value for this situation and stop the
annealing process, we define a hard cluster for every node, determined by the nearest
neighbor data points of the node. The centroid of the hard cluster, the hard centroid, is
given by
1
yˆj = xi , (13)
nj
i∈Cj
where Cj forms the cluster of nearest neighbors of node yj with nj members. The
hard clusters can be compared with the fuzzy clusters of the optimal situation and one
expects that fuzzy clusters and hard clusters have their best approach. For this stopping
criterion we use the centroid’s and the standard deviation:
where σj and σ̂j are the standard deviations of the fuzzy clusters and the hard clusters.
During the annealing process one passes a series of phase transitions which are char-
acterized by a certain number of clusters and a deformation of the chain of nodes. As
the energy of the chain Enod is generally much smaller then the interaction energy of
the nodes with the data points, the change of the deformation should be of the order of
the energy of the chain and Enod shows clearly this phase changes.
As an illustration of the algorithm and the stopping criterion, we apply the ENA to two
artificial created clusters. In the first example, cluster C1, we create a two-dimensional
rectangular homogeneous cluster and in the second example, cluster C2, we create a
two-dimensional rectangular nonhomogeneous cluster with different density regions.
Fig. 1 shows the two clusters.
We begin with a chain of 20 nodes which crosses the cluster diagonally. The β-
dependent node positions and energy Enod of the chain are then calculated with the
steepest descent algorithm (9) with a time step τ = 0.001. Table 1 shows the relevant
data for the two clusters.
Fig. 2 shows the node interaction energy Enod for the cluster C1 as a function of β.
New Aspects of the Elastic Net Algorithm for Cluster Analysis 285
1.5 1.5
1 1
0.5 0.5
0 0
−0.5 −0.5
−1 −1
−1.5 −1.5
−1.5 −1 −0.5 0 0.5 1 1.5 −1.5 −1 −0.5 0 0.5 1 1.5
0.45
0.4 (c)
(a) (d)
0.35
(b)
0.3
Enod(β)
0.25
0.2
0.15
0.1
0.05
0
0 5 10 15 20 25 30 35
β
Fig. 2. Node Interaction energy Enod as a function of β for cluster C1. The arrows indicate the
phases a,b,c,d for data registrations.
Table 1. Number of data points and nodes, initial value βi , final value βf , iteration step β and
spring constant λ for the cluster calculations
The arrows indicate the phases for which we registered the node positions, the hard
centroid’s and the standard deviations. Fig. 3 shows data points, chain of nodes and hard
centroid’s for the phases a: β = 3.8, b: β = 10.3, c: β = 25 and d: β = 35. Fig. 4
shows the standard deviations of the fuzzy clusters, hard clusters and the mean size σc
of estimation (12) for the different phases of C1.
For values of β < 3 in fig. 2, Enod ≈ 0, because the initial chain reduces to nearly a
point. We see a first strong transition at point a, where the nodes extend to a first config-
uration of the chain. This corresponds to a connection between few nodes which form
at this value of β. Fig. 3a shows nodes and hard centroid’s for this phase and fig. 4a their
standard deviations. The nodes accumulate in 4 points which should correspond to the
optimal centroid’s of 4 fuzzy clusters. The chain is large and has a high Enod , because
286 M. Lévano and H. Nowak
its connects the different nodes in the 4 clusters multiple times. We have not connected
the hard centroid’s, because they show large dispersions. The standard deviations for
the fuzzy clusters are much larger then the standard deviations for the hard clusters.
The estimation of mean value σc of formula (12) for the fuzzy clusters fits remarkably
well the average fuzzy clusters size.
For increasing β the nodes separate and the chain contracts slightly. In phase b,
fig.3b, the nodes accumulate in 9 clusters, lowering in this way the interaction energy
between nodes and data points. The hard centroid’s are now nearer to the nodes and
fig. 4b shows that the difference between the standard deviations gets smaller. In phase
c the separation between the 20 nodes are nearly equidistant and the chain covers the
whole cluster. The hard centroid’s are approximately in the same position as the nodes
and one expects that 20 fuzzy clusters exist. They are not yet well defined. This is
seen in fig. 4c, where the difference between the standard deviations is around 0.1. The
stopping criterion (14) is nearly fulfilled.
There is another marked change in Enod between phase c and phase d. In fig. 3d
one notes that the nodes and hard centroid’s coincide. The standard deviations in fig. 4d
differ around 0.05. One node and its hard centroid have changed position such that the
nodes cover better the area of the data points. The criterion (14) is much better fulfilled
for phase d and one could stop the annealing process.
Finally we note that the estimation for the mean size σc of the clusters in phases a,b,c
and d are good approximations of the mean standard deviation of the fuzzy clusters.
For the second example, cluster C2, we see in fig. 5 that the node interaction energy
Enod shows a much clearer distinction for the different phases. This is a consequence
1.5 1.5
(a) (b)
1 1
0.5 0.5
0 0
−0.5 −0.5
−1 −1
−1.5 −1.5
−1.5 −1 −0.5 0 0.5 1 1.5 −1.5 −1 −0.5 0 0.5 1 1.5
1.5 1.5
(c) (d)
1 1
0.5 0.5
0 0
−0.5 −0.5
−1 −1
−1.5 −1.5
−1.5 −1 −0.5 0 0.5 1 1.5 −1.5 −1 −0.5 0 0.5 1 1.5
Fig. 3. Data points (·), nodes () and hard centroid’s () for phases a: β = 3.8, b: β = 10.3, c: β
= 25, d: β = 35 of cluster C1
New Aspects of the Elastic Net Algorithm for Cluster Analysis 287
0.8 0.8
(a) (b)
0.7 0.7
0.6 0.6
Standard deviation σ
Standard deviation σ
0.5 0.5
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0 0
0 5 10 15 20 0 5 10 15 20
Nodes Nodes
0.8 0.8
(c) (d)
0.7 0.7
0.6 0.6
Standard deviation σ
Standard deviation σ
0.5 0.5
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0 0
0 5 10 15 20 0 5 10 15 20
Nodes Nodes
Fig. 4. Standard deviations for fuzzy clusters (−), hard clusters (−•) and the estimation σc (− −)
of the mean size for the phases a, b, c, d of cluster C1
0.45
0.4
(d)
0.35
0.3
(c)
Enod(β)
0.2
0.15
0.1
0.05
0
0 10 20 30 40 50 60 70
β
Fig. 5. Node Interaction energy Enod as a function of β for cluster C2. The arrows indicate the
phases a, b, c, d for data registrations.
of the three region of higher data point density in the inhomogeneous cluster which
attracts more nodes. Fig. 6 shows nodes and hard centroid’s for the 4 phases a: β = 7.0,
b: β = 15, c: β = 23 and d: β = 70 and fig. 7 the standard deviations and the estimation
(12) of mean fuzzy cluster size σc for these phases.
As in the homogeneous cluster C1, one notes in fig. 6a that in cluster C2, after the
first transition, all nodes are nearly in the same position of 4 sub clusters, which should
correspond to 4 fuzzy clusters formed in this phase, with the difference that three of
the positions are now near the high density regions, HDR. Hard centroid’s and nodes
are different and in fig. 7a one notes a large difference between the standard deviations.
The standard deviation of the fuzzy clusters is constant for 3 group of nodes, which
288 M. Lévano and H. Nowak
1.5 1.5
(a) (b)
1 1
0.5 0.5
0 0
−0.5 −0.5
−1 −1
−1.5 −1.5
−1.5 −1 −0.5 0 0.5 1 1.5 −1.5 −1 −0.5 0 0.5 1 1.5
1.5 1.5
(c) (d)
1 1
0.5 0.5
0 0
−0.5 −0.5
−1 −1
−1.5 −1.5
−1.5 −1 −0.5 0 0.5 1 1.5 −1.5 −1 −0.5 0 0.5 1 1.5
Fig. 6. Data points (·), nodes () and hard centroid’s () for phases a: β = 7, b: β = 15, c: β =
23, d: β = 70 of cluster C2
correspond to the nodes in the HDR. The estimation for the mean cluster size gives a
larger value then the standard deviation for the fuzzy clusters.
Fig. 6b shows the next phase which corresponds approximately to phase b of cluster
C1. One notes, too, 9 positions for the 20 nodes which should correspond to 9 fuzzy
clusters, but in this case some of them group around the HDR. The hard centroid’s for
the nodes which are not near the HDR are slightly different from their nodes, but the
hard centroid’s for the nodes near or inside the HDR have larger deviations from the
corresponding nodes. This seems to be a consequence of the close proximity of the nodes
in this region. Their fuzzy clusters are not well defined because their members are from
low and high density regions. The problem may be solved by using a larger number of
nodes for the chain. This difference is seen, too, in the standard deviations in figure 7b
for the nodes near or in the HDR. The mean cluster size approaches the fuzzy cluster
standard deviation.
Phase c shows an intermediate situation with 3 group of nodes in the HDR and a
smaller difference between the nodes and hard centroid’s and between the standard
deviations (fig.6c, fig.7c). The mean cluster size is well approximated by the mean
value of the fuzzy cluster standard deviation.
In phase d (fig. 6d) one finds a larger number of separate nodes and a node concentra-
tion in the HDR. Nodes and hard centroid’s are now in the same positions. The standard
deviations of fig. 7d are the same for the nodes outside the HDR and gets smaller for
nodes in the HDR, in comparison with phases a, b and c. The mean cluster size is a little
bit smaller then the fuzzy cluster standard deviation. For this case one can determine
New Aspects of the Elastic Net Algorithm for Cluster Analysis 289
0.7 0.8
(a) (b)
0.6 0.7
0.6
Standard deviation σ
Standard deviation σ
0.5
0.5
0.4
0.4
0.3
0.3
0.2
0.2
0.1 0.1
0 0
0 5 10 15 20 0 5 10 15 20
Nodes Nodes
0.8 0.8
(c) (d)
0.7 0.7
0.6 0.6
Standard deviation σ
Standard deviation σ
0.5 0.5
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0 0
0 5 10 15 20 0 5 10 15 20
Nodes Nodes
Fig. 7. Standard deviations for fuzzy clusters (−), hard clusters (−•) and the estimation σc (− −)
of the mean size for the phases a, b, c, d of cluster C2
the data points of the hard clusters and analyze its properties which should be in good
accordance with the optimal fuzzy clusters for the given number of nodes.
4 Conclusions
We use a statistical mechanics formulation of the elastic net algorithm of Durbin-
Willshaw [1], [2] with the interaction between the nodes as a perturbation in order
to get in a n-dimensional space a variable number of fuzzy clusters which depend on
a temperature parameter, or the inverse parameter β, of the annealing process. For a
given number of nodes of the elastic chain we find a value for β, for which the number
of optimal fuzzy clusters is the same as the number of nodes and any further annealing
with more smaller optimal fuzzy clusters cannot detect these smaller clusters.
In order to determine the value of this β for which one can stop the annealing pro-
cess, we define a hard centroid as the centroid of the cluster of nearest neighbors of
every node, the hard cluster, and find for the stopping condition that the hard centroid
associated to every node must be the same as the node position and the same goes for
their standard deviations. This means that the fuzzy clusters for this number of nodes
are best approximated by these hard clusters with its known members which can be
analyzed for their properties. An estimation of the mean cluster size as a function of β
is given and shows a good approximation to the fuzzy cluster standard deviation. This
means that for a given value of β a mean cluster size is determined, but we do not know
the minimal number of nodes which are necessary to give cluster of this size at the
stopping condition.
290 M. Lévano and H. Nowak
Acknowledgements
This work was supported by the project DGIPUCT No. 2008-2-01 of the "Dirección
General de Investigación y Postgrado de la Universidad Católica de Temuco, Chile".
References
1. Durbin, R., Willshaw, D.: An Analogue Approach To The Traveling Salesman Problem Using
An Elastic Net Method. Nature 326, 689–691 (1987)
2. Rose, K., Gurewitz, E., Fox, G.: Statistical Mechanics and Phase Transitions in Clustering.
Physical Review Letters 65, 945–948 (1990)
3. Alan, Y.: Generalized Deformable Models, Statistical Physics, and Matching Problems. Jour-
nal Neural Computation 2, 1–24 (1990)
4. Duda, R., Hart, P., Stork, D.: Pattern Classification. Jonh Wiley and Sons Inc., Chichester
(2001)
5. Gorbunov, S., Kisel, I.: Elastic net for standalone RICH ring finding. Proceedings - published
in NIM A559, 139–142 (2006)
6. Salvini, R.L., Van de Carvalho, L.A.: neural net algorithm for cluster analysis. In: Neural
Networks, Proceedings, pp. 191–195 (2000)
7. Reichl, L.E.: A Modern Course In Statistical Physics. Jonh Wiley and Sons Inc., Chichester
(1998)
8. Ball, K., Erman, B., Dill, K.: The Elastic Net Algorithm and Protein Structure Prediction. J.
Computational Chemistry 23, 77–83 (2002)
Neural Networks for Forecasting in
a Multi-skill Call Centre
1 Introduction
Almost all large companies use call centres (CC) to assist with everything from cus-
tomer service to the selling of products and services. Even though CCs have been
widely studied, there are some lacks on forecasting actions which may imply huge
losses of money and client dissatisfaction due to never-ending delays.
In a CC [1], the flow of calls is often divided into outbound and inbound traffic.
Our main concern in this paper is the prediction of inbound traffic. Inbound calls are
those that go from the client to the CC to contract a service, ask for information or
report a problem. This kind of calls is significantly different from outbound calls in
which the agents handle calls to potential clients mainly with commercial pretensions.
Inbound calls are modelled and classified into several call groups, CGs, in relation to
the nature. Once these CGs have been modelled, each call is assigned to a CG.
We assume that there are k types of calls, n customer calls and m agents that may
have up to i skills (i ≤ k). This implies that an agent can handle different types of calls
and, given a type of call, it can be answered by several agents that have that skill.
As it can be expected, the mean arrival rate for each call type is not the same and
these calls have different modelling times. The load of call types, k, is the total amount
of time required for agents service. Note that the inbound flow in CCs is usually not a
D. Palmer-Brown et al. (Eds.): EANN 2009, CCIS 43, pp. 291–300, 2009.
© Springer-Verlag Berlin Heidelberg 2009
292 J. Pacheco, D. Millán-Ruiz, and J.L. Vélez
stationary Poisson process [2] and, the service times do not increase exponentially.
Since calls arrive randomly according to a stochastic process, it would be desirable to
have a very balanced distribution of the agents, who can be available or not, in order to
handle the calls as soon as possible. Figure 1 illustrates the relationship among clients’
calls, queues and agents.
This paper starts by explaining the classical problem of forecasting in CCs in Sec-
tion 2. In Section 3, we give some guidelines to better adapt an improved backpropaga-
tion neural network (IBNN) to a multi-skill CC. Section 4 briefly describes different
forecasting techniques which are compared and analysed. The comparative study is
precisely the main contribution of this paper. Section 5 points out some overall ideas
and prospects for future work on CC forecasting.
The number of call arrivals in a given time follows a Poisson distribution, where a is
the number of call arrivals in an interval T, and μ is the mean of call arrivals in time T.
For this reason, pure-chance traffic is also known as Poisson traffic. However, as men-
tioned previously, the prediction of call arrivals in a CC does not often follow a Pois-
son distribution with a deterministic rate [2]. In all studies, the arrival process agrees
with a Poisson process only if the arrival rate of the Poisson process is itself a stochas-
tic process. Characteristically, the variance of the incoming calls in a given interval is
much larger than the mean. However, it should be equal to the mean for Poisson distri-
butions. The mean arrival rate also depends strongly on the day time and often on the
week day. Finally, there is positive stochastic dependence between arrival rates in
Neural Networks for Forecasting in a Multi-skill Call Centre 293
successive periods within a day and arrival volumes of successive days. Taking into
account all these premises, we can realise of the need of finding a more effective
method to forecast which does not rely on the hypothesis of a simple Poisson arrival
distribution. Section 3 explains the procedure to model an NN step-by-step to forecast
unknown variables taking into account the nature of a real multi-skill CC environment
rather than following a Poisson distribution when predicting.
3.1 Background
In this work, an Improved Backpropagation Neural Network (IBNN) [6] with a single
hidden layer has been development in order to reduce complexity and computing
times, making configurable the standard parameters. This type of ANN learns by
reinforcement, in other words, it is necessary to provide a learning set with some
input examples in which every output is well-known. The classical Backpropagation
algorithm allows the network to adapt itself to the environment thanks to these exam-
ple cases using a gradient-based learning. Our IBNN implements also the momentum
algorithm [3]. The weights in the interval t+1 are determined by the previous weights,
a percentage (η = learning rate) of change in the gradient in this interval and a per-
centage (µ = momentum) of change in the previous interval as follows (2):
∂ E t +1
w ( t + 1) = w ( t ) + η + μ ⋅ Δ w (t ) (2)
∂w
In this implementation, all the parameters (number of neurons in the hidden layer,
initial weights, learning rate and momentum) have been empirically determined to its
optimum value and used in the rest of section of this study. The implementation also
makes configurable the batch learning (weights are changed at the end of each epoch)
or online learning (weights are changed at the end of each pattern of the dataset).
Although batch learning has its advantages (see [3]), the online or stochastic learning
has improved the learning speed, the results and the computing times in our CC fore-
casting [3].
The main problem with gradient-descent methods is the convergence to local opti-
mums instead of global optimums. Some local optimums can provide acceptable
294 J. Pacheco, D. Millán-Ruiz, and J.L. Vélez
solutions although these often offer poor performance. This problem can be overcome
by using global optimization techniques but these techniques require high computing
times. Other improved gradient-based algorithms with more global information such
as Rprop, iRprop or ARProp [10] were not appropriate because the training set was
too large to be effectively applied (see Figure 2). Even if these algorithms could be
applied correctly, the purpose of this paper is to demonstrate, by means of a compara-
tive study, that even a “classical” NN can outperform other forecasting techniques by
including minor changes in the learning rate.
Once the basic aspects have been covered, how do we start the development of an
NN? Firstly, it is essential to find a suitable dataset, and the best way to obtain this is
to do a fair balance between the amount of data and a representative period of time
measured in terms of days. The number of days chosen must be multiple of seven,
because the week day has an important influence in the training and validation of the
NN. Moreover, the number of days must be large enough to represent every possible
situation but not too big because this makes the training slower.
Our implementation is completely configurable and this enables us to determine
that the number of days to take into account should be, at least, 91 days in order to
cover all possible patterns with the considerations previously explained.
The dataset has been split into subsets, which correspond to the CGs' data, to make
a different model for each one. Each subset has been randomly divided into three sets,
following the cross-validation structure [4]: Training (55%), generalization (20%) and
validation (25%). In this strategy, the training dataset is used to make the NN learn,
the generalization dataset is used to prevent overtraining [7] and the validation data-
set, which has not been shown before to the NN, is used to obtain a measure of quality
for the NN. The validation set covers the last two weeks as can be seen in Figure 3
and Figure 4.
Neural Networks for Forecasting in a Multi-skill Call Centre 295
3.3 Variables
Choosing the right inputs from all information we have is not trivial and is very im-
portant for obtaining a higher performance. Since the variables cannot be defined ad-
hoc, the Mann–Whitney–Wilcoxon (MWW) test has been used to obtain a metric of
the relevance of variables (see Table 1). MWW test is a non-parametric test for calcu-
lating whether two independent samples of observations have the same distribution.
This test has been chosen because it is one of the best-known non-parametric signifi-
cance ones.
Variables Relevance %
# Calls in Previous 0-5 Minutes 41,171 %
# Calls in Previous 5-10 Minutes 17,857 %
Night Shift Timetable 11,741 %
Week day 8,41 %
# Calls in Previous 10-15 Minutes 6,307 %
# Calls in Previous 15-20 Minutes 4,873 %
# Calls in Previous 20-25 Minutes 4,790 %
Minutes of the Day 3,433 %
Peak Time 1,415 %
Second Peak Time 1,398 %
Among all variables, the volume of incoming calls in previous intervals, night shift
timetable (NST), week of the month, time, intervals of hours (2, 4 or 8 hours) and
intervals of peak hours must be highlighted. For almost all CGs, the optimum number
of previous intervals required is usually around 5-6 intervals. The NST offers an up-
grading of the results for every CG. When splitting days up into intervals of hours,
predictions are also improved. This division into intervals might guide us to a wrong
decision because these variables are correlated but the causation comes from the night
shift timetable and peak time variables (see Table 1). This happens because the corre-
lation does not imply causality. The improvement is just obtained because these
variables are correlated but only peak time intervals and night shift are useful when
forecasting. Intervals of peak hours are interesting because these divisions clearly
outperform the results for almost all CGs.
3.4 Metrics
To measure the quality of each result, several metrics have been selected. In order to
make the process more understandable, the error is defined between the real result
obtained in an interval and the result predicted by our NN. Some metrics are defined
by the percentage of predictions whether error is inside a given interval that can be
determined by an absolute value or by a percentage of incomings calls. These metrics
are called right rate (RR), specifically absolute right rate (ARR) when X is defined as
the absolute value used to define the interval (X-ARR) and percentage right rate
(PRR) when the percentage Y% is used (Y-PRR). To valuate the model for each pat-
tern, the mean absolute error (MAE), mean squared error (MSE), sum squared error
296 J. Pacheco, D. Millán-Ruiz, and J.L. Vélez
(SSE), mean absolute percentage error (MAPE), ARR and PRR have been consid-
ered. In the same way, the variance error, the maximum error and the minimum error
have been measured with the aim of having a metric of the dispersion of the error.
Although MAPE, as well as PRR, is measured, it is not a high-quality metric for those
groups with a reduced volume of calls due to the real value is very small.
3.5 Adaptations
The number of groups to forecast (320 CGs) and its own behaviour make necessary to
determine the initial parameters of the models, taking in consideration the behaviour
of each one. To mitigate this problem, the CGs have been divided into sets according
to the mean number of calls per day. This criterion has been taken because the behav-
iour is similar in others in which the volume of incoming calls is alike. A different
learning rate has been assigned to each set because this improvement makes the sys-
tem more adaptive to the behaviour of each CG.
CGs with less than 40 calls per day and non-stationary behaviour are very difficult
to predict. When this happens, the error is minimized in order to avoid moving agents
to these CGs without calls. Of course, this action improves the service rates but, obvi-
ously, a client cannot wait indefinitely. To solve this handicap, the NN returns the
number of agents required in the last interval if the NN has predicted zero calls in the
instant t-5. With this modification, a client is always waiting less than the time interval
(5 minutes) and the results remain committed to high-quality.
∂ E t +1
w ( t + 1) = w ( t ) + η t + μ t ⋅ Δ w (t )
∂w
if ( Λ ε t > ε t + 1 ) or (ε t + 1 > 2 * Λ ε t )
η t + 1 = η t * Δ η /(1 + (max( ε t + 1 − 2 * Λ ε t , 0 ) /( ε t +1 − 2 * Λ ε t )))
(3)
μ t + 1 = μ t * Δ μ /(1 + (max( ε t + 1 − 2 * Λ ε t , 0 ) /( ε t +1 − 2 * Λ ε t ))), t
else η t + 1 = η t ; μ t +1 = μ t ; Λ ε t +1 = Λ ε t
Neural Networks for Forecasting in a Multi-skill Call Centre 297
The initial accepted error controls the initial interval so that this decides when to
increase or decrease other parameters. This provokes a significant effect on the learn-
ing speed. If the initial accepted error had a big value, the risk of overtraining would
be very low but, nevertheless, the learning speed would be too slow, making the re-
sults worse after several epochs. If computing time was infinite then the results would
be better with a large value rather than a low value.
The percentage of learning rate that has been changed guides the increment or dec-
rement of the learning rate. If it has a low value, the changes of the learning rate will
be more gradual. It also prevents the NN from overtraining and makes the learning
speed more acceptable when other parameters are well-configured. In other words, a
big value makes abrupt and unpredictable changes on the behaviour of the NN. It is
important to correctly initialise the learning rates for each set and the initial error
accepted in order to be able to configure this parameter.
The percentage error of accepted change controls the variance of the accepted er-
ror. The accepted error should increase when the learning rate increases and it should
decrease when learning rate decreases. The reason of this behaviour is that it changes
the results that are elevated when the learning rate is bigger, and decreases the learn-
ing rate when it is lower. A small value makes the accepted interval constant. The
system learns faster at the beginning but reaches overtraining at the end. To prevent
the NN from overtraining, the value should be positive and bigger than the percentage
of learning rate changed. The percentage of momentum changed controls the variance
in the momentum. When the momentum is bigger, the learning speed increases al-
though it might guide the NN to worse predictions because the change is propagated
faster. This parameter is connected with all parameters of the algorithm but not line-
arly, making the configuration pretty difficult.
As mentioned at the beginning of Section 3, the most important problem with gra-
dient descent methods is the premature convergence to local optimums which might
be far from the global optimum. This problem can be solved by using global optimi-
zation techniques. However, these techniques require high computing times. As other
improved gradient-based algorithms with more global information such as Rprop,
iRprop or ARProp [10] were not appropriate because the training set was too large
enough to be effectively applied (see Figure 2); an adaptive learning rate algorithm
has been proposed. Our modification allows us to quickly analyse our CGs and out-
performs other forecasting techniques.
4 Results
In this section, our results are analyzed and, then, compared with Weka [8] and R
forecasting package [9]’s algorithms to quantify the efficiency of our IBNN and the
convenience of its adaptive learning rate mechanism. Weka and R forecasting pack-
age have been chosen because their codes are well-implemented open-sources. Other
data mining tools such as SPSS or SAS have potent algorithms too, but we would not
have any insight about the algorithm behind them.
4.1 Analysis
The results of our network present stable behaviour for all CGs as Figure 3 demon-
strates. CGs with less volume of calls, i.e. CG5, present better performance when the
298 J. Pacheco, D. Millán-Ruiz, and J.L. Vélez
Fig. 3. (a) 5-ARR x Income calls; (b) 10-PRR x Income calls (c) MSE x epochs
4.2 Comparative
Weka and R [8] and R forecasting package [9]’s algorithms have been selected and
adapted to be analysed and compared to our implementation. The same parameters,
number of epochs, and values for these parameters have been considered in order to
carry out a fair analysis. In Figure 4, we compare regression models (RM), exponen-
tial time series (ETS), ARIMA models (AM) and NNs with our implementation, high-
lighting the performance of our model. RMs present excellent results for some groups
but these cannot tackle well-enough others, i.e. CG1 and CG4. In the same way, AM
and ETS present accurate results for CGs with stable behaviour and/or low incoming
flow but, compared to ANN, the results are poor for CGs with complex behaviour, i.e.
CG2 and CG3. The ANN models usually present better results than ETS for CGs with
a large volume of calls, i.e. CG1 and CG2. Our adaptive learning rate algorithm
makes the IBNN more adaptable than a simple BNN for those CGs with low incom-
ing flow, i.e. CG4 and CG5, making the prediction error closer to the ETS error.
Another consideration about the adaptive learning rate mechanism has to be done.
The BNN parameters used by WEKA algorithms have been fixed to obtain an optimal
model for each CG, but in practice, the model has to adapt its learning rate to obtain
an optimal value because CGs can change dramatically in few months and new ones
may appear. In real-time, the optimal parameter for each CG cannot be chosen every
time the model is recalculated which makes necessary an algorithm that does it.
Neural Networks for Forecasting in a Multi-skill Call Centre 299
4,500
4,000
3,500
MAE
3,000
2,500 5
2,000 4
1,500
1,000 3
0,500
0,000 2
on
N
N
1
Re TS
M
BN
N
si
IB
RI
es
A
gr
Techniques
5 Conclusions
We have seen the problematic of forecasting in a CC and how NNs can help us. This
paper also describes an efficient IBNN model and some upgrading adaptations. Our
implementation is an efficient approach for real-time production environments such as a
multi-skill CC. This solution is not universal and might offer worse results than others
for environments in which time is not critical or conditions are more stable. Sometimes,
a more sophisticated learning rate algorithm provokes a loss of computing time that
cannot be allowed in a production environment. In our production environment, there
are 320 CGs that are recalculated in less than two days for a 91-day data set running
300 J. Pacheco, D. Millán-Ruiz, and J.L. Vélez
under a single processor, making the system truly adaptable to changes. Finally, we
have analyzed our results and compared them to Weka and R’s Algorithms which are
outperformed. The future work should focus on the improvement of the adaptive algo-
rithm to better control the learning process. In addition, other variables, which have not
been considered, should be collected and meticulously analysed.
Acknowledgment
The authors would like to thank Severino F. Galán for his very useful comments.
References
1. Bhulaii, S., Koole, G., Pot, A.: Simple Methods for Shift Scheduling in Multiskill Call
Centers. M&SOM 10(3), 411–420 (Summer 2008)
2. Ahrens, J.H., Ulrich, D.: Computer Methods for Sampling from Gamma, Beta, Poisson
and Binomial Distributions. Computing 12(3), 223–246
3. Lecun, Y., Bottou, L., Orr, G., Muller, K.: Efficient BackProp. In: Orr-Muller (ed.) Neural
Networks: Tricks of the trade, p. 44 (1998)
4. Black, J., Benke, G., Smith, K., Fritschi, L.: Artificial Neural Networks and Job-specific
Modules to Assess Occupational Exposure. British Occupational Hygiene Society 48(7),
595–600 (2004)
5. Mandic, D., Chambers, J.: Recurrent Neural Networks for Prediction: Architectures. In:
Learning algorithms and Stability. Wiley, Chichester (2001)
6. Bishop, C.: Neural Networks for Pattern Recognition, pp. 116–149. Oxford University
Press, Oxford (1995)
7. Müller, K.R., Mika, S., Rätsch, G., Tsuda, K., Schölkopf, B.: An Introduction to Kernel-
Based Learning Algorithms. IEEE Transactions on Neural Networks 12(2) (March 2001)
8. Weka project website: http://www.cs.waikato.ac.nz/ml/weka/
9. R Forecasting package website:
http://cran.r-project.org/web/packages/forecasting/
index.html
10. Igel, C., Husken, M.: Empirical Evaluation of the Improved Rprop Learning Algorithm.
Neurocomputing 50(C), 105–123 (2003)
Relational Reinforcement Learning Applied to
Appearance-Based Object Recognition
1 Introduction
Relational reinforcement learning (RRL) is an attempt to extend the applicabil-
ity of reinforcement learning (RL) by combining it with the relational description
of states and actions. Disadvantages of RL are firstly its inability of handling
large state spaces unless a regression technique is applied to approximate the
Q-function. Secondly, the learned Q-function is neither easily interpretable nor
easily extendable by a human, making it difficult to introduce a priori knowledge
to support the learning process. And last, a generalization from problems the
system was trained on to different but similar problems can hardly be done using
classical RL. In [1], the relational representation of the Q-function was proposed
and studied to overcome these issues. In this paper, we examine the applica-
bility of the relational approach to a computer vision problem. The problem
we consider is common in many computer vision applications (e.g., in manufac-
turing) and consists in the discrimination between similar objects which differ
D. Palmer-Brown et al. (Eds.): EANN 2009, CCIS 43, pp. 301–312, 2009.
c Springer-Verlag Berlin Heidelberg 2009
302 K. Häming and G. Peters
slightly, e.g., in salient features visible from distinct views only. In our system
a camera is rotated around an object (i.e., it is moved on an imaginary sphere
around the object) until the system is able to reliably distinguish two similar
objects. We want an agent to learn autonomously how to scan an object in the
most efficient way, i.e., to find the shortest scan path which is sufficient to decide
which object is presented. We want our approach to be general enough to extend
from our simulated problem to a real world problem. Therefore we avoid using
information that is uncertain or completely lacking in a real world application.
This includes the camera parameters. Thus, in the state representation neither
the camera’s position nor its viewing direction are encoded, rather the design
is purely appearance-based. The agent receives no information on the current
3D position of the camera. The only utilizable information is the appearance of
the current view of the object in form of features visible in the view at hand.
This will lead to learned rules such as: “seeing those features, an advantageous
next action is to move the camera in that direction”. This direction is encoded
relatively to the current view of the agent.
2 Reinforcement Learning
RL [2] is a computational technique that allows an autonomous agent to learn a
behavior via trial and error. A RL-problem is modeled by the following compo-
nents: a set of states S, a set of actions A, a transition function δ : S × A → S,
and a reward function r : S × A → R. The reward function is unknown to the
agent, whose goal is to maximze the cumulated reward. In Q-learning, the agent
attaches a Q(uality)-value to encountered state-action-pairs. After each transi-
tion, this Q-value of the state-action-pair is updated according to the update
equation:
Qt+1 (s, a) = r(s, a) + γ max
Qt (s , a ).
a
In this equation, s = δ(s, a), while γ is a discount factor ensuring that the Q-
values always stay finite even if the number of states is unbounded. As explained
later, in our application it will also be helpful in search of a short scan path.
During exploration, in each state the next action has to be chosen. This is done
by a policy π. In Q-learning, its decision is based on the Q-values. We use the
-greedy policy, which chooses with equal probability one of those actions that
share the highest Q-value, except for a fraction of choices controlled by the
parameter , in which the next action is drawn randomly from all actions.
which is simply the weighted arithmetical mean of the nearest Q-values. The
weight is given by a reciprocal distance measure disti,j , which depends on the
logical modeling of the state-action-pairs and is defined in Sec. 4. We use a value
of n = 30. The strategy to decide whether an example gets included into the
Q-function is basically to test if this example contributes a significant amount
of new information, the Q-function needs denser sampling, or none of these. The
measure of information novelty is based on the local standard deviation σl of
the Q-values in the vicinity of the current state-action-pair. If the difference of
the Q-value qi of the current example to its predicted value q̂i exceeds σl , it is
added to the database, which means that a function quite new takes the value
true:
true, |q̂i − qi | > c1 σl
quite new(qi , c1 ) =
false, else
for a constant c1 . To decide whether a denser sampling is needed, we relate σl
to the global standard deviation σg of the Q-values of all examples constituting
the Q-function, which means that we consider a function quite sparse:
true, σl > c2 σg
quite sparse(c2 ) =
false, else
4 Appearance-Based Modeling
In this section we first describe how we represent the states and actions in our
application. Afterwards we explain the measure of the distance disti,j between
two state-action-pairs i and j.
positions on a sphere [5]. In contrast, our approach will allow arbitrary camera
positions, because the positions will be encoded only implicitly by the perceived
visual appearance and not by camera parameters. This appearance is captured
in a set of features (e.g. interest points, but not necessarily). Each feature fi
is attached to the view in which it is detected. For each view we describe the
visibility of a feature by the following expression: visibility(feature fi ) A state
can now be defined by a list of visible features. The notation fi stands for a
feature and the index identifies this feature globally, i.e., across views. Since it
is only necessary to identify whether or not two image-features represent the
same global feature when computing the similarity between states, a pair-wise
feature matching can be used to implement this global assignment. An encoding
of a state can exemplarily look like this: visibility(f1 ) ∧ visibility(f7 ) ∧
visibility(f14 ) This means that in this state the globally numbered features
f1 , f7 , and f14 are visible. Thus, the current state of the RL system is defined
by the features which are visible in the current view.
1
dreezle
world
0.9 threshold
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
−3 −2 −1 0 1 2 3
Fig. 1. Left: Derivation of actual camera movements in 3-space from their directions
in 2-space of image planes. Right: Behavior of the state similarity for two examples.
of the state similarity function si,j,visibility. For each of the recorded views
the state similarity to a reference view is calculated. The red curve represents
the similarity values for the object ”dreezle”, the reference view of which is de-
picted in the upper left, the green curve does the same for the reference view of
the object ”world”, shown in the upper right. Both reference views have been
recorded at the position of zero degrees. The similarity function provides reliable
values, even for the object ”world”, where the reference view contains very few
features only. In addition, the inflection point p we use for the threshold function
is drawn as dotted line at the similarity value of 0.4.
Distance Measure for State-Action-Pairs. We now examine the distance function
disti,j mentioned in Sec. 3. It is based on a similarity measure simi,j : (i, j) →
[0; 1]:
disti,j = 1 − simi,j .
This similarity measure is defined as a product of the state similarity and the
action similarity. simi,j = simSi,j · simA i,j . Both, sim
S
and simA , are func-
S
tions with range [0; 1]. They are defined as simi,j = t(si,j,visibility) and
simA i,j = t(si,j,to ), where t is a sigmoidal function that maps its argument to
[0; 1]. It imposes a soft threshold to punish poor values quickly while at the same
1
time allowing small deviations from the optimum: t(x) = 1+e−µ(x−p) where μ is
the steepness of the slope (set to 20 in our tests) and p is the inflection point
(set to 0.4 for simSi,j and to 0.8 for simA i,j ). The unthresholded state similar-
ity si,j,visibility is based on the amount of overlapping between two views.
A larger overlap leads to a larger similarity. The overlapping is measured by
counting the features that reference the same global feature. This is measured
306 K. Häming and G. Peters
Fig. 2. Left: Appearance-based choice of actions for two objects. Red lines indicate
possible directions for the next camera movement. Features are indicated by red dots
on the object (close to the direction lines). The directions are computed as local maxima
of the variation of the feature density. Right: Short scan paths are enforced by using a
discount factor γ ∈]0; 1[ and giving zero rewards in all states except for the goal state.
A shorter path leads to a less discounted reward. The exponents of γ are derived from
the iterated application of the update equation.
by actually carrying out a feature matching between both views. The relation of
the amount of same features to the amount of all features is used as the similarity
measure. Let visible(x) be the set of all features visible in state x. Then
visible(i) ∩ visible(j)
si,j,visibility = . (1)
visible(i) ∪ visible(j)
The right diagram of Fig. 1 illustrates the property of this similarity measure by
means of two examples. The term si,j,to expresses the unthresholded similarity
between two actions. It depends on the to-expressions introduced in Sec. 4. Two
of these expressions point to the same state if the cameras capture the same
features after moving in those directions. We use the distance of the camera
centers after moving them along the directions the actions point to. The approach
is also depicted in the left diagram of Fig. 1.
5 Application
The previously described framework has been used in an application that learns
scan paths which are short but nevertheless allow the discrimination of two
very similar objects. These objects look essentially the same except for a minor
difference, for example in texture as seen in the left column of Fig. 3. This figure
illustrates the following. Left: example objects and their differences. Upper row:
The two versions of object ”dreezle” can be distinguished by the starlike figure
on its belly. Lower row: The two versions of object ”world” can be distinguished
by the yellow scribbling. Right: Phase 1 of the identification of the goal state
(determining the most similar view for each object). Each row shows one example
of the image retrieval process. The left column shows the current view of the
agent. For each object in the database we determine the most similar view to the
one in the left column. The right column shows the determined most similar view
of only one special object, namely of that object which together with the object
RRL Applied to Appearance-Based Object Recognition 307
Fig. 3. Left: Example objects and their differences. Right: Phase 1 of the identification
of the goal state (determining the most similar view for each object).
of the left column makes up the pair of similar objects. To set up the learning
task for each pair of objects, both objects are scanned and their features are
stored in a database. This database is used to generate the reward for the agent.
One of these objects is then presented to the agent. The agent moves around the
object, records images, and stops when she receives a reward of 1. Then the next
episode starts. As a result, the agent learns to scan an object in such a way that
she is able to distinguish it from a very similar object stored in the database.
Feature Detection and Calculation of Descriptors. To recognize and match fea-
ture points between images of an object, which have been taken from different
camera positions, they are required to be reasonably stable. To comply with this
requirement, we use a scale space Harris detector, combining the real time scale
space of Lindeberg [6] with the Harris detector [7], while modifying the latter to
make it work in a scale space pyramid [8]. To attach descriptors to these feature
points we use Lowe’s SIFT descriptors [9]. While he only uses those feature coor-
dinates that pose maxima in scale space, we take all feature points into account
as long as they are a local maximum in their own scale. This overcomes some
stability deficiencies with the original approach that we experienced and that
have been reported in [10] as well. As a result we get up to 500 features in each
image. To reduce this amount, each camera takes a second image from a slightly
different view point. Then, we apply a feature matching using a kd-tree [11]. The
resulting correspondences are filtered by applying the epipolar constraint [12].
Only those feature points that survive this procedure are stored in our database.
We aim at about 150 features for each view.
Rewards. We aim at learning the shortest scan path around the sphere of an
object to a view that allows for the reliable discrimination from another similar
object. As the agent aims to maximize the sum of all rewards, positive rewards
in each step will keep the agent away from the goal state as long as possible. For
this reason we do not reward a simple movement at all. The only action that
308 K. Häming and G. Peters
receives a reward is a movement to the goal state. A shortest path can be made
attractive to the agent by (ab)using the discount factor γ. If we set γ := 1, all
Q-values will approach 1, since r is always 0. But with γ ∈]0; 1[ we will find a
greater Q-value attached to states closer to the goal than to those farther away.
This is illustrated in the right of Fig. 2 and leads to the desired preference of
short paths towards the goal state.
Identification of the Goal State. As noted in the last subsection the reward is
zero for all actions except those reaching the goal state. In fact, in our setup the
goal state is defined as the state where the reward is one. In each step an image
I is captured. After computing features and their descriptors of this image, they
are used to identify the goal state. This identification proceeds in two phases.
Phase 1 iterates through all objects inside the database (which consists of several
pairs of pairwise similar objects) and identifies for each object the most similar
view. This is basically an image retrieval task. Given one object O, our approach
begins with building a kd-tree KO with all descriptors belonging to O. For each
descriptor DI,i of I, the most similar descriptor DO,j in KO is identified and
its Euclidean distance d(DI,i , DO,j ) is computed. This distance d is used to vote
for all views of object O, DO,j belongs to: score(d) = d1 (taking the prevention
of zero division into account). These scores are accumulated over all descriptors
of I for each view separately. The one which receives the largest sum is taken
as the most similar view of object O to I. This is done for all objects O in the
database. The right column of Fig. 3 shows resulting pairs of images. Phase 2
aims at the decision whether the most similar images, e.g., IO1 and IO2 , of two
candidate objects O1 and O2 show a significant difference. If so, we have reached
our goal of finding the most discriminative views of two similar objects. Then
we can mark the current state as a goal state by giving a reward of one. Finally
we can find out which of the similar objects we have currently at hand. To do
this, we reuse the similarity measure si,j,visibility of Sec. 4. We compute the
similarity of image I (corresponding to state i in (1)) to both candidates and
take a normalized difference:
|sI,IO1 ,visibility − sI,IO2 ,visibility|
g= (2)
max(sI,IO1 ,visibility, sI,IO2 ,visibility)
If this value g exceeds a certain threshold, the most discriminative view between
the two similar objects is considered identified and the current episode ends.
(Once this discriminative view is identified it is simple to determine which of
both objects has been scanned.) For our learning scheme a threshold of 0.15
suffices. Fig. 4 shows results of this approach. Phase 2 of the identification of the
goal state (finding the discriminative view) is illustrated here. On the abscissa the
angles in the range of [0; 2π] of a camera are logged, which is rotated around an
object with a fixed axis through the center of the object. The ordinate represents
the values of the normalized difference g (cf. (2)). Each value of the red curve
is computed from three images: the test image I and the two best candidate
images IO1 and IO2 . For the red curve image I shows the ”dreezle” figure at a
zero rotation angle, depicted on the left in the upper row of Fig. 3. One example
RRL Applied to Appearance-Based Object Recognition 309
0.4
dreezle
world
threshold
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0
0 1 2 3 4 5 6
Fig. 4. Phase 2 of the identification of the goal state (finding the discriminative view)
for IO1 is the left image in the bottom row, which shows one view of the ”dreezle”
with the star on it belly, IO2 (which is not shown) is the same view of the object
without the star. The resulting difference value g for this example is larger than
the threshold of 0.15 marked by the dotted line. Thus this view is discriminative
enough to tell both objects apart. In contrast, the view in the middle of the
bottom row is not discriminative enough, as it does not reveal if the object has a
star on its belly. This fact is represented well by the low value of the normalized
difference function g. The upper row and green curve show more examples for the
object ”world”. Here the reference view I has also been recorded at the position
of zero degrees. This diagram illustrates that the objects can be distinguished
more reliably with a larger visibility of their discriminative part.
Results. The left column of Fig. 5 illustrates the development of the set of sam-
ples of state-action-pairs constituting the Q-function. After the first episode only
two samples are found inside the sample set. We will briefly examine these two
310 K. Häming and G. Peters
"points.dat" "points.dat"
1 1
40 0.8 40 0.8
20 0.6 20 0.6
0 0.4 0 0.4
-20 0.2 -20 0.2
-40 -40
0 0
40 40
0 20 0 20
40 20 0 -20 -40 -40 -20 40 20 0 -20 -40 -40 -20
"points.dat" "points.dat"
1 1
40 0.8 40 0.8
20 0.6 20 0.6
0 0.4 0 0.4
-20 0.2 -20 0.2
-40 -40
0 0
40 40
0 20 0 20
40 20 0 -20 -40 -40 -20 40 20 0 -20 -40 -40 -20
Fig. 5. Left: Learned rules in the form of state-action-pairs used to approximate the
Q-function. Right: Application of the learned rules in a discrimination task.
samples to comprehend the constitution of the Q-function. The first reward the
agent gets after the first episode is zero. Because nothing is known about the Q-
function, this state-action-pair is immediately added to the set of samples. This
sample predicts the reward of all following state-action-pairs also as zero. This
holds true until the first step encounters a reward of 1. Then one additional sam-
ple is added to the Q-function: the state-action-pair that led to the goal state.
The state-action-pair with the goal state is not inserted because it does not meet
the conditions presented in Sec. 3. This is the end of the first episode. Further
episodes insert samples according to the rules presented in the end of Sec. 3.
Basically, it is tested if the Q-function needs denser sampling or not. The right
column of Fig. 5 shows the paths an agent takes when she uses the Q-functions
depicted in the left column of Fig. 5. It is obvious that the paths get shorter the
more episodes have been completed. The examples clearly indicate the applica-
bility of the RRL approach to computer vision learning tasks, especially because
of its capability of handling continuous domains. Additionally, the Q-function
consists of comprehensive state-action-pairs, where each pair encodes a rule that
indicates the agent the merit of moving the camera in a certain direction with a
given view at hand. This way, a human trainer can easily add information to the
Q-function, e.g., by simply presenting a view of the object and a corresponding
preferable direction. In addition, the relational encoding removes dependencies
on coordinate systems, which may arise when using traditional RL approaches
that use camera positions as their basis of a state’s encoding. Fig. 5 in detail:
RRL Applied to Appearance-Based Object Recognition 311
6 Conclusion
7 Future Research
This work leads to a number of possible future extensions. The image retrieval
using a simple kd-tree is quite slow and can be accelerated. Using the best-bin-
first-technique of [11] accelerates the process slightly, but the matching reliability
deteriorates quickly and values below 20.000 comparisons are strongly discour-
aged. An integration of the generation of the object database into the learning
algorithm would result in a system enabling the agent to explore her environ-
ment and constantly add features into the object database. Similar objects can
share some of the feature sets. However, to generate representations of distinct
objects, a criterion will have to be developed that groups feature sets according
to their belonging to the same unique object type.
References
1. Dzeroski, S., De Raedt, L., Driessens, K.: Relational reinforcement learning. In:
Machine Learning, vol. 43, pp. 7–52 (2001)
2. Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction. MIT Press,
Cambridge (1998)
3. Gartner, T., Driessens, K., Ramon, J.: Graph kernels and gaussian processes for
relational reinforcement learning. In: Inductive Logic Programming, 13th Interna-
tional Conference, ILP (2003)
4. Driessens, K., Ramon, J.: Relational instance based regression for relational rein-
forcement learning. In: Proceedings of the Twentieth International Conference on
Machine Learning, pp. 123–130 (2003)
5. Peters, G.: A Vision System for Interactive Object Learning. In: IEEE International
Conference on Computer Vision Systems (ICVS 2006), New York, USA, January
5-7 (2006)
6. Lindeberg, T., Bretzner, L.: Real-time scale selection in hybrid multi-scale repre-
sentations. In: Griffin, L.D., Lillholm, M. (eds.) Scale-Space 2003. LNCS, vol. 2695,
pp. 148–163. Springer, Heidelberg (2003)
7. Harris, C., Stephens, M.: A Combined Corner and Edge Detector. In: 4th ALVEY
Vision Conference, pp. 147–151 (1988)
8. Mikolajczyk, K., Schmid, C.: Scale and affine invariant interest point detectors.
International Journal of Computer Vision 60(1), 63–86 (2004)
9. Lowe, D.G.: Distinctive image features from scale-invariant keypoints. Int. J. Com-
put. Vision 60(2), 91–110 (2004)
10. Baumberg, A.: Reliable feature matching across widely separated views. In:
CVPR 2001, p. 1774 (2000)
11. Beis, J., Lowe, D.: Shape indexing using approximate nearest-neighbor search in
highdimensional spaces (1997)
12. Hartley, R.I., Zisserman, A.: Multiple View Geometry in Computer Vision, 2nd
edn. Cambridge University Press, Cambridge (2004)
Sensitivity Analysis of Forest Fire Risk Factors and
Development of a Corresponding
Fuzzy Inference System:
The Case of Greece
Abstract. This research effort has two main orientations. The first is the sensi-
tivity analysis performance of the parameters that are considered to influence
the problem of forest fires. This is conducted by the Pearson’s correlation
analysis for each factor separately. The second target is the development of an
intelligent fuzzy (Rule Based) Inference System that performs ranking of the
Greek forest departments in accordance to their degree of forest fire risk. The
system uses fuzzy algebra in order to categorize each forest department as
“risky” or “non-risky”. The Rule Based system was built under the MATLAB
Fuzzy integrated environment and the sensitivity analysis was conducted by
using SPSS.
1 Introduction
Forest fires have been a major issue for several countries of the world over the last
century. Many countries have developed their own models of forecasting the location
of the most severe forest fire incidents for the following fire season and the extent of
the damages. Our research team has designed and implemented the FFR-FIS (Forest
Fire Risk Inference System), which provides a ranking of the Greek forest depart-
ments according to their degree of forest fire risk. The system uses fuzzy logic and it
can provide several results obtained by different scenarios, according to the optimism
and to the perspective of its user. It is adjustable to any terrain and can be used under
different circumstances. The first attempt to characterize the Greek forest departments
as risky or not risky gave promising results that have proven to be a close approxima-
tion of reality. Initially, the first version of the system considered four factors that
influence the degree of forest fire risk whereas this research effort proposes a restruc-
tured and adjusted system to use seven major factors. The determination of the
optimal vector of factors that influence the overall degree of risk for every forest de-
partment is a difficult task and the gathering of data is a tedious procedure, expensive
D. Palmer-Brown et al. (Eds.): EANN 2009, CCIS 43, pp. 313–324, 2009.
© Springer-Verlag Berlin Heidelberg 2009
314 T. Tsataltzinos, L. Iliadis, and S. Stefanos
and time consuming. This study is a first attempt towards the development of an ef-
fective and simple Fuzzy Inference System (FIS) capable of considering the factors
that are of great importance towards forest fire risk modeling in Greece.
2 Theoretical Framework
The FFR-FIS was designed and developed under the Matlab platform. It employees the
concepts and the principles of fuzzy logic and fuzzy sets. The system assigns a degree
of long term forest fire risk (LTDFFR) to each area by using Matlab’s fuzzy toolbox
and its integrated triangular membership function. See the following formula 1.
⎧0 if X < a
⎪ (X- a) /(c- a) if X ∈ [a, c]
⎪
μ s (Χ) = ⎨ (1)
⎪(b- X) /(b- c) if X ∈[c, b]
⎪⎩0 if X > b
The first basic operation of the system is the determination of the main n risk fac-
tors (RF) affecting the specific risk problem. Three vectors of fuzzy sets (FS) can be
formed for each RF:
~
T
1. S 1i = {(μj (Α j), Xi) (forest departments Aj of low risk) / j =1…N, i = 1… M}
~
T
2. S 2i = {(κj (Α j), Xi) (forest departments Α j of medium risk) / j =1…N, i= 1…M}
~
T
3. S 3i = {(λj (Α j), Xi) (forest departments Α j of high risk) / j =1…N, i = 1…M }
where N is the number of forest departments under evaluation and M is the number of
involved parameters. It should also be specified that Xi is a crisp value corresponding
to a specific risk factor.
The long term forest fire risk factors are distinguished in two basic categories;
Human factors and Natural ones (Kailidis, 1990). Each one of these general risk types
consists of several sub-factors that influence in their own way the overall risk degree
(ORD). The factors under consideration are the following:
Some of the above parameters are structural (they do not change overtime) and
some are dynamic. The singleton values (minimum and maximum) that are needed in
order to apply the fuzzyfication functions are calculated from the data inventories
obtained by the Greek secretariat of forests and other public services. There might
also be other factors with a potential influence in the annual number of fire incidents,
but in this pilot attempt, the system has to be simple enough to provide clear results so
that it becomes easier in the future to analyze all the data available. The FFR-FIS fo-
cuses in providing a LTDFFR index for the Greek forest departments related to their
annual number of forest fires.
A critical step is the design and construction of the main rule set that will provide
the reasoning of the system. Factors a, b, c, d and e in the above table 1 are the “natu-
ral factors” whereas the rest two are the “human-related ones”. More specifically, the
average annual temperature, humidity and wind speed belong to the “meteorological
parameters” data group whereas the average height and the percentage of forest cover
belong to the “landscape parameters” data group. This categorization and grouping of
risk factors, enables the construction of a much smaller rule set for the estimation of
the final ORD.
For example, in order to combine the Population Density (Pop) and the Tourism
(Tour) factors in one subgroup named “measurable human factors” (MHF) the fol-
lowing 9 rules need to be applied:
This makes it easier to combine all the factors with the use of proper fuzzy linguis-
tics in order to produce one final degree of risk. The factors are combined in the way
described in the following figure 2. The number of rules becomes even smaller when
proper decision tables are used (Table 2).
Temperature L L L L L L L L L M M M M M M M M M H H H H H H H H H
Wind L L L M M M H H H L L L M M M H H H L L L M M M H H H
Humidity L M H L M H L M H L M H L M H L M H L M H L M H L M H
Low Danger X X X X X X X
Medium Danger X X X X X
X X X X X
High Danger X X X X X X X X X X
Final Table Group 1 Group 2 Group 3
Temperature L L L L L L L M M M M M M M H H H H H Legend
Wind L M M M H H H L M M M H H H L L L M H L = Low
Humidity - L M H L M H - L M H L M H L M H - - M = Medium
H = High
Low Danger X X X X X
Medium Danger X X X X X X X X
High Danger X X X X X X
The ORD is produced by applying various types of fuzzy relations, in order to per-
form fuzzy conjunction operations between the fuzzy sets (and consequently between
partial risk indices due to separate risk parameters). The functions for the conjunction
are called T-norms and for the disjunction T-conorms or S-norms (Kandel A., 1992).
This project employees the Matlab's integrated Mamdani Inference method, which
operates in a forward chaining mode (Cox, 2005). The Mamdani inference system
comprises of five parts:
Sensitivity Analysis of Forest Fire Risk Factors 317
⎛ ⎞
∫χ μ ⎜⎝ Α(χ )⎟⎠.χdx
~
⎛ ~
⎞
f −1 ⎜ COG ( A) ⎟ = (5)
⎝ ⎠ ⎛~ ⎞
∫χ μ ⎜⎝ Α(χ )⎟⎠dx
∑z z x y
r= (6)
N
318 T. Tsataltzinos, L. Iliadis, and S. Stefanos
4 Sensitivity Analysis
A more detailed study of the results show which factors are necessary and which are
not. In this first approach, six of the factors used as inputs, have not proven to be sig-
nificantly correlated to the number of forest fires according to Pearson’s method as seen
in table 4 below. These factors have very weak correlations to the annual number of
fires that vary from -0.074 to just 0.171, and their significance scores are all high above
the 0.05 level of confidence (Urdan 2005). The only significantly correlated factor is the
population, with a moderate relationship to the annual number of fires. Population has
an actual significant correlation due to its p value of 0.004 (Urdan 2005).
Despite the fact that most of the factors separately are not correlated to the final
number of fires, when all of them are combined with the proper rule set, they provide
an ORD that has statistically a moderate correlation degree of 0.347 which is statisti-
cally significant with a p value of 0.007. Table 3 below presents the results of Pear-
son’s correlation coefficients of the seven factors considered by the FFR-FIS
Pearson
Number of 0,171 0,060 -0,072 -0,074 0,129
Correlation
fires
Sig. (1-tailed) 0,118 0,340 0,310 0,304 0,186
Pearson
Number of ,373(**) 0,108 ,311(*) 0,010 ,347(**)
Correlation
fires
Sig. (1-tailed) 0,004 0,227 0,014 0,472 0,007
Sensitivity Analysis of Forest Fire Risk Factors 319
The terms “human-caused risk” and “natural risk” refer to unified risk indices that
derive from human and natural related factors respectively. It is clear that the human
degree of risk with a correlation degree of 0.311 and a p value of 0.014 is more sig-
nificantly correlated to the number of fires than the natural degree of risk which has a
0.01 value for correlation degree and its p value is 0.472. This result is another indica-
tion of the major importance of the human factor in the determination of the number
of forest fire instances.
The next step of this analysis is to find out which of the above factors are abso-
lutely necessary in order to produce a good overall degree of risk and which of them
can be ignored. The basic hypothesis made in this step is to assume that data related to
a factor is not known and cannot be inferred. Starting from the first factor, its data is
deleted and the average value of this factor for Greece is assigned to each forest de-
partment instead. The overall degree of risk is then recalculated and the correlations
matrix is reproduced, in order to compare the new overall degree with the initial one.
The same procedure is repeated iteratively for every factor that the system utilized.
The results are shown in the table 4 below.
Table 4. Results with Pearson correlation coefficients, between the Overall degree of risk and
the number of fire incidents, when each one of the factors is not known
,347(**)
1 All factors known
0,007
0,092559
2 Temperature uknown
0,261309
,299(*)
3 Humidity uknown
0,017544
,249(*)
4 Altitude uknown
0,040906
,304(*)
5 Forest cover density uknown
0,015901
0,230499
6 Wind speed uknown
0,053653
-0,1072
7 Population uknown
0,229347
,349(**)
8 Tourism development index uknown
0,006449
These results show that the system provides an overall degree of risk which is in
fact correlated to the number of fire instances, when all the seven factors are known,
as explained above. Further analysis shows that even if there is no data about humid-
ity, altitude, forest cover or tourism development index, the correlation degrees still
320 T. Tsataltzinos, L. Iliadis, and S. Stefanos
are all above 0.249 and all have p values lower than 0.0409. This shows that the
results still remain significantly associated to the number of forest fire instances. On
the other hand it is not possible to provide an overall degree of risk when the tempera-
ture, the wind and the population of a forest department are not known As shown in
Table 5 above, when the wind factor is missing, the correlation degree of the ORD
with the annual number of forest fires is 0.23 and p is above 0.053. This is “almost”
acceptable as value, whereas if the temperature or the population density indices are
not known, the correlation degrees drop below 0.107 in absolute value and p rises
above 0.22. These values are not acceptable due to the low degree of correlation,
which means that there is no relation between the results and reality, and the fact that
p is way above the 0.05 threshold, which makes the results statistically insignificant.
Of course one can argue that there are several other factors that might have a potential
influence in the forest fire degree of risk. According to (Kailidis 1990), there are two
different kinds of forest fire degrees of risk. More detailed analysis of forest fire risk
needs the distinction between these two different types of forest fire degrees of risk as
shown below:
• Forest fire degree of risk due to the total number of forest fire in-
stances
• Forest fire degree of risk due to the total burned area
The factors that affect the first degree of risk have a different effect to the second
degree. For example, Tourism is a great risk factor when examining the number of
fire instances, but it does not influence the “acceleration risk degree” in a great way.
Meteorological risk factors are those that mostly affect the second degree of danger,
while human caused risk factors affect mostly the first risk degree. (Kailidis 1990)
Thus, in order to produce a risk index due to the extent of the damages of each forest
fire incident, different factors and a different rule set are required. Despite this fact,
the overall risk index provided by the FFR-FIS is significantly associated to the sec-
ond kind of risk degree. The following table 5 presents the correlation coefficients
between the ORD and the degree of risk due to the extent of the burned area.
As seen in table 5 when all the factors are known, the correlation is significant at
the 0.05 alpha level due to its low value of p=0.0467, whereas it has a moderate cor-
relation degree of 0.24. By applying the previous method, assuming that every time
one factor is not known, the results for temperature, humidity, altitude, forest cover
density, wind and population produce correlation degrees between 0.052 and 0.221.
Even though the value 0.221 (when the forest cover density is unknown) shows a
moderate correlation, it cannot be taken under consideration due to its p value which
is 0.061. The correlation degrees in every other case drop below 0.2 and p values are
all above 0.17. On the contrary, when tourism development index is treated as not
known, the correlation becomes greater (0.316) and the p value drops at 0.013. This
proves that in order to produce significant results, all the factors, but “tourism”, are
necessary. Thus, in order to produce a risk index that will try to evaluate the burned
area, a different approach is needed.
Sensitivity Analysis of Forest Fire Risk Factors 321
Table 5. Results with Pearson correlation coefficients, between the Overall degree of risk and
the extent of each fire incident, when each one of the factors is not known
0,240(*)
1 All factors known
0,046682
0,071871
2 Temperature uknown
0,30995
0,090818
3 Humidity uknown
0,265252
-0,05222
4 Altitude uknown
0,359361
0,221249
5 Forest cover density uknown
0,061282
0,132014
6 Wind uknown
0,18039
0,137629
7 Population uknown
0,170266
,316(*)
8 Tourism uknown
0,012702
The factors have to be studied carefully, in order to use the most appropriate of all
and the rule set needs to be rebuilt so that the correlations become stronger and have a
greater significance. More detailed analysis reveals that in order to produce better
results, the method should be focused on “natural factors”. As seen in the following
table 6, the extent of forest fires is significantly correlated to the mean forest cover
density (correlation degree=-0.307 and p=0.015) and to the population density (corre-
lation degree=0.401 and p=0.02) of each forest department.
As it is shown in Table 6 above, the overall “natural risk” is even higher correlated
to the extent of each forest fire (correlation degree=0.430 and p=0.001). On the other
hand, the human caused unified degree of risk has almost no relation to the extent of
fire incidents (correlation degree=0.007 and p=0.482). This means either that a totally
different data index needs to be used in order to produce a reliable ORD or a different
rule set has to be applied. Different human data is necessary and more natural factors
have to be added, in order to use this system effectively towards ranking the forest
departments according to their forest fire degree of risk due to the burned area.
This study concerns Greece, which is divided into smaller parts called forest depart-
ment. Data from 1983 to 2000 has been gathered. The first ten years of data have been
used to produce the decade’s mean values. These mean values have been cross-
checked with the eleventh year’s actual ranking of forest departments due to their
forest fire frequencies. The next step was to estimate the mean values for a period of
eleven years and to compare the results to the actual situation of the twelfth year and
so on. The compatibility of the ORD output of the FFR-FIS to the actual annual forest
fire situation varied from 52% to 70% as shown in table 7 below (based on the fre-
quencies of forest fire incidents).
Table 7. Compatibility analysis according to the following year's total number of forest fire
incidents
Forest fire degree of risk due to total Forest fire degree of risk due to the
number of fire instances burned area
Temperature Temperature
Wind Humidity
Population Altitude
Forest cover density
Wind
Population
Sensitivity Analysis of Forest Fire Risk Factors 323
References
1. Kandel, A.: Fuzzy Expert Systems. CRC Press, USA (1992)
2. Leondes, C.T.: Fuzzy Logic and Expert Systems Applications. Academic Press, California
(1998)
3. Christopher, F.H., Patil Sumeet, R.: Identification and review of sensitivity analysis meth-
ods. Blackwell, Malden (2002)
4. Cox, E.: Fuzzy modeling and genetic algorithms for data mining and exploration. Aca-
demic Press, London (2005)
5. Johnson, E.A., Miyanishi, K.: Forest Fire: Behavior and Ecological Effects (2001)
6. Nurcahyo, G.W., Shamsuddin, S.M., Alias, R.A., Sap, M.N.M.: Selection of Defuzzifica-
tion Method to Obtain Crisp Value for Representing Uncertain Data in a Modified Sweep
Algorithm. JCS&T 3(2) (2003)
324 T. Tsataltzinos, L. Iliadis, and S. Stefanos
7. Iliadis, L., Spartalis, S., Maris, F., Marinos, D.: A Decision Support System Unifying
Trapezoidal Function Membership Values using T-Norms. In: ICNAAM (International
Conference in Numerical Analysis and Applied Mathematics). J. Wiley-VCH Verlag
GmbH Publishing co., Weinheim (2004)
8. Kahlert, J., Frank, H.: Fuzzy-Logik und Fuzzy-Control (1994)
9. Kailidis, D.: Forest Fires (1990)
10. Mamdani, E.H., Assilian, S.: An experiment in linguistic synthesis with a fuzzy logic con-
troller. Man-Machine Studies, 1–13 (1975)
11. Nguyen, H., Walker, E.: A First Course in Fuzzy Logic. Chapman and Hall, Boca Raton
(2000)
12. Tsataltzinos, T.: A fuzzy decision support system evaluating qualitative attributes towards
forest fire risk estimation. In: 10th International Conference on Engineering Applications
of Neural Networks, Thessaloniki, August 29-31 (2007)
13. Urdan, T.C.: Statistics in plain English. Routledge, New York (2005)
14. Kecman, V.: Learning and Soft Computing. MIT Press, London (2001)
15. Zhang, J.X., Huang, C.F.: Cartographic Representation of the Uncertainty related to natu-
ral disaster risk: overview and state of the art. Artificial Intelligence, 213–220 (2005)
Nonmonotone Learning of Recurrent Neural Networks in
Symbolic Sequence Processing Applications
1 Introduction
Sequence processing involves several tasks such as clustering, classification, predic-
tion, and transduction of sequential data which can be symbolic, non-symbolic or
mixed. Examples of symbolic data patterns occur in modelling natural (human) lan-
guage, while the prediction of water level of River Thames is an example of process-
ing non-symbolic data. On the other hand, if the content of a sequence will be varying
through different time steps, the sequence is called temporal or time-series.
In general, a temporal sequence consists of nominal symbols from a particular alpha-
bet, while a time-series sequence deals with continuous, real-valued elements [1]. Proc-
essing both these sequences mainly consists of applying the current known patterns to
produce or predict the future ones, while a major difficulty is that the range of data
dependencies is usually unknown. Therefore, an intelligent system or approach with
memorising and learning capabilities for previous information is crucial for effective
and efficient sequence processing and modelling. In this work, the main focus is the
problem of temporal sequence processing using Recurrent Neural Networks (RNNs).
Training an RNN can be considered in the framework of unconstrained optimisa-
tion as the following minimisation problem
D. Palmer-Brown et al. (Eds.): EANN 2009, CCIS 43, pp. 325–335, 2009.
© Springer-Verlag Berlin Heidelberg 2009
326 C.-C. Peng and G.D. Magoulas
min E ( w ) , w ∈ R n , (1)
wk +1 = wk + α k d k . (2)
Traditional optimisation strategies for RNNs are monotone ones, i.e. these strate-
gies compute a step length that reduces the error function value at each iteration, as:
Ek +1 ≤ Ek , (3)
which is the most straight-forward way to minimise the objective function. Unfortu-
nately, even when an algorithm is proved to be globally convergent, there is no guar-
antee that the method will efficiently explore the search space in the sense that it may
be trapped in a local minimum point early on and never jump out to a global one un-
der ill conditions [2], such as poorly initialised weights.
Inspired by the nonmonotone way learning occurs in cognitive development
[3], we propose here a nonmonotone formulation of the learning problem for RNNs
developed in a deterministic framework of analysis. In the way of nonmonotone
learning, the error value of the k-th epoch is allowed to be larger than or equal to the
previous epoch. More precisely speaking, the monotone constraint shown in Eq. (3)
would be relaxed and replaced by a nonmonotone constraint, such as in Eq. (4).
From deterministic optimisation perspective, algorithms with nonmonotone behav-
iour have been proposed in an attempt to better explore a search space and in certain
cases accelerate the convergence rate [4]-[8]. They have been proved to have several
advantages, such as global and superlinear convergence, requiring fewer numbers of
line searches and function evaluations, and have demonstrated effectiveness for large-
scale unconstrained problems.
We adopt here a nonmonotone strategy, such as the one introduced in [4], taking
the M previous f values into consideration
{ }
E ( wk + α k d k ) ≤ max E ( wk − j ) + δα k g kT d k ,
0≤ j ≤ M
(4)
It is worth mentioning that learning algorithms with momentum terms, such as the
well known momentum backpropagation, do not belong by default to the class of
nonmonotone algorithms discussed here. Although momentum backpropagation may
occasionally exhibit nonmonotone behaviour, it does not formally apply a non-
monotone strategy, such as the one derived in Eq. (4).
The rests of this paper are organised as following. Section 2 provides a brief over-
view of four recently proposed nonmonotone training algorithms for RNNs, while
Section 3 presents simulation results using symbolic sequences. Section 4 concludes
the paper.
Loop of ANM-JRprop
Algorithm: A2NM-CG
max
STEP 0. Initialise w0 , k=0, M0 =0, M is an upper
boundary for the learning horizon M k , l0 = 0,
a0 , σ , δ ∈ ( 0,1) and d 0 = − g 0 ;
STEP 1. If g k = 0 , then stop;
STEP 2. If k ≥ 1, calculate a local Lipschitz as
g k − g k −1
Λk = , and adapt Mk :
wk − wk −1
⎧ M k −1 + 1, if Λ k < Λ k −1 < Λ k − 2
⎪
M k = ⎨ M k −1 − 1, if Λ k > Λ k −1 > Λ k − 2 ,
⎪M ,
⎩ k −1 otherwise,
where M k = min {M k , M max } ;
where lk = lk + 1 ;
STEP 4. Generate a new point by wk +1 = wk + α k d k ;
STEP 5. Update search direction d k = − g k + β k −1d k −1 ,
where β k = β kPR or β kFR is greater than zero and can
be calculated using the Polak-Ribière or
Fletcher-Reeves rule respectively;
STEP 6. Let k = k + 1, go to STEP 1.
0.5
0.4
mse
mse
0.2
0
0 20 40 60 80 100 120 0
0 10 20 30 40 50 60 70 80
epochs
epochs
log10(step size)
5
log10(step size)
0
0
-5
0 20 40 60 80 100 120 -2
0 10 20 30 40 50 60 70 80
epochs
epochs
20
20
value of M
value of M
10
10
0
0 20 40 60 80 100 120 0
0 10 20 30 40 50 60 70 80
epochs
epochs
Fig. 1. Convergence behaviours of P5 (left) and P10 (right): NARX networks, trained with the
A2NM-CG algorithm
training method are progressing. This is achieved by using the objective function
values and its gradient to estimate the Hessian matrix Bk. The main step of our non-
monotone BFGS approach, which incorporates a self-scaling factor, is in Table 3;
further details can be found in [16]. Figures 2a-2b provides examples of behaviour
from training NARX networks using the parity-5 and 10 sequences respectively.
Algorithm: ASCNM-BFGS
If k ≥ 1,
Calculate a local approximation of the Lipschitz
constant and adapt M k as in A2NM-CG
Check that stepsize α k satisfies the nonmonotone
condition;
otherwise, find stepsize that satisfies it;
Generate a new weight vector wk +1 = wk + α k d k ;
Update the search direction d k = − Bk−1 g k , using
⎡ B s s BT ⎤ yk ykT
Bk +1 = ρ k ⎢ Bk − k T k k k ⎥+ T ,
⎣ sk Bk sk ⎦ y k sk
ykT sk
where sk = wk +1 − wk , yk = g k +1 − g k and ρ k = ;
skT Bk sk
to be more effective when compared with mainstream methods used in the literature
for the tested applications [22]-[23]. For example, the best result reported in the litera-
ture for the SL application is an MSE of 25% in training and 22% in testing using
LRN with 10 hidden nodes. In the RA application, RNNs with 100 hidden nodes
which should be trained for 1900 epochs are needed to achieve results similar to ours.
(a)
(b)
Fig. 2. (a) Convergence behaviours of P5: NARX networks, trained with the ASCNM-BFGS.
(b) Convergence behaviours of P10: NARX networks, trained with the ASCNM-BFGS.
Nonmonotone Learning of Recurrent Neural Networks 331
Algorithm: ANM-LMAM
STEP 0. Initialize τ > 1, , w0 , D0 , M max , Φ, δ ∈ ( 0,1) , Θ ∈ ( 0,1) , c,
and k = 0;
STEP 1. If g k = J k ek ≠ 0, calculate H k = J kT J k ; otherwise, stop;
ψ1 −1 1
Compute β k by β k = − ⎡( H k + α k Dk ) ⎤⎦ g k + β k −1 , where
2ψ 2 ⎣
STEP 2.
2ψ 2
1
1 ⎢ β k −1 H k β k ⋅ g k H k g k − ( g k β k −1 ) ⎥
⎡ T −1 2
⎤ 2
−2ψ 2ε + g β k T T T
ψ1 = k
,ψ2 = , and
g kT H k−1 g k 2⎢ g kT H k−1 g k Θ 2 − ε 2 ⎥
⎣ ⎦
ε = −cΘ ( g kT H k−1 g k ) ;
12
Table 5. Total number of free parameter for the architectures used in the simulations
MSE (%)
Algorithm #hid
Train Test
1 35.077 36.660
2 26.814 28.495
JRprop
5 18.274 21.897
10 14.030 17.839
1 33.932 35.493
2 25.431 27.345
ANM-JRprop
5 16.809 20.733
10 12.027 16.160
MSE (%)
Algorithm #hid
Train Test
5 12.608 24.215
JRprop
10 13.843 22.800
5 3.750 17.663
ANM-JRprop
10 3.009 18.055
Table 10. Results for conjugate gradient-trained NARX networks in the SL problem
MSE (%)
Algorithms #hid
Training Testing
2 16.534 16.761
CG 5 16.921 17.025
10 18.321 18.146
2 21.951 22.689
A2NM-CG 5 13.449 14.206
10 10.859 11.819
Table 11. Results for conjugate gradient-trained NARX networks in the RA problem
MSE (%)
Algorithms #hid
Training Testing
5 8.294 7.849
CG
10 4.796 5.824
5 8.942 4.610
A2NM-CG
10 7.879 2.182
Nonmonotone Learning of Recurrent Neural Networks 333
Table 12. Average performance for BFGS-trained NARX networks in the SC problem
Table 13. Average performance for BFGS-trained NARX networks in the SL problem
MSE (%)
Algorithms #hid
Training Testing
2 20.196 20.824
BFGS 5 11.029 12.119
10 8.991 9.846
2 19.972 20.792
ASCNM-BFGS 5 9.740 10.360
10 7.584 8.313
Table 14. Average performance for BFGS-trained NARX networks in the RA problem
MSE (%)
Algorithms #hid
Training Testing
5 8.5899 16.6255
BFGS
10 6.2481 15.4072
5 7.0804 16.1067
ASCNM-BFGS
10 5.1843 14.5477
Table 15. Average improvement of ANM-LMAM trained RNNs compared with LMAM
trained ones in the SC problem
Table 16. Average improvement of ANM-LMAM trained RNNs compared with LMAM
trained ones in the SL problem
MSE (%)
RNN
Train Test
FFTD 1.306 1.286
LRN 3.594 3.549
NARX 1.172 1.201
334 C.-C. Peng and G.D. Magoulas
4 Conclusions
Sequence processing applications involve several tasks, such as clustering, classifica-
tion, prediction and transduction. One of the major challenges is that the data depend-
encies are usually unknown, and in order to be verified a trial-and-error approach is
usually implemented, i.e. changing of the number of delays and/or the number of
hidden nodes in an RNN. Effective training algorithms can facilitate this task, espe-
cially when training high-dimensional networks.
In this paper, we provided an overview of approaches that employ nonmonotone
learning for recurrent neural networks. These consist of nonmonotone first-order
(JRprop and conjugate gradient) and second-order (BFGS and Levenberg-Marquardt)
algorithms, which were tested in symbolic sequence processing applications. One of
the features of our algorithms is that they incorporate an adaptive schedule for deter-
mining the length of the nonmonotone learning horizon and the stepsize. As a result,
the influence of application-dependent settings can be reduced to some extend. Our
simulation results, briefly reported here, show that introducing the nonmonotone
strategy could generally improve the performances of training algorithms in terms of
smaller training and classification errors, namely MSE and CE. Nonmonotone meth-
ods appear to outperform previously reported results in the sequence processing ap-
plications tested and are able to train effectively RNNs of various architectures using
smaller number of hidden nodes than the original methods.
References
1. Antunes, C.M., Oliveira, A.L.: Temporal data mining: an overview. In: Proc. KDD Work-
shop on Temporal Data Mining, San Francisco, CA, August 26, 2001, pp. 1–13 (2001)
2. Gill, P.E., Murray, W., Wright, M.H.: Practical Optimization. Academic Press, London
(1981)
3. Elman, J.L., Bates, E.A., Johnson, M.H., Karmiloff-Smith, A., Parisi, D., Plunkett, K.: The
shape of change. In: Rethinking Innateness: A Connectionist Perspective on Development,
ch. 6. MIT Press, Cambridge (1997)
4. Grippo, L., Lampariello, F., Lucidi, S.: A nonmonotone line search technique for Newton’s
method. SIAM J. Numerical Analysis 23, 707–716 (1986)
5. Grippo, L., Lampariello, F., Lucidi, S.: A quasi-discrete Newton algorithm with a non-
monotone stabilization technique. J. Optimization Theory and Applications 64(3), 495–510
(1990)
6. Grippo, L., Lampariello, F., Lucidi, S.: A class of nonmonotone stabilization methods in
unconstrained optimization. Numerische Mathematik 59, 779–805 (1991)
7. Grippo, L., Sciandrone, M.: Nonmonotone globalization techniques for the Barzilai-
Borwein gradient method. Computational Optimization and Applications 23, 143–169
(2002)
8. Fasano, G., Lampariello, F., Sciandrone, M.: A truncated nonmonotone Gauss-Newton
method for large-scale nonlinear least-squares problems. Computational Optimization and
Applications 34, 343–358 (2006)
9. Plagianakos, V.P., Magoulas, G.D., Vrahatis, M.N.: Deterministic nonmonotone strategies
for effective training of multi-layer perceptrons. IEEE Trans. Neural Networks 13(6),
1268–1284 (2002)
Nonmonotone Learning of Recurrent Neural Networks 335
10. Medsker, L.R., Jain, L.C.: Recurrent neural networks: design and applications. CRC Press,
Boca Raton (2000)
11. Nelles, O.: Nonlinear System Identification. Springer, Berlin (2000)
12. Riedmiller, M., Braun, H.: Rprop – a fast adaptive learning algorithm. In: Proc. Int’l Sym-
posium on Computer and Information Sciences, Antalya, Turkey, pp. 279–285 (1992)
13. Igel, C., Hüsken, M.: Empirical evaluation of the improved Rprop learning algorithms.
Neurocomputing 50, 105–123 (2003)
14. Anastasiadis, A., Magoulas, G.D., Vrahatis, M.N.: Sign-based Learning Schemes for Pat-
tern Classification. Pattern Recognition Letters 26, 1926–1936 (2005)
15. Peng, C.-C., Magoulas, G.D.: Advanced Adaptive Nonmonotone Conjugate Gradient
Training Algorithm for Recurrent Neural Networks. Int’l J. Artificial Intelligence Tools
(IJAIT) 17(5), 963–984 (2008)
16. Peng, C.-C., Magoulas, G.D.: Adaptive Self-scaling Non-monotone BFGS Training Algo-
rithm for Recurrent Neural Networks. In: de Sá, J.M., Alexandre, L.A., Duch, W., Mandic,
D.P. (eds.) ICANN 2007. LNCS, vol. 4668, pp. 259–268. Springer, Heidelberg (2007)
17. Levenberg, K.: A method for the solution of certain problems in least squares. Quart. Ap-
plied Mathematics 5, 164–168 (1944)
18. Marquardt, D.: An algorithm for least squares estimation of nonlinear parameters. J. Soci-
ety for Industrial and Applied Mathematics 11(2), 431–441 (1963)
19. Hagan, M.T., Menhaj, M.B.: Training feedforward networks with the Marquardt algo-
rithm. IEEE Trans. Neural Networks 5, 989–993 (1994)
20. Ampazis, N., Perantonis, S.J.: Two highly efficient second-order algorithms for training
feedforward networks. IEEE Trans. Neural Networks 13, 1064–1074 (2002)
21. Magoulas, G.D., Chen, S.Y., Dimakopoulos, D.: A personalised interface for web directo-
ries based on cognitive styles. In: Stary, C., Stephanidis, C. (eds.) UI4ALL 2004. LNCS,
vol. 3196, pp. 159–166. Springer, Heidelberg (2004)
22. McLeod, P., Plunkett, K., Rolls, E.T.: Introduction to connectionist modelling of cognitive
processes, pp. 148–151. Oxford University Press, Oxford (1998)
23. Plaut, D., McClelland, J., Seidenberg, M., Patterson, K.: Understanding normal and im-
paired reading: computational principles in quasi-regular domains. Psychological Re-
view 103, 56–115 (1996)
24. Waibel, A.: Modular construction of time-delay neural networks for speech recognition.
Neural Computation 1(1), 39–46 (1989)
25. Waibel, A., Hanazawa, T., Hilton, G., Shikano, K., Lang, K.J.: Phoneme recognition using
time-delay neural networks. IEEE Transactions on Acoustics, Speech, and Signal Process-
ing 37, 328–339 (1989)
26. Elman, J.L.: Finding structure in time. Cognitive Science 14, 179–211 (1990)
27. Hagan, M.T., Demuth, H.B., Beale, M.H.: Neural Network Design. PWS Publishing, Bos-
ton (1996)
Indirect Adaptive Control Using Hopfield-Based
Dynamic Neural Network for SISO Nonlinear Systems
1 Introduction
Recently, static neural networks (SNNs) and dynamic neural networks (DNNs) are
wildly applied to solve the control problems of nonlinear systems. Some static neural
networks, such as feedforward fuzzy neural network (FNN) or feedforward radius
basis function network (RBFN), are frequently used as a powerful tool for modeling
the ideal control input or nonlinear functions of systems [1]-[2]. However, the com-
plex structures of FNNs and RBFNs make the practical implementation of the control
schemes infeasible, and the numbers of the hidden neurons in the NNs’ hidden layers
(in general more than the dimension of the controlled system) are hard to be deter-
mined. Another well-known disadvantage is that SNNs are quite sensitive to the major
change which has never been learned in the training phase. On the other hand, DNNs
have a advantage that a smaller DNN is possible to provide the functionality of a
much larger SNN [3]. In addition, SNNs are unable to represent dynamic system
mapping without the aid of tapped delay, which results in long computation time, high
sensitivity to external noise, and a large number of neurons when high dimensional
systems are considered [4]. This drawback severely affects the applicability of SNNs
to system identification, which is the central part in some control techniques for
D. Palmer-Brown et al. (Eds.): EANN 2009, CCIS 43, pp. 336–349, 2009.
© Springer-Verlag Berlin Heidelberg 2009
Indirect Adaptive Control Using Hopfield-Based Dynamic Neural Network 337
nonlinear systems. On the other hand, owing to their dynamic memory, DNNs have
good performance on identification, state estimation, trajectory tracking, etc., even
with the unmodeled dynamics. In [5]-[7], researchers first identify the nonlinear sys-
tem according to the measured input and output, and then calculate the control low
based on the NN model. The output of the nonlinear system is forced by the control
law to track either a given trajectory or the output of a reference model. However,
there are still some drawbacks. In [5], although both identification and tracking errors
are bounded, the control performance shown in the simulations is not satisfactory. In
[6], two DNNs are utilized in the iterative learning control system to approximate the
nonlinear system and mimic the desired system output and thus increase the complex-
ity of the control scheme and computation loading. The work in [7] requires a prior
knowledge of the strong relative degree of the controlled nonlinear system. Besides,
an additional filter is needed to obtain higher derivatives of the system output. These
drawbacks restrict the applicability of the above works to practical implementation.
Hence, we try to fix the above drawbacks by an indirect adaptive control scheme
using Hopfield-based DNNs. Hopfield model was first proposed by Hopfield J.J. in
1982 and 1984 [8]-[9]. Because a Hopfiled circuit is quite easy to be realized and has
the property of decreasing in energy by finite number of node-updating steps, it has
many applications in different fields. In this paper, a so-called indirect adaptive con-
trol scheme using Hopfield-based dynamic neural network (IACHDNN) for SISO
nonlinear systems is proposed. The Hopfield-based DNN can be viewed as a special
kind of DNNs. The control object is to force the system output to track a given refer-
ence signal. The uncertain parameters of the controlled plant are approximated by the
internal states of Hopfield-based DNNs, and a compensation controller is used to
dispel the effect of the approximation error and bounded external disturbance. The
synaptic weights of the Hopfiled-based DNNs are on-line tuned by adaptive laws
derived in the Lyapunov sense. The control law and adaptive laws provide stability for
the closed-loop system with external disturbance. Furthermore, the tracking error can
be attenuated to a desired level provided that the parameters of the control law are
chosen adequately. The main contributions of this paper are summarized as follows.
1) The structure of the used Hopfield-based DNN contains only one neuron, which is
much less than the number of neurons in SNNs or other DNNs for nonlinear system
control. 2) The simple Hopfield circuit greatly improves the applicability of the con-
trol scheme to practical implements. 3) No strong assumptions or prior knowledge of
the controlled plant are needed in the development of IACHDNN.
χ is the state of the DNN, W and Ψ are the weight matrices describing output layer
connections, V1 and V2 are the weight matrices describing the hidden layer connec-
tions, σ (⋅) is a sigmoid vector function responsible for nonlinear state feedbacks, and
γ (⋅) is a differentiable input function. A DNN in (1) satisfying
is the simplest DNN without any hidden layers and can be expressed as
χ = Aχ + BWσ(χ ) + BΘ u (4)
where Wi T = [wi1 wi 2 " win ] and Θ i T = [θ i1 θ i 2 " θ im ] are the ith rows of W and Θ ,
respectively. Solving the differential equation (5), we obtain
χ i = bi (Wi T ξ W , i + Θ Ti ξ Θ , i ) + e − a t χ i (0) − e − a t bi [Wi T ξ W , i (0) + Θ Ti ξ Θ , i (0)] , i = 1, 2, " , n . (6)
i i
ξ W , i = −ai ξ W , i + σ (χ ) (7)
and
ξ Θ ,i = −ai ξ Θ , i + u (8)
(
χ i = bi Wˆ i T ξ W , i + Θ i Θ, i
) i i
i
i W ,i
[
ˆ T ξ + e − a t χ (0) − e − a t b Wˆ T ξ (0) + Θ
i
i Θ, i
]
ˆ T ξ (0) , i = 1, 2, " , n (9)
Indirect Adaptive Control Using Hopfield-Based Dynamic Neural Network 339
where Ŵi and Θ̂ i are the estimations of Wi and Θ i , respectively. For a continuous vector
function Φ = [Φ 1 Φ 2 " Φ n ]T ∈ R n , we first define optimal vectors Wi * and Θ *i as
(W i * , Θ *i ) = arg min
⎧
⎨ sup bi Wˆ i ξ W , i + Θ
T
( i Θ,i
)
ˆ T ξ + e − a t χ (0)
i
i
Wˆ i ∈ΩWi , Θ
⎩
ˆ ∈Ω
Θi χ ∈ D , u ∈ D
i χ
(10)
]} }
U
i
[
− e − a t bi Wˆ i T ξ W , i (0) + Θ
ˆ ξ (0)
Θ ,i
ΩΘ = Θ
ˆ : Θ
i i
ˆ ≤M
i Θ
are constraint sets for Ŵi and Θ̂ i . Then, Φ can be expressed
i
as
( T T
) i i
[ T T
]
Φ i = bi Wi * ξ W ,i + Θ *i ξ Θ, i + e − a t χ i (0) − e − a t bi Wi * ξ W , i (0) + Θ *i ξ Θ, i (0) + Δ i i = 1, 2, " , n (11)
where Δ i is the approximation error. Note that the optimal vectors Wi * and Θ *i are
difficult to be determined and might not be unique. The modeling error vector
χ = [χ~1 χ~2 " χ~n ] can be defined from (9) and (11) as
~ T
χ~i = Φ i − χ i
~
( ~
) ~
[ ~
]
= bi Wi T ξ W , i + ΘTi ξ Θ , i − e − a t bi Wi T ξ W , i (0) + ΘTi ξ Θ , i (0) + Δ i i = 1, 2, " , n
i
(12)
~ ~
where Wi = Wi * − Wˆ i , and Θ i = Θ *i − Θ
ˆ .
i
where κ i is the slope of tanh(⋅) at the origin. It is known that tangent function is
bounded by −1 < tanh(⋅) < 1 .
3 Problem Formulation
with e = y r − x = y r − y .
If f(x) and g are known and the system is free of external disturbance, the ideal
controller can be designed as
u ideal =
1
[− f (x) + y r( n ) + k Tc e] (16)
g ( x)
where k c = [k n k n −1 " k1 ] . Applying (16) to (14), we have the following error dy-
T
namics
e ( n ) + k1e ( n −1) + " + k n e = 0 . (17)
may be unknown, the ideal feedback controller u ideal in (16) cannot be implemented.
4 Design of IACHDNN
To solve this problem, a new indirect adaptive control scheme using Hopfield-based
dynamic neural network (IACHDNN) for SISO nonlinear systems is proposed. Two
Hopfield-based DNNs are used to estimate the uncertain continuous functions f and g,
respectively. The indirect adaptive controller u id takes the following form
u id =
gˆ
[
1 ˆ
− f + y r( n ) + k Tc e − u c ] (18)
⎡ 0 1 0 " 0 ⎤
⎢ # % % % 0 ⎥⎥
where Ac = ⎢ ∈ R n×n ; B c = [0 0 0 1] ∈ R n ;
T
⎢ 0 " " 0 1 ⎥
⎢ ⎥
⎣− k n − k n −1 " " − k1 ⎦
θ11
u1 ∑ χ σ (⋅)
#
θ 1m C R
um
Fig. 1. Electric circuit of the Hopfield-based DNN containing only a single neuron
~
f = f − fˆ ; g~ = g − gˆ . According to the discussion in Sec. 2.2, the Hopfield-
based DNNs used to approximate f and g containing only a single neuron and can be
expressed as
( ) [Wˆ ξ ]
1 1
−
1 ˆ t
1 −R t
fˆ = W f ξW + Θ
ˆ Tξ +e R
f Θ
fCf
fˆ (0) − e fCf
f Wf
( 0) + Θ
ˆ T ξ (0 )
f Θ
(20)
Cf f f
Cf f
and
( ) [Wˆ ξ ]
1 1
−
1 ˆ t
1 −R t
gˆ = Wg ξW + Θ
ˆ Tξ +e R
f Θ
gCg
gˆ (0) − e gCg
g Wg
(0 ) + Θ
ˆ T ξ ( 0)
g Θ
(21)
Cg g g
Cg g
where fˆ (0) and ĝ (0) are the initial value of fˆ and ĝ ; the subscripts (and the sec-
ondary subscripts) f and g indicate the variables corresponding to the estimations fˆ
and ĝ in this paper. Note that Ŵ f , Ŵg , ξ W , and ξ W are scalars, and the input sig-
f g
nals of the Hopfield-based DNNs are u = [ x x " x ( n −1) ]T . Fig. 1. shows the electric
circuit of the Hopfield-based DNN containing only a single neuron. Substituting (20)
and (21) into (19) yields
e = A c e − B c ( f + gu id ) + B c u c − B c ε (22)
where
1 ~ ⎡ ⎤ 1 ~T ⎡ ⎤
1 1
− t − t
f = W f ⎢ξ W f − e ξ W f ( 0) ⎥ + Θ f ⎢ξ Θ f − e
Rf Cf Rf Cf
ξ Θ f ( 0) ⎥ ,
Cf ⎢⎣ ⎥⎦ C f ⎢⎣ ⎥⎦
342 P.-C. Chen, C.-H. Wang, and T.-T. Lee
1 ~ ⎡ ⎤ 1 ~T ⎡ ⎤
1 1
− t − t
g= W g ⎢ξ Wg − e g g ξ Wg (0)⎥ + Θ g ⎢ξ Θ g − e g g ξ Θ g (0)⎥ ,
R C R C
Cg ⎣⎢ ⎦⎥ C g ⎢⎣ ⎦⎥
and
ε = Δ f + Δ g u id + d , where Δ f and Δ g are the approximation errors defined as the
same way in (11). In order to derive the main theorem in this paper, the following
assumption and lemma is required.
Assumption: Assume that there exists a finite constant μ so that
∫ε
t
2
dτ ≤ μ , 0 ≤ t < ∞ . (23)
0
1 1
where Q = Q T > 0 ; ρ > 0 and δ > 0 satisfies − ≤ 0 . Let Wˆ f (0) ∈ ΩW ,
ρ 2
δ f
Wˆ g (0) ∈ Ω W , Θ
g
ˆ (0) ∈ Ω , and Θ
f Θ
ˆ (0) ∈ Ω , where Wˆ (0) , Wˆ (0) , Θ
fg Θ f g
ˆ (0) and
f g
Θ
ˆ (0) are the initial values of Ŵ , Ŵ , Θ̂ , and Θ̂ , respectively. For simplifying
g f g f g
the mathematical expressions in the rest of the paper, here we define eight conditions
as follows.
⎛ ⎡ −
1
⎤ ⎞
Condition A: Wˆ f < M W or ⎜ Wˆ f = M W and e T PB cWˆ f ⎢ξ W − e ξ W ( 0) ⎥ ≥ 0 ⎟ .
t
RfCf
⎜ ⎟
⎣⎢ ⎦⎥
f f f f
⎝ ⎠
⎡ −
1
t ⎤
Condition B: Wˆ f = M W and e T PB cWˆ f ⎢ξ W − e RfCf
ξ W (0)⎥ < 0 .
⎣⎢ ⎦⎥
f f f
⎛ ⎡ −
1
⎤ ⎞
Condition C: Wˆ g < M W or ⎜ Wˆ g = M W and e T PB cWˆ g ⎢ξ W − e ξ W ( 0) ⎥ ≥ 0 ⎟ .
t
Rg C g
g
⎜ g
⎢⎣
g g
⎥⎦ ⎟
⎝ ⎠
⎡ −
1
t ⎤
Condition D: Wˆ g = M W and e T PB cWˆ g ⎢ξ W − e Rg Cg
ξ W (0)⎥ < 0 .
⎢⎣ ⎥⎦
g g g
⎛ ⎡ −
1
⎤ ⎞
Condition E: Θˆ f < M Θ or ⎜ Θˆ f = M Θ and e T PB c Θˆ f ⎢ξ Θ − e ξ Θ (0)⎥u id ≥ 0 ⎟ .
t
RfCf
f
⎜ f
⎢⎣
f
⎥⎦
f
⎟
⎝ ⎠
⎡ −
1
t ⎤
Condition F: Θˆ f = M Θ and e T PB c Θˆ f ⎢ξ Θ − e Rf C f
ξ Θ (0)⎥u id < 0 .
⎢⎣ ⎥⎦
f f f
⎛ ⎡ −
1
⎤ ⎞
Condition G: Θˆ g < M Θ or ⎜ Θˆ g = M Θ and e T PB c Θˆ g ⎢ξ Θ − e ξ Θ (0)⎥u id ≥ 0 ⎟ .
t
Rg Cg
⎜ ⎟
⎣⎢ ⎦⎥
g g g g
⎝ ⎠
Indirect Adaptive Control Using Hopfield-Based Dynamic Neural Network 343
⎡ −
1
t ⎤
Condition H: Θˆ g = M Θ and eT PB cΘˆ g ⎢ξ Θ − e Rg C g
ξ Θ (0)⎥uid < 0
⎢⎣ ⎥⎦
g g g
⎪ Cf ⎣⎢ ⎦⎥
f f
ˆ ~ ⎪ , (25)
W f = −W f = ⎨
⎪
⎪ ⎧⎪ βW T ⎡ ⎤ ⎫⎪
1
− t
−
R C
⎪ Pr ⎨ e PB c ⎢ ξ W
e f
ξ W (0)⎥ ⎬ if Condition B f f
⎩ ⎪⎩ f ⎣⎢ ⎦⎥ ⎪⎭
f f
C
⎧ β Wg T ⎡ −
1
t ⎤
⎪− e PB c ⎢ξ Wg − e g g ξ Wg (0)⎥ if Condition C
R C
~ ⎪⎪ C g ⎢⎣ ⎥⎦
Wˆ g = −Wg = ⎨ , (26)
⎪ ⎧⎪ βWg T ⎡ ⎤ ⎫⎪
1
− t
⎪Pr ⎨ C e PB c ⎢ξ Wg − e
Rg C g
ξ Wg (0) ⎥ ⎬ if Condition D
⎩⎪ ⎪⎩ g ⎣⎢ ⎦⎥ ⎪⎭
⎧ βΘ ⎡ −
1
t ⎤
⎪− e T PB c ⎢ξ Θ − e ξ Θ (0) ⎥ uid if Condition E
R C f f f
⎪⎪ ⎣⎢ ⎥⎦
f f
ˆ ~ C
Θ
f
, (27)
f = −Θ f = ⎨
⎪ ⎧⎪ β Θ ⎡ −
1
t ⎤ ⎫
⎪
eT PB c ⎢ξ Θ − e ξ Θ (0) ⎥uid ⎬ if Condition F
R C
⎪ Pr ⎨
f f f
⎪⎩ C f ⎣⎢ ⎥⎦ ⎪⎭
f f
⎪⎩
⎧ βΘ ⎡ −
1
t ⎤
⎪− eT PB c ⎢ξ Θ − e ξ Θ (0) ⎥u id if Condition G
R C g g g
⎪⎪ C g ⎢⎣ ⎥⎦
g g
ˆ ~ , (28)
Θ g = −Θ g = ⎨
⎧ β
⎪ ⎪ Θ T ⎡ −
1
t ⎤ ⎫
⎪
e PB c ⎢ξ Θ − e ξ Θ (0) ⎥u id ⎬ if Conditon H
R C
⎪Pr ⎨
g g g
⎩⎪ ⎪⎩ g ⎣⎢ ⎦⎥ ⎪⎭
g g
C
⎧⎪ βW ⎡ −
1
t ⎤ ⎫⎪ βW f T ⎪ ⎡ −
1
t ⎤ ⎢⎣ ⎥⎦ ˆ ⎪ ,
Pr ⎨ f eT PB c ⎢ξ W f − e ξ W f ( 0) ⎥ ⎬ = e PB c ⎨− ⎢ξ W f − e ξ W f ( 0) ⎥ +
Rf C f Rf C f
2
Wf ⎬
⎪⎩ f
C ⎣⎢ ⎥⎦ ⎪⎭ C f ⎪ ⎣⎢ ⎦⎥ Wˆ f ⎪
⎪ ⎪
⎩ ⎭
⎧ ⎡ −
1
t ⎤ ⎫
⎪ Wˆ ⎢ξ − e R C ξ (0)⎥ ⎪ g g
⎧⎪ β W ⎡ ⎤ ⎫⎪ β W T ⎡ ⎤
g W W
⎪ ⎣⎢ ⎥⎦ ˆ ⎪ ,
1 1 g g
− t − t
e T PB c ⎢ξ W − e (0)⎥ ⎬ = e PB c ⎨− ⎢ξ W − e ξ W (0 ) ⎥ +
R C R C
Pr ⎨ ξW g ⎬
g g g g g g
2
W
⎪⎩ C g ⎣⎢ ⎥⎦ ⎪⎭ C g ⎪ ⎣⎢ ⎥⎦
g g g g
Wˆ g ⎪
⎪ ⎪
⎩ ⎭
⎧ ⎡ −
1
⎤ ⎫
⎪ Θ
ˆ T ⎢ξ − e R C ξ (0)⎥u ⎪ f f
⎧⎪ β Θ ⎡ ⎤ ⎫⎪ β Θ T ⎪ ⎡ ⎤
f Θ Θ id
ˆ ⎪⎬
−
1
−
1
⎣⎢ ⎥⎦ , f f
PB c ⎢ξ Θ − e ξ Θ (0)⎥uid ⎬ = e PB c ⎨− ⎢ξ Θ − e ξ Θ (0)⎥uid + Θ
Rf Cf R C
Pr ⎨ f f
2 f
f f
⎪⎩ C f ⎢⎣ ⎥⎦ ⎪⎭ C f ⎢ ⎥
f f f f
⎪ ⎣ ⎦ Θf
ˆ ⎪
⎪ ⎪
⎩ ⎭
344 P.-C. Chen, C.-H. Wang, and T.-T. Lee
⎧ ⎡ −
1
⎤ ⎫
⎪ Θ
ˆ T ⎢ξ − e R g C g ξ ( 0 ) ⎥ u ⎪
⎧⎪ β Θ g ⎡ −
1
⎤ ⎫
⎪ β ⎪ ⎡ −
1
⎤
g
⎢
Θg Θg
⎥
id
⎪,
Pr ⎨ PB c ⎢ξ Θ g − e
Rg C g
ξ Θ g (0)⎥ uid ⎬ =
Θg
e PB c ⎨− ⎢ξ Θ g − e
T RgCg
ξ Θ g (0) ⎥uid + ⎣ ⎦ Θg ⎬
ˆ
2
⎩⎪ g
C ⎣⎢ ⎦⎥ ⎭⎪ C g ⎪ ⎣⎢ ⎦⎥ Θg
ˆ ⎪
⎪ ⎪
⎩ ⎭
then Ŵ f , Ŵg , Θ̂ f and Θ̂ g are bounded by Wˆ f ≤ M W f , Wˆ g ≤ M Wg , Θ̂ f ≤ M Θ f ,
and Θ̂ g ≤ M Θ g for all t ≥ 0 [10]. Now we are prepared to state the main theorem of
this paper.
Theorem: Suppose the Assumption (23) holds. Consider the plant (14) with the
control law (18). The function estimations fˆ and ĝ are given by (25) and (21) with
the adaptive laws (25)-(28). The compensation controller u s is given as
1 T
uc = − B c Pe (29)
2δ
~ ~
Θ Tg (0)Θ g (0) ρ2
∫
t
+ + ε 2 dτ (30)
2β Θ g
2 0
~ ~ ~ ~
for 0 ≤ t < ∞ , where e(0) , W f (0) , W g (0) and Θ f (0) , Θ g (0) are the initial values of
~ ~ ~ ~
e , W f , Wg , Θ f , and Θ g , respectively.
ii) The tracking error e can be expressed in terms of the lumped uncertainty as
2V (0) + ρ 2 μ
e ≤ (31)
λ min (P)
where V (0) is the initial value of a Lyapunov function candidate defined later and
λmin (P) is the minimum eigenvalue of P.
Proof
i) Define a Lyapunov function candidate as
1 T 1 ~ 2 1 ~ 2 1 ~ T ~ 1 ~ T ~
V= e Pe + Wf + Wf + ΘfΘf + ΘgΘg (32)
2 2β W 2β W
f
2β Θ 2β Θ
f f g
Indirect Adaptive Control Using Hopfield-Based Dynamic Neural Network 345
1 T T 1 ~ ~ 1 ~ ~
= e ( A c P + PA c )e − e T P B c ( f + gu id ) + e T PB c u c − e T PB c ε + WfWf + WgWg
2 βW βW f g
1 ~ T ~ 1 ~ T ~
+ ΘfΘf + ΘgΘg
βΘ f
βΘ g
1
= eT ( ATc P + PA c )e + eT PB cuc − eT PB cε + VW + VW + VΘ + VΘ (33)
2 f g f g
where
~ ⎧⎪ 1 T ⎡ ⎤ 1 ~ ⎫⎪
1
− t
VW = W f ⎨− −e ξ W (0)⎥ +
RfCf
e P B c ⎢ξ W Wf ⎬ ,
⎪⎩ C f ⎣⎢ ⎦⎥ β W ⎪⎭
f f f
f
~ ⎧⎪ 1 T ⎡ ⎤ 1 ~ ⎫⎪
1
− t
VW = W g ⎨− −e ξ W ( 0) ⎥ +
Rg Cg
e P B c ⎢ξ W Wg ⎬ ,
⎪⎩ C g ⎢⎣ ⎥⎦ β W ⎪⎭
g g g
g
~ ⎧⎪ 1 T ⎡ ⎤ 1 ~ ⎫⎪
1
− t
VΘ = Θ Tf ⎨− −e ξ Θ (0)⎥u id + Θf ⎬ ,
RfCf
e P B c ⎢ξ Θ
f
⎪⎩ C f ⎢⎣
f f
⎥⎦ βΘ ⎪⎭ f
~ ⎧⎪ 1 T ⎡ ⎤ 1 ~ ⎫⎪
1
− t
VΘ = Θ Tg ⎨− −e ξ Θ (0)⎥u id + Θg ⎬ .
Rg C g
e P B c ⎢ξ Θ
g
⎪⎩ C g ⎣⎢
g g
⎦⎥ βΘ ⎪⎭ g
1 1
= e T ( A Tc P + PA c - PB c B c P )e + e T P B c ε + VW + VW + VΘ + VΘ
T
2 δ f g f g
1 1
= e T (−Q- 2 PB c B c P)e + e T P B c ε + VW f + VW g + VΘ f + V Θ g
T
2 ρ
2
1 T 1 ⎡1 ⎤ 1
=− e Qe - ⎢ B Tc Pe - ρ ε ⎥ + ρ 2 ε 2 + VW f + VW g + VΘ f + V Θ g (34)
2 2 ⎣ρ ⎦ 2
⎪− 1 eT PB c WfWf if Condition B
⎪ Cf ˆ
Wf
2
⎩⎪
346 P.-C. Chen, C.-H. Wang, and T.-T. Lee
~ 1 ~
the constraint set Ω W f . Using this fact, we obtain W f Wˆ f = (W f* 2 − Wˆ f 2 − W f 2 ) ≤ 0 .
2
Thus, the second line of (35) can be rewritten as
⎡ −
1
t ⎤
W f ⎢ξ W − e
ˆ R C
T
ξ W (0)⎥ f f
⎢⎣ ⎥⎦
f f
1 T 2 2 ~ 2
VW = − e Pb 2
( W f* − Wˆ f − Wf ) ≤ 0 . (36)
2C f Wˆ f
1 1
V ≤ − e T Qe + ρ 2 ε 2 (37)
2 2
Integrating both sides of the inequality (37) yields
1 ρ2
∫ ∫
t t
V (t ) − V (0) ≤ − e T Qe dτ + ε 2 dt (38)
2 0 2 0
1 ρ2
∫ ∫
t t
e T Qe dτ ≤ V 0 + ε 2 dt . (39)
2 0 2 0
∫
t
ii) From (37) and since e T Qe dt ≥ 0 , we have
0
Thus, we obtain
from (39)-(40). Therefore, from (42), we can easily obtain (31), which explicitly de-
scribe the bound of tracking error e . If initial state V (0) = 0 , tracking error e can
be made arbitrarily small by choosing adequate ρ. Equation (31) is very crucial to
show that the proposed IACHDNN will provide the closed-loop stability rigorously in
the Lyapunov sense under the Assumption (23). Q. E. D.
The block diagram of IACHDNN is shown in Fig. 2.
Indirect Adaptive Control Using Hopfield-Based Dynamic Neural Network 347
Remark: Equation (31) describes the relations among e , ρ , and λ min (P) . To get
more insight of (31), we first choose ρ 2 = δ in (24) to simplify the analysis. Thus,
from (24), we can see that λ min (P) is fully affected by the choice of λ min (Q) in the
way that a larger λ min (Q) leads to a larger λ min (P) , and vice versa. Now, one can
easily observe form (31) that the norm of tracking error can be attenuated to any de-
sired small level by choosing ρ and λ min (Q) as small as possible. However, this may
lead to a large control signal which is usually undesirable in practical systems.
compensation uc uid x
∑ plant ∑ y
controller
e
x Wˆ , Wˆ ξW f
indirect adaptive f g adaptive laws ξWg
e controller Θˆ ,Θ
f
ˆ
g ΘW f
ΘWg
5 Simulation Results
Example: Consider an inverted pendulum system. Let x1 (rad) be the angle of the
pendulum with respect to the vertical line. The dynamic equations of the inverted
pendulum are [10]
x1 = x 2
where g v = 9.8 m/s 2 is the acceleration due to gravity; mc is the mass of the cart; m
is the mass of the pole; l is the half-length of the pole; u is the applied force (control)
and d is the external disturbance. The reference signal here is y r = (π / 30) sin t , and
d is a square wave with the amplitude ±0.05 and period 2π . Also we choose
mc = 1 , m = 0.5 , and l = 0.5 . The initial states are [x1 (0) x 2 (0)] = [0.2 0.2]T . The
T
δ = 0.1 for the compensation controller. The resistance and capacitance are chosen
348 P.-C. Chen, C.-H. Wang, and T.-T. Lee
⎡22.5 7.5⎤
solve the Riccati-like equation (24) and obtain P = ⎢ ⎥ . The simulation
⎣ 7 .5 7 .5 ⎦
results for are shown in Figs. 3, where the tracking responses of state x1 and x 2 are
shown in Figs. 3(a) and 3(b), respectively, the associated control inputs are shown
Fig. 3(c). From the simulation results, we can see that the proposed IACHDNN can
achieve favorable tracking performances with external disturbance. This fact shows
the strong disturbance-tolerance ability of the proposed system.
x1
x2
state x2
state x1
yr
y r
time (sec)
(c)
6 Conclusions
References
1. Wang, C.H., Lin, T.C., Lee, T.T., Liu, H.L.: Adaptive hybrid Intelligent control for uncer-
tain nonlinear dynamical systems. IEEE Trans. Syst., Man, Cybern. B 5(32), 583–597
2. Li, Y., Qiang, S., Zhang, X., Kaynak, O.: Robust and adaptive backstepping control for
nonlinear systems using RBF neural networks. IEEE Trans. Neural Networks 15, 693–701
(2004)
3. Lin, C.T., George Lee, C.S.: Neural fuzzy systems: a neuro-fuzzy synergism to intelligent
systems. Prentice-Hall, Englewood Cliffs (1996)
4. Yu, D.L., Chang, T.K.: Adaptation of diagonal recurrent neural network model. Neural
Comput. & Applic. 14, 189–197 (2005)
5. Poznyak, A.S., Yu, W., Sanchez, D.N., Perez, J.P.: Nonlinear adaptive trajectory tracking
using dynamic neural networks. IEEE Trans. Neural Networks 10, 1402–1411 (1999)
6. Chow, T.W.S., Li, X.D., Fang, Y.: A real-time learning control approach for nonlinear
continuous-time system using recurrent neural networks. IEEE Trans. Ind. Electronics. 47,
478–486 (2000)
7. Ren, X.M., Rad, A.B., Chan, P.T., Lo, W.L.: Identification and control of continuous-time
nonlinear systems via dynamic neural networks. IEEE Trans. Ind. Electronics 50, 478–486
(2003)
8. Hopfield, J.J.: Neural Networks and Physical Systems with Emergent Collective Computa-
tional Abilities. Proceedings of National Academy of sciences, USA 79, 2554–2558 (1982)
9. Hopfield, J.J.: Neurons with graded response have collective computational properties like
those of two-state neurons. Proceedings of National Academy of sciences, USA 81, 3088–
3092 (1984)
10. Wang, L.X.: Adaptive Fuzzy Systems and Control - Design and Stability Analysis. Pren-
tice-Hall, Englewood Cliffs (1994)
A Neural Network Computational Model of Visual
Selective Attention
1 Introduction
Due to the great number of sensory stimuli that a person experiences at any given
point of conscious life, it is practically impossible to integrate available information
D. Palmer-Brown et al. (Eds.): EANN 2009, CCIS 43, pp. 350–358, 2009.
© Springer-Verlag Berlin Heidelberg 2009
A Neural Network Computational Model of Visual Selective Attention 351
into a single perceptual event. This implies that a selective mechanism must be pre-
sent in the brain to effectively focus its resources on specific stimuli; otherwise we
would have been in constant distraction by irrelevant information. Attention can be
guided by top-down or via bottom-up processing as cognition can be regarded as a
balance between internal motivations and external stimulations. Volitional shifts of
attention or endogenous attention results from "top-down" signals originating in the
prefrontal cortex while exogenous attention is guided by salient stimuli from "bottom-
up" signals in the visual cortex [1]. In this paper we emphasize and try to simulate the
behaviour of selective attention, especially in top-down tasks, mostly based on the
theoretical background behind neural mechanisms of attention as it is explained in the
field of neuroscience.
The underlying mechanisms of the neuronal basis of attention are supported by two
main hypotheses. The first is known as “biased competition” [8] and it originated
from studies with single-cell recordings. These studies have shown that attention
enhances the firing rates of the neurons that represent the attended stimuli and sup-
presses the firing rates of the neurons encoding unattended stimuli. The second more
recent hypothesis, places emphasis on the synchronization of neural activity during
the process of attention. The second hypothesis stems from experiments showing that
neurons selected by the attention mechanism have enhanced gamma-frequency syn-
chronization [14, 3]. More specifically, Fries et al [3] measured activity in area V4 of
the brain of macaque monkeys while they were attending behaviorally relevant stim-
uli and observed increased gamma frequency synchronization of attended stimuli
compared to the activity elicited by distractors.
The proposed computational model for endogenous and exogenous visual attention
is based on the second hypothesis for the neural mechanisms behind attention. The
basic functionality of the model is based on the assumption that the incoming visual
stimulus will be manipulated by the model based on its rate and temporal coding. The
rate of the visual stimuli will have important role in the case of exogenous attention
since this type of attention is mainly affected by the different features of the visual
stimuli. More salient stimuli will have an advantage to pass in a further stage of proc-
essing and finally to access working memory. On the other hand, endogenous or top-
down attention is mainly affected by the synchronization of the corresponding neural
activity that represents the incoming stimuli with the neural activity initiated by inter-
nal goals that are setup when the individual is requested to carry out a specific task.
These goals are possibly maintained in the prefrontal cortex of the brain. The direct
connection of top-down attention with synchronization is supported by many recent
studies [9, 5]. For example, Saalmann et al [12] recorded neural activity simultane-
ously from the posterior parietal cortex and an earlier area in the visual pathway of the
brain of macaques while they were performing a visual matching task. Their findings
revealed that there was synchronization of the timing activities of the two regions
when the monkeys selectively attended to a location. Thus, it seems that parietal neu-
rons which presumably represent neural activity of the endogenous goals may selec-
tively increase activity in earlier sensory areas. Additionally, the adaptive resonance
theory by Grossberg [4] implies that temporal patterning of activities could be ideally
suited to achieve matching of top–down predictions with bottom–up inputs, while
Engel et al [2] in their review (p.714) have noted that “If top–down effects induce a
352 K.C. Neokleous et al.
Fig. 1. Neural activity that corresponds to a specific visual input propagates along the visual
cortex and initially to area V1. From there, the corresponding neural activity continues into the
visual hierarch and specifically to area V4. Additionally, top-down signals originate from
“higher” brain areas such as parietal and frontal lobe where possibly interact with the neural
activity from the incoming stimuli. Correlation between these two streams of information could
be examined in area V4 of the visual cortex.
representing each incoming stimulus enters into a network comprised by integrate and
fire neurons. As a result, the corresponding neural activity will propagate along the
network with the task to access a working memory node. Based on the firing rate of
each incoming stimulus, a different neural activation will reach the working memory
node and if the corresponding neural activity is strong enough to cause the working
memory node to fire, then what can be inferred is that the specific stimulus that
caused this activation has accessed working memory and thus it has been attended.
However, in a later stage of processing, top- down signals coming from parietal and
frontal lobes enter the network and try to influence the selection based on internal
goals. For example, suppose that a person is asked to identify and respond if the letter
A appears in the visual field. Then, information represented by spike trains that en-
code how letter A is stored in long term memory will enter the network as top – down
signals. As a result, if a visual stimulus enters the visual field and has strong correla-
tions with the corresponding top-down information, it will be aided in its attempt to
access working memory.
The interaction between top-down information and the neural activity generated by
each incoming stimulus is performed in the correlation control module which is the
major component of the model (Figure 2).
Top-Down
modulations
Working
memory node.
Correlation
Control
Incoming Module
Stimulus
One possible explanation of the mechanism behind the correlation control theory
proposed in this report can be made based on coincidence detector neurons.
A
+
C
+
B
Fig. 3. A coincidence detector neuron C will fire only if the two input neurons A and B fire
synchronously
Fig. 4. Correlation control mechanism between the endogenous goals and the incoming stimuli
A Neural Network Computational Model of Visual Selective Attention 355
Fig. 5. Presentation of the RSVP for the “attentional blink” experiment (Figure5.a) and the
attentional blink basic curve with no blanks (series 1), with blank at lag 1 (series 2) and blank at
lag 2 (series 3) based on the behavioral data of Raymond and Sapiro (1992) (Figure5.b)
One possible explanation for the classic U-shaped curve of Figure 5.b (red series) is
based on Electroencephalography (EEG) measurements and more importantly on two
attention related Event Related Potentials (ERPs). The first ERPs appear at about 180-
240 ms post-stimulus and are referred to as the P2/N2 signals. These signals have been
proposed as control signals for the movement of attention [6, 15]. The second compo-
nent is the P300 signal at about 350–600 ms post-stimulus which is associated with the
working memory sensory buffer site and is taken to be the signal of the availability for
report. Therefore, the explanation for the U- shaped curve lies in the assumption that
the P300 signal generated by the first target falls into the time window that the P2/N2
component of the second target was about to be generated. However due to this inter-
action, the P2/N2 component of the second target is inhibited.
356 K.C. Neokleous et al.
The explanation behind the curves of figure 5.b (with blank at lag 1 (green series)
and blank at lag 2 (black series)) is based on the neural mechanisms behind selection
at attentional tasks. Mostly, is based in the competition process between various stim-
uli in order to access working memory. This competition is reflected through relevant
inhibition between the neural activities that corresponds to each stimulus.
The proposed computational model has been implemented in the Matlab-Simulink
environment. Each of the visual stimulus has been represented by a 10 ms sequence of
spikes and in each ms there is a one (spike) or a zero (no-spike) as seen in Figure 6.
For coding both the distractors and the targets, the same firing rate has been used
since both (targets and distractors) have the same effect from the salience filters (same
brightness, intensity etc.). However, the difference between the spike trains generated
by the targets and the spike trains generated by distractors is in the temporal patterns.
Therefore, it is possible through the coincidence detector module to capture the corre-
lation between the spike trains generated by the targets and spike trains initiated by
internal goals if those two sources have similar temporal patterns in their spike trains.
Based on the degree of correlation between the incoming stimulus and the internal
goals, a relevant control signal is generated that could be associated with the N2/P2
component explained in the previous section. Additionally, once a specific working
memory node that corresponds to a specific stimulus fires, then another signal is gen-
erated that inhibits at that timing any attempt for the coincidence control module to
generate a new control signal.
As a consequence, three important features of the model that rely on neurophysi-
ologic evidence have given the ability to reproduce the behavioural data from the
attentional blink experiment as shown in Figure 7 below. These important features of
Fig. 7. Comparison between simulation data (7.a) and experimental data (7.b)
A Neural Network Computational Model of Visual Selective Attention 357
the model are: a) The correlation control module that generates a control signal rele-
vant to the degree of correlation b) the interaction between the signals related to iden-
tification and response (P300) with the control signal and c) the competitive inhibition
between each incoming stimuli.
5 Discussion
The main advantages of the implementation of a computational model of specific brain
functions can be seen in a twofold manner. First, a biologically plausible model will
give the ability to perform appropriate and detailed simulations in order to study the
most important aspects of the specific brain function as well as to magnify or weaken
related theories. On the other hand, the detailed study of the psychological and neu-
rophsiological approach will aim into an improved understanding of the specific func-
tionality in the brain. This, combined with knowledge from computer science, will
provide the potentials to advance in neurally inspired computing and information proc-
essing. Robots and other engineered systems that mimic biological capabilities as well
as brain-computer interfaces are some of the potential applications that can be benefit
and improved by this knowledge.
Therefore, in terms of the first issue mentioned in the previous paragraph, based on
the results that we got in our attempt to simulate behaviour data from the attentional
blink experiment, we emphasize the importance of temporal coding and as a conse-
quence, temporal coincidence, as an important mechanism for regulating information
through the brain. Thus, we can propose that this way of controlling streams of infor-
mation as well as transferring information between different modules could be applied
in control system engineering.
References
1. Corbetta, M., Shulman, G.L.: Control of goal-directed and stimulus-driven attention in the
brain. Nature R. Neuroscience 3, 201–215 (2002)
2. Engel, A.K., Fries, P., Singer, W.: Dynamic predictions: Oscillations and synchrony in
top–down processing. Nature 2, 704–716 (2001)
3. Fries, P., Reynolds, J.H., Rorie, A.E., Desimone, R.: Modulation of oscillatory neuronal
synchronization by selective visual attention. Science 291, 1560–1563 (2001)
4. Grossberg, S.: The link between brain learning, attention, and consciousness. Conscious.
Cogn. 8, 1–44 (1999)
5. Gross, J., Schmitz, F., Schnitzler, I., et al.: Modulation of long-range neural synchrony -
reflects temporal limitations of visual attention in humans. PNAS 101(35), 13050–13055
(2004)
6. Hopf, J.-M., Luck, S.J., Girelli, M., Hagner, T., Mangun, G.R., Scheich, H., Heinze, H.-J.:
Neural sources of focused attention in visual Search. Cereb. Cortex 10, 1233–1241 (2000)
7. Kempter, R., Gerstner, W., van Hemmen, J.: How the threshold of a neuron determines its
capacity for coincidence detection. Biosystems 48(1-3), 105–112 (1998)
8. Moran, J., Desimone, R.: Selective attention gates visual processing in the extrastriate cor-
tex. Science 229, 782–784 (1985)
9. Niebur, E., Hsiao, S.S., Johnson, K.O.: Synchrony: a neuronal mechanism for attentional
selection? Cur.Op. in Neurobio. 12, 190–194 (2002)
358 K.C. Neokleous et al.
10. Niebur, E., Koch, C.: A Model for the Neuronal Implementation of Selective Visual Atten-
tion Based on Temporal Correlation Among Neurons. Journal of Computational Neuro-
seience 1, 141–158 (1994)
11. Raymond, J.E., Shapiro, K.L., Arnell, K.M.: Temporary suppression of visual processing
in an RSVP task: an attentional blink? J. of exp. psyc. Human perc., and perform-
ance 18(3), 849–860 (1992)
12. Saalmann, Y.B., Pigarev, I.N., et al.: Neural Mechanisms of Visual Attention: How Top-
Down Feedback Highlights Relevant Locations. Science 316, 1612 (2007)
13. Spruston, N.: Pyramidal neurons: dendritic structure and synaptic integration. Nature Re-
views Neuroscience 9, 206–221 (2008)
14. Steinmetz, P.N., Roy, A., et al.: Attention modulates synchronized neuronal firing in pri-
mate somatosensory Cortex. Nature 404, 187–190 (2000)
15. Taylor, J.G., Rogers, M.: A control model of the movement of attention. Neural Net-
works 15, 309–326 (2002)
Simulation of Large Spiking Neural Networks
on Distributed Architectures,
The “DAMNED” Simulator
Introduction
Spiking neurons models open new possibilities for the Artificial Neural Networks
(ANN) community, especially modelling biological architectures and dynamics
with an accurate behaviour [6]. The huge computation cost of some individual
neuron models [7, 9] and the interest of simulating large networks (regarding
non-spiking ANN) make Spiking Neural Networks (SNN) good candidates for
parallel computing [3]. However, if distributed programming has become eas-
ier, reaching good accelerations implies a huge work and a specific knowledge.
Therefore we present a distributed simulator dedicated to SNN, with a mini-
mal need of (sequential) programming to deal with a new project, eventually
no programming at all. Our simulator is named DAMNED for Distributed
And Multi-threaded Neural Event-Driven simulator. It is designed to run
efficiently on various architectures as shown section 1, using an “event-driven”
approach detailed in section 2. Sections 3 and 4 present two optimisations, op-
timized queues and a distributed approach of virtual time handling. Section 5
presents the creation of a SNN within DAMNED and section 6 analyses the
scalability of DAMNED. Finally section 7 presents some future works.
D. Palmer-Brown et al. (Eds.): EANN 2009, CCIS 43, pp. 359–370, 2009.
c Springer-Verlag Berlin Heidelberg 2009
360 A. Mouraud and D. Puzenat
1 MIMD-DM Architectures
Our goal is to build large SNN, so the parallel architecture must be scalable in
computational power. Large SNN need large amount of memory so the archi-
tecture must be scalable in memory which is not possible with shared memory
architectures. Thus, our simulator is optimized to run on MIMD-DM archi-
tectures. MIMD stands for Multiple Instruction stream, Multiple Data stream
using Flynn’s taxonomy [4], and means several processors working potentially
independently; DM stands for Distributed Memory which means each processor
accesses a private memory, thus communications between processors are nec-
essary to access information located in non-local memory. Another advantage
of using MIMD-DM architecture is the diversity of production hardware: from
the most powerful parallel computers to the cheapest workstation clusters. The
drawback of MIMD-DM architecture is the complexity of the program design;
the need of a specific knowledge (message passing instead of well known shared
variable programming); the complexity of the code (not to mention debugging
and validation). However since DAMNED is end-user oriented, the distributed
programming work is already done and validated. The user has no need to enter
the parallel code or even to understand the design of the simulator.
Development has been done in C++ using the MPI 2.0 library [5]. MPI stands
for Message Passing Interface and provides methods to handle communications
(blocking, non-blocking, and collectives) and synchronization. MPI can launch
several tasks (the name of a process in MPI) on a single processor thanks to
the scheduler of the host operating system. In such a case communications will
not involve the physical network when occurring between two tasks hosted by
the same processor. As a consequence, DAMNED runs out of the box on a SMP
architecture or on a hyper-threaded architecture. Thereby DAMNED has been
developed and tested on a single SMP dual-core ultra-portable laptop computer
and validated on a MIMD-DM cluster of 35 dual-core workstations.
The DAMNED simulator is available free of charge; most implementations
of MPI are free; the most suitable host operating system is Linux; and the
target architecture can be a non-dedicated PC cluster with a standard ethernet
network. As a consequence installing and running DAMNED potentially only
cost electricity if it is used at time machines were used to be swhiched off.
2 Architecture of DAMNED
Simulating SNN implies dealing with time [1]. The simulator can divide the
biological time in discrete steps and scan all the neurons and synapses of the
network at each time step, this method is named “clock-driven”. Our simulator
uses another strategy named “event-driven” [17]. An event is a spike emission
from pre-synaptic neurons towards post-synaptic neurons, and each event is
stamped with the spike emission date. Thus, DAMNED is based on an infinite
loop processing events. Each time an event is processed, the simulator (i) actu-
alizes the state of the impacted neuron which eventually generates new events
Simulation of Large Spiking Neural Networks on Distributed Architectures 361
for post-synaptic neurons; and (ii) increases its virtual clock to the event time
stamp. The event-driven strategy is suitable for spiking neuron simulation [12].
Indeed biological spike flows are generally irregular in time with a low average
activity. Such irregular activities imply high and low activity periods. While
a high activity period benefits to the clock-driven approach, all low activity
periods advantage the event-driven approach. Furthermore, a clock-driven simu-
lation looses the order of emission of spikes emitted at the same time step which
can change the behaviour of the simulation [15]. In an event-driven simulation,
temporal precision only depends on the precision of the variable used for time
stamps and clocks, and even a simple 16 bits variable gives a suitable precision.
However, an event-driven simulation does have some drawbacks. The state of
neurons is only actualized when events are processed, on reception of a spike on
a target neuron, then this neuron decides to spike or not. However when the be-
haviour of the neuron is described by several differential equations and/or when
synaptic impacts are not instantaneous. It is possible to make a prediction of the
spike emission date [10]. Such a prediction implies heavy computational costs,
and a mechanism must control a posteriori the correctness of the prediction.
To efficiently run variety of experiments, DAMNED relies on a single “front-
end task” (typically running on a workstation) and on several “simulation
tasks” (typically running on a cluster). The front-end sends inputs into the
simulated spiking network and eventually receives outputs. The input can be a
flow of stimuli, for example matching a biological experiment. In such a case
the evolution of neurons can be monitored to compare the behaviour of the
simulated network with biological recordings. The input can also be a physical
device such as a camera, in such an example the output could be used to control
a motor moving the camera, eventually producing new inputs. More generally,
the front-end can host a virtual environment producing inputs for the SNN and
eventually being modified by the output of the SNN.
The simulation tasks run the heart of the simulator, that is the processing of
events; each task holds a part of the whole neural network. In a simulation of N
neurons with P tasks, each task handles N/P neurons. A simulation task is com-
posed of two threads named CMC and CPC, respectively for “ComMunication
Controller” and “ComPutation Controller”. The use of threads takes ad-
vantage of hyper-threaded and dual-core processors. More precisely, CPC and
CMC share two priority queues to manage two types of events:
– incoming Events to be Processed (by local neurons) are in a so called “EtoP”
queue, an EtoP is the result of an incoming spike and contains the target
neuron ID, the source neuron ID, and the spike emission date;
– outgoing Events to be Emit are in a so called “EtoE” queue, an EtoE is
the result of a local neuron spiking and contains the source neuron ID, the
emission date, and a boolean flag used to eventually invalidate a predicted
emission (see second paragraph of current section).
The simulation is an infinite loop where CPC processes the top event of the
EtoP queue (step “CPC 1” on figure 1). The targeted neuron is activated and
eventually spikes which generates an EtoE stored in the EtoE ordered queue
362 A. Mouraud and D. Puzenat
according to the event emission date (step “CPC 2”). The “activation” of the
target neuron implies changing its state according to its own dynamics. This
processing can be achieved by CPC or by a dedicated thread as shown figure 1.
Also within an infinite loop, CMC processes the top event of the EtoE queue
(step “CMC 2”). CMC knows the tables of postsynaptic neurons ID for all its
neurons so the processed EtoE is used to generate EtoPs in the EtoP queue
(spikes emitted to local neurons) and to pack EtoPs in messages to be sent to
other simulation tasks (spikes emitted to remote neurons). Some controls are
performed by CMC and CPC, respectively “EC” and “PC” on figure 1, to keep
the behaviour strictly conservative and avoid deadlocks (see section 4).
EC
de te te
= local clock
then possible to define a type of priority queue allowing no additional cost when
inserting a new event nor accessing or deleting the next event. This delayed queue
is a circular list containing as many sets of events as existing time stamps in the
maximal delay δmax . Each new event is inserted at the index corresponding to time
de+δ , where de is the actual time of emission and δ is the delay between source and
target fields of the event. Knowing the maximal delay δmax ensures that at time
de no incoming event could be inserted later that de + δmax . When every event
at time de are computed and the authorization to increase actual virtual time has
been given (see next section) the delayed queue moves its first index to de + δt
where δt is the authorized time increment.
t’ = t + d
t t+d t + δ max
... ...
inhibitory events insertion
t
t’
empty
date : t + d
t’ +δ max
...
excitatory events insertion t + δ max
(a) (b)
the whole clock array of Tsrc is sent as a preamble. On Ttgt , the received tei
(∀i = tgt) indicate Ti is granted to run the simulation till date tei . The received
clock array is used to actualize the local clock array by keeping the higher date
between received and local elements, excepted for the local te. The CPC thread
also handles its own clock standing for current event processing time and named
tp, set from the time stamp of the top EtoP queue. Please note that, as described
in following paragraphs, DAMNED will set te and tp to null or even negative
value to carry special informations.
While no clock change occurs, CPC and CMC continue to respectively process
EtoPs and send EtoEs. However when CMC is about to increment te (the task
virtual clock), it is necessary to check if the task is not running the simulation too
fast regarding other tasks with an “Emission Control” (“EC” on figure 1). This
issue also exists for CPC regarding tp, thus a “Processed Control” is performed
(“PC” on figure 1) when CPC is about to increment tp. An inaccurate element
of the clock array on one of the task can lead to an erroneous check and blocks
CPC or CMC resulting in a deadlock of the simulator. DAMNED uses a boolean
array on each task Ti to ensure clock array accuracy: clock sent[j] is set to
true when the actual value of tei is sent to task Tj (eventually without events);
and each element of clock sent is set to false when tei increases.
When the EtoE queue is empty, CMC updates tesrc as described by algorithm 1
where δmin is the minimal “biological” delay within the SNN. The variable tp may
has been set to null by CPC (as described in next paragraph), in such a case algo-
rithm 1 enters line 2 if Tsrc has nothing to do, and sending a null clock value will
inform other tasks not to wait Tsrc . Finally, if EtoE is empty but not EtoP (line 1),
CMC sends −|tpsrc −δmin | which is the opposite value of the date untill which Tsrc
is sure that other tasks can be granted to emit.
if tpsrc = 0 then
1 tesrc ←− −|tpsrc − δmin |
else
2 tesrc ←− 0
Algorithm 1. Update of the actual virtual time
any, i.e. if tej = 0). If both conditions are fulfilled local clock can be incremented
(tei ← de) and emission granted (line 5), else emission is denied and tei is set to
−de (line 6) meaning CMC is blocked till a future emission at de.
The control performed by CPC before activating the neuron targeted by the
top EtoP (“PC” on figure 1) is presented in algorithm 3. CPC is free to process
events while (i) tpi does not have to be incremented (line 1) or (ii) the incre-
mentation keeps tpi in the range of the lookahead provided by δmin : CPC must
check that all tasks have an up-to-date tei and that it is not any more possible to
receive an anterior event that could void the neuron activation (line 2). If these
conditions are fulfilled, tpi can be incremented to de and EtoP can be processed
(line 3). Otherwise CPC is blocked until the reception of a sooner event (i.e.
until time de) and tpi is set to the opposite of de (line 4). Furthermore, if the
EtoE queue is empty at processing time (line 3) tei is set to the opposite of tpi .
1 if (tpi = de) or
2 (clock sent[j] and (de ≤ |tej | + δmin ), 0 < j < P ) or (tej = 0)) then
tpi ←− de
3 if (tei = 0) then
tei ←− −tpi
→ processing granted
else
4 tpi ←− −de
→ processing denied
However, for now the types of neurons used by a network is defined in the
code and changing the type used imply to recompile DAMNED. Implemented
models include LIF [8], GIF [14], and SRM0 [6]. The implementation of a new
model implies writing a new C++ object but does not need parallel programming.
The difficulty deeply depends of the type of neuron, and can be significant, for
example if the spiking date must be predicted.
6 Results
The DAMNED simulator has been tested on a cluster, that is a classroom of the
French West Indies University (Université des Antilles et de la Guyane) with 35
intel 2.2 GHz dual-core computers. Each machine has 1 gigabyte of RAM and
Simulation of Large Spiking Neural Networks on Distributed Architectures 367
machines are needed for the OS not to swap (see both dashed lines on figure 4).
Making larger neural networks runnable is the main successful goal of DAMNED.
Furthermore, results show that simulation times are significantly decreased, even
with the low cost 100 Mb/s network used. The two 10 000 neurons simulations
(dashed lines) are performed at 1 Hz and 10 Hz. Both frequencies are biologically
plausible average values, however some cells can spike at a higher rate. Results
show that a higher activity implies lower speedups. Indeed it is more difficult for
computations to hide the increased number and size of messages. Thus dealing
with high average frequencies would take advantage of a better network. Finally,
the 100 000 neurons network (see plain line), involving at least 27 simulations
tasks, validates the scalability of DAMNED for effective simulations.
23
21
19
17
15
speedup
13
11
9
7
100 000 neurons at 1.0 Hz
5 10 000 neurons at 1.0 Hz
3 10 000 neurons at 10.0 Hz
1
15 20 25 27 29 30 31 33 35
number of simulation tasks (with one task per machine)
Fig. 4. Speedups as a function of the number of simulation tasks, with average spiking
rates of 1 Hz and 10 Hz, with biologically plausible connectivity of respectively 0.1 %
and 0.01 % of the network size for a 10 000 and a 100 000 neurons networks
This paper has presented the DAMNED architecture and some original contri-
butions to the field of distributed event-driven simulations, namely an optimised
delayed event queue and efficient distributed virtual time handling algorithms.
DAMNED is well adapted for biologically inspired protocols and can output valu-
able simulation data. Significant speedups have been achieved and prove that
even using a non dedicated cluster of simple computers, the DAMNED simu-
lator is able to handle large SNN simulation. Such results tend to validate the
scalability of DAMNED. Lower speedups occur when the computational load is
not high enough to overlap communications, thus complex neuron models would
definitively lead to high speedups. Furthermore, results show that even when
DAMNED is under-loaded, increasing of the number of processor does not slow
the simulation which can in all cases reach a normal end. More evaluations will
be done on dedicated parallel computers to simulate larger SNN.
Simulation of Large Spiking Neural Networks on Distributed Architectures 369
Figure 5 presents the speedups of the creation of the SNN and shows that this
step already takes advantage of the cluster. However the creation time remains
high for large neural networks, about 2 000 s for a 100 000 neurons network
using 35 machines while simulating 1 s of biological time takes about 150 s. We
are currently working on this issue and significant improvements are anticipated.
19
17
15
13
speedup
11
9
100 000 neurons
7 10 000 neurons
5
3
1
15 20 25 27 29 30 31 33 35
number of simulation tasks (with one task per machine)
Fig. 5. Speedups of the creation of the SNN for a 10 000 and a 100 000 neurons network
Regarding usability, DAMNED is end user oriented and the creation of a SNN
has been presented. A web based interface which enables to launch simulations
with a simple navigator is already developed and will be improved. This interface
shows available machines on the cluster, creates MPI configuration files, runs
DAMNED and collects results. For now, the main remaining difficulty for an end-
user is the modification of neuron models and implementation of new models,
witch will be addressed by defining new neuron models from the web interface.
References
[1] Bohte, S.M.: The evidence for neural information processing with precise spike-
times: A survey. Natural Computing 3(4), 195–206 (2004)
[2] Chandy, K.M., Misra, J.: Distributed simulation: A case study in design and
verification of distributed programs. IEEE Transactions on Software Engineering,
SE-5(5), 440–452 (1979)
[3] Ferscha, A.: Parallel and distributed simulation of discrete event systems. In: Par-
allel and Distributed Computing Handbook, pp. 1003–1041. McGraw-Hill, New
York (1996)
[4] Flynn, M.J., Rudd, K.W.: Parallel architectures. ACM Computation Sur-
veys 28(1), 67–70 (1996)
[5] Message Passing Interface Forum. MPI: A message-passing iterface standard.
Technical Report UT-CS-94-230, University of Tennessee (1994)
[6] Gerstner, W., Kistler, W.M.: Spiking Neuron Models: An Introduction. Cam-
bridge University Press, New York (2002)
370 A. Mouraud and D. Puzenat
[7] Izhikevich, E.M.: Simple model of spiking neurons. IEEE Transactions on Neural
Networks 14(6), 1569–1572 (2003)
[8] Knight, B.W.: Dynamics of encoding in a population of neurons. Journal of Gen-
eral Physiology 59, 734–766 (1972)
[9] Lobb, C.J., Chao, Z.C., Fujimoto, R.M., Potter, S.M.: Parallel event-driven neu-
ral network simulations using the hodgkin-huxley model. In: Proceedings of the
Workshop on Principles of Advanced and Distributed Simulations. PADS 2005,
June 2005, pp. 16–25 (2005)
[10] Makino, T.: A discrete-event neural network simulator for general neuron models.
Neural Computing and Applications 11(3-4), 210–223 (2003)
[11] Marin, M.: Comparative analysis of a parallel discrete-event simulator. In: SCCC,
pp. 172–177 (2000)
[12] Mattia, M., Giudice, P.D.: Efficient event-driven simulation of large networks
of spiking neurons and dynamical synapses. Neural Computation 12, 2305–2329
(2000)
[13] Morrison, A., Mehring, C., Geisel, T., Aertsen, A., Diesmann, M.: Advancing the
boundaries of high-connectivity network simulation with distributed computing.
Neural Computation 17, 1776–1801 (2005)
[14] Rudolph, M., Destexhe, A.: Analytical integrate-and-fire neuron models with
conductance-based dynamics for event-driven simulation strategies. Neural Com-
putation 18(9), 2146–2210 (2006)
[15] Shelley, M.J., Tao, L.: Efficient and accurate time-stepping schemes for integrate-
and-fire neuronal network. Journal of Computational Neuroscience 11(2), 111–119
(2001)
[16] Swadlow, H.A.: Efferent neurons and suspected interneurons in binocular visual
cortex of the awake rabbit: Receptive fileds and binocular properties. Journal of
Neurophysiology 59(4), 1162–1187 (1988)
[17] Watts, L.: Event-driven simulation of networks of spiking neurons. In: Cowan,
J.D., Tesauro, G., Alspector, J. (eds.) Advances in Neural Information Processing
System, vol. 6, pp. 927–934. MIT Press, Cambridge (1994)
A Neural Network Model for the Critical Frequency of
the F2 Ionospheric Layer over Cyprus
Abstract. This paper presents the application of Neural Networks for the pre-
diction of the critical frequency foF2 of the ionospheric F2 layer over Cyprus.
This ionospheric characteristic (foF2) constitutes the most important parameter
in HF (High Frequency) communications since it is used to derive the optimum
operating frequency in HF links. The model is based on ionosonde measure-
ments obtained over a period of 10 years. The developed model successfully
captures the variability of the foF2 parameter.
1 Introduction
Skywave HF communiations utilize the ability of the ionosphere to reflect waves up
to 30 MHz to achieve medium to long-distance communication links with a minimum
of infrastructure (figure 1). The ionosphere is defined as a region of the earth's upper
atmosphere where sufficient ionisation can exist to affect the propagation of radio
waves in the frequency range 1 to 30 MHz. It ranges in height above the surface of
the earth from approximately 50 km to 600 km. The influence of this region on radio
waves is accredited to the presence of free electrons.
The uppermost layer of the ionosphere is the F2 layer which is the principal reflect-
ing region for long distance HF communications [1,2,3]. The maximum frequency
that can be reflected at vertical incidence by this layer is termed the F2 layer critical
frequency (foF2) and is directly related to the maximum electron density of the layer.
The F2 layer critical frequency is the most important parameter in HF communication
links since when multiplied by a factor which is a function of the link distance, it
defines the optimum usable frequency of operation. The maximum electron density of
free electrons within the F2 layer and therefore foF2 depend upon the strength of the
solar ionising radiation which is a function of time of day, season, geographical loca-
tion and solar activity [1,2,3]. This paper describes the development of a neural net-
work model to predict foF2 above Cyprus. The model development is based on
around 33000 hourly foF2 measurements recorded above Cyprus from 1987 to 1997.
The practical application of this model lies in the fact that in the absence of any real-
time or near real-time information on foF2 above Cyprus this model can provide an
alternative method to predict its value under certain solar activity conditions.
D. Palmer-Brown et al. (Eds.): EANN 2009, CCIS 43, pp. 371–377, 2009.
© Springer-Verlag Berlin Heidelberg 2009
372 H. Haralambous and H. Papadopoulos
skywave
ionosphere
Earth
Earth
16
14
12
foF2(MHz)
10
8
6
4
2
0
0,00 4,00 8,00 12,00 16,00 20,00 24,00
Local Time
The most profound solar effect on foF2 is reflected on its daily variation as shown
in figure 2. As it is clearly depicted, there is a strong dependency of foF2 on local
time which follows a sharp increase of foF2 around sunrise and gradual decrease
around sunset. This is attributed to the rapid increase in the production of electrons
due to the photo-ionization process during the day and a more gradual decrease due to
the recombination of ions and electrons during the night.
The long–term effect of solar activity on foF2 follows an eleven-year cycle and is
clearly shown in figure 3(a) where all the values of foF2 are plotted against time as
well as a modeled monthly mean sunspot number R which is a well established index
of solar activity (figure 3(b)). We can observe a marked correlation of the mean level
of foF2 and modeled sunspot number.
20
15
foF2(MHz)
10
0
1986 1989 1992 1995 1997
Year
(a)
Sunspot number R
200
150
100
50
0
1986 1989 1992 1995 1997
Year
(b)
Fig. 3. Long-term foF2 and solar activity variation with time
There is also a seasonal component in the variability of foF2 which can be attributed
to the seasonal change in extreme ultraviolet (EUV) radiation from the Sun. This can be
clearly identified in figure 4 for noon values of foF2 for year 1991 according to which in
winter foF2 tends to be higher than during the summer. In fact this variation reverses for
night-time foF2 values. This particular phenomenon is termed the winter anomaly. In
addition to the effects of solar activity on foF2 mentioned above we can also identify a
strong effect on the diurnal variability as solar activity gradually increases through its
11-year cycle. This is demonstrated if figure 5 where the diurnal variation of foF2 is
plotted for three different days corresponding to low (1995), medium (1993) and high
(1991) sunspot number periods. It is evident from this figure that the night to day vari-
ability in foF2 increases as sunspot number increases.
374 H. Haralambous and H. Papadopoulos
18
16
14
12
foF2(MHz) 10
8
6
4
2
0
1 2 3 4 5 6 7 8 9 10 11
Year
High Sunspot
16
Medium Sunspot
14
Low Sunspot
12
foF2(MHz)
10
8
6
4
2
3 Model parameters
0
0,00 4,00 8,00 12,00 16,00 20,00 24,00
Local Time
Fig. 5. Diurnal variability of foF2 for low, medium and high solar activity
3 Model Parameters
The diurnal variation of foF2 is clearly evident by observing figure 2 and figure 5. We
therefore include hour number as an input to the model. The hour number, hour, is an
integer in the range 0 ≤ hour ≤ 23. In order to avoid unrealistic discontinuity at the
midnight boundary, hour is converted into its quadrature components according to:
⎛ hour ⎞
sinhour = sin ⎜ 2π (1)
⎝ 24 ⎟⎠
and
⎛ hour ⎞
coshour = cos ⎜ 2π ⎟ (2)
⎝ 24 ⎠
A seasonal variation is also an underlying characteristic of foF2 as shown in figure
4 and is described by day number daynum in the range 1 ≤ daynum ≤ 365. Again to
avoid unrealistic discontinuity between December 31st and January 1st daynum is
converted into its quadrature components according to:
A Neural Network Model 375
⎛ daynum ⎞
sinday = sin ⎜ 2π ⎟ (3)
⎝ 365 ⎠
and
⎛ daynum ⎞
cosday = cos ⎜ 2π ⎟ (4)
⎝ 365 ⎠
Long-term solar activity has a prominent effect on foF2. To include this effect in
the model specification we need to incorporate an index, which represents a good
indicator of solar activity. In ionospheric work the 12-month smoothed sunspot num-
ber is usually used, yet this has the disadvantage that the most recent value available
corresponds to foF2 measurements made six months ago. To enable foF2 data to be
modelled as soon as they are measured, and for future predictions of foF2 to be made,
the monthly mean sunspot number values were modeled using a smooth curve defined
by a summation of sinusoids (figure 3(b)).
A Neural Network (NN) was trained to predict the foF2 value based on sinhour,
coshour, sinday, cosday and R (modeled sunspot number) model parameters. The
33149 values of the dataset recorded between 1987 and 1997 were used for training
the NN, while the 3249 values of the more recent dataset recorded from 18.09.08 until
16.04.09 were used for testing the performance of the trained NN. The training set
was sparse to a certain degree in the sense that many days had missing foF2 hourly
values and this did not allow the dataset to be approached as a time-series.
The network used was a fully connected two-layer neural network, with 5 input, 37
hidden and 1 output neuron. Both its hidden and output neurons had tan-sigmoid
activation functions. The number of hidden neurons was determined by trial and error.
The training algorithm used was the Levenberg-Marquardt backpropagation algorithm
with early stopping based on a validation set created from the last 3000 training ex-
amples. In an effort to avoid local minima ten NNs were trained with different ran-
dom initialisations and the one that performed best on the validation set was selected
for application to the test examples. The inputs and target outputs of the network were
normalized setting their minimum value to -1 and their maximum value to 1. The
results reported here were obtained by mapping the outputs of the network for the test
examples back to their original scale.
The RMSE of the trained NN on the test set was 0.688 MHz, which is considered
acceptable for a prediction model [4,5,6]. To further evaluate the performance of the
developed network, a linear NN was applied to the same data and the performance of
the two was compared. The RMSE of the linear NN on the test set was 1.276 MHz,
which is almost double that of the multilayer NN. Some examples of measured and
predicted foF2 values are given in figure 6. These demonstrate both the good per-
formance of the developed NN and its superiority over the linear model.
376 H. Haralambous and H. Papadopoulos
FoF2(MHz)
FoF2(MHz)
5 5
4 4
3 3
2 2
1 1
0 0
0.00 4.00 8.00 12.00 16.00 20.00 24.00 0.00 4.00 8.00 12.00 16.00 20.00 24.00
Local Time Local Time
8 8
7 7
6 6
FoF2(MHz)
FoF2(MHz)
5 5
4 4
3 3
2 2
1 1
0 0
0.00 4.00 8.00 12.00 16.00 20.00 24.00 0.00 4.00 8.00 12.00 16.00 20.00 24.00
Local Time Local Time
Despite the good agreement of measured and predicted values under benign geo-
magnetic conditions we have noticed several occasions where the discrepancy be-
tween them increases significantly. This is the case particularly during geomagnetic
storms due to the impact of the disturbed magnetic field on the structure of the iono-
sphere causing rapid enhancements or depletions in its electron density. An example
of such an occasion is given in figure 7. It is evident that around 15:00 the intense
geomagnetic activity causes an increase in the error in the model due to excursions of
foF2 from its undisturbed behaviour.
10 Measured Predicted
9 Storm period
8
7
FoF2(MHz)
6
5
4
3
2
1
0
0.00 4.00 8.00 12.00 16.00 20.00 24.00
Local Time
References
1. Goodman, J.: HF Communications, Science and Technology. Nostrand Reinhold (1992)
2. Maslin, N.: The HF Communications, a Systems Approach, San Francisco (1987)
3. McNamara, L.F.: Grid The Ionosphere: Communications, Surveillance, and Direction Find-
ing. Krieger Publishing Company, Malabar (1991)
4. Altinay, O., Tulunay, E., Tulunay, Y.: Forecasting of ionospheric critical frequency using
neural networks. Geophys. Res. Lett. 24, 1467–1470 (1997)
5. Cander, L.R., Lamming, X.: Forecasting Neural networks in ionospheric prediction and
short-term forecasting. In: 10th International Conference Conferenceon Antennas and
Propagation, Edinburgh, April 14-17, vol. 436, pp. 2.27-2.30. IEE Conference Publication
(1997)
6. Wintoft, P., Cander, L.R.: Short term prediction of foF2 using time delay neural networks.
Phys. Chem. Earth (C) 24, 343–347 (1999)
Dictionary-Based Classification Models. Applications for
Multichannel Neural Activity Analysis
Vincent Vigneron1, , Hsin Chen2 , Yen-Tai Chen3 , Hsin-Yi Lai3 , and You-Yin Chen3
1
IBISC CNRS FRE 3190, Université d’Evry, France
vincent.vigneron@ibisc.univ-evry.fr
2
Dept. of Electrical Engineering, National Tsing Hwa University, Taiwan
hchen@ee.nthu.edu.tw
3
Dept. of Electrical Engineering, National Chiao-Tung University,Taiwan
kenchen@cn.nctu.edu.tw
1 Introduction
The analysis of continuous and multichannel neuronal signals is complex, due to the
large amount of information received from every electrode. Neural spike recording have
This project was supported in part by fundings from the Hubert Curien program of the For-
eign French Minister and from the Taiwan NSC. The neural activity recordings were kindly
provided by the Neuroengineering lab. of the National Chiao-Tung University.
D. Palmer-Brown et al. (Eds.): EANN 2009, CCIS 43, pp. 378–388, 2009.
© Springer-Verlag Berlin Heidelberg 2009
Dictionary-Based Classification Models 379
shown that the primary motor cortex (M1) encodes information about movement direc-
tion [25,24,23,1] limb velocity, force [4] and individual muscle activity [22,17]. To
further investigate the relationship between cortical spike firing patterns and special be-
havioral task, we have investigated motor cortex responses recorded during movement
in freely moving rats: Fig. 1 shows the experimental setup for neural activities recording
and the video captures related animal behavioral task simultaneously.
Fig. 1. The experimental setup (top). Light-color (red virtual ring) was belted up the right forelimb
to be recognized the trajectory by video tracking system. The sequence images were captured the
rat performing the lever press tasks in return for a reward of water drinking (bottom).
In this paper, we report on the classification of action potentials into a spike wave
shape database. We do not intend to classify the waveforms: much work have been al-
ready made on this subject, see for instance Van Staveren et al. in [26]. But we expect
to put in evidence a relationship between the rat posture or motion and the recorded
neural activity. So, our research is mostly based on two objectives: phenomenon ex-
planation and phenomenon prediction. This represents one of the current challenges in
signal theory research.
380 V. Vigneron et al.
The study, approved by the Institutional Animal Care and Use Committee at the Na-
tional Chiao Tung University, was conducted according to the standards established in
the Guide for the Care and Use of Laboratory Animals. Four male Wistar rats weigh-
ing 250-300 g (BioLASCO Taiwan Corp., Ltd.) were individually housed on a 12 h
light/dark cycle, with access to food and water ad libitum.
Dataset was collected from the motor cortex of awake animal performing a simple
reward task. In this task, male rats (BioLACO Taiwan Co.,Ltd) were trained to press a
lever to initiate a trial in return for a water reward. The animals were water restricted
8-hour/day during training and recording session but food were always provided to the
animal ad lib every day.
The animals were anesthetized with pentobarbital (50 mg/kg i.p.) and placed on a stan-
dard stereotaxic apparatus (Model 9000, David Kopf, USA). The dura was retracted
carefully before the electrode array was implanted. The pairs of 8 microwire electrode
arrays (no.15140/13848, 50m in diameter; California Fine Wire Co., USA) are im-
planted into the layer V of the primary motor cortex (M1). The area related to forelimb
movement is located anterior 2-4 mm and lateral 2-4 mm to Bregma. After implanta-
tion, the exposed brain should be sealed with dental acrylic and a recovery time of a
week is needed.
During the recording sessions, the animal was free to move within the behavior task
box (30 cm×30 cm× 60 cm), where rats only pressed the lever via the on right fore-
limb, and then they received 1-ml water reward as shown in Fig. 1. A Multi-Channel
Acquisition Processor (MAP, Plexon Inc., USA) was used to record neural signals. The
recorded neural signals were transmitted from the headstage to an amplifier, through a
band-pass filter (spike preamp filter: 450-5 kHz; gain: 15,000-20,000), and sampled at
40 kHz per channel. Simultaneously, the animals’behavior was recorded by the video
tracking system (CinePlex, Plexon Inc., USA) and examined to ensure that it was con-
sistent for all trials included in a given analysis [26].
3 Data Analysis
3.1 Preprocessing
Neural activity was collected from 400-700ms before to 200-300 ms after lever release
for each trail. Data was recorded from 33 channels, action potentials (spikes) crossing
manually set thresholds were detected, sorting and the firing rate for each neuron was
computed in 33 ms time bins (Fig. 3).
In each channel, the rms noise level is constantly monitored and determines the set-
ting of a level detector to detect spike activity. Each trail was segmented into 19-30
bins approximately. Notice that the real time data processing software reduces the data
stream by rejection of data which does not contain bioelectrical activity.
Dictionary-Based Classification Models 381
In the following experiment, the data are summarized by a 706 × 33 matrix X = (xi j )
and a 706 × 1 vector Z = z j where xi j denotes the feature i for the channel j and z j
stands for the class label of the sample i. The class label is 1,2,3 or 4, depending on the
trajectory followed by the rat arm in the plane of the video tracking system (Fig. 2).
y2
2 1
y(t)
y1
a. b. 3 4
Fig. 2. Arm trajectory in the video tracking system y1 y2 −plane (a). z-label derived from the
position vector y(t)(b).
The z-label is obtained from differential geometry. We assume that a Cartesian co-
ordinate system has been introduced in R2 . Then, evey point in space can be uniquely
determined by the position vector y = y(t) = (y1 (t), y2(t)). Therefore the vectory(t2) −
y(t1 ) gives the direction of the motion and lies in one of the four quadrant of the
y1 y2 −plane. The first quadrant is labelled z = 1, the second quadrant z = 2, etc.
Let X = (xi j ) the matrix of dataset and Z = z j the vector of class label.
LDA, QDA and MDA. Statistical discriminant analysis methods such as LDA, QDA
and MDA arise in a Gaussian mixture model. These methods are concerned with the
construction of a statistical decision rule which allows to identify the population mem-
bership of an observation. The predicted class is chosen to maximize the posterior class
probability given the observation. The main assumption of these methods concerns the
distribution of the observations of each class which is Gaussian. We refer to Chapter 4
382 V. Vigneron et al.
Fig. 3. The raw data of neural recording and an example of spike sorting. (a) Real raw data record-
ing in M1 area. (b) Three individual spikes are detected from the raw data from (a). Visualized
result for spike sorting by using principal component analysis. All spike timestamps were dis-
played for one trail a single neural recording (d) and the firing activity histograms (f) around the
time of behavioral task. The bin is 33 ms. The red line denotes that the animal presses the lever.
of [3] for details on LDA and QDA. MDA, developed by Hastie and Tibshirani [16], is
a generalization of LDA in which each class is modeled by a mixture of Gaussians. This
modelization gives more flexibility in the classification rule than LDA and allows MDA
to take into account heterogeneity in a class. Breiman et al. [5], MacLachlan and Bas-
ford [18] have actually contributed and tested this generative approach on many fields.
Let k the number of classes. The proportion of each class in the population is πi with
i = 1, . . . , k. Let πir the proportion of the subclass r = 1, . . . , Rk of the class i and
φi,r (x, θ) the probabilistic normal density function of the subclass r. We can define the
probability of a nuclear x belongs to the class i such as:
Dictionary-Based Classification Models 383
Ri
P(Z = i, X = x) = πi fi (x) = πi πir φ(x, θi,r ), (1)
r=1
where θi,r = μi,r , σ. The πi estimate by the maximum likelihood stands for the propor-
tion of the class i in the learning set. The parameters Σ, μi,r and πi,r are estimated with
the EM algorithm. At the qth iteration, the parameters are :
n (q)
j=1 I(z j = i)p jr
(q)
πir = n , (2)
j=1 I(z j = i)
n (q)
j=1 x j I(z j = i)p jr
μ̂ir = (q)
, (3)
n
j=1 I(z j = i)p jr
n Ri
j=1 r=1
p jr (x j − μ̂ir )(x j − μ̂ir )t
Σ̂ = . (4)
n
The rule of Maximum A posteriori predicts the class of an observation x : the observa-
tion x belongs to the class which gives it the largest likelihood.
as few number of atoms as possible. On the one hand, discriminative analysis methods,
such as LDA, are more suitable for the tasks of classification. On the other hand, dis-
criminative methods are usually sensitive to corruption in signals due to lacking crucial
properties for signal reconstruction. We propose here a method of sparse representation
for signal classification , which modifies the standard sparse representation framework
for signal classification. We first show that replacing the reconstruction error with dis-
crimination power in the objective function of the sparse representation is more suitable
for the tasks of classification. When the signal is corrupted, the discriminative methods
may fail because little information is contained in discriminative analysis to success-
fully deal with noise, missing data and outliers. Let the n-dimensional vector x to be
decomposed as a linear combination of the vectors ai , i = 1, . . . , m. According to [19],
the vectors ai , i = 1, . . . m are called atoms and they collectively
form a dictionary over
which the vector x is to be decomposed. We may write x = m i=1 si ai = As, where
A [a1 , . . . , am ] is the n × m dictionary (matrix) and s (s1 , . . . , sm )T is the m × 1
vector of coefficients. If m > n, the dictionary is overcomplete, and the decomposition
is not necessarily unique. However, the so called “sparse decomposition” (SD), that is,
a decomposition with as much zero coefficients as possible has recently found a lot of
attention in the literature because of its potential applications in many different areas.
For example, it is used in Compressive Sensing (CS) [9], underdetermined Sparse Com-
ponent Analysis (SCA) and source separation [14], decoding real field codes [6], image
deconvolution [12], image denoising [11], electromagnetic imaging and Direction of
Arrival (DOA) finding [13], and Face Recognition [27].
The sparse solution of the Underdetermined System of Linear Equations (USLE)
As = x is useful because it is unique under some conditions: Let spark(A) denote the
minimum number of columns of A which form a linear dependent set [10]. Then, if the
USLE:
As = x (5)
has a solution s with less than 12 spark(A) non-zero components, it is the unique sparsest
solution [10,15]. As a special case, if every n × n sub-matrix of A is invertible (which is
called the Unique Representation Property or URP in [13]), then a solution of (5) with
less than (n + 1)/2 non-zero elements is the unique sparsest solution.
For finding the sparse solution of (5), one may search for a solution for which the 0
norm of s, i.e. the number of non-zero components of s, is minimized. This is written
as:
m
minimize |si |0 subject to As = x (6)
i=1
Direct solution of this problem needs a combinatorial search and is NP-hard. Con-
sequently, many different algorithms have been proposed in recent years for finding
the sparse solution of (5). Some examples are Basis Pursuit (BP) [8], Smoothed 0
(SL0) [20,21], and FOCUSS [13]. Many of these algorithms, replace the 0 norm in (6)
by another function of s, and solve the problem:
minimize f (s) subject to As = x (7)
m
For example, in BP, f (s) is the 1 norm of s (i.e. i=1 |si |); and in SL0, f (s) is a smoothed
measure of the 0 norm.
However, up to our best knowledge, in all of the previous works, it is explicitly or
implicitly assumed that the dictionary matrix is full-rank. Note that having the URP is
Dictionary-Based Classification Models 385
a more strict than being full-rank, that is, a matrix which has the URP is full-rank, but
a full-rank matrix has not necessarily the URP. Consider however a dictionary A which
is not full-rank (and hence has not the URP), but spark(A) > 2. This dictionary may
still be useful for SD applications, because a solution of (5) with less than 12 spark(A)
non-zero components is still unique and is the sparsest solution. As an example, A =
[1, 2, −1, 1; 2, −1, 1, 0; 3, 1, 0, 1] is not full-rank (its 3rd row is the sum of its first two
rows), but every two of its columns are linearly independent, and hence spark(A) = 3.
On the other hand, for a non-full-rank A, the system (5) does not even necessarily
admit a solution (that is, x cannot be necessarily stated as a linear combination of the
atoms ai , i = 1, . . . , m. For example, x = (1, 2, 3.1)T cannot be stated as a linear com-
bination of the columns of the above mentioned A, because contrary to A, its last row
is not the sum of its first two rows. In this case, all the algorithms based on (6) or (7)
will fail, because the solution set of (5) is empty. In effect, in this case, the ‘sparsest
solution’ of (5) has not been even defined, because (5) has no solution !
Non-full-rank dictionaries may be encountered in some applications. For example,
in SD based classification [27], the idea is to express a new point x as a sparse linear
combination of all data points ai , i = 1, . . . , m, and assign to x the class of the data points
ai which have more influence on this representation. In this application, if for example,
one of the features (one of the components of ai ) can be written as a linear combination
of the other features for all the ‘data’ points ai , i = 1, . . . , m, then the dictionary A is
non-full-rank. If this is also true for the new point x, then we are in the case that (5) has
solutions but A is non-full-rank; and if not, then (5) has no solution and our classifier
will fail to provide an output (based on most current SD algorithms).
For a non-full-rank overcomplete dictionary, one may propose to simply remove the
rows of A that are dependent to other rows, and obtain a full-rank dictionary. This naive
approach is not desirable in many applications. In Compressive Sensing (CS) language,
this is like trowing away some of the measurements, which were useful in presence of
measurement noise for a better estimation of s (recall that in simple estimation of the
weight of an object, if you have several measurements available, trowing away all but
one is not a good estimation method of the weight!).
In the next section, we generalize the definition of sparse decomposition to classifica-
tion (SDC) and modify the algorithm itself to directly cover non-full-rank dictionaries.
4 Definition of SDC
In linear algebra, when the linear system As = x is inconsistent (underdetermined
as well as overdetermined), one usually considers a Least Squares (LS) solution, that
is, a solution which minimizes As − x, where · stands for the 2 (Eucidean) norm
throughout the paper. Naturally, we define the sparse decomposition as a decomposition
x ≈ s1 a1 + · · · + sm am = As which has the sparsest s among all of the minimizers of
As − x. By a sparse LS solution of (5), we mean the s ∈ S which has the minimum
number of non-zero components, that is:
m
argmin |si |0 subject to As − x is minimized (8)
s i=1
Note that the constraint As = x in (6) has been replaced by s ∈ S in (8). If (5) admits a
solution, S will be the set of solutions of (5), and the above definition is the same as (6).
386 V. Vigneron et al.
Replacing the reconstruction error with the discrimination power quantified by the
Fisher’s discrimination criterion used in the LDA, the objective function (8) that focuses
only on classification can be written as:
m
ΣB
argmin |si |0 subject to S : is minimized (9)
s ΣW
i=1
where ΣB is the ’intra-class distance’ and ΣB is the “inner-class scatter’. Fisher’s crite-
rion is motivated by the intuitive idea that the discrimination power is maximized when
the spatial distribution of different classes are as far away as possible and the spatial
distribution of samples from the same class are as close as possible. Maximizing (9)
generates a sparse representation that has a good discrimination power.
The SL0 Algorithm. SL0 algorithm [20,21] is a SD algorithm with two main features:
1) it is very fast, 2) it tries to directly minimize the 0 norm (and hence does not suffer
from replacing 0 by asymptotic equivalents). The basic idea of SL0 is to use a smooth
measure of the 0 norm ( f (s)), and solve (7) by steepest descent. To take into account
the constraint As = x, each iteration of (the full-rank) SL0 is composed of:
– Minimization: s ← s − μ∇ f
– Projection: Project s onto {s| As = x}:
For extending SL0 to non-full-rank dictionaries, the projection step should be modified:
s should be projected onto S instead of {s| As = x}. This can be seen from the lemma
in [2].
5 Results
In this section, the classification is conducted on the experimental database. An exem-
ple of a scatterplot after finishing the classification is shon in Fig. 4. This classification
consisted in a MDA classification (Fig. 4.a) and the application of the SDC algorithm
resulting in classes 1,2,3 and 4. Classification with SDC is conducted with the decom-
position coefficient (the x as in equation (8)) as feature and support vector machine
(SVM) as classifier.
Table 1. Classification error rates with different level of signal-to-noise ratio (SNR)
Methods no noise 5 dB 10 dB 20
MDA 0.02620 0.0308 0.1774 0.2782
SDC 0.02773 0.0523 0.0720 0.1095
In this experiment, noise is added to the signals to test the robustness of SDC, with
increasing level of energy. Table 1 summarizes the classification error rates obtained
with different SNR.
Dictionary-Based Classification Models 387
Results in Table 1 show that in the case that signals are ideal (noiseless), MDA is
the best criterion for classification. This is consistent with the known conclusion that
discriminative methods outperform reconstructive methods in classification. However,
when the noise is increased the accuracy based on MDA degrades faster than the accu-
racy base on SDC. This indicates that the signal structures recovered by the standard
sparse representation are more robust to noise thus yield less performance degradation.
3 3
2 2
1 1
0 0
-1 -1
-2 -2
-3 -3
a. -3 -2 -1 0 1 2 3 4 5
b. -3 -2 -1 0 1 2 3 4 5
Fig. 4. Expected class assigment for the waveform dataset (a). Class assignments by SDC, visu-
alized w.r.t parameters 1 and 3.
6 Conclusion
References
1. Amirikian, B., Georgopoulus, A.P.: Motor Cortex: Coding and Decoding of Directional Op-
erations. In: The Handbook of Brain Theory and Neural Networks, pp. 690–695. MIT Press,
Cambridge (2003)
2. Babaie-Zadeh, M., Vigneron, V., Jutten, V.: Sparse decomposition over non-full-rank dictio-
naries. In: Proceedings of ICASSP 2009, Taipei, TW (April 2009)
3. Bishop, C.: Pattern recognition and machine learning. Springer, New York (2006)
388 V. Vigneron et al.
4. Bodreau, M., Smith, A.M.: Activity in rostal motor cortex in response to predicatel force-
pulse pertubations in precision grip task. J. Neurophysiol. 86, 1079–1085 (2005)
5. Breiman, L., Friedman, J.H., Olshen, R.A., Stone, C.J.: Classification and regression trees.
Wadsworth International Group, Belmont (1984)
6. Candès, E.J., Tao, T.: Decoding by linear programming. IEEE Transactions Information The-
ory 51(12), 4203–4215 (2005)
7. Dash, M., Liu, H.: Feature selection for classification. Intelligent Data Analysis 1, 131–156
(1997)
8. Donoho, D.L.: For most large underdetermined systems of linear equations the minimal l1 -
norm solution is also the sparsest solution. Technical report (2004)
9. Donoho, D.L.: Compressed sensing. IEEE Transactions on Information Theory 52(4), 1289–
1306 (2006)
10. Donoho, D.L., Elad, M.: Maximal sparsity representation via 1 minimization. The Proc.
Nat. Aca. Sci. 100(5), 2197–2202 (2003)
11. Elad, M.: Why simple shrinkage is still relevant for redundant representations? IEEE Trans-
actions on Image Processing 52(12), 5559–5569 (2006)
12. Figueiredo, M.A.T., Nowak, R.D.: An EM algorithm for wavelet-based image restoration.
IEEE Transactions on Image Processing 12(8), 906–916 (2003)
13. Gorodnitsky, I.F., Rao, B.D.: Sparse signal reconstruction from limited data using FOCUSS,
a re-weighted minimum norm algorithm. IEEE Transactions on Signal Processing 45(3),
600–616 (1997)
14. Gribonval, R., Lesage, S.: A survey of sparse component analysis for blind source separation:
principles, perspectives, and new challenges. In: Proceedings of ESANN 2006, pp. 323–330
(April 2006)
15. Gribonval, R., Nielsen, M.: Sparse decompositions in unions of bases. IEEE Trans. Inform.
Theory 49(12), 3320–3325 (2003)
16. Hastie, T., Tibshirani, R.: Discriminant analysis by gaussian mixtures. Technical report, AT
& T Bell laboratories, Murray Hill, NJ (1994)
17. Krakei, S., Hoffman, D.S., Strick, P.L.: Muscle and movement representation in the primary
motor cortex. Science 285, 2136–2139 (1999)
18. MacLachlan, G., Basford, K.: Mixtures models: inference and applications to clustering.
Marcel Dekker, New York (1988)
19. Mallat, S., Zhang, Z.: Matching pursuits with time-frequency dictionaries. IEEE Trans. on
Signal Proc. 41(12), 3397–3415 (1993)
20. Mohimani, G.H., Babaie-Zadeh, M., Jutten, C.: Fast sparse representation based on smoothed
0 norm. In: Davies, M.E., James, C.J., Abdallah, S.A., Plumbley, M.D. (eds.) ICA 2007.
LNCS, vol. 4666, pp. 389–396. Springer, Heidelberg (2007)
21. Mohimani, H., Babaie-Zadeh, M., Jutten, C.: A fast approach for overcomplete sparse de-
composition based on smoothed 0 norm. Accepted in IEEE Trans. on Signal Processing
22. Morrow, M.M., Miller, L.E.: Prediction of muscle activity by populations of sequentially
recorded primary motor cortex neurons. J. Neurophysiol. 89, 1079–1085 (2003)
23. Schwartz, A.B., Taylor, D., Tillery, S.I.H.: Extraction algorithms for cortical control of arm
prosthesis. Current opinion in Neurobiology 11, 701–707 (2001)
24. Sergio, L.E., Kalaska, J.F.: Systematic changes in directional tuning of motor cortex cell
activity with hand location in the workspace during generation of static isometric forces in
constant spatial directions. J. Neurophysiol. 78, 1170–1174 (2005)
25. Taylor, D.M., Tillery, S.I.H., Schwartz, A.B.: Direct cortical control of 3d neuroprosthetic
devices. Science 296(7), 1829–1832 (2002)
26. Van Staveren, G.W., Buitenweg, J.R., Heida, T., Ruitten, W.L.C.: Wave shape classification
of spontaneaous neural activity in cortical cultures on micro-electrode arrays. In: Proceedings
of the second joint EMBS/BMES conference, Houston, TX, USA, October, 23-26 (2002)
27. Wright, J., Yang, A.Y., Ganesh, A., Sastry, S.S., Ma, Y.: Robust face recognition via sparse
representation. IEEE Transaction on Pattern Analysis and Machine Intelligence (Accepted,
March 2008)
Pareto-Based Multi-output Metamodeling
with Active Learning
1 Introduction
Regardless of the rapid advances in High Performance Computing and multi-core ar-
chitectures, it is rarely feasible to explore a design space using high fidelity computer
simulations. As a result, data based surrogate models (otherwise known as metamodels
or response surface models) have become a standard technique to reduce this computa-
tional burden and enable routine tasks such as visualization, design space exploration,
prototyping, sensitivity analysis, and optimization.
It is important to first stress that this paper is concerned with fully reproducing the
simulator behavior with a global model. The use of metamodels to approximate the
costly function for optimization (Metamodel Assisted Optimization) is not our goal.
Our objective is to construct a high fidelity approximation model that is as accurate as
possible over the complete design space of interest using as few simulation points as
possible (= active learning). This model can then be reused in other stages of the engi-
neering design pipeline, for example as cheap accurate replacement models in design
software packages (e.g., ADS Momentum).
In engineering design simulators are typically modeled on a per-output basis. Each
output is modeled independently using separate models (though possibly sharing the
same data). Instead, the system may be modeled directly using multi-objective algo-
rithms while maintaining the tie-in with active learning (classically a fixed data set is
D. Palmer-Brown et al. (Eds.): EANN 2009, CCIS 43, pp. 389–400, 2009.
c Springer-Verlag Berlin Heidelberg 2009
390 D. Gorissen et al.
chosen up front). This benefits the practitioner by giving information about output cor-
relation, facilitating the generation of diverse ensembles (from the Pareto-optimal set),
and enabling the automatic selection of the best model type on the fly for each output
without having to resort to multiple runs. The purpose of this paper is to illustrate these
concepts by discussing possible use-cases and potential pitfalls.
3 Multi-objective Modeling
There are different ways to approach the global surrogate modeling problem in a multi-
objective manner. The most obvious is to use multiple criteria to drive the hyperpa-
rameter optimization instead of a single one. In this case the minimization problem in
Pareto-Based Multi-output Metamodeling with Active Learning 391
equation 1 becomes a multi-objective one. This approach is useful because single crite-
ria are inadequate at objectively gaging the quality of an approximation model. This is
the so called “The five percent problem” [1] which always arises during approximation.
Secondly, a multi-objective approach is also useful if models with multiple outputs
are considered. It is not uncommon that a simulation engine has multiple outputs that
need to be modeled. Also, many Finite Element packages generate multiple perfor-
mance values for free. The direct approach is to model each output independently with
separate models (possibly sharing the same data). However, it is usually more compu-
tationally efficient to approximate the different responses together in the same model.
The question then is how to drive the hyperparameter optimization. Instead of sim-
ple weighted scalarization (which is usually done) a useful approach is to tackle the
problem directly in a multi-objective way. This avoids the weight assignment problem
and the resulting multi-objective hyperparameter trace gives information about how the
structure of the responses are correlated. This is particularly useful if the hyperparame-
ters can be given physically relevant interpretations.
In addition, in both cases (multi-criteria, multi-output) the final Pareto front enables
the generation of diverse ensembles, where the ensemble members consist of the (par-
tial) Pareto-optimal set (see also references in [2]). This way all the information in the
front can be used. Rainfall runoff modeling and model calibration in hydrology [3] are
examples where this is popular. Models are generated for different output flow compo-
nents and/or derivative measures and these are then combined into a weighted ensemble
or fuzzy committee. Finally, a Pareto based approach to multi-output modeling also al-
lows integration with the automatic surrogate model type selection algorithm described
in [4]. This enables automatic selection of the best model type (Kriging, neural net-
works, support vector machines (SVM), etc.) for each output without having to resort
to multiple runs or compromising accuracy. While for this paper we are concerned with
the global case, this also applies to the local (optimization) case.
4 Related Work
Little work seems to have been done on multi-objective multi-output modeling, with
only some results for classification problems [11]. The link with active learning has
also not yet been explored it seems. A multi-objective approach also enables automatic
model type selection, both for the global and local (optimization) case. As Knowles
et al. state in [10]: “Little is known about which types of model accord best with par-
ticular features of a landscape and, in any case, very little may be known to guide this
choice”. Thus an algorithm to automatically solve this problem is very useful [12].
This is also noticed by Voutchkov et al. [8] who compare different surrogate models
for approximating each objective during optimization. They note that, in theory, their
approach allows the use of a different model type for each objective. However, such an
approach will still require an a priori model type selection choice and does not allow
for dynamic switching of the model type or the use of hybrid models.
5 Problems
We now discuss two problems to illustrate how multi-output problems can be modeled
directly using multi-objective algorithms: an analytic test function and a Low Noise
Amplifier (LNA). In addition to the results described here, more data and test cases can
be found in [1]. For results involving multiple criteria the reader is also referred to [1].
1 d 1 d
y2 (x) = −20 · exp ⎝−0.2 · ∑ xi ⎠ − exp
2 · ∑ cos(2π · xi ) + 20 + e (3)
d i=1 d i=1
So f (x1 , x2 ) = [y1 , y2 ] with xi ∈ [−2, 2] and d = 2. Readers may recognize these two
responses as representing the Rosenbrock and Ackley functions, two popular test func-
tions for optimization. Plots of both functions are shown in figure 1.
We chose this combined function since it is an archetypal example of how two out-
puts can differ in structure. Thus it should show a clear trade-off in the hyperparameter
space when modeled together (i.e., a model that accurately captures y1 will be poor at
capturing y2 and vica versa).
It is important to stress that we are not interested in optimizing these functions di-
rectly (as is usually done), rather we are interested in reproducing them with a regres-
sion model (with two inputs and two outputs), using a minimal number of samples.
Thus the problem is effectively a dynamic multi-objective optimization problem in the
hyperparameter space.
Pareto-Based Multi-output Metamodeling with Active Learning 393
8
4000
3000 6
2000 4
y2
y1
1000
2
0
−1000
0
2 2
1 2 1 2
1 0 1
0
0 0
−1 −1 −1
−1
x2 −2 −2 x2 −2 −2
x1 x1
6 Experimental Setup
6.1 SUMO-Toolbox
As experimental platform we used the SUrrogate MOdeling (SUMO) Toolbox v6.1.1
The SUMO Toolbox [14] is an adaptive tool that integrates different modeling ap-
proaches and implements a fully automated, adaptive global surrogate model construc-
tion algorithm. Given a simulation engine the toolbox produces a surrogate model
within the time and accuracy constraints set by the user. Different plugins are supported:
model types (rational functions, Kriging, splines, etc.), hyperparameter optimization al-
gorithms (PSO, GA, simulated annealing, etc.), active learning (random, error based,
density based, etc.), and sample evaluation methods (local, on a cluster or grid). Com-
ponents can easily be added, removed or replaced by custom implementations.
394 D. Gorissen et al.
The toolbox control flow is as follows: Initially, a small initial set of samples is cho-
sen according to some experimental design. Based on this initial set, one or more surro-
gate models are constructed and their hyperparameters optimized according to a chosen
optimization algorithm (e.g., PSO). Models are assigned a score based on one or more
measures (e.g., cross validation) and the model parameter optimization continues until
no further improvement is possible. The models are then ranked according to their score
and new samples are selected based on the best performing models and the behavior of
the response (the exact criteria depend on the active learning algorithm used). The hy-
perparameter optimization process is continued or restarted intelligently and the whole
process repeats itself until one of the following three conditions is satisfied: (1) the max-
imum number of samples has been reached, (2) the maximum allowed time has been
exceeded, or (3) the user required accuracy has been met. The SUMO-Toolbox and all
the algorithms described here is available from http://www.sumo.intec.ugent.be
In a first use case for this problem the Kriging [15] and NSGA-II plugins were used.
The correlation parameters (θ ) represent models in the population. Following general
practice, the correlation function was set to Gaussian, and a linear regression was used.
Starting from an initial Latin Hypercube Design of 24 points, additional points are
added each iteration (using a density based active learning algorithm) up to a maxi-
mum of 150. The density based algorithm was used since it is shown to work best with
Kriging models [16]. The search for good Kriging models (using NSGA-II) occurs be-
tween each sampling iteration with a population size of 30. The maximum number of
generations between each sampling iteration is also 30.
A second use case of the same problem was done using the automatic model type se-
lection plugin. This algorithm is based on heterogeneous evolution using the GA island
model and is able to select the best model type for a given data source. A full discussion
of this algorithm and its settings would consume too much space. Such information can
be found in [4]. The following model types were included in the evolution: Kriging
models, single layer ANNs (based on [17]), Radial Basis Function Neural Networks
(RBFNN), Least Squares SVMs (LS-SVM, based on [18]), and Rational functions. To-
gether with the ensemble models (which result from a heterogeneous crossover, e.g., a
crossover between a neural network and a rational function), this makes that 6 model
types will compete to fit the data. In this case, the multi-objective GA implementation
of the Matlab GADS toolbox is used (which is based on NSGA-II). The population
size of each model type is set to 10 and the maximum number of generations between
each sampling iteration is set to 15. Again, note that the evolution resumes after each
sampling iteration. In all cases model selection is done using the Root Relative Square
Error (RRSE) on a dense validation set.
6.3 LNA
The same settings were used for this problem except that the sample selection loop was
switched off and LS-SVM models were used to fit the data (instead of Kriging). Instead
Pareto-Based Multi-output Metamodeling with Active Learning 395
of sampling, a 122 full factorial design was used. More extensive results (including
sampling and more dimensions) will be presented in a separate publication.
7 Results
7.1 Analytic Function: Use Case 1
Two snapshots of the Pareto front at different number of samples are shown in figure 2.
The full Pareto trace (for all number of samples) is shown in figure 3(a). The figures
clearly show that the Pareto front changes as the number of samples increase. Thus
the multi-objective optimization of the hyperparameters is a dynamic problem, and the
Pareto front will change depending on the data. This change can be towards a stricter
trade-off (i.e., a less well defined ‘elbow’ in the front) or towards an easier trade-off (a
more defined ‘elbow’). What happens will depend on the model type.
0.7 0.7
Validation error on Ackley (RRSE)
0.6 0.6
0.5 0.5
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
−8 −7 −6 −5 −4 −3 −2 −1 0 −8 −7 −6 −5 −4 −3 −2 −1 0
10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10
Validation error on Rosenbrock (RRSE) Validation error on Rosenbrock (RRSE)
Fig. 2. Two snapshots of the Pareto front (during the model parameter optimization) at different
sampling iterations (AF)
0.8 0.65
First 120 fronts
0.6
First pareto front
0.7
Validation error on Ackley (RRSE)
0.55
0.6
0.5
0.5
0.45
0.4 0.4
0.35
0.3
0.3
0.2
0.25
0.1 0.2
−7 −6 −5 −4 −3 −2 −1 0 −7 −6 −5 −4 −3 −2 −1 0
10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10
Validation error on Rosenbrock (RRSE) Validation error on Rosenbrock (RRSE)
(a) With sample selection (up to 150 samples) (b) Without sampling (brute force hyperparam-
eter search at 124 samples)
3 0 3 3.5
θ surface
NSGA−II 3
2 −2 2
Best
2.5
1 −4 1 2
1.5
0 −6 0
2
2
1
θ
θ
−1 −8 −1
0.5
−2 −2 0
−10
−0.5
−3 −12 −3
−1
−4 −4 −1.5
−4 −3 −2 −1 0 1 2 3 −4 −3 −2 −1 0 1 2 3
θ θ
1 1
Fig. 4. Kriging θ -surface with the NSGA-II search trace (AF, left: Rosenbrock, right: Ackley)
From the figure it is also immediately clear that the Rosenbrock output is much easier
to approximate than the Ackley output. Strangely though, there seems to be a disconti-
nuity in the front. The Pareto front is split into two parts and as sampling proceeds the
algorithm (NSGA-II) oscillates between extending the left front over the right front (or
vica versa). The full Pareto trace in figure 3(a) also shows this.
To understand what is causing this behavior, a brute force search of the hyperparame-
ter space was performed for a fixed LHD of 124 sample points. The space of all possible
θ parameters was searched on a 100x100 grid with bounds [-4 3] (in log10 space) in
each dimension. Simultaneously an extensive NSGA-II Kriging run was performed on
the same data set for 450 generations. In both cases a dense validation set was used
to calculate the accuracy of each model. The combination of both searches (for both
outputs) is shown in figure 4 (note that the RRSE is in log scale). The brute force search
of the θ -surface also allows the calculation of the true Pareto front (by performing a
non-dominated sorting). The resulting Pareto front (and the next 119 fronts, these are
shown for clarity) are shown in figure 3(b).
Studying the surfaces in figure 4 reveals what one would expect intuitively: the
Rosenbrock output is very smooth and easy to fit, so given sufficient data a large range
of θ values will produce an accurate fit. Fitting the Ackley output, on the other hand,
requires a much more specific choice of θ to obtain a good fit. In addition both basins
of attraction do not overlap, leading to two distinct optima. This means that (confirming
intuition) the θ -value that produces a good model for y1 produces a poor model for y2
and vice versa. Together with figure 3(b) this explains the separate Pareto fronts seen
in figure 2. The high ridge in the Ackley surface makes that there are no compromise
solutions on the first front. Any model whose performance on y2 would lie between the
two separate fronts would never perform well enough on y1 to justify a place on the
first front. Thus, the fact that NSGA-II does not find a single united front is due to
the structure of the optimization surface and not due to a limitation of NSGA-II itself.
The analytic problem was also tackled with the automatic model type selection al-
gorithm described in [4]. This should enable the automatic identification of the most
Pareto-Based Multi-output Metamodeling with Active Learning 397
Ensemble
4 Kriging
10 ANN
RBFNN
2
10
1
10
0
10
−1
10
−6 −4 −2 0 2
10 10 10 10 10
ValidationSet score on rosenbrock
adequate model type for each output without having to perform separate runs. Figure 5
shows the full Pareto search trace for the test function across all sampling iterations.
The figure shows the same general structure as figure 3(a): there is a strong trade-off
between both outputs resulting in a gap in the search trace. If we regard the model selec-
tion results we find they agree with intuition. The Rosenbrock function is very easily fit
with rational functions, and its smooth structure makes for an excellent approximation
with almost any degree assignment (illustrated by the straight line at roughly 10−7 on
the x-axis). However, those same rational models are unable to produce a satisfactory
fit on the Ackley function, which is far more sensitive to the choice of hyperparameters.
Instead the LS-SVM models perform best, the RBF kernel function matching up nicely
with the different ‘bumps’ of the Ackley function.
Thus, we find the results to agree with intuition. The Rosenbrock function is better
approximated with a global model since there are no local non-linearities. While the
Ackley function benefits more from a local model, and this is exactly what figure 5
shows. Note the diverse front shown in the figure now allows the generation of a diverse
ensemble (further improving accuracy) using standard ensemble techniques (bagging,
weighted, etc.).
5 5
−1
−0.5
4 4
−2
3 3
−1
2 −3 2
1 1 −1.5
−4
0 0
c
c
−5 −2
−1 −1
−6
−2 −2 −2.5
−3 −7 −3
Optim surface
−4 NSGA−II −4 −3
Best −8
−5 −5
−5 0 5 −5 0 5
spread spread
Fig. 6. SVM (c, σ )-surface with the NSGA-II search trace (LNA, left: P, right: IIP3)
Model 1
Ensemble
0.35 10
Pareto front Kriging
ANN
0.3 RBFNN
ValidationSet score on IIP3
ValidationSet score on IIP3
LS−SVM
0.25 Rational
0
10
0.2
0.15
−1
10
0.1
0.05
−8 −6 −4 −2
1 2 3 4 5 10 10 10 10
ValidationSet score on P (beeq) x 10
−3 ValidationSet score on P
(a) SVM(c, σ )-optimization search trace and (b) Heterogeneous Pareto trace (LNA)
Pareto front (LNA)
data distribution (see also the discussion in [16]). On the other hand, it is usually quite
easy to generate an LS-SVM that captures the trends in the data without being too sen-
sitive to the data distribution. An added benefit of SVM type models is that the number
of hyperparameters is independent of the problem dimensionality (unlike Kriging).
The use of metamodels to aid design space exploration, visualization, etc. has become
standard practice among scientists and engineers alike. In order to increase insight and
save time when dealing with multi-response systems, the goal of this paper was to
illustrate that a multi-objective approach to global surrogate model generation can be
useful. This allows multiple outputs to be modeled together, giving information about
the trade-off in the hyperparameter space. It further enables selecting the best model
type for each output on the fly, permits the generation of diverse ensembles, and the use
of multiple criteria. All for roughly the same computational cost as performing multiple
independent runs (which is still outweighed by the simulation cost).
While this paper presented some interesting use cases, still much work remains. An
important area requiring further investigation is understanding how the iterative sample
selection process influences the hyperparameter optimization landscape. For the tests
in this paper the authors have simply let the optimization continue from the previous
generation. However, some initial tests have shown that an intelligent restart strategy
can improve results. Knowledge of how the number and distribution of data points
affects the hyperparameter surface would allow for a better tracking of the optimum,
reducing the cost further. The influence of noise on the hyperparameter optimization
(e.g., neural network topology selection) also remains an issue as is the extension to
high dimensional problems (many outputs/criteria). In general, while progress towards
dynamic multi-objective optimization has been made, this is a topic that current research
in multi-objective surrogate modeling is only just coming to terms with [10].
Acknowledgments
The authors would like to thank Jeroen Croon from the NXP-TSMC Research Center,
Device Modeling Department, Eindhoven, The Netherlands for the LNA code.
References
1. Gorissen, D., Couckuyt, I., Dhaene, T.: Multiobjective global surrogate modeling. Technical
Report TR-08-08, University of Antwerp, Middelheimlaan 1, 2020 Antwerp, Belgium (2008)
2. Jin, B.Y., Sendhoff: Pareto-based multiobjective machine learning: An overview and case
studies. IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications and
Reviews 38(3), 397–415 (2008)
3. Fenicia, F., Solomatine, D.P., Savenije, H.H.G., Matgen, P.: Soft combination of local models
in a multi-objective framework. Hydrology and Earth System Sciences Discussions 4(1), 91–
123 (2007)
4. Gorissen, D., De Tommasi, L., Croon, J., Dhaene, T.: Automatic model type selection with
heterogeneous evolution: An application to rf circuit block modeling. In: Proceedings of the
IEEE Congress on Evolutionary Computation, WCCI 2008, Hong Kong(2008)
5. Mierswa, I.: Controlling overfitting with multi-objective support vector machines. In:
GECCO 2007: Proceedings of the 9th annual conference on Genetic and evolutionary com-
putation, pp. 1830–1837. ACM Press, New York (2007)
400 D. Gorissen et al.
6. Fieldsend, J.E.: Multi-objective supervised learning. In: Knowles, J., Corne, D., Deb, K.
(eds.) Multiobjective Problem Solving from Nature From Concepts to Applications. Natural
Computing Series. LNCS. Springer, Heidelberg (2008)
7. Knowles, J.: Parego: A hybrid algorithm with on-line landscape approximation for expen-
sive multiobjective optimization problems. IEEE Transactions on Evolutionary Computa-
tion 10(1), 50–66 (2006)
8. Voutchkov, I., Keane, A.: Multiobjective Optimization using Surrogates. In: Parmee, I. (ed.)
Adaptive Computing in Design and Manufacture 2006. Proceedings of the Seventh Interna-
tional Conference, Bristol, UK, pp. 167–175 (April 2006)
9. Keane, A.J.: Statistical improvement criteria for use in multiobjective design optimization.
AIAA Journal 44(4), 879–891 (2006)
10. Knowles, J.D., Nakayama, H.: Meta-modeling in multiobjective optimization. In: Branke, J.,
Deb, K., Miettinen, K., Słowiński, R. (eds.) Multiobjective Optimization. LNCS, vol. 5252,
pp. 245–284. Springer, Heidelberg (2008)
11. Last, M.: Multi-objective classification with info-fuzzy networks. In: Boulicaut, J.-F., Es-
posito, F., Giannotti, F., Pedreschi, D. (eds.) ECML 2004. LNCS, vol. 3201, pp. 239–249.
Springer, Heidelberg (2004)
12. Keys, A.C., Rees, L.P., Greenwood, A.G.: Performance measures for selection of metamod-
els to be used in simulation optimization. Decision Sciences 33, 31–58 (2007)
13. Lee, T.: The Design of CMOS Radio-Frequency Integrated Circuits, 2nd edn. Cambridge
University Press, Cambridge (2003)
14. Gorissen, D., De Tommasi, L., Crombecq, K., Dhaene, T.: Sequential modeling of a low
noise amplifier with neural networks and active learning. Neural Computing and Applica-
tions 18(5), 485–494 (2009)
15. Lophaven, S.N., Nielsen, H.B., Søndergaard, J.: Aspects of the matlab toolbox DACE. Tech-
nical report, Informatics and Mathematical Modelling, Technical University of Denmark,
DTU, Richard Petersens Plads, Building 321, DK-2800 Kgs. Lyngby (2002)
16. Gorissen, D., De Tommasi, L., Hendrickx, W., Croon, J., Dhaene, T.: Rf circuit block mod-
eling via kriging surrogates. In: Proceedings of the 17th International Conference on Mi-
crowaves, Radar and Wireless Communications, MIKON 2008 (2008)
17. Nørgaard, M., Ravn, O., Hansen, L., Poulsen, N.: The NNSYSID toolbox. In: IEEE In-
ternational Symposium on Computer-Aided Control Sysstems Design (CACSD), Dearborn,
Michigan, USA, pp. 374–379 (1996)
18. Suykens, J.A.K., Gestel, T.V., Brabanter, J.D., Moor, B.D., Vandewalle, J.: Least Squares
Support Vector Machines. World Scientific Publishing Co., Pte, Ltd., Singapore (2002)
Isolating Stock Prices Variation with Neural Networks
1
University of East London, 4-6 University Way, London E16 2RD, UK
2
Department of Multimedia and Graphic Arts, Cyprus University of Technology, 31
Archbishop Kyprianos Street, P. O. Box 50329, 3603 Lemesos, Cyprus
3
Department of Computer Science, University of Cyprus, 75 Kallipoleos Avenue,
P.O. Box 20537, 1678 Nicosia, Cyprus
c.draganova@uel.ac.uk, andreas.lanitis@cut.ac.cy,
cchrist@cs.ucy.ac.cy
Abstract. In this study we aim to define a mapping function that relates the
general index value among a set of shares to the prices of individual shares. In
more general terms this is problem of defining the relationship between multi-
variate data distributions and a specific source of variation within these distribu-
tions where the source of variation in question represents a quantity of interest
related to a particular problem domain. In this respect we aim to learn a com-
plex mapping function that can be used for mapping different values of the
quantity of interest to typical novel samples of the distribution. In our investiga-
tion we compare the performance of standard neural network based methods
like Multilayer Perceptrons (MLPs) and Radial Basis Functions (RBFs) as well
as Mixture Density Networks (MDNs) and a latent variable method, the Gen-
eral Topographic Mapping (GTM). According to the results, MLPs and RBFs
outperform MDNs and the GTM for this one-to-many mapping problem.
1 Introduction
D. Palmer-Brown et al. (Eds.): EANN 2009, CCIS 43, pp. 401–408, 2009.
© Springer-Verlag Berlin Heidelberg 2009
402 C. Draganova, A. Lanitis, and C. Christodoulou
mapping function. With our work we aim to investigate the use of different methods
for defining a mapping associating a specific source of variation within a distribution
and a given representation of this data distribution.
As part of our performance evaluation framework we assess the performance of
different one-to-many mapping methods in a case study related to the definition of the
relationship between the index value of twenty stocks included in the FTSE 100 UK
(www.ftse.com/Indices/UK_Indices/index.jsp) and the daily individual stock prices
over a three year time period. We implement and test methods that learn the relation-
ship between the daily general index value and the corresponding individual daily
stock prices of twenty of the FTSE 100 UK stocks with largest volume that have
available data for at least three consecutive years. Once the mapping is learned we
attempt to predict the daily stock prices of each share given the value of the general
index. This application can be very useful for predicting the prices of individual share
prices based on a given value of the daily index.
As part of our experimental evaluation process we investigate the following neural
network-based methods: Multilayer Perceptron (MLP) [1], Radial Basis Functions
[2], Mixture Density Networks (MDN) [3, 4] and the non-linear latent variable
method Generative Topographic Mapping (GTM) [5]. As a reference benchmark of
the prediction accuracy we consider the values of the predicted variables that corre-
spond to the average values over certain intervals of the quantity of interest that we
are trying to isolate (the so called Sample Average (SA) method). The SA method
provides an easy way to estimate the most typical values of each stock, for a given
index value. However, the SA method does not take into account the variance of stock
prices hence it cannot be regarded as an optimum method for this problem.
The rest of the paper is organised as follows: in section 2 we present an overview
of the relevant literature; in section 3 we describe the case study under investigation,
the experiments and give visual and quantitative results and in section 4 we present
our conclusions.
2 Literature Review
There exist well-established neural network methods for solving the mapping approxima-
tion problem such as the Multilayer Perceptron (MLP) [1] and Radial Basis Functions
(RBF) [2]. The aim of the training in these methods is to minimize a sum-of-square error
function so that the outputs produced by the trained networks approximate the average of
the target data, conditioned on the input vector [3]. It is reported in [4] and [6], that these
conditional averages may not provide complete description of the target variables espe-
cially for problems in which the mapping to be learned is multi-valued and the aim is to
model the conditional probability distributions of the target variables [4]. In our case
despite the fact that we have a multi-valued mapping we aim to model the conditional
averages of the target data, conditioned on the input that represents a source of variation
within this distribution. The idea is that when we change the value of the parameter
representing the source of variation in the allowed range, the mapping that is defined
will give typical representation of the target parameters exhibiting the isolated source of
variation.
Isolating Stock Prices Variation with Neural Networks 403
Bishop [3, 4] introduces a new class of neural network models called Mixture Den-
sity Networks (MDN), which combine a conventional neural network with a mixture
density model. The mixture density networks can represent in theory an arbitrary
conditional probability distribution, which provides a complete description of target
data conditioned on the input vector and may be used to predict the outputs corre-
sponding to new input vectors. Practical applications of feed forward MLP and MDN
to the acoustic-to-articulatory mapping inversion problem are considered in [6]. In
this paper, it is reported that the performance of the feed-forward MLP is comparable
with results of other inversion methods, but that it is limited to modelling points ap-
proximating a unimodal Gaussian. In addition, according to [6], the MLP does not
give an indication of the variance of the distribution of the target points around the
conditional average. In the problems considered in [4] and [6], the modality of
the distribution of the target data is known in advance and this is used in selecting the
number of the mixture components of the MDN.
Other methods that deal with the problem of mapping inversion and in particular
mapping of a space with a smaller dimension to a target space with a higher dimen-
sion are based on latent variable models [7]. Latent variables refer to variables that
are not directly observed or measured but can be inferred using a mathematical model
and the available data from observations. Latent variables are also known as hidden
variables or model parameters. The goal of a latent variable model is to find a repre-
sentation for the distribution of the data in the higher dimensional data space in terms
of a number of latent variables forming a smaller dimensional latent variable space.
An example of a latent variable model is the well-known factor analysis, which is
based on a linear transformation between the latent space and the data space [3]. The
Generative Topographic Mapping (GTM) [5] is a non-linear latent variable method
using a feed-forward neural network for the mapping of the points in the latent space
into the corresponding points in the data space and the parameters of the model are
determined using the Expectation-Maximization (EM) algorithm [8]. The practical
implementation of the GTM has two potential problems: the dimension of the latent
space has to be fixed in advance and the computational cost grows exponentially with
the dimension of the latent space [9].
Density networks [10] are probabilistic models similar to the GTM. The relation-
ship between the latent inputs and the observable data is implemented using a multi-
layer perceptron and trained by Monte Carlo methods. The density networks have
been applied to the problem of modelling a protein family [10]. The biggest disad-
vantage of the density networks is the use of the computer-intensive sampling Monte
Carlo methods, which do no not scale well when the dimensionality is increased.
Even though the problem we consider in this paper bear similarities with the problem
of sensitivity analysis with respect to neural networks, there are also distinct differences.
In sensitivity analysis the significance of a single input feature to the output of a trained
neural network is studied by applying that input, while keeping the rest of the inputs
fixed and observing how sensitive the output is to that input feature (see for example [11]
and references therein). In the problem investigated in this paper, we do not have a
trained neural network, but the index based on the values of 20 stocks. Based on our
knowledge of the application, i.e., the index, we isolate a specific source of variation and
carry out an one-to-many mapping between that isolated source and the model (which is
404 C. Draganova, A. Lanitis, and C. Christodoulou
a multivariate data distribution). More specifically, the model refers to all the 20 stock
values. This allows us to analyse the variation of the isolated source within the model.
For the experiments related to stock price prediction described in this paper we have
used the historical daily prices available at uk.finance.yahoo.com. Twenty stocks have
been selected from those that have the largest volume and that have their daily prices
between 16/12/2003 and 10/12/2007. Precisely the set of selected stocks includes:
BA, BARC, BLT, BP, BT, CW, FP, HBOS, HSBA, ITV, KGF, LGEN, LLOY,
MRW, OML, PRU, RBS, RSA, TSCO and VOD. The daily index values for these 20
stocks have been calculated using the method described in [12] with a starting index
point set to 1000.
We first train neural network models using the MLP with the scaled conjugate gradi-
ent algorithm [13], the RBF and MDN methods. The inputs for the neural network
model are the numerical values of the daily general index value and the output corre-
sponds to the prices of the 20 stocks. In the MLP model, the network has one input
node, one hidden layer with hyperbolic tangent (tanh) activation function and an out-
put layer with linear activation function, since the problem we consider is a regression
problem. In the case of RBF similarly to the MLP, the input layer has one node, the
output layer has linear outputs and the hidden layer consists of nodes (centres) with
Gaussian basis functions. The Gaussian basis function centres and their widths are
optimised by treating the basis functions as a mixture model and using the Expecta-
tion-Maximisation (EM) algorithm for finding these parameters. The number of hid-
den nodes in the MLP and RBF networks and the learning rate in the MLP network
are set empirically. We also set empirically the number of hidden nodes and kernel
functions (mixture components) in the MDN model. Theoretically by choosing a
mixture model with a sufficient number of kernel functions and a neural network with
a sufficient number of hidden units, the MDN can approximate as closely as desired
any conditional density. In the case of discrete multi-valued mappings the number of
kernel functions should be at least equal to the maximum number of branches of the
mapping. We performed experiments using up to five kernel functions.
The SA method is applied by calculating the average vectors of the 20 stock prices
corresponding to the index value in fifty equal subintervals between 715.1 and
1044.8, which are the minimum and the maximum value of the general index.
We have also carried out an experiment for isolating one latent variable using the
GTM. The GTM models consist of an RBF non-linear mapping of the latent space
density to a mixture of Gaussians in the data space of parameters (20 stock prices).
The models are trained using EM algorithm. After training we use the RBF mapping
to obtain the parameters corresponding to several values of the latent variable. We
show the variation of the latent variable reflected on the stock prices.
Isolating Stock Prices Variation with Neural Networks 405
3.2 Results
Table 1 represents the quantitative results for each method used, expressed as the
mean error between actual and predicted stock prices over the considered period of
time. The mean error was calculated using: mean error = (Ʃi=1,n abs(yi – ai))/n, where
yi is the predicted and ai is the actual price of the shares, n is the total number days
over which the share prices are predicted.
Figure 1 illustrates the graphical results for the actual and the model output prices
of one of the stocks - ITV obtained with the SA, MLP, RBF, MDN and GTM meth-
ods. These graphical results show the variation of the index value reflected on the
prices of the stock in question.
The graphical and quantitative results corresponding to the SA, MLP and RBF
models are comparable. The results obtained with the MDN method did not produce
better representation of the data which again can be explained with the large dimen-
sionality of the problem. In the case of the GTM method, although the quantitative
results are worse than those obtained with the other methods, the graphical results
show that the general trend of the actual prices is captured, demonstrating therefore
the potential of the GTM method for modeling the distribution of the stock prices in
terms of one latent variable.
Table 1. Mean error between actual and predicted prices of the 20 listed shares (see text for
details) with different methods
Share Method
MLP SA RBF MDN GTM
BA 45.51 45.94 49.66 57.25 84.40
BARC 40.72 41.36 43.94 58.45 73.96
BLT 143.26 145.12 157.12 191.11 266.71
BP 41.46 41.93 46.16 55.64 74.51
BT 19.07 19.64 24.32 26.09 41.09
CW 14.29 14.68 17.84 20.59 27.12
FP 14.58 14.69 15.01 21.71 22.76
HBOS 67.60 68.22 70.32 99.54 121.99
HSBA 30.85 30.56 31.79 35.03 51.85
ITV 4.73 4.65 5.01 6.57 7.62
KGF 18.51 18.13 19.07 27.97 32.15
LGEN 9.62 9.80 10.88 13.39 17.88
LLOY 27.90 28.40 30.38 35.77 49.97
MRW 22.99 23.45 28.55 31.13 41.92
OML 15.32 15.47 15.65 18.63 27.73
PRU 49.34 50.07 56.36 58.32 93.83
RBS 25.93 26.13 27.70 43.30 44.50
RSA 13.17 13.03 13.69 15.44 28.95
TSCO 28.21 28.16 33.01 37.29 55.30
VOD 7.91 7.65 8.36 11.70 15.01
Total 32.05 32.35 35.24 43.25 58.96
406 C. Draganova, A. Lanitis, and C. Christodoulou
SA – 50 points
Fig. 1. Sample graphical result for the variation of the index value reflected on the ITV stock
price; the solid lines and the scattered dots indicate the predicted and actual stock prices respec-
tively corresponding to the index values
Isolating Stock Prices Variation with Neural Networks 407
4 Conclusions
In this paper we investigate the use of a number of different techniques in the task of
finding the mapping between a specific source of variation within a multivariate data
distribution and the multivariate data distribution itself. The source of variation repre-
sents a quantity of interest related to a given problem domain. More specifically, we aim
to define a mapping function relating the general index value among a set of shares to
the prices of individual shares. We look for such mapping which gives a typical repre-
sentation of the data distribution that exhibits the variation of the specific quantity. In
this mapping, the target space has a higher dimension than the input space and for one
input value the target output value is not unique. This leads to finding one-to-many
multi-valued mapping. More specifically, we investigate several well-known methods
used for solving such problems including MLP, RBF, MDN and GTM.
The results of our experiments demonstrate the potential of using neural networks
trained with the MLP and RBF methods for isolating sources of variation and generating
typical representations of the corresponding data distributions in the considered case study.
With the neural network approach we do not make any assumptions about the mapping
function; the neural networks are learning the complex mapping between the desired at-
tributes and the parameters related the specific applications. The quantitative results ob-
tained with the MLP and RBF are similar. The best result is achieved with the MLP
method. The graphical results obtained with these methods are also similar. The MLP and
RBF methods give the conditional averages of the target data conditioned on the input
vectors and as expected they do not give a complete description of the target data reported
in [4, 6]. For this problem we are addressing it is sufficient to define a mapping that gener-
ates typical samples of the data distribution (and not its entire variance) given specific
values of the desired source of variation. This makes our results with the MLP and RBF
(which are relatively simple methods, compared to MDN and GTM) not only acceptable
but quite good for the type of inversion problems we are addressing, compared to the MLP
results for the acoustic-to-articulatory inversion mapping reported in [14], [6]. For this
one-to-many problem considered and also the problem of reconstructing the same spec-
trum from different spectral line parameters [4], the entire variance of the distribution is
required. It has to be noted also that for the one-to-many problems considered in our paper
the training algorithm for the MLP does not have to be modified as suggested by Brouwer
[15], resulting in increased complexity. To the best of our knowledge RBFs have not
previously been specifically used for one-to-many problems.
The MDN [3, 4] can give a complete description of the target data conditioned on the
input vector provided that the number of mixture components is at least equal to the
maximum number of branches of the mapping. The experiments carried out in the consid-
ered case study demonstrate that in problems for which the modality of the distribution of
the target data is very large and not known, the application of the MDN leads to a large
number of mixture components and outputs. Therefore it does not provide the desired type
of mapping, which explains the poor results obtained with the MDN method. In particular,
these results do not show the desired variation of the quantity of interest in the stock prices
case study, they do not show complete representation of the target data space.
The GTM [5] is used to map points in the latent variable interval to points in the
target data space and our results with this method actually show this mapping of one
latent variable to the corresponding data spaces. It has to be noted though, that in
order to isolate a specific source of variation, we need to find all latent variables
which leads to a problem that is not computationally feasible [9]. In the stock price
408 C. Draganova, A. Lanitis, and C. Christodoulou
case study the isolated latent variable might be a different source of variation and not
necessarily the desired index value.
The framework presented in the paper can be potentially useful in various applica-
tions involving multivariate distributions. With regards to the stock prices application,
it is well established that trends of the general index can be easily predicted, unlike
price fluctuations at share level. Therefore the ability to infer individual stock prices
based on the general index value can be an invaluable tool for efficient portfolio man-
agement and prediction of the behavior of individual shares.
References
1. Rumelhart, D.E., Hinton, D.E., Williams, R.J.: Learning representations by back-
propagation errors. Nature 323, 533–536 (1986)
2. Powell, M.J.D.: Radial basis functions for multivariable interpolation: A review. In: IMA
Conference on Algorithms for the approximation of Functions and Data, pp. 143–167.
RMCS, Shrivenham (1985)
3. Bishop, C.M.: Neural Networks for Pattern Recognition. Oxford University Press, New
York (1995)
4. Bishop, C.M.: Mixture Density Networks. Technical Report NCRG/94/004, Neural Com-
puting Research Group, Aston University (1994),
http://research.microsoft.com/~cmbishop/downloads/
Bishop-NCRG-94-004.ps
5. Bishop, C.M., Svensén, M., Williams, C.K.I.: GTM: The Generative Topographic Map-
ping. Neural Computation 10(1), 215–234 (1998)
6. Richmond, K.: Mixture Density Networks, Human articulatory data and acoustic-to-
articulatory inversion of continuous speech (2001),
http://www.cstr.ed.ac.uk/downloads/publications/2001/
Richmond_2001_a.ps
7. Bartholomew, D.J.: Latent Variable Models and Factor Analysis. Charles Griffin & Com-
pany Ltd., London (1987)
8. Dempster, A., Laird, N., Rubin, D.: Maximum likelihood from incomplete data via the EM
algorithm. Journal of the Royal Statistical Society. Series B 39(1), 1–38 (1977)
9. Carreira-Perpinan, M.A.: One-to-many mappings, continuity constraints and latent variable
models. In: Proc. IEE Colloquium on Applied Statistical Pattern Recognition, Birming-
ham, pp. 14/1–14/6 (1999)
10. MacKay, D.J.C., Gibbs, M.N.: Density networks. In: Proceedings of Society for General
Microbiology, Edinburgh (1997)
11. Zeng, X., Yeung, D.S.: A Quantified Sensitivity Measure for Multilayer Perceptron to In-
put Perturbation. Neural Computation 15, 183–212 (2003)
12. FTSE Guide to UK Calculation Methods,
http://www.ftse.com/Indices/UK_Indices/Downloads/
uk_calculation.pdf#
13. Møler, M.: A scaled conjugate gradient algorithm for fast supervised learning. Neural Net-
works 6(4), 525–533 (1993)
14. Carreira-Perpiñán, M.Á.: Continuous latent variable models for dimensionality reduction
and sequential data reconstruction. PhD thesis, Dept. of Computer Science, University of
Sheffield, UK (2001),
http://faculty.ucmerced.edu/mcarreira-perpinan/papers/
phd-thesis.html
15. Brouwer, R.K.: Feed-forward neural network for one-to-many mappings using fuzzy sets.
Neurocomputing 57, 345–360 (2004)
Evolutionary Ranking on Multiple Word Correction
Algorithms Using Neural Network Approach
1 Introduction
Computer users inevitably make typing mistakes. These may be seen as spelling er-
rors, prolong key press and adjacent key press errors etc [1]. Multiple solutions such
as Metaphone [2] and n-grams [3] have been developed to correct user’s typing mis-
takes, and each of them may have its unique features. However, an optimum solution
is hardly identified among them. Therefore, it is desired to develop a hybrid solution
based on combining these technologies, which can put all merits of those distinct
solutions together.
Moreover, each function may rarely generate a single answer, let alone multiple
functions which may produce a larger list of suggestions. This requires developing an
evolutionary and adjustable approach to prioritize the suggestions in this list. Also, the
answers may change within different context; and the solutions are also required to
evolve based on user’s feedbacks. Therefore, this research is motivated by the require-
ment of combining distinct word correction algorithms and subsequently producing an
optimal prediction based on dataset updates and neural network learning process.
D. Palmer-Brown et al. (Eds.): EANN 2009, CCIS 43, pp. 409–418, 2009.
© Springer-Verlag Berlin Heidelberg 2009
410 J. Li et al.
The right side of the arrow is words’ phonetic keys. Let’s assume that a user in-
tends to type a word ‘hello’ but mistakenly typed ‘hallo’ instead, whose phonetic keys
are identical. Subsequently, the system is able to index and retrieve possible words
from the database based on the phonetic key and present them to a user for selection.
Levenshtein distance is another function that needs to be explored. It is designed
based on the calculation of minimum number of operations required to transform one
string into another, where an operation is an insertion, deletion, or substitution of a
single character, for instance,
After a comparison with each string stored in the memory, the pair with the least
distance can be considered as having the highest similarity, and then the one or the
group with the least distance can be presented through the user interface module.
Word 2-gram (i.e. word digram or word bigram) is groups of two consecutive
words, and is very commonly used as the basis for simple statistical analysis of text.
For instance, given a sentence, ‘I am a student’, some word 2-gram samples are,
I am
am a
For example, under an ideal condition, ‘am’ can be predicted if its predecessor ‘I’
is typed. Then the predicted word ‘am’ can be used to make sure that user types it
correctly.
Another similar method to Levenshtein distance is Jaro metric [5]. The Jaro dis-
tance metric states that, given two strings s1 and s2, two characters ai, bj from s1 and s2
respectively are considered matching only if,
' '
Where | s1 |, | s2 | are the numbers of s1 matching s2 and s2 matching s1 characters re-
spectively, and t is the number of transpositions.
A variant of Jaro metric uses a prefix scale p, which is the longest common prefix
of string s1 and s2. Let’s define Jaro distance as d, then Jaro-Winkler [6] distance can
be defined as,
max( p, 4)
Jaro − Winkler ( d + ∗ (1 − d )) . (3)
10
The result of the Jaro-Winkler distance metric is normalized into the range of [0, 1]. It
is designed and best suited for short strings.
First Rank Conversion Values and First Hitting Rate Definition: In a neural net-
work post processing, if its output follows a ‘winner takes all’ strategy, that is, the
maximum value in output is converted into one and the rest values are converted into
412 J. Li et al.
zeros, then the converted elements are named as First Rank Conversion Values. Given
testing metrics P, target metrics T and testing result metrics R where their numbers of
lines and columns are equal and expressed as n, m respectively, then the Hitting Rate
is HR = {hr | hr = zeros (T − R ) n , i ∈ m} , where R is the ith Rank Conversion Values
i i i i
of R, zeros() is the function to compute the number of zero vector included in metrics,
and the First Hitting Rate is hr . 1
Word-List n-formula Prediction Definition: Let’s assume that one has distinct
algorithms set A = {a1 ...ai ...an } , where 1 ≤ i ≤ n and i , n are positive integers. To
process a sequence s, if there exists a one-to-many mapping {s → Oi } associated
{
with algorithm ai between input and output, where Oi = oi | 1 ≤ j ≤ mi , oi is a
j
} j
generated sequence from the algorithm, and j , mi are positive integers, then one has
n
∑m i
sequence generated and the sequence set is defined as Word-List. The process
i =1
As illustrated above, a word correction function can combine multiple algorithms and
all of them produce their self-interpreted results independently, which is the so-called
Word-List n-formula Prediction. The results could be rarely similar while a user may
require only one of them if Success Prediction is fulfilled. So a functional ranking
model will play a major role to present an efficient word list with priority. If one con-
siders the learning factor required by a word list and variability of its related dataset, a
neural network model is a good choice with the dataset being constantly updated.
In L.M.T combination, Levenshtein word distance algorithm calculates the similar-
ity between each two words, where all the most similar ones are presented;
Metaphone algorithm retrieves words based on phonetic index while word 2-gram
algorithm retrieves them based on last typed word index. From the definition of
Word-list n-formula prediction, L.M.T correction can be referred to as Word-List
Evolutionary Ranking on Multiple Word Correction Algorithms 413
3-formula Prediction. Let’s use the example shown below, where the word ‘shall’ is
wrongly typed as ‘sahll’.
and assume that a database, which includes a 1-gram & 2-gram table, has been initial-
ized by a sentence,
Then, L.M.T correction result of word ‘sahll’ based on 2-gram word algorithm is
‘all’, the correction results based on Metaphone algorithm are ‘shall’ and ‘shell’ and
the correction results based on Levenshtein word distance algorithm are ‘all’ and
‘shall’.
L-distance
.
.
.
.
.
Metaphone .
I O
N . U
P . T
.
U P
T . T
.
2-gram .
.
.
. Hidden layer -3
Output layer -6
Input layer -12
Fig. 1. The circles in blue are neurons of WLR model; the circles in grey are predicted words;
the three rectangles represent the three algorithms, namely, Levenshtein distance, Metaphone
and 2-gram; the shapes in yellow show the input and the output of WLR model
Based on these rules, the word dictionary table and 2-gram dictionary table including
their words occurrences are initialized in the Access database. Moreover, for database
efficiency purpose, all the 2-gram records whose occurrences are less than two are
eliminated. Overall, about 79.10% of all 2-gram records are eliminated. This will only
produce a very limited influence on the performance of WLR model if one considers
thousands of repetitive trials in a neural network training and testing. The occurrences
of the words’ 1 & 2-gram are kept updated along with user’s typing progress (if there
is a new 2-gram generated, the 2-gram and its occurrence will be inserted into the
database). Therefore, these updated frequencies can well represent a user’s temporal
typing state captured and stored in a database.
As a simulation to dyslexic’s typing, a testing sample [9] is used as the experimen-
tal dataset for WLR model as shown below,
Evolutionary Ranking on Multiple Word Correction Algorithms 415
Fig. 2. The numbers in red and black indicate words 1 and 2-gram frequencies respectively
As shown in Figure 2, some words within sentences are wrongly typed, such as
‘hvae’ (should be ‘have’) and ‘raed’ (should be ‘read’). The numbers which are right
under each word (in red) indicate the frequency of the word after the database initiali-
zation. For example, the frequency of the word ‘If’ is 414 and the frequency of the
word ‘you’ is 1501 in the database. The numbers in black indicate the 2-gram fre-
quency between two consecutive words. For example, the frequency between the first
two words ‘If’ and ‘you’ is eighty-five, shown as ‘|---85---|’
Let’s assume the frequencies of the words shown above gradually increases in the da-
tabase while other words are rarely typed. Consequently, the change of other words’
frequencies will not have a big effect on the algorithms. Therefore, a simulation can be
performed by using the testing dataset which has ignored the influence brought by other
words’ frequency changes. In this research, 5505 trials of test samples are inserted into
the database gradually without considering other words’ frequency changes.
Let’s define a sampling point as a starting point of sampling in these 5505 trials,
and define a sampling step as a gap between two consecutive sampling actions.
Twenty five sampling points are set up to collect the three algorithms’ prediction
results. Only those wrongly typed and completed words are considered at every sam-
pling point. For example, the prediction results for words such as ‘hvae’ and ‘raed’
are collected; while the prediction results for right words such as ‘if’, ‘you’ and un-
completed words such as ‘hva’ of ‘hvae’ are ignored. At each sampling point, the
whole dataset are gathered and called a sample. Then, twenty five samples are gath-
ered. The determination of sampling points and sampling step is based on a heuristics
method, which shows that the influence of initial frequency updating is essential
while further updating influence is waning.
Figure 3 illustrates the sampling procedure, which are classified in four categories
[0Æ5, 10Æ50, 55Æ505, 1505Æ5505]. As illustrated, the influence of frequency
updating is waning from one category to another although the sampling steps
416 J. Li et al.
are actually increasing. The four categories are shown in red lines of Figure 3. For
instance, five samples have been collected with the frequency being changed from
zero to five (i.e. the sampling step is one), and ten samples are collected when the
frequency changed from 55 to 505 (i.e. the sampling step is 50).
25
No. of sample
20
15
10
Fig. 3. X-axis refers to the frequency of the whole sample; y-axis refers to the numbers of
sampling
The first two subsets of sample one are shown in Figure 4, which lists the predicted
results of two mistakenly typed words, which are ‘hvae’ and ‘raed’.
Fig. 4. First line is a comment which marks the three algorithms names and ‘output’. The rest
are two prediction results based on the three algorithms.
As shown in the columns of Figure 4, each of the three algorithms has generated
two predicted words. For instance, Levenshtein word distance algorithm gives two
suggestions to the word - ‘hvae’, which are ‘have’ and ‘hae’. Next to each word, the
word’s frequency and the similarity values to the target word are displayed. For ex-
ample, the frequency of the word ‘have’ is 679 and its similarity to ‘hvae’ is 0.925.
The last six columns of Figure 4 clearly show the required output for WLR neural
network model. Each of those columns corresponds to one of the words that those
three algorithms could generate. If the prediction is true, the corresponding column is
set to one, otherwise it is set to zero. For example, the first line of Figure 4 is a predic-
tion for mistakenly typed word ‘hvae’ while among the six predictions only the first
result of Levenshtein word distance algorithm is a correct prediction, therefore the
first column of the output is set to one while others are set to zeros. By default, the
processing stops at the first ‘1’, and the others will be set to zeros. So the output will
have a maximum of one ‘1’.
The data shown in Figure 4 still can not be used by WLR model directly, as further
pre-processing is required. Therefore the following procedures are applied.
Evolutionary Ranking on Multiple Word Correction Algorithms 417
100.00%
90.00% 74.69%
e 80.00%
71.81%
t 67.63%
a 70.00%
R 57.50%
gn 60.00%
i 50.00%
t
t 40.00%
i
H 30.00%
20.00%
10.00%
0.00%
Fig. 5. X-axis refers to the increase of words frequency difference; y-axis refers to the Hitting
Rate of WLR model ranking
Figure 5 shows that the samples are separated into four categories based on the step
distance of [1, 10, 50, 1000]. For example, the first histogram shows a 57.50% Rank-
ing First Hitting Rate with the samples of category one; the fourth histogram shows a
best achievement of 74.69% Ranking First Hitting Rate with more samples collected
between frequency 1505 and 5505 in five separated sampling points. Figure 5 shows
an increase of ranking Hitting Rate as words frequency difference and the amount of
418 J. Li et al.
testing samples increase. This is also partly influenced by the three algorithms previ-
ously introduced with learning factors. All the algorithms are adjusting gradually
toward a better prediction rate as trials increase.
5 Conclusion
In this paper a hybrid solution based on multiple typing correction algorithms and a Word
List Neural Network Ranking model to produce an optimal prediction are presented.
Three distinct algorithms, namely, Metaphone, Levenshtein distance and word 2-gram are
used in a pilot study. Several key factors including Time Change, Context Change and
User Feedback are considered. Experimental results show that 57.50% ranking First Hit-
ting Rate with initial samples is achieved. Further testing with updated samples indicates a
best ranking First Hitting Rate of 74.69%. The findings demonstrate that neural network
as a learning tool, can provide an optimum solution through combining distinct algorithms
to learn and subsequently to adapt to reach a high ranking Hitting Rate performance. This
may inspire more researchers to use a similar approach in some other applications.
In practice, an application using WLR theory can be implemented based on propagat-
ing rewards to each algorithm and/or word. Currently WLR model adjusts its ranking
based on the change of word frequency and similarity. In the future, more parameters
such as time element and more typing correction algorithms can be added to achieve a
better performance.
Acknowledgments. The research is funded by Disability Essex [10] and Technology
Strategy Board [11]. Thanks to Pete Collings and Ray Mckee for helpful advice and
discussions.
References
1. Ouazzane, K., Li, J., et al.: A hybrid framework towards the solution for people with dis-
ability effectively using computer keyboard. In: IADIS International Conference Intelli-
gent Systems and Agents 2008, pp. 209–212 (2008)
2. Metaphone, http://en.wikipedia.org/wiki/Metaphone (accessed January
23, 2009)
3. N-gram, http://en.wikipedia.org/wiki/N-gram (accessed January 18, 2009)
4. Levenshtein algorithm, http://www.levenshtein.net/ (accessed January 23, 2009)
5. Cohen, W.W., Ravikumar, P., Fienberg, S.: A Comparison of String Distance Metrics for
Name-Matching Tasks. In: IIWeb 2003, pp. 73–78 (2003)
6. Jaro-Winkler distance, http://en.wikipedia.org/wiki/Jaro-Winkler (ac-
cessed January 23, 2009)
7. Far from the Madding Crowd,
http://en.wikipedia.org/wiki/Far_from_the_Madding_Crowd (accessed
April 15, 2009)
8. Calgary Corpus,
ftp://ftp.cpsc.ucalgary.ca/pub/projects/text.compression.corpus/
text.compression.corpus.tar.Z (accessed January 18, 2009)
9. Davis, M.: Reading jumbled texts,
http://www.mrc-cbu.cam.ac.uk/~mattd/Cmabrigde/
(accessed January 26, 2009)
10. Disability Essex, http://www.disabilityessex.org (accessed January 18, 2009)
11. Knowledge Transfer Partnership, http://www.ktponline.org.uk/ (accessed
January 18, 2009)
Application of Neural Fuzzy Controller for Streaming
Video over IEEE 802.15.1*
1 Introduction
Nowadays Moving Picture Expert Group-4 (known as MPEG-4) is broadly used as a
video compression technique. Real-time delivery of MPEG-4 is delay-sensitive, as a
frame cannot be displayed if its data arrives damaged or late. In practice, many prob-
lems can occur and create transmission problem such as noise from other nearby de-
vices. The maximum delay permissible corresponds to the start-up delay accepted by
the user. Data may arrive too late for the frame to be displayed, and the result is a
poor quality video. Not only data may arrive too late, while retransmission is taking
place, other packets may either wait too long at the token bucket of the IEEE 802.15.1
or in extreme cases may overflow the token bucket.
IEEE 802.15.1 standard is a low power radio signal, permits high-quality, high-
security, high-speed, voice and data transmission [1]. In IEEE 802.15.1 standard
*
This work was funded by Engineering and Physical Sciences Research Council (EPSRC),
UK. The work was carried out at the London Metropolitan University, UK.
D. Palmer-Brown et al. (Eds.): EANN 2009, CCIS 43, pp. 419–429, 2009.
© Springer-Verlag Berlin Heidelberg 2009
420 G.F. Remy and H.B. Kazemian
2 Methodology
This section briefly introduces fuzzy logic controller. In a fuzzy subset, each member
is an ordered pair, with the first element of the pair being a member of a set S and the
second element being the possibility, in the interval [0, 1], that the member is in the
fuzzy subset. This should be compared with a Boolean subset in which every member
of a set S is a member of the subset with probability taken from the set {0, 1}, in
which a probability of 1 represents certain membership and 0 represents none mem-
bership. In a fuzzy subset of “buffer fullness,” the possibility that a buffer with a
given fullness taken from the set S of fullness may be called high is modeled by a
Application of Neural Fuzzy Controller for Streaming Video over IEEE 802.15.1 421
Video IN Rencoder(f)
Encoding
"Buffer Added"
Level in the
"buffer added"
Level in the
token bucket
Transmission
device
Data + Noise
& Interferences
Rtransmission(f)
Video OUT
Buffering Rreception(f)
Reception
Decoding and
device
reconstruction
Fig. 1. The RBF and NFC scheme for video transmission over IEEE 802.15.1
membership function, which is the mapping between a data value and possible mem-
bership of the subset. Notice that a member of one fuzzy subset can be a member of
another fuzzy subset with the same or a different possibility. Membership functions
may be combined in fuzzy “if then” rules to make inferences such as if x is high and y
422 G.F. Remy and H.B. Kazemian
is low, then z is normal, in which high, low, and normal are membership functions of
the matching fuzzy subsets and x, y, z are names for known data values. In practice,
the membership functions are applied to the data values to find the possibility of
membership of a fuzzy subset and the possibilities are subsequently combined
through defuzzification, which results in a crisp (non-fuzzy) value [9].
IEEE 802.15.1 standard employs variable-sized packets up to a maximum of five
frequency-hopping time-slots of 625 microseconds in duration. Every IEEE 802.15.1
standard frame consists of a packet transmitted from a sender node over 1, 3, or 5
timeslots, while a receiver replies with a packet occupying at least one slot, so that
each frame has an even number of slots. A source of error bursts is co-channel inter-
ference by other wireless sources, including other IEEE 802.15.1 piconets, IEEE
802.11b/g networks, cordless phones, and even microwave ovens. Though this has
been alleviated to some extent in version 1.2 of Bluetooth standard by adaptive fre-
quency hopping [10], this is only effective if interferences are not across all or most
of the 2.402 to 2.480 GHz unlicensed band. IEEE 802.11b operating in direct se-
quence spread spectrum mode may occupy a 22 MHz sub-channel (with 30 dB energy
attenuation over the central frequency at ±11 MHz) within the 2.4 GHz Industrial,
Scientific and Medical (ISM) band. IEEE 802.11g employs orthogonal frequency di-
vision multiplexing (OFDM) to reduce inter-symbol interference but generates similar
interference to 802.11b. Issues of interference might arise in apartment blocks with
multiple sources occupying the ISM band or when higher-power transmission occurs
such as at Wi-Fi hotspots.
In this article only the sending part will be analysed. For the system here, as shown
in Fig.1, there is one Fuzzy Logic Controller (FLC), one neural-fuzzy regulator, and
an added buffer between the MPEG-4 encoder and the IEEE 802.15.1 standard. IEEE
802.15.1 token bucket is a measure of channel conditions, as was demonstrated in
[11]. Token bucket buffer is available to an application via the Host Controller Inter-
face (HCI) presented by an IEEE 802.15.1 standard hardware module to the upper
layer software protocol stack. Retransmissions avoid the effect of noise and interfer-
ence but also cause the master’s token bucket queue to grow, with the possibility of
packet loss from token bucket overflow.
Therefore, in the Fig. 2 “buffer added” level of membership is presented as used in
the ANFIS training, token bucket level of membership is shown in Fig.3. In here, the
“buffer added” is full when the variable is very high and the token bucket is full of to-
ken (no data in the buffer) when the variable is low. These figures have been pro-
duced through MATLAB Fuzzy Logic Toolbox. The inputs were combined according
Surgeno Type Fuzzy Controller to produce a single output value. As shown in the Ta-
ble 1, if the “buffer added” level is very high and the token bucket level is very high
(no token left), then big congestion appear and delay occurs as we obtain an overflow.
On the other hand, if the “buffer added” level is low (no data in) and the token bucket
level is low as well (full of token), then the transmission is very fluid and occurs
without problem. In this project the rules are trained by a neural network controller.
This controller collects the two inputs (level in the token bucket and level in the
“buffer added”) and trains them with and Adaptive Neural-based Fuzzy Inference
System (ANFIS) training in order to smooth the output (Rtransmission).
Application of Neural Fuzzy Controller for Streaming Video over IEEE 802.15.1 423
“Buffer added”
Fig. 2. Membership function plots of the input variable "buffer added" in the Fuzzy inference
system
Token bucket
Fig. 3. Membership function plots of the input variable token bucket in the Fuzzy inference
system
The host can only send data when there is "token" available in the buffer. The data
rate during transmission depends on the space available in the token bucket even a
rate was set-up at the beginning of the transmission. The noise is generated by a
Gaussian noise following the equation:
( x − b )²
1 −
f ( x) = .e 2σ ² (1)
σ 2π
1
Where is the height of the curve's peak, b is the position of the centre of the
σ 2π
peak, and σ controls the width of the "bell".
A video clip called ‘X-men' [13] is used to implement the proposed rule based
fuzzy and neural fuzzy control scheme and we used from the Group of Pictures (GoP)
50 to the GoP 150 of this clip for the computer simulation. The frame size of the clip
is 240 pixels x 180 pixels.
IF token bucket is low AND buffer added is low THEN output is very fluid
IF token bucket is low AND buffer added is medium THEN output is intermediate
IF token bucket is low AND buffer added is high THEN output is fluid
IF token bucket is low AND buffer added is very high THEN output is intermediate
IF token bucket is medium AND buffer added is low THEN output is small intermediate
IF token bucket is medium AND buffer added is medium THEN output is small congestion
IF token bucket is medium AND buffer added is high THEN output is small congestion
IF token bucket is medium AND buffer added is very high THEN output is congestion
IF token bucket is high AND buffer added is low THEN output is fluid
IF token bucket is high AND buffer added is medium THEN output is small congestion
IF token bucket is high AND buffer added is high THEN output is congestion
IF token bucket is high AND buffer added is very high THEN output is congestion
IF token bucket is very high AND buffer added is low THEN output is intermediate
IF token bucket is very high AND buffer added is medium THEN output is congestion
IF token bucket is very high AND buffer added is high THEN output is congestion
IF token bucket is very high AND buffer added is very high THEN output is big congestion
Table 2 compares the numerical results of the testing clip using the Artificial Intel-
ligent system and the No-Artificial Intelligent system for a fixed mean value.
In Table 2, if we compare the standard deviation, of a system with and without Ar-
tificial Intelligence. We can easily prove that the standard deviation of the intelligent
system is much lower than the No-AI system, reducing the burstiness and data loss.
Fig. 4. Token bucket availability for No-AI system X-men GOP 50 to 150 724 Kbps
Fig. 5. Buffers availability for AI system X-men GOP 50 to 150 724 Kbps
426 G.F. Remy and H.B. Kazemian
The following four figures present the MATLAB computer simulation results for
'X-men' clip. Fig. 6 and Fig. 7 have two sets of results respectively the system without
Artificial Intelligence (No-AI) and the system with Artificial Intelligence (AI), from
the Group of Picture (GoP) 50 to the Group of Picture (GoP) 150 at a mean value of
724 Kbps. On Fig. 6, from above the graphs, it can be seen the departure rate at the
entrance of the IEEE 802.15.1 standard buffer (token bucket) (Rencoder), the mean
value of the device with noise (Rtransmission), and the superposition of the departure
rate at the entrance of the token bucket and the mean value of the device without
noise. In Fig 7, from above the graphic shows the departure rate from the IEEE
802.15.1 token bucket, the bit rate of the "buffer added" just after the MPEG-4 en-
coder, the mean value of the device with noise and the superposition of the departure
rate from the token bucket, the bit rate of the "buffer added" just after the MPEG-4
encoder and the mean value of the device without noise.
Fig. 6. Rencoder, Rtransmission, No-AI system X-men GOP 50 to150 724 Kbps
Application of Neural Fuzzy Controller for Streaming Video over IEEE 802.15.1 427
The token bucket level is an indication of the storage available in the buffer for
data. The token bucket level Fig. 4 shows when the token bucket is overflowed and
data are lost. All the data under the value "0" indicates overflow and the token bucket
causes data loss. The first graph on top in Fig. 5 indicates the degree of overflow or
starvation of the “buffer added”. The second graph at the bottom of Fig. 5 outlines the
capacity of the token bucket. We can easily see in Fig. 4 the overflow of the token
bucket (when the value goes under 0), and in contrast in Fig. 5, the token bucket al-
ways has available space between frame 200 and frame 300. For this video clip, the
introduction of the RBF and NF control eliminates the data loss altogether, as there is
still some storage capacity left in the “buffer added” and the token bucket.
Fig. 7. Rencoder, Rreception, Rtransmission, AI system X-men GOP 50 to 150 724 Kbps
428 G.F. Remy and H.B. Kazemian
In Fig. 7, the burstiness of the departure bit rate from the Rencoder is reduced on
the Rreception signal. Fig. 7 demonstrates that the departure bit rate is much
smoother, resulting in reduction in data loss, time delay, and improvement in picture
quality over the IEEE 802.15.1 channel. Furthermore, by maintaining a sufficient
space available in the token bucket, the novel scheme decreases the standard deviation
of the output bit rate and the number of dropped data, resulting in better quality of
picture and video stability.
4 Conclusion
This paper applies artificial intelligence techniques to streaming video over IEEE
802.15.1 to improve quality of picture, to reduce time delay and excessive data loss
during wireless communication. Neural Fuzzy Controller (NFC) enables a smooth
flow of MPEG-4 video data over IEEE 802.15.1 and reduces burstiness, while allevi-
ating the overflow of the token bucket associated with the IEEE 802.15.1 standard. In
this research, a significant reduction of the standard deviation of the MPEG-4 video
and diminution of the token bucket burstiness have been achieved by using the novel
algorithms. These results therefore affect directly the stability of the communication,
while providing an improvement in image quality at the receiving end. Finally, as
these intelligent techniques are based on efficient algorithms, the proposed design
can be applied to delay sensitive applications such as audio, video and multimedia
streaming data.
References
1. Haartsen, J.C.: The Bluetooth radio system. IEEE Personal Communications 7(1), 28–36
(2000)
2. http://www.ieee802.org/15/pub/TG1.html
3. Antoniou, P., Pitsillides, A., Vasiliou, V.: Adaptive Feedback Algorithm for Internet Video
Streaming based on Fuzzy Rate Control. In: 12th IEEE Symposium on Computers and
Communications (ISCC 2007), IEEE Catalog Number: 07EX1898C, Aveiro, Portugal,
July 1-4, pp. 219–226 (2007)
4. Vasilakos, A., Ricudis, C., Anagnostakis, K., Pedrycz, W., Pitsillides, A., Gao, X.: Evolu-
tionary - Fuzzy Prediction for Strategic ID-QoS: Routing in Broadband Networks. Interna-
tional Journal of Parallel and Distributed Systems and Networks 4(4), 176–182 (2001)
5. Razavi, R., Fleury, M., Ghanbari, M.: Fuzzy Logic Control of Adaptive ARQ for Video
Distribution over a Bluetooth Wireless Link. In: Advances in Multimedia, vol. 2007. Hin-
dawi Publishing Corporation (2007), doi:10.1155/2007/45798, Article ID 45798
6. Kazemian, H.B., Meng, L.: An adaptive control for video transmission over Bluetooth.
IEEE Transactions on Fuzzy Systems 14(2), 263–274 (2006)
7. Kazemian, H.B., Chantaraskul, S.: An integrated Neuro-Fuzzy Approach to MPEG Video
Transmission in Bluetooth. In: 2007 IEEE Symposium on Computational Intelligence in
Image and Signal Processing, CIISP 2007 (2007), IEEE 1-4244-0707-9/07
8. Chrysostomou, C., Pitsillides, A., Rossides, L., Sekercioglu, A.: Fuzzy logic controlled
RED: congestion control in TCP/IP differentiated services networks. Soft Computing - A
Fusion of Foundations, Methodologies and Applications 8(2), 79–92 (2003)
Application of Neural Fuzzy Controller for Streaming Video over IEEE 802.15.1 429
9. http://plato.stanford.edu/entries/logic-fuzzy
10. Ziemer, R.E., Tranter, W.H.: Principles of Communications – systems, modulation and
noise, 5th edn. John Wiley & Sons, Inc., Chichester (2002)
11. Razavi, R., Fleury, M., Ghanbari, M.: Detecting congestion within a Bluetooth piconet:
video streaming response. In: London Communications Symposium, London, UK, Sep-
tember 2006, pp. 181–184 (2006)
12. https://www.bluetooth.org/spec/Core_1.2_and Profile_1.2
13. http://www.movie-list.com/nowplaying.shtml
Tracking of the Plasma States in a Nuclear Fusion
Device Using SOMs
1 Introduction
Experimental data are essential information in several fields, such as chemistry, biol-
ogy, physics, engineering. Utilizing the experimental information for knowledge
discovery is a critical task, and the strategy for knowledge discovery usually depends
on the contents and characteristics of the data.
As an example, monitoring the state of a plant may generate thousands of multidi-
mensional data points for every experiment performed. A study may consist of many
experiments, such that in a single study a huge amount of data points may be generated.
Such volumes of data are too large to be analyzed by, e.g., sorting in spreadsheets, or
plotting on a single or few graphs, and systematic methods for their inspection are
required. In this paper, clustering methods are investigated, for the analysis and organi-
zation of large-scale experimental data, and for knowledge extraction in order to map
the operational plasma space of a nuclear fusion device.
Tokamak is the most promising device for nuclear fusion. In a Tokamak, the
plasma is heated in a ring-shaped vacuum chamber (vessel or torus) and kept away
from the vessel walls by applying magnetic fields. The range of ‘plasma states’ viable
D. Palmer-Brown et al. (Eds.): EANN 2009, CCIS 43, pp. 430–437, 2009.
© Springer-Verlag Berlin Heidelberg 2009
Tracking of the Plasma States in a Nuclear Fusion Device Using SOMs 431
both data clustering and projection. The main advantage of the SOM is in the visuali-
zation tools that allow for an intuitive data exploration.
A SOM defines a mapping from the n-dimensional input space X onto a regular
(usually two-dimensional) array of K clusters (neurons), preserving the topological
properties of the input. This means that points close to each other in the input space
are mapped on the same or neighbouring neurons in the output space.
The output neurons are arranged in 2D lattice where a reference vector w is associ-
ated with every neuron. The ith output neuron represents the ith cluster. Hence, the
output of the ith output neuron Oi, i=1,…,K, is:
n
Oi = ∑ wij x j (1)
j =1
A competitive learning rule is used, choosing the winner i* as the output neuron
with w closest to the input x. The learning rule is:
The neighbourhood function Λ (i, i * ) is equal to 1 for i=i*, and falls off with the
distance dii* between neurons i and i* in the output lattice. Thus, neurons close to the
winner, as well as the winner itself, have their reference vector updated, while those
further away, experience little effect. When training is completed, the reference vec-
tors associated with each neuron define the partitioning of the multidimensional data.
A typical choice for Λ (i, i * ) is
Λ (i, i* ) = e − d ii* / 2σ
2 2
(3)
A common measure used to calculate the precision of the clustering is the average
quantization error Eq [4] over the entire data set. Moreover, it is possible to evaluate
how well a SOM preserves the data set topology evaluating the topographic error Et
[4]: partitions with a good resolution are characterized by a low value of Eq and a high
value of Et. The Davies-Bouldin (DB) Index [5] and the Dunn (DN) Index [6], are two
Tracking of the Plasma States in a Nuclear Fusion Device Using SOMs 433
of the most popular validity indexes. Both indexes are well suited to identify the pres-
ence of compact and well separate clusters. The optimum number of clusters corre-
sponds to the minimum of the DB index, whereas the optimum number of clusters
corresponds to the maximum of the DN index.
In the present paper, the Davies-Bouldin and Dunn indexes have been modified in
order to adapt the indexes to our application. In fact, the main goal of the mapping is
the identification of different regions corresponding to different plasma states, and in
particular, the safe and unsafe operational regions. The separation of clusters is not
required if they contain samples of the same type. In particular, most relevance has
been done to minimum distance between disruptive and safe regions with respect to
the compactness within each region. Moreover, the modified DN index (DNM) yields a
high value if safe region and disruptive regions are well separated.
The samples of a disruptive shot in the interval [tPREC÷ tD] are considered disruptive
samples (ds). Moreover, two types of safe samples are considered: the samples of a
disruptive shot in the interval [tFLAT-TOP÷tPREC] (ssd), and all the samples of a safe shot
(ss), where tFLAT-TOP is the flat-top beginning time.
Following this classification, four main categories of clusters can be identified de-
pending on their composition: Empty clusters (ECs), which contain no samples; Dis-
ruptive clusters (DCs), which contain disruptive samples ds; Safe clusters (SCs),
which contain safe samples ss and ssd; Mixed Clusters (MCs), which contain both
safe and disruptive samples.
4 Results
The first issue to be performed in order to partition the samples in the data base, is a
selection of the number of clusters. The number of MCs has to be maintained as lim-
ited as possible, in order to obtain a compact representation of the plasma parameter
space. To this end the number of clusters should be as large as possible. Nevertheless,
using an excessive number of clusters reduces the power of the mapping as a tool to
investigate the common features of the patterns belonging to the same cluster.
In the present paper, the number of clusters in the SOM has been chosen in order to
optimize the quality indexes introduced in Section 2.1. As previously cited, this
choice allows to minimize both the number of samples that belongs to mixed clusters
MC, and to limit the number of clusters in the SOM.
A second issue to be performed is the data reduction, in order to balance the num-
ber of samples that describes the disruptive phase and that describes the non-
disruptive phase. In fact, the number of samples ss and ssd is much larger with respect
to the number of samples ds available in the disruptive phase. Data reduction is per-
formed using SOM applied to each shot as proposed in [8]. As each cluster is sup-
posed to contain samples belonging to similar plasma states, one sample for each safe
cluster is retained, whereas all the samples in the disruptive phase are selected. This
procedure allows one to automatically select a limited and representative number of
samples for the subsequent operational space mapping.
Moreover, SOM uses Euclidian metric to measure distances between data vectors,
a third issue to be performed is the scaling of variables, which have been normalized
between zero and one.
The composition of the dataset to which the SOM has been applied, was: 38% of
ss; 50% of ssd, and 12% of ds.
Fig. 1 represents the percentage of MCs with respect to the total number of clusters
(dashed line) and the percentage of the samples contained in them, with respect to the
total number of samples (solid line) versus the total number of clusters. As it can be
noticed, both curves saturate for a number of clusters greater then 1421. Thus, further
increasing the number of clusters is not useful to reduce the number of MCs.
Fig. 2 shows the trend of the Average quantization error Eq and the Topographic
error Et, versus the total number of clusters. An agreement between the minimization
of Eq and the maximization of Et is achieved for 1421 clusters.
Fig. 3 shows the modified DB index and the modified Dunn index versus the total
number of clusters respectively. A good trade-off among the minimization of the DBM
Tracking of the Plasma States in a Nuclear Fusion Device Using SOMs 435
Fig. 2. Average quantization error (continuous line), and Topographic error (dashed line)
index, the maximization of the DNM index and an acceptable fraction of samples that
belong to MCs could be obtained with 1421 clusters. Thus, all three diagrams confirm
that is not useful overdraw the map up to 1421 clusters.
The resulting mapping gives: 77% of clusters are SCs; 17% are MCs; 3% are DCs;
3% are empty. Moreover, 20% of samples belong to MCs, whereas the majority of
them (78%) is in SCs and only 2% is attributed to DCs.
This means that, as expected, a pre-disruptive phase of 45ms, optimized on the
base of all the disruptive shots contained in the dataset, cannot be valid for each of
them.
Fig. 4a shows the 2D SOM mapping with 1421 clusters. In this mapping, different
colours are assigned to the different type of samples, i.e., ss are blue, ssd are red, and ds
are green. A large safe region (clusters colored of blue and/or red) and a smaller disrup-
tive region (prevalence of green in the clusters) can be clearly identified. Moreover, a
transition region, where ssd and ds coexist, appears between the safe and disruptive
regions.In Fig. 4b, the colours of clusters are assigned on the base of the type of cluster,
rather than on the samples density. This map clearly highlights the presence of a large
safe region (cyan), a disruptive region (green) on the right top side, and a transition
436 M. Camplani et al.
region (magenta) formed by MCs, which is at the boundary of the two regions. Empty
clusters are white. Therefore, safe and disruptive states of plasma are well separated in
the SOM map, provided that an appropriate number of clusters is set.
Fig. 3. DBM (continuous line), and DNM (dashed line) versus the mapping size
Fig. 4. a) SOM coloured on the bases of samples density (blue: ss; red: ssd; green: ds). b) SOM
coloured on the basis of cluster type (cyan: SC; green: DS; magenta: MC; white: EC). Red
trajectory tracks a disruptive shot, whereas black trajectory tracks a safe shot.
Tracking of the Plasma States in a Nuclear Fusion Device Using SOMs 437
The SOMs in Fig. 4a and Fig. 4b are used also to display the operational states of
the disruptive shots and of the safe shots during the time. The temporal sequence of
the samples of a shot forms a trajectory on the map depicting the movement of the
operating point. As can be noted, a disruptive shot (red trajectory in the upper side of
the maps) starts in a SC cluster containing only ssd (red cluster in Fig. 4a), crosses
mixed clusters, and arrives in a DC cluster. Following the trajectory in the map in Fig.
4b, it is possible to recognize a region (magenta region) where the risk of an imminent
disruption is high. The safe shot (black trajectory in the lower side of the maps) starts
in a SC cluster containing only ss (blue cluster in Fig.4a), and evolves in the time
moving into the safe region (cyan region in Fig. 4b).
5 Conclusions
A strategy to analyze the information encoded in ASDEX Upgrade data base is devel-
oped, using the Self-Organizing-Map. It reveals a number of interesting issues for the
ASDEX-Upgrade operational space. In particular, the proposed strategy allows us to
simply display both the clustering structure of the data corresponding to different
plasma states and the temporal sequence of the samples on the map, depicting the
movement of the operating point during a shot. Following the trajectory in the map, it
is possible to eventually recognize the proximity to an operational region where the
risk of an imminent disruption is high.
Acknowledgments
This work was supported by the Euratom Communities under the contract of Associa-
tion between EURATOM/ENEA. The views and opinions expressed herein do not
necessarily reflect those of the European Commission.
References
1. Kohonen, M.T.: Self-Organization and Associative Memory. Springer, New York (1989)
2. Vesanto, F.J., Alhoniemi, E.: Clustering of the self-organizing map. IEEE Transaction on
Neural Networks 11(3), 586–600 (2000)
3. Alhoniemi, E.: Unsupervised pattern recognition methods for exploratory analysis of indus-
trial process data. Ph.D. thesis. Helsinki University of Technology, Finland (2002)
4. SomToolbox, Helsinky University of Technology,
http://www.cis.hut.fi/projects/somtoolbox
5. Davies, D.L., Bouldin, D.W.: A cluster separation measure. IEEE Transaction on Pattern
Recognition and Machine Intelligence 1, 224–227 (1979)
6. Dunn, J.: Well separated clusters and optimal fuzzy partitions. Journal of Cybernetics 4, 95–
104 (1974)
7. Sias, G.: A Disruption prediction System for ASDEX-Upgrade based on Neural Networks.
PhD. dissertation, University of Padova (2007)
8. Camplani, M., Cannas, B., Fanni, A., Pautasso, G., Sias, G., Sonato, P., Zedda, M.: The
ASDEX Upgrade Team: Mapping of the ASDEX Upgrade operational space using cluster-
ing techniques. In: Proc. of 34th EPS Conference on Plasma Physics and Controlled Fusion,
Warsaw, Poland (2007)
An Application of the Occam Factor to
Model Order Determination
1 Introduction
An important modeling problem is the inference of a functional relationship f(x,w)
between a set of parameters w and attribute variables x associated with target values t.
This short paper addresses the problem of the determination of both parameter values
w and model order K for the standard polynomial model denoted HK. Models that are
too simple are unlikely to fit the data adequately, models that are too complex can fit
the data and the noise element together and thus fail to generalize when presented
with a different data set.
P( D | w, H K ) P( w | H K )
P ( w | D, H K ) = (1)
P( D | H K )
The likelihood is calculated from the pdf of w, the prior is a quantitative expression
of our belief about the probability of w before the data is presented. The evidence
term is derived by integrating (1) over w.
D. Palmer-Brown et al. (Eds.): EANN 2009, CCIS 43, pp. 438–443, 2009.
© Springer-Verlag Berlin Heidelberg 2009
An Application of the Occam Factor to Model Order Determination 439
0 = ∇ w log P ( D | wˆ , σ , H K ) ⇒ wˆ = X T X ( )
−1
XTt (4)
∂ log P ( D | w, σ , H ) 1
0= ⇒ σˆ 2 = (t − X wˆ ) T (t − X wˆ )
∂σ N
Following Ghahramani Z [1], the second order properties are:
P ( D | H K ) = ∫ P ( D | w, H i ) P ( w | H K )d w (8)
∫ P(D | w, H K ) P( w | H K ) d w =
(9)
DEN ( P( D | w, H K ) P( w | H K )) *VOL( P( D | w, H K ) P( w | H K ))
Where DEN is the probability density at the maximum w and VOL is the volume of
the box that models the volume enclosed by the probability surface over the parameter
space. In the one dimensional case DEN is just the maximum value and VOL is just
the width of the box. If w is a two dimensional vector then VOL is an area constitut-
ing the base of the box. For higher dimensions VOL is a hyper box.
To apply the DEN operator we have assumed that the maximum of the product of
it’s two terms is the product of the maximum of each term which can be justified by
imposing the restriction of a constant prior term in some hyper region about ŵ
DEN ( P( D | w, H K ) P( w | H K )) = P( D | wˆ , H K ) P( wˆ | H K ) (10)
The final expression for the marginal likelihood is the product of the maximised like-
lihood and Mackay’s occam factor, [2].
An Application of the Occam Factor to Model Order Determination 441
∫ P( D | w, H )P(w | H )d w = 1
P( D | wˆ , H ) P(wˆ | H ) det ( A / 2π ) −1 / 2
(12)
i i
4243 1444424444 i
3 i
max imised likelihood occam factor
The occam factor above is a direct consequence of the application of Bayesian rea-
soning. It will penalise models with a large number of parameters since the size of the
hyper box will force P ( wˆ | H i ) to be low. Secondly an over fitted model will have a
sharp narrow peak for the marginal likelihood distribution (9) which corresponds to a
large A term and hence a small det ( A / 2π ) . We approach the problem of the
−1 / 2
For comparison OIC is compared to the Bayesian Information Criterion, [3] and to
the Akaike Information Criterion, [4].
BIC = − 2 ln L + K ln N
AIC = −2 ln L + 2 K
For all the indices above we would look for a lower value to determine a preferred
model. The choice of σw is part of the problem of assessing of how probable a particu-
lar w is prior to any data arriving and does need careful consideration, [2]. In this case
we have chosen a value of about twice that of the true coefficients
We have calculated these criteria for four polynomial models of orders 0 to 3. The
data set D comprises of a range of points along the x axis and corresponding values
from the application of a target quadratic function plus a noise component whose
magnitude is a fraction, rv=0.3 of the quadratic’s standard deviation, Fig 1.
Figure 2 shows the components needed to compute the information indices. For
each model order the standard deviation (sigmahat, upper left), the ‘sigwbarD’ term is
the probability volume that results when the model is applied and corresponds to
det1/2(AK) term (upper right), the information criterion (lower left) and lastly the
posterior probability. We have repeated the analysis for a linear model and show the
results in Figure 3. In both cases the Occam Information Criterion does correctly
identify the correct order.
442 D.J. Booth and R. Tye
Fig. 2. For each model order top left: the standard deviation, top right: the det-1/2(AK) term,
lower left: the information criterion and lower right: the posterior probability
An Application of the Occam Factor to Model Order Determination 443
Fig. 3. Left Linear model, generated data and fitted model, right information indices
References
1. Ghahramani, Z.: Machine Learning 2006 Course Web Page, Department of Engineering.
University of Cambridge (2006)
2. MacKay, D.: Information Theory, Inference, and Learning Algorithms. Cambridge Univer-
sity Press, Cambridge (2003)
3. Schwarz, G.: Estimating the Dimension of a Model. Source: Ann. Statist. 6(2), 461–464
(1978)
4. Akaike, H.: A new look at the statistical model identification. IEEE Transactions on Auto-
matic Control 19(6), 716–723 (1974)
5. Bishop, C.M.: Pattern Recognition and Machine Learning. Springer Science and Media
LLC (2006)
Use of Data Mining Techniques for Improved Detection
of Breast Cancer with Biofield Diagnostic System
1 Introduction
The Biofield Diagnostic System (BDS), developed by Biofield Corporation, USA, is
based on differentiating cancerous breast lesions from non-cancerous lesions by
measuring alterations in skin surface electropotential differences which are known to
occur concurrently with abnormal epithelial proliferation associated with breast ma-
lignancies [1]. The main requirement of BDS is that it should be used on a patient
with a suspicious lesion that has been located by palpation or by using other diagnos-
tic procedures. The device received market approval from the European Union (CE
mark) to be an adjunct to physical breast examination or relevant imaging modalities.
D. Palmer-Brown et al. (Eds.): EANN 2009, CCIS 43, pp. 444–452, 2009.
© Springer-Verlag Berlin Heidelberg 2009
Use of Data Mining Techniques for Improved Detection of Breast Cancer with BDS 445
2.1 Dataset 1
The first dataset, henceforth referred to as the “TTSH dataset”, was obtained as a
result of the BDS clinical trial conducted at Tan Tock Seng Hospital (TTSH), Singa-
pore in 2008. 149 women scheduled for mammography and/or ultrasound tests par-
ticipated in the study. Of the 149 cases, 53 cases were malignant and 96 were benign.
No pre-processing steps were necessary as no outliers/missing data were present. The
BDS test conducted on these women resulted in a sensitivity of 96.23%, specificity of
93.80%, and an accuracy of 94.51%. The Area under the Receiver Operating Charac-
teristics (AROC) curve was 0.972. For each patient, the following features were
available for further analysis. The encoded data as used in the analysis is given in
brackets. Menopausal status (Pre: 0; Post: 1), Parity (Previous pregnancy (no risk):0;
Never pregnant (has risk): 1), Family History (No history: 0; Has history: 1), Lesion
palpability (Not palpable: 0; Palpable: 1), Lesion location (RUO: 1; RUI: 2; RLO: 3;
RLI: 4; RSA: 5; LUO: 6; LUI: 7; LLO: 8; LLI: 9; LSA: 10 where R–Right; L–Left;
SA–Sub Areolar), Age, BDS index, Post-BDS-LOS and Class (Benign: 0; Malignant:
1). Since the tested BDS device was meant for clinical purposes, it did not output
the 16 sensed electropotential values. Overall, the TTSH dataset had a total of nine
features.
2.2 Dataset 2
The second dataset, henceforth referred to as the “US dataset”, was from one of the
BDS trials conducted in the U.S. back in 1995. The dataset was obtained from the
manufacturer. It had a total of 291 cases, out of which 198 were benign and 93 were
malignant. The major difference between this dataset and the TTSH dataset is the
presence of the 16 sensor values for each patient in addition to the other features listed
in section 2.1. This dataset was purposely chosen to study the effect of these addi-
tional sensor values on the disease prediction efficiency. It has been reported that the
value recorded by the SC sensor for malignant cases must be statistically greater than
or equal to mean of SI, SO, SU and SL values and can never be lesser [12]. This fact
was used to pick out outliers from the malignant dataset using t-test. Subsequently,
the univariate outliers were detected using z-score method. The threshold was kept as
±3 for benign dataset (since sample size was greater than 80) and as ±2.5 for malig-
nant dataset (sample size ≤ 80). The multivariate outliers were then detected using
Mahanalobis distance (cases having probability of the distance ≤ 0.001 were consid-
ered outliers) [13]. After these data-preprocessing steps, there were 183 benign cases
and 58 malignant cases (total of 241 cases) which were used for further analysis. The
BDS test results from these 241 cases indicated a sensitivity of 89.6%, specificity of
54.6%, and an accuracy of 63.1%. The AROC was 0.771. In addition to the original
electropotential values, ten linear combinations of these electropotentials were formu-
lated and used as features. These Electropotential Differentials (EPDs) are listed in
Use of Data Mining Techniques for Improved Detection of Breast Cancer with BDS 447
Table 1. Additional derived features for the US dataset; EPD: Electropotential Differential
and the built classifier was evaluated using the remaining one part (test set). This proce-
dure was repeated 10 times using a different part for testing in each case. The 10 test
classification performance metrics were then averaged, and the average test perform-
ance is declared as the estimate of the true generalization performance.
Pre-Processing
Original Pre-processed
Dataset Missing dataset
data Outlier removal Normalization
handling
FFS1
Classifier Classifier
Filter method: Ranker
development using evaluation using
learning test
Filter method: FFS2
set(FFS1/FFS2) set(FFS1/FFS2)
Correlation based
WFS1 LDA classifier LDA classifier
Wrapper method: Linear
development using evaluation using
Discriminant Analysis
learning set(WFS1) test set(WFS1)
The wrapper method uses a classifier to perform classification using every possible
feature subset and finally selects the feature subset that gives the best classification
accuracy. Since a classifier is built as many times as the number of feature subsets gen-
erated, computationally intensive classifiers should not be considered. Hence in this
work, the Linear Discriminant Analysis (LDA) classifier based wrapper technique was
implemented in Matlab (version 7.4.0.287-R2007a). The search strategy employed was
nondeterministic strategy and the search direction was random. At the end of each fold,
a rank was assigned to each feature. The top ranked 25% of the features were selected
for each dataset. That is, most significant two features were selected to form the reduced
TTSH dataset and top eight features were chosen to form the reduced US dataset.
In filter methods, features are first assigned scores based on particular criteria like
correlation (Correlation-based Feature Selection (CFS)), χ2-statistic (Chi-Squared
method (CS)) or information measure (Information Gain (IG)/Gain Ratio (GR)). Then
they are ranked using methods like best-first, ranker, greedy stepwise etc. The soft-
ware WEKA was used for this purpose (version 3.6.0) [14]. Again, the top two and
top eight features were selected from TTSH and US datasets respectively. The se-
lected features for both datasets using the described feature selection methods are
shown in Table 2. It can be observed that all the filter techniques selected similar
feature subsets in case of the US dataset.
The reduced datasets formed using the filter subsets were used to build and test the
following supervised learning based classifiers - LDA, Quadratic Discriminant
Use of Data Mining Techniques for Improved Detection of Breast Cancer with BDS 449
Table 2. Selected feature subsets for both datasets using filter and wrapper methods
Filter Methods
Wrapper LDA Evaluator: Ranking method
Selected features CS/IR/GR: Ranker CFS: Best First
Dataset WFS1 FFS1 FFS2
Post-BDS-LOS, Post-BDS-LOS,
TTSH dataset Post-BDS-LOS, Age
Lesion Palpability Lesion Palpability
Analysis (QDA), Support Vector Machines (SVM) and Back Propagation Neural
Network (BPNN). Since wrapper feature sets should be used with their respective
classifiers for best results, the reduced dataset formed using WFS1 was used to build a
LDA classifier. Classifier performance was evaluated using sensitivity, specificity,
accuracy and AROC measures.
Table 3. Classifier performance measures for the reduced TTSH datasets formed using the
filter subsets FFS1 and FFS2
Table 4. Classifier performance measures for the reduced US datasets formed using the filter
subsets
The TTSH BDS clinical trial resulted in a sensitivity of 96.23%, specificity of 93.80%,
and an accuracy of 94.51%. The AROC was 0.972. The US trial, on the other hand,
demonstrated a sensitivity of 89.6%, very low specificity of 54.6% and low accuracy
(63.1%). The AROC was also only 0.771. It is evident that the BDS device used in the
TTSH clinical trial recorded significantly better performance measures when compared
to those recorded by the US trial device. The reason for this can be attributed to the fact
that both the clinical trials used different interpretation maps to deduce the Post-BDS-
LOS from the BDS index and the prior LOS. The TTSH device was an improved ver-
sion of the device used in the US trial. The reason for including the US dataset was to
study the use of the raw independent sensor values in disease detection and also to
evaluate the use of classifiers for improving the current very low detection accuracy.
In the case of the results of the US dataset, the best classification results were
demonstrated by the LDA/WFS1 combination. It resulted in 93.10% sensitivity,
97.81% specificity, 96.68% accuracy and AROC of 0.9507. These values are signifi-
cantly very high compared to those observed in the US clinical trial.
In the case of TTSH dataset, the filter and the wrapper datasets’ results for the
LDA classifier were identical - the sensitivity, specificity, accuracy, and AROC val-
ues were 96.23%, 91.67%, 93.29%, and 0.9406 respectively. The use of data mining
techniques did not improve the detection accuracy in this case as the results were
almost similar to the results obtained using the original TTSH clinical trial dataset.
Also, it can be observed that the feature subset selected by the wrapper method
provides equal or higher accuracy when compared to that given by filter method
based subsets. This reinstates the fact that wrapper method is a better feature selection
Use of Data Mining Techniques for Improved Detection of Breast Cancer with BDS 451
5 Conclusion
Overall, this study has demonstrated that the use of classifiers with carefully selected
features does improve the classification accuracy of the BDS device. The most impor-
tant observation is the fact that a feature subset formed just with original and derived
electropotentials is able to provide the best detection accuracy. The LDA/WFS1 tech-
nique that gave this best accuracy did not use the subjective Post-BDS-LOS as a fea-
ture. This conclusion indicates that there is a possibility of designing BDS as an
objective diagnostic device without any dependence on the results of other diagnostic
techniques like mammography and ultrasound for interpretation.
452 S. Vinitha Sree et al.
Acknowledgements
The authors would like to thank Dr. Michael J. Antonoplos of Biofield Corp., USA,
for providing the US clinical trial dataset for this work and for comparison with our
clinical trial results and for sharing his views and interests on the work. Support and
permission to publish these findings granted by Mr. David Bruce Hong of The
MacKay Holdings Limited is also gratefully acknowledged.
References
1. Biofield Corp., http://www.biofield.com
2. Biofield Diagnostic System, Physician’s manual, LBL 024 rev. A
3. Cuzick, J., et al.: Electropotential measurements as a new diagnostic modality for breast
cancer. Lancet 352, 359–363 (1998)
4. Dickhaut, M., Schreer, I., Frischbier, H.J., et al.: The value of BBE in the assessment of
breast lesions. In: Dixon, J.M. (ed.) Electropotentials in the clinical assessment of breast
neoplasia, pp. 63–69. Springer, NY (1996)
5. Fukuda, M., Shimizu, K., Okamoto, N., Arimura, T., Ohta, T., Yamaguchi, S., Faupel,
M.L.: Prospective evaluation of skin surface electropotentials in Japanese patients with
suspicious breast lesions. Jpn. J. Cancer Res. 87, 1092–1096 (1996)
6. Gatzemeier, W., Cuzick, J., Scelsi, M., Galetti, K., Villani, L., Tinterri, C., Regalo, L.,
Costa, A.: Correlation between the Biofield diagnostic test and Ki-67 labeling index to
identify highly proliferating palpable breast lesions. Breast Cancer Res. Treat. 76, S114
(2002)
7. Davies, R.J.: Underlying mechanism involved in surface electrical potential measurements
for the diagnosis of breast cancer: an electrophysiological approach to breast cancer. In:
Dixon, J.M. (ed.) Electropotentials in the clinical assessment of breast neoplasia, pp. 4–17.
Springer, Heidelberg (1996)
8. Faupel, M., Vanel, D., Barth, V., Davies, R., Fentiman, I.S., Holland, R., Lamarque, J.L.,
Sacchini, V., Schreer, I.: Electropotential evaluation as a new technique for diagnosing
breast lesions. Eur. J. Radiol. 24, 33–38 (1997)
9. Gallager, H.S., Martin, J.E.: Early phases in the development of breast cancer. Cancer 24,
1170–1178 (1969)
10. Goller, D.A., Weidema, W.F., Davies, R.J.: Transmural electrical potential difference as an
early marker in colon cancer. Arch. Surg. 121, 345–350 (1986)
11. Sacchini, V., Gatzemeier, W., Costa, A., Merson, M., Bonanni, B., Gennaro, M., Zan-
donini, G., Gennari, R., Holland, R.R., Schreer, I., Vanel, D.: Utility of biopotentials
measured with the Biofield Diagnostic System for distinguishing malignant from benign
lesions and proliferative from nonproliferative benign lesions. Breast Cancer Res.
Treat. 76, S116 (2002)
12. United States Patent 6351666 - Method and apparatus for sensing and processing biopoten-
tials, http://www.freepatentsonline.com/6351666.html
13. Detecting outliers (2006),
http://www.utexas.edu/courses/schwab/sw388r7_spring_2006/
SolvingProblems/
14. WEKA data mining software, http://www.cs.waikato.ac.nz/ml/weka/
Clustering of Entropy Topography in Epileptic
Electroencephalography
1 Introduction
Epilepsy represents one of the most common neurological disorders (about 1%
of the world’s population). Two-thirds of patients can benefit from antiepileptic
drugs and another 8% could benefit from surgery. However, the therapy causes
side effects and surgery is not always resolving. No sufficient treatment is cur-
rently available for the remaining 25% of patients. The most disabling aspects of
the disease lie at the sudden, unforeseen way in which the seizures arise, leading
to a high risk of serious injury and a severe feeling of helplessness that has a
strong impact on the everyday life of the patient. It is clear that a method ca-
pable of explaining and forecasting the occurrence of seizures could significantly
improve the therapeutic possibilities, as well as the quality of life of epileptic pa-
tients. Epileptic Seizures have been considered sudden and unpredictable events
for centuries. A seizure occurs when a massive group of neurons in the cere-
bral cortex begins to discharge in a highly organized rhythmic pattern then,
Corresponding author.
D. Palmer-Brown et al. (Eds.): EANN 2009, CCIS 43, pp. 453–462, 2009.
c Springer-Verlag Berlin Heidelberg 2009
454 N. Mammone et al.
for mostly unknown reasons, it develops according to some poorly described dy-
namics. Nowadays, there is an increasing amount of evidence that seizures might
be predictable. In fact, as proved by the results reported by different research
groups working on epilepsy, seizures appear not completely random and unpre-
dictable events. Thus it is reasonable to wonder when, where and why these
epileptogenic processes start up in the brain and how they result in a seizure.
The goal of the scientific community is to predict and control epilepsy, but if
we aim to control epilepsy we must understand it first: in fact, discovering the
epileptogenic processes would throw a new light on this neurological disease. The
cutting edge view of epileptic seizures proposed in this paper, would dramatically
upset the standard approaches to epilepsy: seizures would no longer be consid-
ered the central point of the diagnostic analysis, but the entire “epileptogenic
process” would be explored. The research in this field, in fact, has been focused
only on epileptic seizures so far. In our opinion, as long as we focus on the seizure
onset and then we try to explain what happened before in a retrospective way,
we will not be able to discover the epileptogenic processes and therefore to fully
understand and control epilepsy, because seizures are only a partial aspect of a
more general problem. Epileptic seizures seem to result from an abnormal syn-
chronization of different areas of the brain, as if a kind of recruitment occurred
from a critical area towards other areas of the brain (not necessarily the focus)
until, the brain can no longer bear the extent of this recruitment and it triggers
the seizure in order to reset this abnormal condition. If this hypothesis is true,
we are not allowed to consider the onset zone the sole triggering factor, but
the seizure appears to be triggered by a network phenomenon that can involve
areas apparently not involved in seizure generation at a standard EEG visual
inspection. Seizures had been considered unpredictable and sudden events until
a few years ago. The Scientific Community began being interested in epileptic
seizure prediction during ’70s: some results in literature showed that seizures
were likely to be a stage of a more general epileptogenic process rather than an
unpredictable and sudden event. Therefore a new hypothesis was proposed: the
evolution of brain dynamics towards seizures was assumed to follow this tran-
sition: inter-ictal − > pre-ictal − > ictal − > post-ictal state. This emerging
hypothesis is still under analysis and many studies have been carried out: most
of them have been carried out on intracranial electroencephalographic record-
ings (IEEG). The processes that start up in the brain and lead to seizure are
nowadays mostly unknown: investigating these dynamics from the very begin-
ning, that is minutes and even hours before seizure onset, may throw a new
light on epilepsy and upset the standard diagnosis and treatment protocols.
Many researchers tried to estimate and localize epileptic sources in the brain,
mainly analysing the ictal stage in EEG [1], [2], [3]. If the aim is to detect and
follow the epileptogenic processes, in other words to find patterns of epileptic
sources activation, a long-time continuous analysis (from either the spatial and
temporal point of view) of this output in search of any information about the
brain-system might help. In order to understand the development of epileptic
seizures, we should understand how this abnormal order affects EEG over the
Clustering of Entropy Topography in Epileptic Electroencephalography 455
cortex, over time. Thus the point is: how can we measure this order? Thanks to
its features, entropy might be the answer, thus here we propose to introduce it
to investigate the spatial temporal distribution of order over the cortex. EEG
brain topography is a technique that gives a picture of brain activity over the
cortex. It consists of plotting the EEG in 2-D maps by color coding EEG fea-
tures, most commonly the EEG power. EEG topography has been widely used
as a tool for investigating the activity of epileptic brains but just for analysing
sparse images and not for reconstructing a global picture of the brain behaviour
over time [1], [4], [5], [6], [7], [8], [9], [10], [11], [12]. Here we propose to carry
out a long-time continuous entropy topography in search of patterns of EEG
behaviour. Once entropy is mapped a spatio-temporal SOM based clustering is
carried out in order to put in the same cluster the electrodes that share similar
entropy levels.
The paper is organized as follows: Section 2 will describe the electroencephalo-
graphic potentials generation from neuronal sources and the genesis of epileptic
seizures, Section 3 will introduce entropy topography and spatial clustering and
Section 4 will report the results.
3 Methodology
3.1 Entropy to Measure “Order” in the Brain
Since we were interested in monitoring the order degree of different areas of the
brain, in this paper we proposed to map the Renyi’s entropy of EEG [13] and we
compared it with the mapping of the power of EEG. Entropy can be interpreted
as a measure of order and randomness. Given a signal, we can think about its
amplitude as a random variable X.
1
HRα (X) = log P α (X = ai ) (1)
1−α i
figure was captured as a frame of a movie and each frame was associated to the
corresponding time, so that it was possible to keep the time information and
therefore to identify the critical stages while reviewing the movie itself. Movies
are available at: http://www.permano.it/research/res-eann2009.php. During the
visualization we followed the topographic trends of the most active areas in terms
of lowest and highest entropy levels, in other words, we were in search of the
electrodes that had been associated with low-entropy or high-entropy values for
the longest time and we called these electrodes the “most active”.
4 Results
4.1 Data Description
The analyzed dataset consists in three EEG recordings with 18 to 20-channels
(Figure 1) from three patients affected by frontal lobe epilepsy (A, B) and one
patient affected by frontal-temporal intractable epilepsy (C). The EEG was high
pass filtered at 0.5Hz and the sampling rate was set at 256Hz for patients A and
B, at 200Hz, for patients C. The dataset of patient A and C included one seizure
whereas the dataset of patient B included three seizures. Figure 2 shows in 3D
some explanatory frames from the movie of patient B: Figure 2.a shows an inter-
ictal map whereas Figure 2.b-c-d show the ictal maps of the three seizures.
Fig. 1. The international 10-20 system seen from (A) left and (B) above the head. A =
Ear lobe, C = central, Pg = nasopharyngeal, P = parietal, F = frontal, Fp = frontal
polar, O = occipital.
Fig. 2. 3-D plot of entropy spatial distribution of patient B during: (a) interictal stage;
(b) first seizure; (c) second seizure; (d) third seizure. The area associated to low entropy
(blue) is the epileptogenic focus.
Fig. 3. Quantification of how often (time %) each electrode (x axis) belonged to the
low-entropy cluster (cluster 1), to the neutral entropy clusters (clusters 2 and 3) and to
the high-entropy cluster (cluster 4) for patient A (a), patient B (b) and patient C (c)
ictal stage. This means that an abnormal coupling among the electrodes that
will be involved in seizure development starts before the seizure itself. We will
now detail the results patient by patient.
Visually reviewing the movie of patient A, we could realize that the frontal
area had been steadily associated to low entropy. Once we applied SOM, we
realized that this behaviour impacted on electrode clustering: the electrodes of
the frontal areas had been clustered together from the very beginning of the
recording and were associated to cluster 1, as we can see from Figure 3. In order
to give a deeper look into the evolution of clustering throughout the recording,
we analised the trend of cluster 1 visualising the electrodes belonging to it from
the beginning of the recording to the time of seizure onset (see Table 1). As we
can see from Table 1, cluster 1 is clearly dominated by electrodes Fp1, Fp2, F3,
F4 and Fz with an occasional involvement of Cz.
The movie of patient B showed that the frontal area had been steadily asso-
ciated to low entropy. SOM based clustering showed that the electrodes of the
460 N. Mammone et al.
frontal areas had been clustered together in cluster 4. The trend of cluster 4 is
visualised in Table 2. Cluster 4 is clearly dominated by electrodes Fp1, Fp2, F7,
F8, F4 and T3, thus there is some fronto-temporal involvement, compared to
the analysis of patient A.
The movie of patient C showed that the frontal area had been steadily asso-
ciated to high entropy. SOM based clustering showed that the electrodes of the
frontal areas had been clustered together in cluster 1. The trend of cluster 1 is
visualised in Table 3. Cluster 1 is dominated by electrodes Fp1, Fp2, F7, F8, F4
and F3, there is some fronto-temporal involvement for this patient too.
5 Conclusions
was introduced to study this abnormal behaviour of the neuronal sources. Renyi’s
entropy was proposed to measure the randomness/order of the brain. Three
EEG dataset from patients affected by partial epilepsy were analysed. Renyi’s
Entropy spatial distribution showed a clear relationship with the region of seizure
onset. Entropy mapping was compared with the standard power mapping that
was much less stable and selective. Thus we can infer that entropy seems to
be a possible window on the synchronization of epileptic neural sources. These
preliminary results are qualitative, we will pursue a more quantitative description
with larger number of experiments and evaluating the statistical significance of
the observed behavior. Future research will be devoted to the analysis of the
EEG of normal subjects in order to carry out a comparison with the entropy
mapping of normal EEG. Furthermore, the analysis will be extended to many
other epileptic patients and patterns of entropy activation will be investigated.
Acknowledgements
The authors would like to thank the doctors of the Epilepsy Regional Center
of the Riuniti Hospital of Reggio Calabria (Italy) for their insightful comments
and suggestions.
References
1. Im, C.-H., Jung, H.-K., Jung, K.-Y., Lee, S.Y.: Reconstruction of continuous and
focalized brain functional source images from electroencephalography. IEEE Trans.
on Magnetics 43(4), 1709–1712 (2007)
2. Im, C.-H., Lee, C., An, K.-O., Jung, H.-K., Jung, K.-Y., Lee, S.Y.: Precise estima-
tion of brain electrical sources using anatomically constrained area source (acas)
localization. IEEE Trans. on Magnetics 43(4), 1713–1716 (2007)
3. Knutsson, E., Hellstrand, E., Schneider, S., Striebel, W.: Multichannel magnetoen-
cephalography for localization of epileptogenic activity in intractable epilepsies.
IEEE Trans. on Magnetics 29(6), 3321–3324 (1993)
4. Sackellares, J.C., Iasemidis, L.D., Gilmore, R.L., Roper, S.N.: Clinical application
of computed eeg topography. In: Duffy, F.H. (ed.) Topographic Mapping of Brain
Electrical Activity, Boston, Butterworths (1986)
5. Nuwer, M.R.: Quantitative eegs. Journal of Clinical Neurophysiology 5, 1–86 (1988)
6. Babiloni, C., Binetti, G., Cassetta, E., Cerboneschi, D., Dal Forno, G., Del Percio,
C., Ferreri, F., Ferri, R., Lanuzza, B., Miniussi, C., Oretti, D.V.M., Nobili, F.,
Pascual-Marqui, R.D., Rodriguez, G., Romani, G., Salinari, S., Tecchio, F., Vitali,
P., Zanetti, O., Zappasodi, F., Rossini, P.M.: Mapping distributed sources of corti-
cal rhythms in mild alzheimer’s disease. a multicentric eeg study. NeuroImage 22,
57–67 (2004)
7. Miyagi, Y., Morioka, T., Fukui, K., Kawamura, T., Hashiguchi, K., Yoshida, F.,
Shono, T., Sasaki, T.: Spatio-temporal analysis by voltage topography of ictal elec-
troencephalogram on mr surface anatomy scan for the localization of epileptogenic
areas. Minim. Invasive Neurosurg. 48(2), 97–100 (1988)
8. Ebersole, J.S.: Defining epileptic foci: past, present, future. Journal of Clinical
Neurophysiology 14, 470–483 (1997)
462 N. Mammone et al.
9. Scherg, M.: From eeg source localization to source imaging. Acta Neurol.
Scand. 152, 29–30 (1994)
10. Tekgul, H., Bourgeois, B.F., Gauvreau, K., Bergin, A.M.: Electroencephalography
in neonatal seizures: comparison of a reduced and a full 10/20 montage. Pediatr.
Neurol. 32(3), 155–161 (2005)
11. Nayak, D., Valentin, A., Alarcon, G., Garcia Seoane, J.J., Brunnhuber, F., Juler,
J., Polkey, C.E., Binnie, C.D.: Characteristics of scalp electrical fields associated
with deep medial temporal epileptiform discharges. Journal of Clinical Neurophys-
iology 115(6), 1423–1435 (2004)
12. Skrandies, W., Dralle, D.: Topography of spectral eeg and late vep components
in patients with benign rolandic epilepsy of childhood. J. Neural Transm. 111(2),
223–230 (2004)
13. Hild II, K.E., Erdogmus, D., Principe, J.C.: On-line minimum mutual information
method for time varying blind source separation. In: 3rd International Conference
on Independent Component Analysis And Blind Signal Separation, pp. 126–131
(2001)
14. Delorme, A., Makeig, S.: Eeglab: an open source toolbox for analysis of single-trial
eeg dynamics including independent component analysis. Journal of Neuroscience
Methods 134, 9–21 (2004)
15. Kohonen, T.: Self-Organizing Maps. Series in Information Sciences, vol. 30.
Springer, Heidelberg (1995)
Applying Snap-Drift Neural Network to Trajectory Data
to Identify Road Types: Assessing the Effect of
Trajectory Variability
Abstract. Earlier studies have shown that it is feasible to apply ANN to catego-
rise user recorded trajectory data such that the travelled road types can be re-
vealed. This approach can be used to automatically detect, classify and report
new roads and other road related information to GIS map vendor based on a
user travel behavior. However, the effect of trajectory variability caused by
varying road traffic conditions for the proposed approach was not presented;
this is addressed in this paper. The results show that the variability encapsulated
within the dataset is important for this approach since it aids the categorisation
of the road types. Overall the SDNN achieved categorisation result of about
71% for original dataset and 55% for the variability pruned dataset.
1 Introduction
Due to increasing traffic density, new roads are being constructed and existing ones
modified, thus digital GIS road network databases are rarely up-to-date. There is a
need to implement methods that would readily capture road changes, classify feature
types and insert them into the database such that users of GIS map-based applications
would have up-to-date maps. In [1] a solution that addresses the perceived problem of
automated road network by applying Snap-Drift Neural Network on user recorded
trajectory data is presented. This solution can be incorporated into Location-Based
Service (LBS) applications dependent on Global Positioning System (GPS) and digi-
tal Geographical Information System (GIS) road network (e.g. in-car navigation sys-
tem), such that when users encounter possible new road segments (departure from the
known roads in the database), the on-board device would record such deviations. This
data would be processed by the SDNN as discussed in this paper or transferred back
D. Palmer-Brown et al. (Eds.): EANN 2009, CCIS 43, pp. 472–484, 2009.
© Springer-Verlag Berlin Heidelberg 2009
Applying Snap-Drift Neural Network to Trajectory Data to Identify Road Types 473
to the Sat Nav provider and input into their SDNN along with similar track data pro-
vided by other service users, to decide whether or not to automatically update (add)
the “unknown road” to the road database. In this way, users of applications dependent
on road network would be provided with unified and sustainable platform to auto-
matically perform road change update. Also, possible locations of new roads are
pinpointed to road database vendors for further investigation. However, only the per-
formance of unsupervised SDNN was assessed. In this paper, the effect of road traffic
variability encapsulated in the recorded trajectory data on the SDNN categorisation is
assessed. This would further inform on the suitability of using user recorded trajectory
data to update road network database.
The following sections of this paper are organised as follows: in Section 2, an
overview of related work on trajectory analysis is given. This is followed in Section 3
by an overview of the SDNN. In Section 4, data collection is described followed by
presentation of road design parameter derivation from GPS data in section 5. In
Section 6, trajectory variability analysis is presented followed by data types in
section 7. The results and performance of the SDNN are presented in section 8 and
section 9 presents the conclusions.
‘episodes’ - discreet time periods for which a user’s spatio-temporal behaviour was
relatively homogenous. Episodes are discovered by analysing breakpoints, which they
identified as temporal/spatial jumps and rapid change in direction speed. Other works
have focused on using users’ trajectory data to predict users’ future location by ana-
lysing user’s historical [7] or current [8] trajectory patterns. These approaches could
be used for on-the-fly data pruning in anticipation of a user request in Location based
Services (LBS).
GPS trajectory data have not been studied for the purpose of identifying roads seg-
ments that might need to be updated. In this study, the concept that trajectory infor-
mation is an abstraction of user movement is exploited. The characteristics of this
movement would in most cases be influenced by the road type or road feature the user
is travelling on. A comparison of GPS trail with the actual road map would always
reveal some level of conformance between the two data. As such, analysing GPS
recorded trajectory data using an Artificial Neural Network should group or classify
the trajectory data to reveal the road information or road type the user was travelling.
Hence, for a road update problem candidate roads that might need to be updated into a
GIS road network database, could be identified and updated. This is the principal
concept in this study. Earlier work as shown that the SDNN is able to categorise re-
corded trajectory data to reflect the travelled roads [1], this papers investigates the
effect of road traffic variability encapsulated in recorded trajectory data on the SDNN
categorisation.
On presentation of input data patterns at the input layer F1, the distributed SDNN
(dSDNN) will learn to group them according to their features using snap-drift [15].
The neurons whose weight prototypes result in them receiving the highest activations
are adapted. Weights are normalised so that in effect only the angle of the weight
vector is adapted, meaning that a recognised feature is based on a particular ratio of
values, rather than absolute values. The output winning neurons from dSDNN act as
input data to the selection SDNN (sSDNN) module for the purpose of feature group-
ing and this layer is also subject to snap-drift learning.
The learning process is unlike error minimisation and maximum likelihood
methods in MLPs and other kinds of neural networks which perform optimization for
classification, or equivalents, by for example pushing features in the direction that
minimizes error. Such approaches do not have any requirement for the feature to be
statistically significant within the input data. In contrast, SDNN toggles its learning
mode to find a rich set of features in the data and uses them to group the data into
categories. The following is a summary of the steps that occur in SDNN [14]:
Step 1: Initialise parameters: (snap = 1, drift = 0), era = 2000
Step 2: For each epoch (t)
For each input pattern
Step 2.1: Find the D (D = 10) winning nodes at F21 with the largest net input
Step 2.3: Weights of dSDNN adapted according to the alternative learning proce-
dure: (snap, drift) becomes Inverse(snap, drift) after every successive epoch
Step 3: Process the output pattern of F21 as input pattern of F12
Step 3.1: Find the node at F12 with the largest net input
Step 3.2: Test the threshold condition:
IF (the net input of the node is greater than the threshold)
THEN
Weights of the sSDNN output node adapted according to the alternative learning
procedure: (α, σ) becomes inverse (snap, drift) after every successive epoch
ELSE
476 F. Ekpenyong and D. Palmer-Brown
An uncommitted sSDNN output node is selected and its weights are adapted ac-
cording to the alternative learning procedure: (snap, drift) becomes Inverse(α, σ)
after every successive epoch
The snap-drift learning algorithm combines Learning Vector Quantisation (drift) and
pattern intersection learning (snap) [16]. The top-down learning of the neural system
which is a combination of the two forms of learning is as follows:
where WJi = top-down weights vectors; I = binary input vectors, and β = the drift
speed constant = 0.5. In successive learning epochs, the learning is toggled between
the two modes of learning. When snap = 1 and drift = 0, fast, minimalist (snap) learn-
ing is invoked, causing the top-down weights to reach their new asymptote on each
input presentation. (1) is simplified as:
= I ∩ WJi
( new) ( old )
WJi (2)
This learns sub-features of patterns. In contrast, when drift=1 and snap = 0, (1) sim-
plifies to:
= WJi + β (1 − WJi
( new) ( old ) ( old )
WJi ) (3)
= WJi
( new) ( new) ( new )
WiJ / | WJi | (4)
group collected points to reveal travelled road segments. Relying only on the winning
node, a grouping accuracy of about 71% is achieved compared to 51% from Learning
Vector Quantisation (LVQ). On analysis and further experimentation with the SDNN
d-nodes an improved grouping accuracy was achieved, but with a high count of
unique d-node combinations.
4 Data Collection
GPS recorded trajectory data were gathered from a 31.2 km drive over a range of road
types in London (Fig. 2). The GPS data were collected using Garmin’s Etrex Vista
with a reported accuracy (95% confidence interval) of 5.8m during the day and 2.1m
at nights [19]. The GPS points were collected every 5 seconds. Voice data were also
concurrently collected noting road segment characteristics that could affect the re-
corded data; like stops at junctions, traffic lights, GPS carrier loss and other delays.
The voice data were used to identify collected points features that do not match any
road related features from the Ordnance Survey (OS) MasterMap ITN layer [20].
Seven days data (Table 1) was collected to generate data which was used to study the
effect of traffic variation on the study approach. One dataset was collected each day
and covered different times ranging from 7am till 10pm. Table 1 shows the summa-
ries of the data collection campaign. Day 1 data was collected on a Sunday with less
traffic thus the least number of points were recorded, least time journey time and
highest average speed. On the contrary, 976 GPS points was recorded on Saturday
and it took about an hour and 22 minutes with an average speed of 21Km/h for the
trip.
478 F. Ekpenyong and D. Palmer-Brown
It is important to note here that variation in the data collection summary (Tables 1)
is due to traffic condition of the travelled roads. For instance, the time window of the
Day-7 (Saturday) data collection coincided with when people were out shopping, so
much of the points were recorded on traffic delays on road serving shopping malls
and at its car parks. To capture these variations, most of the routes were travelled
more than once. Using the OS road type naming convention, roundabout features are
normally part of other road classes say A roads, local streets or minor roads. But for
our purpose we treat roundabout features and points collected at traffic light stops as
unique classes considering the fact that we are grouping the trajectory data based on
geometry information between successive points. GPS data points were collected
from 7 road types and road related features namely; A road, local street, minor road,
private road, roundabout, car park, and traffic light.
following constraints as: the person’s capabilities for trading time for space in move-
ment; access to private and public transport; the need to couple with others at particu-
lar locations for given durations; and the ability of public or private authorities to
restrict physical presence from some locations in space and time. How different indi-
viduals negotiate these constraints while trading time for space influences a road
traffic condition. Here the locations of travel pattern variability within the test site are
sought with a view to eliminating or reducing the variability. To inform on the effect
of traffic variability on the research approach, the performance of SDNN with original
data is compared with that with the variability removed or reduced. Point density
analysis was carried out to determine clusters (areas of variability) in the recorded
dataset. This was achieved using the kernel density implemented in ArcGIS 9 [22].
In total 18 variability corridors was found in the test site. The result of a spatial
summation of all the trajectory variability locations for each data collection day is
shown in Fig. 3. This summation reveals likely trajectory variation location (travel
delay) in the study area; this is represented by the green shades in the figure. Overall
most of the variability was found around traffic lights, road calming measures or
temporal obstructions, road entries/exit and on road segment linking shopping malls.
Comparison of the total number of recorded GPS points each day with number of
recorded GPS points in the mapped variability corridor reveals that these variation
represents a significant amount (almost 50%) of the total GPS points recorded each
data collection day. For instance, In Day-1 673 GPS points were recorded and 327 of
these points reside in the variability corridors. Similarly, in Day-2 819 GPS points
were recorded and 428 were found in the variability corridors; in Day-7 976 points
were recorded and more than half (563 points) reside in the variability corridors.
480 F. Ekpenyong and D. Palmer-Brown
A method was implemented that reduced the recorded GPS points within the trajec-
tory variability corridor and also preserves the abstraction (geometry) representing the
shape of the road segments travelled. The steps are shown below:
1. Calculate the number of points in a trajectory variability corridor (TVC)
2. Calculate the total length covered by the points in the TVC
3. find the speed of last recorded GPS points before the TVC points
4. Using speed-distance formula, where t = 5s, derived the new equidistance where
new points would be placed using a constant speed found in 3 above
5. Get the x y coordinate of newly created points.
The number of representative points placed in a TVC depends on the length of points
in the TVC and the speed of the last GPS point before the TVC was encountered.
Consequently, recorded GPS points in the TVC were replaced with the newly created
points and the variables recalculated for the new dataset.
7 Data Types
Two trajectory datasets are presented to the SDNN following the trajectory variability
analysis, these are:
1. The original recorded GPS trajectory dataset with its derived variables now re-
ferred to as dataset 1
2. The variability pruned GPS Trajectory dataset with its derived variables now re-
ferred to as dataset 2, for this dataset there is reduction in the number of data
points.
For both datasets 7 variables represented by separate fields in the input vector were
presented to the SDNN. These are the speed, rate of acceleration and deceleration,
radiuses of horizontal and vertical curvatures, change in direction and sinuosity. The
training and test patterns were presented in a random order to the SDNN. This simu-
lates the real world scenarios whereby travelled road patterns measured by GPS-based
trajectory data varies depending on the travel speed, geometry of the road and nature
of the road such that a given road type might be repeatedly encountered while others
are not encountered at all.
8 Results
Fig. 4 shows a comparative plot of the SDNN and dSDNN categorisations across the
road patterns for dataset 1. From this plot it can be seen that the SDNN is able to
correctly categorised majority of the patterns except for private and minor road pat-
terns. Most of the minor road patterns were categorised as local street patterns by the
SDNN and the dSDNN suggesting that the minor road patterns share similar features
to the local street patterns. In reality, minor road attributes like the speed limit do
overlap with that of local street and as such it is challenging for the SDNN to cor-
rectly group features of these road types. Overall an improve categorisation is
Applying Snap-Drift Neural Network to Trajectory Data to Identify Road Types 481
achieved with the dSDNN. Private road patterns were found on the dSDNN but
wrongly generalised by the SDNN as local street and A road patterns. Using a simple
winner-take-all, categorisation of 60% is achieved by the SDNN and 71% by the
dSDNN.
Fig. 4. Comparative plot of SDNN categorisation for the winning (SDNN) and the d-nodes
(dSDNN) for dataset 1
Fig. 5. Comparative plot of SDNN categorisation for the winning (SDNN) and the d-nodes
(dSDNN) for dataset 2
482 F. Ekpenyong and D. Palmer-Brown
Fig. 5 shows a comparative plot of the different categorisation methods across the
different road class patterns for variability removed dataset (dataset 2). From this plot it
is evident that majority of the road class are categorised as either local street or A road
patterns while other patterns are not detected by the SDNN or dSDNN except for the car
park patterns. This result shows that variability within the dataset holds significant in-
formation that enables the SDNN categorised majority of the dataset to reflect the trav-
elled road types. Hence removing this information results in categorisation where all
datasets grouped either as local street or A road patterns. A categorisation accuracy of
50% is achieved by the SDNN and 55% for the dSDNN for this dataset.
Fig. 6 shows a comparative plot of both dataset categorisations across the SDNN
and the dSDNN nodes. It can be seen that the categorisation performance of the
SDNN (and dSDNN) with the dataset 1 is better than dataset 2 (Fig. 6). For instance,
all the road type patterns where detected by the SDNN and dSDNN except for private
road for dataset 1. For dataset 2 only car park, local street and A road patterns were
detected by the SDNN while only the car park, traffic light, local street and A road
patterns were detected by dSDNN. With dataset 2, the traffic lights and roundabout
patterns were poorly categorised suggesting that the variability in the recorded dataset
is associated to these features; hence removal of the variability also reduces the
chance of these patterns being detected by the SDNN. These results also show that
without the variability information, most of the roads are categorised as either A roads
or Local streets. Overall, it can be seen that the performance of SDNN for the vari-
ability pruned dataset (dataset 2) is poor compared to datasets 1. It is concluded that
the variability information encapsulated by the recorded trajectory data is useful in
grouping the trajectory data to reveal the travelled roads.
9 Conclusion
Unsupervised SDNN is able to categorise (classify) recorded trajectory information
such that travelled road information like road type can be revealed. This is because
the SDNN adopts a learning process that performs real-time combination of fast,
Applying Snap-Drift Neural Network to Trajectory Data to Identify Road Types 483
convergent, minimalist learning (snap) and more cautious learning (drift) to capture
both precise sub- features in the data and more general holistic features and is able to
adapt rapidly in a non-stationary environment where new patterns (new candidate
road attributes in this case) are introduced over time. Also, it can be concluded that
the variability encapsulated in recorded trajectory caused by varying road traffic con-
ditions is useful for identifying different road types when applying SDNN to trajec-
tory data for the purpose of travelled road class identification. Artificial elimination or
reduction of this variability significantly reduces the chances of correctly identifying
travel road classes from recorded trajectory data.
Acknowledgment
The authors gratefully acknowledge the Ordnance Survey for provision of MasterMap
coverages. All road centreline data in Figure 2 are Crown Copyright.
References
1. Ekpenyong, F., Palmer-Brown, D., Brimicombe, A.: Updating of Road Network Data-
bases: Spatio-temporal Trajectory Grouping Using Snap-Drift Neural Network. In: 10th
International Conference on Engineering Applications of Neural Networks, 2007, Thessa-
loniki, Hellas, August 29–31(2007)
2. Yanagisawa, Y., Akahani, J.-i., Satoh, T.: Shape-based similarity query for trajectory of
mobile objects. In: Chen, M.-S., Chrysanthis, P.K., Sloman, M., Zaslavsky, A. (eds.)
MDM 2003. LNCS, vol. 2574, pp. 63–77. Springer, Heidelberg (2003)
3. Mountain, D.M.: An investigation of individual spatial behaviour and geographic filters
for information retrieval, Department of Information Science, City University, London
(2005)
4. Laurini, R., Thompson, D.: Geometries in fundamentals of spatial information systems.
Academic Press Ltd., London (1992)
5. Hwang, J.-R., Kang, H.-Y., Li, K.-J.: Spatio-temporal similarity analysis between trajecto-
ries on road networks. In: Akoka, J., Liddle, S.W., Song, I.-Y., Bertolotto, M., Comyn-
Wattiau, I., van den Heuvel, W.-J., Kolp, M., Trujillo, J., Kop, C., Mayr, H.C. (eds.) ER
Workshops 2005. LNCS (LNAI and LNBI), vol. 3770, pp. 280–289. Springer, Heidelberg
(2005)
6. Mountain, D.M., Raper, J.: Modelling human spatio-temporal behaviour: A challenge for
location-based services. In: 6th International Conference of GeoComputation, University
of Queensland, Brisbane, Australia, September 24- 26 (2001),
http://www.geocomputation.org/2001/papers/mountain.pdf
(Accessed January 12, 2006)
7. Liu, X., Karimi, H.A.: Location awareness through trajectory prediction. Computers, Envi-
ronment and Urban Systems 30(6), 741–756 (2006)
8. Brimicombe, A., Li, Y.: Mobile Space-Time Envelopes for Loaction-Based Services.
Transactions in GIS 10(1), 5–23 (2006)
9. Barsi, A., Heipke, C., Willrich, F.: Junction Extraction by Artificial Neural Network Sys-
tem - JEANS. International Archives ofPhoto grammetry and Remote Sensing 34, Part 3B,
18–21 (2002)
484 F. Ekpenyong and D. Palmer-Brown
10. Jwo, D.J., Lai, C.C.: Neural network-based geometry classification for navigation satellite
selection. Journal of Navigation 56(2), 291 (2003)
11. Winter, M., Taylor, G.: Modular neural networks for map-matched GPS positioning. In:
IEEE Web Information Systems Engineering Workshops (WISEW 2003), December 2003,
pp. 106–111 (2003)
12. Jwo, D.J., Lai, C.C.: Neural network-based GPS GDOP approximation and classification.
GPS Solutions 11(1), 51–60 (2007)
13. Lee, S.W., Palmer-Brown, D., Roadknight, C.M.: Performance-guided neural network for
rapidly self-organising active network management. Neurocomputing 61, 5 (2004)
14. Lee, S.W., Palmer-Brown, D.: Phonetic Feature Discovery in Speech Using Snap-Drift
Learning. In: Kollias, S.D., Stafylopatis, A., Duch, W., Oja, E. (eds.) ICANN 2006. LNCS,
vol. 4132, pp. 952–962. Springer, Heidelberg (2006)
15. Lee, S.W., Palmer-Brown, D.: Phrase recognition using snap-drift learning algorithm. In:
The Internation Joint Conference on Neural Neural Networks (IJCNN 2005), Montreal,
Canada, July 31 - August 4 (2005)
16. Kohonen, T.: Improved versions of learning vector quantization. In: International Joint
Conference on Neural Networks, vol. 1, pp. 545–550 (1990)
17. Donelan, H., Pattinson, C., Palmer-Brown, D.: The Analysis of User Behaviour of a Net-
work Management Training Tool using a Neural Network. Systemics, Cybernetics and In-
formatics 3(5), 66–72 (2006)
18. Lee, S.W., Palmer-Brown, D.: Phrase Recognition using Snap-Drift Learning Algorithm.
In: The International Joint Conference on Neural Networks (IJCNN 2005), Montreal, Can-
ada, July 31- August 4 (2005)
19. Mehaffey, J., Yeazel, J.: Receiver WAAS On and Off, 30-Minute Tests in the Open,
Night and Day (2002), http://gpsinformation.net/waas/vista-waas.html
(Access on May 10, 2007)
20. Ordnance Survey, OS ITN Layer Dataset. Ordnance Survey, Great Britain (2006)
21. Miller, H.J.: A Measurement Theory for Time Geography. Geographical Analysis 37(1),
17–45 (2005)
22. ArcGIS User’s Guide, ESRI, Inc. (2007)
Reputation Prediction in Mobile Ad Hoc Networks
Using RBF Neural Networks
Abstract. Security is one of the major challenges in the design and implementa-
tion of protocols for mobile ad hoc networks (MANETs). ‘Cooperation for cor-
porate well-being’ is one of the major principles being followed in current
research to formulate various security protocols. In such systems, nodes estab-
lish trust-based interactions based on their reputation which is determined by
node activities in the past. In this paper we propose the use of a Radial Basis
Function-Neural Network (RBF-NN) to estimate the reputation of nodes based
on their internal attributes as opposed to their observed activity, e.g., packet
traffic. This technique is conducive to prediction of the reputation of a node
before it portrays any activities, for example, malicious activities that could be
potentially predicted before they actually begin. This renders the technique fa-
vorable for application in trust-based MANET defense systems to enhance their
performance. In this work we were able to achieve an average prediction per-
formance of approximately 91% using an RBF-NN to predict the reputation of
the nodes in the MANET.
1 Introduction
A Mobile ad-hoc Network (MANET) is a collection of mobile nodes connected to
each other through a wireless medium dynamically forming a network without the use
of centralized control or existing infrastructure. It is this type of network that is being
researched and is projected to be applied in urgent and critical situations like medical
emergency, disaster relief and military combat arenas. The sensitivity of such appli-
cations and the possible lack of an alternate communications path make these net-
works attractive to cyber attacks [1]. According to a recent DARPA BAA [2], among
all of the cyber threats that exist, one of the most severe is expected to be worms with
arbitrary payload, which can saturate and infect MANETs on the order of seconds.
Some examples of the mobile devices that can reside at the nodes in the MANET are
workstations, sensors, handheld devices, laptop computers, etc.
Maintaining the integrity of the network and each individual node in a MANET is
challenging due to the lack of central authority in the network. Preserving this integrity,
D. Palmer-Brown et al. (Eds.): EANN 2009, CCIS 43, pp. 485–494, 2009.
© Springer-Verlag Berlin Heidelberg 2009
486 F.M. Ham et al.
however, is critical for carrying out missions in a military or disaster relief scenario.
Although, for example, a computer at a node can be fully tested and configured with the
latest “known good” patches (and therefore assumed to be free from malicious code),
this known good state might be compromised in the field when a sudden change in the
mission or environment could require the users to install new applications or modify
certain settings “on-the-fly.”
Assuming the presence of a trusted monitoring component in the kernel of the op-
erating system allows a reliable measurement of the state of the computer; however,
in itself such a trusted component might be unable to detect the presence of certain
malicious code and stop the activities. Moreover, even if the malicious code can be
detected, it is often too late: by the time of the detection, the malicious code could
have compromised the system in an unknown manner as well as spread to other sys-
tems. Also, detecting attacks from remote hosts is useful as those hosts can be iso-
lated, preventing further damage. The mere fact of detecting an attack, however, does
not give insight into the vulnerability that allowed the malicious compromise, thus
limiting the system to reactive rather than proactive defense.
Intuitively, it should be possible for these trusted components to cooperate and
share information about themselves and each other. Correlating the known state of
the computers with their reputation relating to malicious activities can be used to
identify vulnerable computers based on their states even before they engage in mali-
cious activity, thereby keeping “a step ahead” of the malicious code. As we will
demonstrate in this paper, although some compromise is inevitable before the system
learns the vulnerable state, the overall damage can be well contained.
In this work we use a Radial-Basis Function Neural Network (RBF-NN) [3] at
each node to perform the “step ahead,” prediction of the node’s reputation, i.e., proac-
tive defense from malicious activities. As explained above, relying on the detection
of malicious activity by other nodes or monitoring systems to assign reputations
scores is not adequate since malicious code in compromised nodes might wait only a
few time steps before it starts attacking other nodes [4]. Moreover, we want to predict
a potential compromise before the malicious activity is well underway.
An attribute vector is computed for each network node. The vector contains 10
numeric values related to crucial status indicators of processes and physical character-
istics associated with nodal activity. An RBF-NN at each node maps the attribute
vector of that particular node to its reputation score so that trustworthiness of each
node can be estimated earlier than it would be determined by monitoring the behavior
of the node. Figure 1 illustrates a comparison between behavior monitoring systems
and our proposed RBF-NN reputation prediction system. In the figure, the node be-
comes compromised at time step n = nk and initiates its malicious activities at n = nl
(nl > nk). Behavior monitoring defense systems detect the compromise at n=nl+1
while the RBF-NN predictor can detect the compromise at n = nk + 1.
The remainder of this paper is organized as follows. Section 2 gives information on
related work and Section 3 details how the network and the nodes are modeled in this
study. Section 4 gives a brief explanation of the simulation details, including the
RBF-NN predictor training. Simulation results are presented in Section 5. Practical
considerations that should be considered for implementation of the system that we are
proposing is presented in Section 6, and finally, Section 7 concludes the paper with
suggestions for future research directions.
Reputation Prediction in Mobile Ad Hoc Networks Using RBF Neural Networks 487
2 Related Work
Marti et al. [5] proposed the use of trust-based systems in the selection of next hop for
routing. To ascertain the reliability of a node, the node monitors neighboring nodes to
determine if it has been cooperative in forwarding other nodes’ traffic. If a node has
been cooperative, then it has a good reputation and its neighbors are inclined to for-
ward packets to this node. In this way, packets are diverted (or directed away) from
misbehaving nodes. Moreover, Buchegger and Le Boudec [6], [7] proposed and ana-
lyzed the CONFIDANT protocol, which detects and isolates misbehaving nodes.
Similarity, they designed their protocol so that trust relationships and routing deci-
sions are made through experience, which can be observed or reported behavior of
other nodes. In [8] the authors analyzed the properties of the mechanisms that mobile
nodes use to update and agree on the reputation of other mobile nodes. They suggest
that mobile nodes can evaluate the reputation of other nodes based on both direct
observation and the reputation propagation algorithm that they formulated in their
research. The pitfalls of such systems involve relying on the detection of abnormal
and possibly malicious activity to activate the necessary countermeasures, that is,
operating reactively as opposed to proactively.
In Zheng et al. [9] it is discussed that malicious code is typically some type of
coded computer program and under certain conditions within the computer’s hard-
ware and software it can outbreak and infect other code, compromise information and
even possibly destroy certain devices within the system. These certain conditions can
be modeled by defined combinations of states related to different features at the
nodes. Saiman [10] lists features at the nodes that determine the reputation of the
nodes within the network. They are classified as performance metrics evaluation
features and quantitative trust value features. In the first category, features have dif-
ferent states assigned with certain numerical values. In the second category, the fea-
tures are assigned values by mathematical evaluation or physical measurements.
This paper proposes a system which can be applied in any reputation-based or
trust-based system. It enhances their performance by being able to determine the
reputation value faster than it would be determined by simple behavior monitoring.
488 F.M. Ham et al.
4 Simulation Details
From the 50 nodes that were simulated according to the model presented in Section 3,
14 of them were compromised and displayed some level of malicious activity, i.e.,
their reputation fluctuated between 0 and 1. All of the remaining nodes had reputa-
tion of 0(trusted) during the simulation run. Each node had 10 attributes (settings) that
can assume any integer from 0 to 10. As previously mentioned, the vector containing
the numeric values that represent the “state” of the node is referred to as the attribute
vector. The RBF-NN predictor is used to estimate the reputation of a node given the
attribute vector at each time step. MATLAB® was used to develop the simulation.
The steps involved in training and testing the RBF-NN is now explained. Detailed
explanation of RBF-NNs is given in [3].
p× q
Let the matrix A i ∈ Ζ (a matrix of integers), where p=1000 and q=10, contain
the attribute row vectors generated for ith node for 1000 time steps. The vector
R i ∈ ℜ p×1 (p=1000) contains the assigned reputation values for each 1000 times
steps.
490 F.M. Ham et al.
The following steps were followed to train, test and calibrate the neural network:
Step 2: Select 60% of the rows in Ai to be Ai ,train and 40% to be Ai ,test . Similarity, Ri
was decomposed into R n,train and Ri ,test .
Step 3: Train RBF-NN using Ai ,train with a spread parameter of 300 (determined ex-
perimentally).
Step 4: Sort the test data such that R i ,test increases monotonically ( Ai ,test should be
ordered accordingly). Generate the reputation vector estimate Rˆ n ,test obtained by pre-
senting Ai ,test to the RBF-NN from step 3.
Step 5: Threshold R i ,test so that any value greater than 0.5 is rounded to 1, otherwise
set to 0.
Step 6: Use a Receiver Operating Characteristic (ROC) curve [11], developed for each
node and based on the simulated data to compute an optimal threshold, TRi. Specifi-
cally, the thresholded reputation test vector Ri ,test and its estimate, Rˆ i ,test , are used in
the ROC curve analysis to determine the optimal threshold TRi associated with the
RBF-NN output (this is the calibration step referred to above) in step 3.
Step 7: Threshold Rˆ i ,test such that any value greater than TRi becomes 1, otherwise it
is set to 0.
Step 9: After performing steps 1 – 8 for all active nodes, the overall Reputation Pre-
diction Performance (RPP) of the MANET is computed as the average of the individ-
ual performances of all the nodes.
Assuming a stationary environment, once the neural network is trained on the at-
tributes/reputations of a particular node then a proactive approach can be taken using
the RBF-NN to perform a “step ahead” prediction of the node’s reputation. This
significantly enhances the performance of the wireless network’s defense system
since it allows prediction of malicious activities before they occur.
For a non-stationary environment, training must be performed periodically “on-the-
fly” to update the RBF-NN weights based on the changes in the statistical nature of
the environment. The frequency of “re-training” the RBF-NN is determined by how
the statistics change within the environment.
Reputation Prediction in Mobile Ad Hoc Networks Using RBF Neural Networks 491
5 Simulation Results
As previously mentioned, of the 50 nodes defined in the simulation, only 14 were
active in the MANET. From the simulation results, the average RPP achieved was
90.82% using an RBF-NN predictor at each node. The (i) actual node reputation, (ii)
the RBF-NN reputation estimate and (iii) the thresholded reputation estimate for
nodes 0 and 43 are given in Fig. 2 and 3.
1
0.5
0
-0.5
0 50 100 150 200 250 300 350 400
Time step
RBF-NN Reputation Estimate at Node0
10
Reputation
0
Threshold
-10
0 50 100 150 200 250 300 350 400
Time step
Thresholded Reputation Estimate at Node0
1.5
Reputation
1
0.5
0
-0.5
0 50 100 150 200 250 300 350 400
Time step
1
0.5
0
-0.5
0 50 100 150 200 250 300 350 400
Time step
RBF-NN Reputation Estimate at Node43
10
Reputation
0
Threshold
-10
0 50 100 150 200 250 300 350 400
Time step
Thresholded Reputation Estimate at Node43
1.5
Reputation
1
0.5
0
-0.5
0 50 100 150 200 250 300 350 400
Time step
Figure 4 shows the ROC curves for nodes 0 and 43. The optimal threshold at the
output of the RBF-NN for each node is determined by computing the Euclidean dis-
tance from the (0,1) point on the graph to the “knee” of the ROC curve.
492 F.M. Ham et al.
0.9
0.8
0.7
0.6
True Positive
0.5
0.4
0.3
0.2
0.1
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
False Positive
0.9
0.8
0.7
0.6
True Positive
0.5
0.4
0.3
0.2
0.1
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
False Positive
Figure 5 illustrates the performance of the RBF-NN predictor at the “active” nodes.
50
40
30
20
10
0
0 4 7 10 12 14 19 25 31 34 35 39 40 43
Node Number
6 Practical Considerations
There are some practical considerations that must be considered for the successful
operation of the proposed system. The first one is the existence of a trusted compo-
nent in the kernel of the operating system to regularly report the state of the node.
This component should be properly secured such that its functionality is not interfered
with or obstructed by any software or physical compromise.
The attributes associated with a particular node are definable critical processes and
physical characteristics associated with network (MANET) security. Examples are
operating system patches, encryption keys, hardware configurations, exposure and
location of the node [10]. The numerical assignment of these attributes should be
made in such a manner to avoid singularity problems so that the RBF-NN algorithm
can perform efficiently. Due to the absence of a central authority, the MANET de-
fense system can be executed in a dynamically assigned node or even a specialized
node like a cluster head.
The other issue is reputation scoring. There should be a defined benchmark known
at all the nodes to facilitate the assignment of the reputation values for different types
of nodal behavior (or activity). In our study, activity either results in an increase in the
reputation score or a decrease. A few examples of activities that are normally ex-
pected from malicious, misbehaving or suspicious nodes are:
• Packet dropping [10]
• Flooding the MANET with a large number of Route Request (RREQ) [12]
• Sending out incorrectly addressed packets [13]
• A fake Route Reply (RREP) as in a Black Hole attack [14].
It should also be noted that the RBF-NN predictor cannot be operational before it
learns the activity and state dynamics of the nodes in the MANET. Moreover, the
initial RBF-NN training does not necessarily need to be performed as an integral part
of the actual operation of the MANET.
As previously mentioned, a proactive approach to MANET defense as opposed to a
reactive approach has the potential to better protect the MANET. For this reason,
training of the RBF-NN should be carried out in a laboratory, or possibly during field
testing, but in either case before the MANET is deployed for actual operation. It
should be noted that the RBF-NN could be updated “on-the-fly” by re-training it
adaptively once the actual operation has been initiated.
the use of a Kalman filter as an N-step predictor capable of estimating the reputation of
a node into the future. This would be carried out by predicting the attribute vector out to
N time steps, and then using this vector estimate to predict the node’s reputation.
Acknowledgments
This research is part of a multi-institutional effort, supported by the Army Research
Laboratory via Cooperative Agreement No. W911NF-08-2-0023.
References
1. Abdelhafez, M., Riley, G., Cole, R.G., Phamado, N.: Modeling and Simulations of TCP
MANET Worms. In: 21st International Workshop on Principles of Advanced and Distrib-
uted Simulation (PADS 2007), pp. 123–130 (2007)
2. TPOC: Ghosh, A.K.: Defense Against Cyber Attacks on Mobile ad hoc Network Systems
(MANETs). In: BAA04-18 Proposer Information Pamphlet (PIP), Defense Advanced Re-
search Projects Agency (DARPA) Advanced Technology Office (ATO) (April 2004)
3. Ham, F.M., Kostanic, I.: Principles of Neurocomputing for Science and Engineering.
McGraw-Hill, New York (2001)
4. Gordon, S., Howard, F.: Antivirus Software Testing for the new Millennium. In: 23rd Na-
tional Information Systems security Conference (2000)
5. Marti, S., Giuli, T., Lai, K., Baker, M.: Mitigating routing Misbehavior in Mobile Ad Hoc
Networks. In: Proceedings of the Sixth International Conference on Mobile Computeing
and Networking, pp. 255–265 (August 2000)
6. Buchegger, S., Le Boudec, J.Y.: Nodes Bearing Grudges: Towards Routing Security, Fair-
ness, and Robustness in Mobile ad hoc Networks. In: 10th Euromicro Workshop and Par-
allel, Distributed and Network-Based Processing, pp. 403–410 (2002)
7. Buchegger, S., Le Boudec, J.Y.: Performance Analysis of the CONFIDANT Protocol: Co-
operation of Nodes–Fairness in Dynamic ad-hoc Networks. In: Proceeding of IEEE/ACM
Workshop on Mobile Ad Hoc Networking and Computing, pp. 226–236 (June 2002)
8. Liu, Y., Yang, T.R.: Reputation Propagation and Agreement in Mobile Ad-Hoc Networks.
IEEE Wireless Communication and Networking 3, 1510–1515 (2003)
9. Zheng, Z., Yi, L., Jian, L., Chang-xiang, S.: A New Computer Self-immune Model against
Malicious Codes. In: First International Symposium on Data, Privacy and E-Commerce,
pp. 456–458 (2007)
10. Samian, N., Maarof, M.A., Abd Razac, S.: Towards Identifying Features of Trust in Mo-
bile Ad Hoc Networks. In: Second Asia International Conference on Modeling & Simula-
tion, pp. 271–276 (2008)
11. McDonough, R.N., Whalen, A.D.: Detection of Signals in Noise, 2nd edn. Academic
Press, New York (1995)
12. Balakrishnan, V., Varadharajan, V., Tupakula, U., Lucs, P.: TEAM: Trust Enhanced Secu-
rity Architecture of Mobile Ad-hoc Networks. In: 15th IEEE International Conference on
Networks, pp. 182–187 (2007)
13. Stopel, D., Boger, Z., Moskovitch, R., Shahar, Y., Elovici, Y.: Application of Artificial
Neural Networks Techniques to Computer Worm Detection. In: International Joint Con-
ference on Neural Networks, pp. 2362–2369 (2006)
14. Deng, H., Li, W., Agrawal, D.P.: Routing Security in Wireless Ad Hoc Networks. IEEE
Communications Magazine, 70–75 (October 2002)
Author Index