0% found this document useful (0 votes)
64 views71 pages

Ontologies Engineering

This document discusses ontology engineering and defines key concepts. It discusses how ontologies are constructed and modeled to represent knowledge. Ontologies are defined as formal specifications of conceptualizations that are shared and machine-readable. They contain basic components like concepts, properties, hierarchies and axioms. Ontologies can be classified based on their dependence on a particular task or domain, and include top-level, domain and application ontologies. The goal is to formally represent knowledge to facilitate knowledge sharing and reuse.

Uploaded by

Palo Azriel
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
0% found this document useful (0 votes)
64 views71 pages

Ontologies Engineering

This document discusses ontology engineering and defines key concepts. It discusses how ontologies are constructed and modeled to represent knowledge. Ontologies are defined as formal specifications of conceptualizations that are shared and machine-readable. They contain basic components like concepts, properties, hierarchies and axioms. Ontologies can be classified based on their dependence on a particular task or domain, and include top-level, domain and application ontologies. The goal is to formally represent knowledge to facilitate knowledge sharing and reuse.

Uploaded by

Palo Azriel
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 71

Ontology engineering

4
Ontologies are knowledge representation languages borrowed by computer science from philoso-
phy and commonly used in the Semantic Web. In philosophy, ontologies help to model reality so
as to distinguish what is real and its categories from what is not real [57]. In computer science and
information science, ontologies help to model a domain/problem. The main goal of this chapter is
to present how ontologies are constructed. It is divided in three main sections: section 4.1 presents
ontologies, section 4.2 presents how ontologies are build, section 4.3 presents ontology learning
and section 4.4 presents related works on ontology learning from source code.

4.1 Ontologies

The goal of the first section of this chapter is to define the notion of ontology and how it is mod-
elled. Section 4.1.1 will present the notion of knowledge, section 4.1.2 will present the ontologies
and section 4.1.3 will present how ontologies are modelled.

4.1.1 The notion of knowledge

Knowledge are facts, information and skills acquired through experience or education for under-
standing of a subject area. In this area, it describes concepts and facts, relations among them and
mechanisms to combine them in order to solve problems in a domain [53]. To be useful, knowl-
edge must be acquired from domain experts/resources and represented by a formal model such as
semantic networks, system architecture, Frames, rules, ontologies, and logic.

The theory behind knowledge representation is cognitive science which studies human think-
ing in terms of representational structures in the mind and computational procedures that operate
in those structures. It is assumed that the human mind has mental representations analogous to
computer data structures and that the computational procedures of the mind are similar to compu-
tational algorithms. In this field, the different mental representations of the human mind are cited as

Semantic-aware epidemiological surveillance system


4.1 Ontologies 64

follows: logical propositions, rules, concepts, images, and analogies. These constitute the basis of
the different knowledge representation techniques of the human knowledge such as rules, frames,
and logic. Every representation provides some guidance about how knowledge can be organized
for efficient computation (e.g. frames are suitable for taxonomic reasoning) [53]. The domain of
cognitive science distinguishes the different types of knowledge that humans commonly use [53]:

• Procedural knowledge: Describes how things can be done. Example of such knowledge
includes rules, problem-solving strategies, agendas, and procedure manuals;
• Declarative knowledge: It is about what is known about a topic or a problem. For example,
facts that are either true or false;
• Metaknowledge: Describes the knowledge behind knowledge;
• Heuristic knowledge: Express as a simple heuristic which help to guide the the problem
solving process and moving through the solution space;
• Structural knowledge: Describes the relationship between the different pieces of knowl-
edge from other categories;
• Inexact and uncertain knowledge: Described a prior, a posterior, and conditional proba-
bilities of events.
• Commonsense knowledge: Denotes a vast amount of human knowledge about the world
which cannot be put easily in the form of precise theories.
• Ontological knowledge: Describes a category of things; a domain and the terms that people
use to talk about them; the relations between categories, and the axioms and constraints in
the domain. Its main components are concepts, properties of concepts, axioms and rules. In
section 4.1.2, we shall be talking about ontologies.

After the knowledge of a particular domain is gathered, it is organized and stored in a knowl-
edge base. These knowledge can be retrieved when need be. This is called knowledge retrieval.
This is done through reasoning to obtain conclusions, inferences, and explanation. In order to de-
velop practical knowledge bases, knowledge engineers have to execute a process consisting of
[53]:

• Understanding knowledge properly, transforming it making it suitable for the application of


the various knowledge representation formalism;
• Encoding knowledge in a knowledge base using appropriate representation techniques, lan-
guages, and tools;
• Verifying and validating knowledge by running the practical intelligent system that relies on
it;
• Maintaining knowledge in the course of time.

Knowledge helps to add semantics to the Web. However, in the next section, we shall be talking
about a particular type of knowledge that is ontologies.

Semantic-aware epidemiological surveillance system


4.1 Ontologies 65

4.1.2 Ontologies

In this section, we will define ontologies from the computer sciences point of view, present the
different components of an ontology and finally, present the different types of ontologies.

4.1.2.1 Definitions

In literature review, several definitions of ontologies were noted. Gruber defines ontology as an
explicit specification of a conceptualisation. Based on Gruber’s definition, many other definitions
were proposed: Borst defines ontology as a formal specification of a shared conceptualization;
Studer merged Gruber and Borst definitions and defines ontology as an explicit, formal specifica-
tion of a shared conceptualization; Guarino gives a definition using the modelling tool that is logic.
According to him, an ontology is a logical theory of a conceptualization [57]. These definitions
present the key elements of an ontology:

• Conceptualization: Refers to an abstract model of some phenomenon in the world. This


model is made up of relevant domain concepts, relations, and how concepts relate to each
other.

• Explicit: Refers to the fact that the meaning of all concepts and the constraints on their use
must be explicitly defined. All concepts must be correctly interpreted by machines.

• Formal: Refers to the fact that the ontology should be machine-readable (understandable
and interpretable).

• Shared: Here, the ontology must capture a consensual knowledge (accepted by a group of
domain experts) and must be shared to facilitate communication.

4.1.2.2 Basic ontological components

An ontology is composed of these basic components [57]:

• Concept, also called Class, represents a category of objects. For instance ”Health_f acility”
is the concept of all health facilities including health centers and clinics;

• Individual is an instance of a concept and corresponds to a concrete object. For example,


from the concept ”P erson”, ”Bob” is an individual;

• Property is used to describe the characteristics of individuals of a concept. They are com-
posed of DataProperties and ObjectProperties. DataProperties are properties whose values
are data types. For instance, ”age” of type ”Integer” can be a property of an instance
of the concept ”P erson". ObjectProperties are special attributes whose values are individ-
uals of concepts. For instance, ”examined_in” defines a relationship between the concept
”P erson” and the concept ”Health_f acility” ("A person is examined in a health facility");

Semantic-aware epidemiological surveillance system


4.1 Ontologies 66

• Class/Property hierarchy is one of the most important relation used to organize concepts
and properties in the ontology. It is used to organize concepts/properties through which in-
heritance mechanisms can be applied. For instance, ”P atient” is subClassOf ”P erson” is
a hierarchical relation between these two classes. Class/Property taxonomies are generally
used to construct the so called lightweight ontologies or taxonomies;

• Axiom is used to model statements that are always true. Heavyweight ontologies add ax-
ioms and constraints to lightweight ontologies. Axioms and constraints clarify the intended
meaning of the terms in the ontology. For example, the assertion "the concepts ”M en” and
”W omen” are disjoint" is an axiom;

• Rule is a statement in the form P1 ,...,P


P
n
, this means that if the statement P is true, then, the
statements P1 , ..., Pn are true. Rules are used for knowledge inference purposes.

4.1.2.3 Types of ontologies

Several authors have classified ontologies [57]. However, in this thesis, we will be working on
domain ontologies and application ontologies. Thus, in the following paragraphs, we will present
the classification of ontologies according to Guarino [57]. This classification is done according to
the level of dependence of the ontology on a particular task:

• Top-level/Upper-level ontologies: They are also called cross-domain ontologies. They de-
scribe general concepts and provide general notions under which all root terms in existing
ontologies should be linked. They use general concepts like time, space, event and can be
shared and transferred from one context to another. Top-level ontologies are absolutely inde-
pendent from a specific domain or from a specific problem. Examples of top-level ontologies
are two ontologies build by Guarino and al. [57] One is universals (a universal is a concept
like "car") and the other is particular (a particular is an individual like "your car").

• Domain ontologies: They provide vocabularies about concepts within a domain and their
relationships; about the activities taking place in that domain; and about the theories and ele-
mentary principles governing that domain. They are limited to the representation of concepts
in a given domain and in some cases, they are a specialization of an upper-level ontology. For
example, the term "City" in a domain ontology is a specialization of a more generic concept
"Location" which is a specialization of the term "SpatialP oint" that may be defined in an
upper-level ontology.

• Task ontologies: Task ontologies describe the vocabulary related to a generic task or activity
(diagnosing, scheduling) by specializing the terms in the top-level ontologies. They are used
to model tasks or processes and how these tasks are related. They provide a systematic
vocabulary of the terms used to solve problems associated with tasks that may or may not
belong to the same domain. For example, an ontology can be constructed that describes the
task of a health professional in a Hospital.

• Application ontologies: Application ontologies are ontologies that are application-dependent.


They combine domain ontology and task ontology. They contain all the definitions needed

Semantic-aware epidemiological surveillance system


4.1 Ontologies 67

to model the knowledge required for a particular application. These ontologies often extend
and specialize the vocabulary of a domain and task ontologies.

Note however that these ontologies can be heavyweight ontologies or lightweight ontologies
depending on the conceptualization used. They are called lightweight ontologies when they de-
fine only hierarchies of types and/or properties and heavyweight ontologies when they are more
expressive, using restrictions, inferences, and class construction.

4.1.3 Knowledge modelling

Knowledge can be modelled using many modelling techniques such as semantic networks, sys-
tem architecture, Frames, rules, logic [53]. Ontological knowledge particularly can be classified
according to the formalism used for their modelling. They can be modelled using rules or software
engineering techniques (lightweight ontological knowledge); or logical techniques (heavyweight
ontological knowledge) [57, 62]. In this section, we are going to present the modelling of heavy-
weight ontological knowledge and lightweight ontological knowledge.

4.1.3.1 Modelling heavyweight ontological knowledge

Logic can be defined by L = (S, |=) where S is a set of statements and |= is an entailment relation.
It is used to make formal deductions and inferences, study correct and incorrect reasoning and
modelled knowledge [112].

Logical languages. A logical language is defined by a syntax and a semantic [57, 62]:

• The Syntax is composed of a collection of symbols and rules which are combined as for-
mula;

• The Semantic gives meaning (interpretation) to symbols and formulae. With the semantic,
one can use facts to make deductions, to reason by building demonstrations (e.g., to demon-
strate that a patient has tuberculosis by demonstrating that he/she has been tested positive
for Koch’s Bacillus (KB)).

Logical deduction helps to derive formulae (provable formulae or theorems) from starting for-
mulae (axioms) or rules (inference rules) [62, 112]. There are two ways to show that a formula is a
logical consequence of another formulae: The resolution method and the tableau algorithm. With
these methods, it can be demonstrated for example, in the case of epidemiological surveillance
that any patient who takes tuberculosis drugs has positive sputum results [112]. Several logical
formalisms can be used to model knowledge. In the following, we are going to present the propo-
sitional logic, the first order logic and the description logic.

The Propositional Logic (PL). The Propositional logic or the calculation of the propositions
defines the rules of deduction which connect the propositions to each other without examining

Semantic-aware epidemiological surveillance system


4.1 Ontologies 68

the contents. A proposition is a fact, a theorem, an utterance that is either true or false. It can be
demonstrated or refuted. The formulae are the propositions that can be formed by combining the
atomic propositions [112]. For example, the statement "a TB case is a person" can be modelled
using a complex formula consisting of two propositions (t, h) and a binary operator: t −→ h (t,
represents a case of tuberculosis and h a person). To model knowledge with Propositional Logic,
one must specify the syntax and the semantic [112]:

• The syntax defines allowable facts. Atomic facts consist of a single proposition that can
be true or false. Complex statements are constructed from simpler ones using parentheses
and logical connectives. There are five connectives generally used: ¬ called a negation, ∩
called a conjunction, ∪ called a disjunction, ⇒ called an implication also known as "if-then"
statements, ⇐⇒ called if and only if. All sentences are constructed from atomic sentences
and the five connectives.

• The semantic specifies how to compute the true value of any sentence given a model. It
specifies how to compute the truth of atomic sentences and how to compute the truth of
sentences formed with each of the five connectives. The rules can be expressed with truth
tables that specify the truth values of a complex sentence for each possible assignment of
truth values to its components. It is to map all atomic propositions to {t, f }. If F is a formula
and I an interpretation, then, I(F ) is a truth value computed from F and I via a truth table.

A knowledge base built using PL can be validated using Resolution or analytic tableaux method
by demonstrating that it is satisfiable.

Note that with PL, one can only make statements and assertions about single objects. It is
impossible to summarize objects into a set or a class (ontological concepts) and to make statements
about a set of things; to make relationship among propositions (data/object properties); to make
arguments on a set of objects without explicitly naming them [112]. For example, it is not possible
to model the statement "anyone with positive sputum exams is a case of tuberculosis".

First Order Logic (FOL). FOL is define by (V, C, F, P ) where V is a set of variables which is
countably infinite, C a set of constant symbols, F a set of function symbols (each function comes
with an arity), P a set of predicate or relation symbols (each P comes with non negative integer as
its arity). It is assumed that the world consists of objects with certain relations among them that do
or don’t hold. Thus, it is used to express facts about some or all the objects in the universe [112].
Contrary to PL, FOL can be used to represent knowledge of complex environments in a concise
way because it is sufficiently expressive to represent a good deal of commonsense knowledge. The
language of FOL is built around its syntax and its semantic:

• To represent the syntax, the model of FOL defines the formal structures that constitute the
possible world under consideration. The basic syntax elements of FOL are the symbols that
stand for objects, relations, and functions.

– Objects are constant symbols. The domain of a model is the set of object or domain
elements it contains. The domain is required to be nonempty. Every possible world
must contain at least one object.

Semantic-aware epidemiological surveillance system


4.1 Ontologies 69

– Relations are predicates symbols consisting of a set of tuples of objects that are related.
Each predicate symbol comes with an arity that fixes the number of arguments;
– The model can contain some relations considered as functions. Every function has an
arity that fixes the number of arguments.

• To define the semantic, a model consists of a set of objects and an interpretation that maps
constant symbols to objects, predicates symbols to relations on those objects, and function
symbols to functions on those objects:

1. A term is a logical expression that refers to an object;


2. Predicate symbols refer to relations among terms;
3. Atomic sentence (or atom) is formed from a predicate symbol optionally followed by
a parenthesized list of terms;
4. Complex sentences can be constructed using logical connectives;
5. Quantifiers (universal quantification - ∀ and existential quantification ∃) are used to
express properties of entire collections of objects, instead of enumerating objects by
name;
6. Nested quantifiers are used to express more complex sentences using multiple quanti-
fiers;
7. ∀ and ∃ are intimately connected with each other through negation (∀x¬P (x, Q) is
equivalent to ¬∃xP (x, Q));
8. Equality symbol (=) is used to signify that two terms refer to the same object. It can be
used to state facts about a given function with negation to state that two terms are not
the same.

In a broader sense, with FOL, one can:

• Make reasoning on a set of object: e.g., ∀ T BCases −→ ∃ Hospital∩treatedHospital.T BCases


to say that "TB cases are treated in Hospital",

• Deduce formulae: e.g., "people do TB tests. If a person is tested as positive to KB then he is


a TB case. A TB case follows treatment for at least 6 months with Rifampicin. Bob has been
on Rifampicin for 6 months. Can we assume that Bob was a TB case?"

FOL is suitable for modelling ontologies but, it is difficult to achieve a consensus in modelling;
cumbersome for modelling; complex if one have to make calculations because it is not decidable;
the same knowledge can have several possible interpretations; it is complex to prove the accuracy
and the completeness of the statements [129]. In the following paragraphs, we are going to present
another knowledge representation formalism consisting of a family of logic called Description
Logics.

Description Logics (DLs). This is the name of a family of knowledge representation formal-
ism where the majority are a decidable fragment of FOL. DLs provide concepts (classes), roles

Semantic-aware epidemiological surveillance system


4.1 Ontologies 70

(properties), operations (and, or, not, some, all, atleast, atmost, any, ...) on the primitive elements
of language, a classification mechanism based on the subsumption relation between concepts or
roles. They help to represent the terminological knowledge and assertional knowledge of a domain
in a formal and structured way and to reason effectively minimizing the response time on this
knowledge [62]. The applications of DLs are numerous: Semantic Web, medicine, bioinformatics,
knowledge engineering, software engineering, etc. Let’s define its syntax and semantic as we have
done with PL and FOL:

• The syntax of DLs defines concepts, roles, individuals, and operators.


– Concepts are unary predicates that represent entities and classes e.g., Location:{x |
Location(x)};
– Roles (also called properties) are binary predicates that represent relations between one
concept and another e.g. treatedAt: {(x, y) | treatedAt(x, y)} connects two individuals
that belongs to different classes.
– Individuals (or concept assertion) are constants that are the instantiation of a class. e.g.
we can have the assertion P erson(Bob) to say that Bob is a person.
– Operators (or constructors) are used to construct complex representations of concepts
or roles. To guarantee the decidability and low complexity of DLs, the expressivity
of all the operators are limited. Fundamental operators to define class and properties
are: logical conjunction: (∩), logical disjunction: (∪), negation: (¬), restricted form of
quantification : (∃, ∀).
• The semantic of DLs is given by an interpretation consisting of the interpretation domain (D)
and an interpretation function (I). The interpretation function interprets all atomic concepts
and all atomic roles. Atomic concepts will be mapped to subsets of domain of discoursure
and the role will be mapped to a subset of a cartesian product of the domain of discoursure.

Terminological knowledge
Knowledge on concepts, properties,
axioms and rules

TBox

Assertional knowledge
Instances of elements described by the
TBox

ABox

Figure 22: Knowledge Base in DL

DLs divide knowledge into 2 parts as presented by the figure 22:

• TBox (Terminological Box): It contains the terminological knowledge and describes the
general knowledge of the domain. It is composed of Classes (describing the concepts of
the domain), Roles (defining relationships between concepts) and Axioms (additional con-
straints on classes and roles).

Semantic-aware epidemiological surveillance system


4.1 Ontologies 71

• ABox (Assertional Box): It represents a configuration, a situation or a specific data of the


system. It describes individuals by naming them and specifying them in terms of concepts
and roles, assertions that relate to these individuals. Several ABox can be associated with the
same Tbox [62].

There are several families of descriptive logic: AL, ALN, ALC, SH, ALCN, ALCQ, ALCF,
SHOIN, SHIQ, SHIF. The difference between one family and another is mainly expressed in terms
of expressivity.

With DLs, deductions help to derive new formulae from the starting formulae by means of the
rules (inference rules). To make deduction, DLs distinguish Closed World Assumption (CWA) in
which all knowledge is specified without giving the possibility to extend the model. Any assertion
that cannot be proven true is false; and the Open World Assumption (OWA) in which the model
is specified by giving other people the ability to extend it. If a request is made on the KB without
answer, the KB will return "don’t know" (In the real world there is not enough information). As
DLs use OWA, this can lead to a problem of undecidability of the knowledge base (because in
the case where it is false, the algorithm can run indefinitely). To show that a knowledge base is
consistent, one can use either Resolution or analytic tableaux method. In DLs these methods are
extended to stop in a finite time by adding the stopping conditions.

4.1.3.2 Modelling lightweight ontological knowledge

Lightweight ontological knowledge are knowledge with restricted expressivity in which concepts
are connected with other concepts using untyped association. Lightweight ontologies include con-
cepts, concept taxonomies, relationships between concepts, and properties that describe concepts.
Software engineering and rules techniques are used to model this kind of knowledge because they
impose a structure to the domain knowledge and constrain the interpretations of terms [57].

Rules. Rules define constraints and always resolve to either true or false. They are composed
of the dependent clause expressing the condition and the main clause expressing the consequence.
Placed at the top of the Semantic Web stack, they can be seen as an extension of FOL that describe
knowledge that often depends on the context and cannot be easily modelled using DLs [51]. Rules
can be used to assert control or influence the behaviour of a system. One common use is to stan-
dardize access control to applications e.g. "Provide statistics of the hospital in which a decision
maker is located". One can distinguish general Inference rules (Premise → Conclusion), assump-
tions rules (Cause → Effect), production rules (Condition → Action). Rules are often combined
with DLs to model knowledge. But, combining DLs and rules can give rise to undecidability. The
decidable fragment called DATALOG is generally used [51].

The Use of software engineering techniques. Many software engineering techniques may be
used to model lightweight ontological knowledge.

1. UML technique: UML might be used for knowledge modelling because it has a graphical
language easy to understand by people outside the computer science domain. For instance,

Semantic-aware epidemiological surveillance system


4.2 Ontology engineering 72

Expressivity Ontology type Decidability Make calculation


PL Weakly expressive lightweight Decidable Yes
FOL Strongly expressive Heavyweight Non Decidable Yes
DL Strongly expressive Heavyweight Decidable Yes
Rules Weakly expressive lightweight Decidable Yes
UML Weakly expressive lightweight Decidable No
Database Weakly expressive lightweight Decidable No
MDA Weakly expressive lightweight Decidable No

Table 4.1: The summary of knowledge modelling approaches

class diagram notations helps to model concepts using classes, taxonomies using the gener-
alization relation between classes, attributes using class attributes, and formal axioms using
Object Constraint Language (OCL) [53, 112]. The main limit is that OCL cannot be used to
represent all axioms [57, 112].
2. MDA techniques: With the MDA technique, the system is modelled using a meta-model
and the resulting model is used to generate the application source code [19, 53]. Knowl-
edge can be extracted from meta-models [53] by using the correspondence between MDA
models components and knowledge components. The figure 14 presents the meta-model of
EPICAM, very close to the representation of the ontologies using the Protege editor. When
the meta-model is designed with Ecore, then, as for UML, the constraints can be added using
the OCL language.
3. Database techniques: A database is an organized collection of data for a rapid search and
retrieval of information. It can be modelled using the Entity-Relation (ER) model. Onto-
logical knowledge can be modelled using database design techniques by matching the ER
model with knowledge components [57]. The ER notations allows modelling classes through
ER-entities, taxonomies using the generalization relationship between ER-entities; attributes
through ER-attributes; and formal axioms with integrity constraints. As with UML, the main
limit is the difficulty to model and to evaluate all axioms.

In this section, we presented the different ontological knowledge modelling formalisms with
their advantages and disadvantages. The table 4.1 summarizes these knowledge modelling ap-
proaches and their limits. In this thesis, we are particularly interested in ontological knowledge.
As presented by many authors [57, 62] and the table 4.1, DLs are good candidates for ontological
knowledge modelling.

4.2 Ontology engineering

Ontologie engineering consists of a set of activities that concerns ontology development process,
the ontology life cycle, the methodologies, tools and languages for building ontologies. The goal

Semantic-aware epidemiological surveillance system


4.2 Ontology engineering 73

of this section is to present the different artefacts used to build ontologies. In section 4.2.1, we
will present the ontology development process, methods and methodologies. In section 4.2.2 we
will present the knowledge representation languages and query languages. In section 4.2.3 we will
present the tools used to build and manage ontologies.

4.2.1 Ontology construction process, methods and methodologies

4.2.1.1 Ontology development process

Figure 23: Ontology development process [57]

The ontology development process (see figure 23) refers to activities that are being performed
when building ontologies without identifying the order in which these activities should be per-
formed. These activities are very important especially in the case where the ontology is being built
by geographically distant cooperative teams. The ontology life cycle describes the different phases
involved in the ontology development. When this life cycle is well defined, error detection is done
much earlier and it is possible to control the quality, the delays and the costs of the development.
The control will help for example to know if the ontology was well built and the validation will help
to know if the ontology responded to the needs of the domain experts. The ontology development
activities can be organized in 3 categories [57]:

1. Ontology management activities: It consists of scheduling activity, control activity and


quality assurance activity.
• The scheduling activity: Consists of identifying the problem to be solved, the tasks to
be performed, their scheduling, the time needed and the resources for their realization.
• The control activity: Here, all the steps must be checked in order to ensure that all
scheduled tasks are completed as intended.

Semantic-aware epidemiological surveillance system


4.2 Ontology engineering 74

• The quality assurance activity: The quality assurance activity ensures that all pro-
duced resources (ontology, documentation) during the development process are of good
quality.

2. Ontology development activities: It involves the Pre-development, the Development and


the Post-development activities.

• The pre-development activity: During this activity, a situational analysis (environ-


mental study) and a feasibility study are carried out in order to identify the platforms
where the ontology will be integrated.
• The development activity: The development activity is made up of the specification,
the conceptualization, the formalization and the implementation. The specification per-
mits to respond to the following questions: Why is the ontology being built? What its
intended uses are and who the end-users are? The conceptualisation involves structur-
ing the domain knowledge into a conceptual model. The formalisation is the transfor-
mation of the conceptual model into a formal or semi-computable model. The imple-
mentation is the serialization of the computable model into an ontology representation
language.
• The post-development activity: The post-development activity is made up of the
maintenance activity and the reusing activity. The maintenance activity involves the
updates and the corrections of the ontology if need be. The reusing activity consists of
reuse the ontology constructed by other ontologies or applications.

3. Ontology support activities. These include a series of activities performed at the same time
as the development activities, without which the ontology could not be built. These activities
are:

• Knowledge acquisition activity which consists of gathering knowledge from identified


sources (human, text, databases, meta-models, source code, etc.).
• Evaluation activity consists of the validation of the ontology and associated resources
(documentation, software environment) by verifying if it is really the shared conceptu-
alisation of the modelled domain.
• Integration, merging, and alignment activities consisting of the construction of a new
ontology from already existing ones.
• Documentation activity. This gives every detail on all the completed stages and prod-
ucts that are being produced.
• Configuration management activity consists of managing all versions of the ontology
and documentation so that when there are errors, knowledge engineers can go back to
rectify or make corrections on the previous versions.

4.2.1.2 Ontology construction methods

To build ontologies, knowledge engineers may choose amongst the existing methods. For this
purpose, they must ask themselves some questions. For example, is it necessary to use the top-

Semantic-aware epidemiological surveillance system


4.2 Ontology engineering 75

down, bottom-up, or middle-out approach? Is it necessary to build it manually, automatically or


semi-automatically? Or will the ontology be built from existing ones?

Top-down approach: From general to particular. In the top-down approach, one starts from
general concepts and evolves towards major specializations. Firstly, one identifies the most general
concepts and creates categories at the most general level as possible. The main advantage of using
a top-down approach is a better control of the level of details. However, starting at the top can
equally result in choosing and imposing arbitrary and possibly not needed high level categories
[57].

Bottom-up approach: From particular to general. In this approach, the ontology is con-
structed from the most specific concepts, which are then grouped into categories. The main result
here is a very high level of detail of terms obtained. However, this approach increases the over-
all development effort and makes it difficult to spot commonality between related concepts and
equally increases the risk of inconsistencies [57].

Middle-Out approach: Starting in the middle. In this approach, an intermediary layer of


concepts serves as a starting point. The development can go in both directions. It is recommended
to identify firstly the core of basic terms, then, specify and generalize as required. This approach
strikes a balance in terms of the level of details. Details only arise as necessary by specializing the
basic concepts, so that some wasted efforts are avoided [57].

(Semi)automatic construction of ontologies. Ontologies can be built manually, automatically


or semi-automatically. During manual construction, the various resources containing knowledge
are collected, terms are identified and the ontology is constructed. Automatic construction (also
called ontology learning) implements the generation of terms automatically from resources [57].
Ontology learning is detailed in section 4.3.

Building ontologies by reusing existing ontologies. Before the development of a new ontol-
ogy, one can consider the reuse of existing ones. There are several ways to reuse existing ontologies
[57]:

• Ontology re-engineering: This is used when domain experts do not agree on the content
of ontologies or the conceptual model of the ontology is absent. Thus, the re-engineering
process will consist of recovering the model of an ontology and transforming this model into
another ontology more appropriate.
• Ontology enrichment: This is the process of adding new knowledge in an ontology to have
a more complete one. This will help to ensure their growth and to continue to meet the needs
of users [107].
• Ontology Fusion. This consists of creating a new ontology by merging existing ones. There
are several methods of ontology fusion: ONIONS allows the creation of an ontology library
from multiple sources, FCA-Merge merges two ontologies into one set of domain docu-
ments, PROMPT takes as an input two ontologies and creates a list of matches [57].
• Ontology alignment. This consists of creating links between several ontologies without
modifying them, hence preserving the original ontologies. It is often used for complementary

Semantic-aware epidemiological surveillance system


4.2 Ontology engineering 76

domains.

• Cooperative Construction of Ontology (Co4). This is a protocol developed at INRIA for


the collaborative construction of KBs. Its goal is to allow the discussion of people and knowl-
edge commitment in the KBs of a system.

4.2.1.3 Ontology construction methodologies

The first methodologies for ontology construction were inspired by the experience of domain
experts and knowledge engineers. However, a series of methodologies have been reported: Cyc
methodology, TOVE methodology, METHONTOLOGY, SENSUS methodology, On-To-Knowledge
methodology, TERMINAE, and Ontology Development 101 [57]. With time, other proposed method-
ologies, based on software engineering (Unified Process Methodology for ontology design), soft-
ware architectures (ontology design pattern methodology) and agile methodologies appeared [1].
This section details the NeOn methodology used in this thesis.

NeOn Methodology for Building Ontology Networks. An ontology network is composed of


ontologies related together via meta-relationships such as mapping, modularization, version, and
dependency relationships. Suárez-Figueroa et al.[126] proposed a methodology that they named
NeOn methodology to build such ontology. This methodology takes into account the existence of
multiple ontologies in ontology networks, the collaborative ontology development, the dynamic
dimension, and the reuse and re-engineering of knowledge resources. It is composed of a set of
scenarios that the knowledge engineer can combine in different ways, and any combination should
include Scenario 1.

• Scenario 1: From specification to implementation. The first scenario is composed of:


(1) Knowledge acquisition activity, carried out during the whole development. During this
activity, knowledge engineers simultaneously acquire knowledge and make the specification
that the ontology should fulfill. This gave rise to the ontology requirements specification
document (ORSD). Then, a quick search for knowledge resources using the terms in the
ORSD permits us to know which types of resources are available for a possible reuse. (2)
The scheduling activity can start. It uses ORSD and knowledge resources (it exists) to carry
out the rest of the activities (i.e., conceptualization, formalization, and implementation) using
existing methodology such as On-To-Knowledge.

• Scenario 2: Reusing and re-engineering non-ontological resources (NORs). Knowledge


engineers decide which NORs to be used. Then, from these resources the terms are extracted
and the ontology is built.

• Scenario 3: Reusing ontological resources. In this scenario, the knowledge engineers use
existing ontological resources to build the ontology. From each ontology selected, a part or a
whole can be reused. Knowledge engineers can also perform a re-engineering of ontological
resources (following Scenario 4) or the merging of ontologies of the same domain to obtain
a new ontology (following scenarios 5 and 6).

Semantic-aware epidemiological surveillance system


4.2 Ontology engineering 77

• Scenario 4: Reusing and re-engineering ontological resources. In this activity, knowledge


engineers re-engineer ontological resources before their integration in the ontology (corre-
sponding activity of Scenario 1).

• Scenario 5: Reusing and merging ontological resources. This scenario consists to merge/align
ontological resources of the same domain. The merging will permit to obtain a new ontology.
The alignment will permit to establish links among the selected resources in order to create
a network.

• Scenario 6: Reusing, merging and re-engineering ontological resources. In this scenario,


knowledge engineers decide not to use the set of merged ontological resources such as it is,
but to re-engineer it. The set of merged ontologies re-engineers is integrated in the corre-
sponding activity of Scenario 1.

• Scenario 7: Reusing ontology design patterns (ODPs). In this scenario, knowledge en-
gineers access ODPs repositories in order to reuse ODPs. ODPs can be used to reduce
modelling difficulties, speed up the modelling process or check the adequacy of modelling
decisions.

• Scenario 8: Restructuring ontological resources. This scenario can performed as fol-


lowed: (1) modularizing the ontology in different ontology modules; (2) pruning the branches
of the taxonomy not considered necessary; (3) extending the ontology including new con-
cepts and relations; and (4) specializing those branches that require more granularity and
including more specialized domain concepts and relations.

• Scenario 9: Localizing ontological resources. This is to adapt an existing ontology to one


or various languages and culture communities to obtain a multilingual ontology. For exam-
ple, the translation of all ontology labels into one or several natural languages.

4.2.2 Knowledge representation languages and queries languages

Once the ontology is modelled, it will be put in a form understandable by the machine and queries
will be made to retrieve information. At this level, a very important decision is to choose the
language(s) to model the knowledge and to make queries. Among the multitude of knowledge rep-
resentation languages, the selection criteria is based on the knowledge base that one wants to build
and the inference mechanisms needed by applications that will use the ontology. In the following
paragraphs, we will present some knowledge representation languages and queries languages.

4.2.2.1 Knowledge representation languages

In early 1990, a set of knowledge representation languages were invented. These languages were
based mostly on FOL (Cycl, KIF, Ontolingua, OCML, and Flogic) and others were based on DLs
and production rules (LOOM). The Internet growth has led to the creation of languages to exploit
the characteristics of the Web (RDF(S), OWL). These languages are called Web-based ontology
languages or Ontology Markup Languages. Knowledge representation Languages can be classified

Semantic-aware epidemiological surveillance system


4.2 Ontology engineering 78

according to the goal to which they aim at. One can distinguish languages to improve the process
of ontologies building (OCMl), languages that helps to make inferences (OIL, OWL), languages
that permits the design of ontologies (Ontolingua), languages that permits exchange on the Web
(RDF(S), OWL, SHOE, OIL) [57]. We will be focusing on the languages that permit the exchange
of data on the Web in general, and especially on RDF(S) and OWL languages. We will present
these languages according to the following two main dimensions: Knowledge Representation (de-
scription of how the components in the ontology can be modelled) and Reasoning Mechanisms
(used to create an inference engine with the corresponding deductive mechanisms).

1. RDF(S). The acronym RDF stands for Resource Description Framework where Resource is
everything that can be uniquely identified and referenced simply by using an Uniform Re-
source Identifier (URI). Description means that all the resources are described (e.g., by us-
ing properties and relationships between resources). Framework means that they are based
on a formal template defining all possible relationships between resources. The RDF graph
can be represented by a set of triples (Subject, Predicate, Object). The triple is the smallest
description structure in RDF and each one represents a declaration. All statements follow
the same pattern. The subject is the resource to be described, the predicate (property value)
refers to a property type applicable to that resource and the object represents data (literal) or
other resources. An RDF document is a tagged multigraph in which each triple corresponds
to an oriented arc whose label is the predicate, the source node is the subject and the target
node is the object [51, 57, 62].
RDF(S) is an extension of the RDF language which provides RDF documents with a struc-
ture. Its purpose is to provide an encoding and interpretation mechanism to represent re-
sources for software, and to describe and link all Web resources [51, 57, 62].

• RDF(S) Knowledge representation: RDF(S) provides the most basic primitives for
ontology modelling achieving a balance between expressivity and reasoning. In RDFS,
concepts are known as classes. Classes are referenced either by their name or by a
URL to a Web resource and can include their documentation and their super-classes.
Attributes of classes are defined as properties. The domain of a property is the class to
which the attribute belongs, and the range is the type of the attribute value. No cardi-
nality constraints nor default values can be defined for attributes. Class attributes can
be represented by defining the domain of the property, and including the property value
in the class definition. Concept taxonomies are built in RDF(S) by defining a class as
a subclass of one or more classes. However, neither disjoint nor exhaustive knowledge
in concept taxonomies can be expressed. Binary relations between classes are defined
in RDF(S) as properties. However, relations of higher arity cannot be represented di-
rectly. Assertions made by instances can be represented using reification (transforming
the value of a property into a statement).
• Reasoning mechanisms: Most of the inference systems for RDF(S) are devoted to
querying information about RDF ontologies.

RDFS is used to represent the lightweight ontologies that can be serialized using RDF/XML,
N-Triples, Turtle and the triples serialized can be saved in files, in a database or in a triple-
store.

Semantic-aware epidemiological surveillance system


4.2 Ontology engineering 79

2. OWL (Ontology Web Language). The OWL language is an XML language used to repre-
sent ontologies modelled using DLs and for publishing and sharing knowledge on the Web.
It helps to represent a very rich knowledge, to reason on the data and satisfies the following
conditions: expressivity, clarity, readability, unambiguity, extensibility [57, 62]. It is divided
into three layers: OWL Lite, OWL DL, OWL Full.

• OWL-Lite: It corresponds to the SHIF(D) family of DLs and is characterized by its


simplicity, the ease of programming, and its quick reasoning. It has been designed to
express simple constraints for which inference algorithms are decidable.
• OWL-DL: It is the SHOIN(D) family of DLs. It has a higher expressivity than OWL-
Lite. With OWL-DL, real world elements are represented by concepts, roles and indi-
viduals. Concepts and roles have a structured description to which semantics are asso-
ciated and any manipulation of semantics must be consistent with that semantics. It is
composed of :
– OWL EL (Existential Language): A family of description logic that only pro-
vides the existential quantification of variables.
– OWL QL (Query Languages): It permits answer to queries that can be rewritten
in a relational query language.
– OWL RL (Rule Language): It indicates reasoning profiles that can be imple-
mented using a rule-based system.

• OWL-Full: This is a language that has been designed to ensure compatibility with
RDFS without providing decidability of inference algorithms. It is characterized by
maximum expressivity, full compatibility with RDF/RDFS and very complex reason-
ing. However, it is slow and undecidable.

Concerning knowledge representation and inference mechanism in OWL:

• Knowledge representation: Different OWL languages have different knowledge rep-


resentation. In a broader sense, with OWL, concepts are known as classes. A class may
contain its documentation and a list of primitives that defined it. These are the super-
classes, equivalent classes, disjoint classes, conjunction, disjunction, negation of other
classes, the enumeration of all classes instances, collection of individuals of the classes,
property restriction (existential restriction, role filter, number restriction) that contains a
reference to the property to which the restriction is applied and an element for express-
ing the restriction. Class attributes must be defined as properties in the ontology. There
are two types of properties: ObjectProperty (whose range is a class) and DatatypeProp-
erty (whose range is a datatype). To define a property, one may explain its domain
and range. Besides, in OWL, one can define properties hierarchies, properties equiva-
lences, inverse properties, transitive properties, symmetric properties, global cardinal-
ity restrictions on all kinds of properties, functional properties and inverse functional
properties. Higher arity relations must be defined as concepts. Instances are defined
using RDF vocabulary. With instances, one can assert that two instances are equivalent
or different, and a set of individuals are different from each other.
• Reasoning mechanisms: Different ontology languages have different expressivity and
inference mechanisms. In the broader sense, the semantic of OWL is described in two

Semantic-aware epidemiological surveillance system


4.2 Ontology engineering 80

different ways: Firstly, as an extension of the RDF(S) model and secondly, as a direct
model-theoretic semantics of OWL. OWL allows the inclusion of additional statements
in its ontologies apart from those explicitly defined in the language. Multiple inheri-
tance is allowed in OWL ontologies. Constraint checking can be performed on the
values of properties and their cardinalities. OWL assumes monotonic reasoning, even
if class definition or property definitions are split up in different Web resources. This
means that facts and entailments declared explicitly or obtained with inference engines
can only be added, never deleted, and that new information cannot negate previous.
RDF(S) query engines, storage systems, and parsers can be employed to manage OWL
ontologies since they can be serialized in RDF(S).

In the above paragraphs, we presented two Semantic Web languages for sharing and exchang-
ing data on the Web. We have seen previously that the RDF language helps to present facts without
necessarily bringing semantics to the types of data. The RDFS language complements the RDF
language by making it possible to create the data types and by creating the class hierarchy, and the
OWL language comes with more semantics, permitting the representation of classes, properties
with restrictions on these classes and these properties. Note that, OWL-DL does not represent the
relationship between the roles. So, all the knowledge cannot be represented using OWL-DL.

4.2.2.2 Rule Languages

Rules can be seen as an alternative or complementary stack to represent semantics and inferences
over the web of data. A rule system is a specific implementation of syntax and semantics, which
can extend to include existential quantification, universal quantification, logical disjunction, logical
conjunction, and so on. The semantics of RDFS and some subset of OWL (OWL2 RL) can be
axiomatized as first-order logic that can be used as a foundation for a rule-based implementation.
Thus, the rules can be seen as part of an ontology supplementing RDF or OWL declarations [51,
57, 63]. In the following paragraphs, we will present three rules languages: RIF, DATALOG and
SWRL.

1. Rule Interchange Format (RIF). This is designed to exchange rules on the Web in general
and the Semantic Web in particular. RIF has become a W3C (World Wide Web Consortium)
recommendation since 2013. It is an extensible set of rule dialects.
RIF rule includes 3 dialects: a basic dialect, a basic logical dialect, and the production rule
dialect [51].

• RIF-CORE: It is the core of all the primitives common to RIF dialects. It corresponds
to the HORN logic without symbol function.
• RIF-BLD (Basic Logic Dialiect): It consists of a set of well-formed formulae from
terms built on one alphabet. It helps to represent logic programs on the positive facts
and corresponds to the HORN logic without the symbol of equality. The reasoning is
based on the deduction of new facts by the instantiation of the universal rules (applica-
tion of the rule of modus ponens and evaluation of conjunctive or disjunctive formulae).

Semantic-aware epidemiological surveillance system


4.2 Ontology engineering 81

• RIF-PRD (Production Rule Dialect): This is used to represent the production rules
whose application triggers actions to add, modify, delete facts in a class.
2. DATALOG. DATALOG has been developed at the beginning for deductive databases. The
idea was to couple a database to a set of logical rules, allowing the deduction of information
[51]. DATALOG rules are used to mix classes and relations. A knowledge base (DATALOG
programs) are a set of HORN clauses without function symbols. OWL DL does not allow the
mixing of classes and properties but, DATALOG permits it. By combining OWL-DL with
DATALOG permits to make the ontology more expressive.
3. Semantic Web Rule Language(SWRL). SWRL is the semantic Web rules language pro-
posed as a W3C recommendation in 2004. It is based on the combination of OWL and
RuleML / DATALOG. The idea being to use DATALOG rules on OWL ontologies to model
more knowledge [93]. To do this, the symbols in the rules can be OWL identifiers.

In the above paragraphs, we presented how ontologies are represented in a machine readable
form. We also presented OWL’s limitations and showed that OWL-DL can be combined with the
rules for more expressivity. Note that modelling knowledge is not enough. Some mechanisms must
be put in place to obtain information needed.

4.2.2.3 Query Languages

A query language is a language used to request and to retrieve information in a data source. There
are several types of Semantic Web query languages. Some permit the extraction of information
from the knowledge base and others can, in addition make an inference on the data as seen below.

1. SparQL Protocol and RDF Query Language (SparQL). Its purpose is to provide service-
level interoperability and structured data across the Internet to easy access to all data on
the Web. It permits the extraction of all types of data contained in RDF; The exploration
of the data; the transformation of RDF(S) from one vocabulary to another; The building of
new graphs from RDF query graphs; The updating of RDF graphs; The discovering of hid-
den information using the SparQL service and the federation of data from multiple SparQL
queries.
2. Semantic Query-Enhanced Web Rule Language (SQWRL). It is a query language for
extracting information from OWL ontologies [93]. It offers two types of queries operators.
The Basic Operators use the SWRL rules as a pattern by replacing the elements with the
SQWRL selection operations. Collection Operators provide a set of operators for grouping,
aggregation, disjunction, etc.

4.2.3 Ontology development tools

Building ontologies is complex and time consuming. Ontology development tools provide inter-
faces that help users carry out some of the main activities of the ontology development process

Semantic-aware epidemiological surveillance system


4.2 Ontology engineering 82

such as conceptualization, implementation, consistency checking and documentation. Ontology


tools can be classified by categories:

• Ontology development tools: These tools are used to build a new ontology from scratch.
They also give support to ontology documentation, ontology export and import, ontology
graphical edition, etc.

• Ontology merging and alignment tools: These tools are used for merging many ontologies
or for aligning different ontologies.

• Ontology learning tools: These can be used to derive ontologies (semi)automatically from
data sources.

• Ontology querying tools and inference engines: These allow querying ontologies easily
and performing inferences with them.

• Ontology evaluation tools: These tools are used to evaluate the content of ontologies and
their related technologies. It tries to reduce problems when one needs to integrate and use
ontologies in other ontologies or in information systems.

Tools allowing the manipulation of the ontology can be classified in several categories: edition,
integration, visualization, validation, interrogation, extraction, exportation, storage, inference en-
gines, and development tools. A tool can belong to several categories depending on the features it
offers. According to this classification, we can distinguish:

• Protege, Ontolingua Server, WebOnto and OntoSaurus: These are tools used to build
ontologies completely by hand, automatically or by integrating existing ontologies.

• Virtuoso and R2RML: These set of tools allow to build ontologies automatically by ex-
tracting it from the database and storing it in a triple store.

• Protege, VOWL, Ontolingua Server, WebOnto and ODE: These allow the visualization
of the ontology in the form of a graph.

• Facct ++, Hermit, Pellet, Drool and Racer: These make inferences about knowledge bases.
They will also validate if the knowledge base is consistent.

• Virtuoso and R2RQ: These allow access to data from relational databases as an RDF graph.

• Sesame, Jena and Virtuoso: These tools offer storage tools for RDF triples.

• Sesame, Jena and SWRL API: These tools provide APIs for developing Semantic Web
applications.

In this thesis, we will particularly use Protege and SWRL:

Semantic-aware epidemiological surveillance system


4.2 Ontology engineering 83

• Protege is an Open-source software available in desktop and web version and developed by
Stanford University [91]. Its plug-in system makes it expandable. It includes many features:
an ontology editor with which one can define the hierarchy of classes and properties, an-
notate resources and make restrictions on classes and properties; an interface allowing the
integration of ontologies (fusion, mapping, alignment); an interface to define SWRL rules
and SQWRL requests; an inference engine to validate the ontology; a rules engine to validate
rules written in SWRL and execute queries written in SQWRL; a SparQL query interface; an
ontology visualization interface; an interface to export data in different formats (RDF/XML,
RDF/Turltle, OWL/XML, Json). The Web version can allow multiple users or groups of
users to build distributed ontologies.

• SWRL API is a Java API for programming Semantic Web applications by applying inference
rules on ontologies encoded in RDFS or in OWL [93]. It consists of a library collection that
allows developers to work with SWRL rules and SQWRL queries in their applications; to
model, reason and query knowledge bases; to manage OWL reasoners; to use SWRL rules
engines; to execute queries written in SQWRL. Drools is an implementation of SWRL API
rules to execute SWRL and SQWRL rules.

4.2.4 Ontology evaluation

Before reusing existing ontologies, their content should be evaluated. The purpose of ontology
evaluation is to determine what the ontology defines correctly, what it does not, and what it does
incorrectly. It includes ontology verification, ontology validation and ontology assessment [57].

• Ontology verification consists of ensuring that it implements correctly the ontology require-
ments and the competencies questions.

• Ontology validation verifies if the ontology definitions really model the real world for which
the ontology was created.

• Ontology assessment is focused on judging the ontology content from the user’s point of
view. Different types of users and applications require different means of assessing ontol-
ogy. Gómez-Pérez and al. [57] proposed the evaluation of ontologies using some evaluation
criteria:

– Technical evaluation: This is done by the developers. It permits to know if the ontol-
ogy is well built, and if it is consistent.
– User evaluation: This is performed by users. It permits to check if the ontology meets
their needs.
– Consistency evaluation: This is performed to check whether it is possible to obtain
contradictory conclusions from valid input definitions. A given definition is consistent
if and only if the individual definitions are consistent and no contradictory knowledge
can be inferred from other definitions and axioms.

Semantic-aware epidemiological surveillance system


4.3 Ontology learning 84

– Completeness evaluation: It is difficult to prove the incompleteness of an ontology.


But, if a concept or a definition is missing, the ontology can be said to be incomplete.
In a broader sense, an ontology is complete if and only if:
∗ Everything that is supposed to be in the ontology is explicitly stated or can be
deduced;
∗ Any definition is complete. This can be determined by: (a) precise knowledge of
the definition (does it define the world?) (b) all knowledge that is necessary, but not
explicit (it should be noted that definitions can be inferred from other definitions
and axioms).
– Conciseness evaluation: An ontology is concise if (a) it does not store useless defi-
nitions, (b) there are no redundancies between definitions of terms, (c) redundancies
definitions can not be deduced from explicit definitions.

4.3 Ontology learning

Acquiring knowledge for building an ontology from scratch, or for refining an existing ontology is
costly in time and resources. Ontology learning techniques are used to reduce this cost during the
knowledge acquisition process. Ontology learning refers to the extraction of ontological knowl-
edge from unstructured, semi-structured or fully structured knowledge sources in order to build an
ontology from them with little human intervention [7, 74, 120, 145]. In this section, we will present
knowledge sources generally used for ontology learning (section 4.3.1), some ontology learning
techniques (section 4.3.2) and ontology learning evaluation (section 4.3.3).

4.3.1 Knowledge sources for ontology learning

The process of developing an ontology requires knowledge acquisition from any relevant sources.
There are several possible sources of knowledge: domain experts or unstructured, semi-structured,
and structured sources [126].

4.3.1.1 Domain experts

A domain expert is a person knowledgeable of a domain. To get knowledge from domain experts,
a knowledge engineer conducts interviews. This process might lead to knowledge loss or even
worse, introduce errors because misunderstandings arises frequently in human communication.

4.3.1.2 Unstructured knowledge sources

Unstructured knowledge sources contain knowledge that do not have a pre-defined organization.
These are all kinds of textual resources (Web pages, manuals, discussion forum postings, spec-

Semantic-aware epidemiological surveillance system


4.3 Ontology learning 85

ifications, analysis and conception documents, source code comments) and multimedia contents
(videos, photos, audio files) [5, 7, 22, 74, 29, 54, 120]. Unstructured sources are the most recurrent
and can permit us to extract a more complete knowledge. However, the unstructured sources are
easily accessible to human information processing only. For example, extracting formal specifica-
tions from arbitrary texts is still considered a hard problem because sentences might be ambiguous
and, in some cases, no unique correct syntactic analysis is possible [63].

4.3.1.3 Structured knowledge sources

Structured knowledge sources contain knowledge described by a schema. It is advantageous to use


these knowledge sources because they contain directly accessible knowledge [63]. Some structured
knowledge sources include:

• Ontologies: Before constructing an ontology from scratch, one may look at other ontologies
that could be reused [103, 123, 126];
• Knowledge bases: In knowledge bases, one can generate discovered rules as input to develop
a domain ontology [7, 71];
• Database : Terms to be used to build an ontology can be extracted from a database schema
[7, 28, 31, 65, 143].

4.3.1.4 Semi-structured knowledge sources

Semi-structured knowledge sources contain knowledge having a structure that already reflects part
of the semantic interdependencies. This structure facilitates the extraction of a schema [63]. Some
examples of semi-structured knowledge sources are:

• Folksonomies/thesaurus: It is advantageous to extract knowledge from folksonomies or/and


thesaurus to build an ontology because they reflect the vocabulary of their users [52, 135];
• XML (Extensible Markup Language): The aim of XML data conversion to ontologies is the
indexing, integration and enrichment of existing ontologies with knowledge acquired from
XML documents [58];
• UML/meta-model: To learn an ontology from UML or/and meta-model, one approach is to
extract OWL classes and properties from diagrams or to use Ontology UML Profile (OUP)
which, together with Ontology Definition Meta-model (ODM), enable the usage of Model
Driven Architecture (MDA) standards in ontological engineering [53];
• Entity-relation diagram: They can be used to learn ontologies because they are used to de-
scribe the information managed by the databases [44];
• Source code [13, 14, 22, 50, 143]: Generally, in source code, the names of data structures,
variables, functions are close to the terms of the domain.

Semantic-aware epidemiological surveillance system


4.3 Ontology learning 86

A lot of work has been done on the extraction of ontological knowledge from texts, databases,
XML files, vocabularies, and the use of ontologies to build or enrich other ontologies. This has
resulted in a wide range of models, techniques and tools for the generation of knowledge structure
that can be considered as an intermediate process when constructing ontologies. It should be noted
that few works go beyond extracting concepts and properties from source code whereas axioms
and rules are also key elements of ontologies.

4.3.2 Ontology learning techniques

To extract knowledge from knowledge sources, many techniques are used [7, 60, 74, 120]. Shams-
fard and Barforoush [120] proposed a classification of these techniques by considering symbolics,
statistics and multi-strategies.

4.3.2.1 Symbolic techniques

In symbolic techniques, the extraction process consists of examining text fragments that match
some predefined rules, looking for lexico-syntactic patterns corresponding for instance to taxo-
nomic relations or scanning for various types of templates related to ontological knowledge. A
symbolic method can be rule-based, linguistic-based or pattern-based.

1. Rule-based models are represented as a set of rules where each rule consists of a condition
and an action [145].
• Logical rules may be used to discover new knowledge by deduction (deduce new
knowledge from existing ones) or induction (synthesize new knowledge from expe-
rience). For example, inductive logic programming can be used to learn new concepts
from knowledge sources [7, 29, 82, 120];
• Association rules aim at finding correlations between items in a dataset. This technique
is generally used to learn relations between concepts [5, 7, 29, 120] and can be used
to recognize a taxonomy of relations [7] or to discover gaps in conceptual definitions
[29, 120, 138].
2. Linguistic approaches (syntactic analysis, morpho-syntactic analysis, lexico-syntactic pat-
tern parsing, semantic processing and text understanding) are used to derive knowledge
from text corpus [7, 120]. This technique can be used to derive an intentional description
of concepts in the form of natural language description [138].
3. Pattern/Template-driven approach allows searching for predefined keywords, templates or
patterns. Indeed, a large class of entity extraction tasks can be accomplished by the use of
carefully constructed regular expressions [80].

Although very powerful for particular domains, symbolic techniques are inflexible because of
their strong dependency on the structure of the data. Symbolic techniques are precise and robust,
but can be complex to implement, and difficult to generalize [120].

Semantic-aware epidemiological surveillance system


4.3 Ontology learning 87

4.3.2.2 Statistic-based techniques

Statistic analysis for ontology learning is performed from input data to build a statistical model [7,
74, 120, 145]. Several statistical methods for extracting ontological knowledge have been identified
in the literature:

1. Co-occurrence or collocation detection identifies the occurrence of some words in the same
sentence, paragraph or document. Such occurrences hint a potential direct relation between
words [72]. These techniques can be used to discover terms that are siblings to each other
[25].

2. Clustering can be used to create groups of similar words (clusters) which can be regarded as
representing concepts, and further hierarchically organize these as clusters. This technique is
generally used for learning concepts by considering clusters of related terms as concepts and
learning taxonomies by organizing these groups hierarchically [29]. Ontology alignment can
use agglomerative clustering to find candidate groups of similar entities in ontologies [138].

3. Hidden Markov Models (HMMs) define a generative statistical models that are able to gener-
ate data sequences according to rather complex probability distributions and that can be used
for classifying sequential patterns [46, 112, 119]. Zhou and Su [144] have used HMM for
Named Entity Recognition; Maedche and Staab [5] have used the n-gram models based on
HMMs to process documents at the morphological level before supplying them to term ex-
traction tools. Labsky et al. [76] present the use of HMMs to extract information on products
offered by companies from HTML files.

4.3.2.3 Multi-Strategy learning

Multi-Strategy learning techniques leverage the strengths of the above techniques to extract a wide
range of ontological knowledge from different types of knowledge sources [7, 120, 145]. For ex-
ample, Maeche and Staab [5] present the use of clustering for concept learning and association
rules to learn relations between these concepts.

4.3.3 Ontology learning evaluation

After the extraction process, the evaluation phase permits to know whether the knowledge extracted
is accurate and to conclude on the quality of the knowledge source. The evaluation of ontological
knowledge is coined by several authors in the literature [6, 37]. Dellschaft and Staab [37] have
proposed two ways to evaluate ontological knowledge: (1) In manual evaluation by human experts,
the knowledge is presented to one or more domain experts who have to judge to what extent it is
correct; (2) The comparison of the knowledge to existing reference vocabularies/ontologies to
ensure that it covers the studied domain.

Semantic-aware epidemiological surveillance system


4.4 Related works on ontology learning from source code 88

4.4 Related works on ontology learning from source code

Despite the large amount of available source codes and the fact that they may contain relevant
knowledge of the domain [13, 14, 22, 143] addressed by the software, the number of existing work
on knowledge extraction from these knowledge sources is quite low. Parser-based approach and
machine learning techniques are the commonly used in knowledge extraction from source code.

4.4.1 Parser-based approach

A straightforward solution to extract knowledge from source code is to use a parser. There are
works in this direction for generating knowledge base (RDF triples) or extracting ontological
knowledge (concepts and properties) from source codes using parsers. For instance, CodeOntol-
ogy [10, 11] parser is able to analyze Java source code and serialize it into RDF triples. From these
triples, highly expressive queries using SPARQL (SPARQL Protocol and RDF Query Language)
can be executed for different software engineering purposes including the searching of specific
software component for reuse. Ganapathy and Sagayaraj [50] used QDox1 generator to generate
an ontology that will further enable the developers to reuse source code efficiently. QDox gener-
ator is a parser that can be used for extracting classes, attributes, interfaces and method definition
from Java source code. In the approach proposed by [143], the authors defined the components
parts of the source code and break down the source code into these components. The source code
is browsed and the different components are analyzed in order to take an appropriate action which
is the extraction of knowledge sought. This knowledge can be used in supplementing and assisting
ontology development from database schemas.

Beyond RDF triples, terms, concepts and properties extraction, existing parsers do not provide
services for axioms and rules extraction. To overcome these limits, they need to be improved.
However, building and/or updating parsers for programming languages is a non-trivial, laborious
and time-consuming task [45, 90].

4.4.2 Machine learning-based approach

Machine learning approaches are also proposed to extract knowledge from source code.

Kalina Bontcheva and Marta Sabou [22] have presented an approach for ontology learning from
software artifacts such as software documentation, discussion forums and source code by using the
language processing facilities provided by GATE 2 platform2 . GATE 2 is an Open source software
developed in Java for building and deploying Human Language Technology application such as
parsers, morphology, tagging, Information Retrieval tools, Information Extraction components,
etc. To extract concepts from source code, Kalina Bontcheva and Marta Sabou used the GATE
key phrase extractor, which is based on TF.IDF (term frequency/inverted document frequency).
1
https://github.com/paul-hammant/qdox
2
https://gate.ac.uk/

Semantic-aware epidemiological surveillance system


4.5 Conclusion 89

The TD.IDF approach is an unsupervised machine learning technique which consists of finding
words/phrases that are characteristic of the given text, while ignoring phrases that occur frequently
in the text simply because they are common in the language as a whole. When using TF.IDF on
the source code, high frequency terms specific to the programming language can be eliminated
and only terms specific to the given software project would be selected as relevant to the domain
(ontology concept). This approach is used to extract concept. However, ontological knowledge is
also made up of properties, axioms and rules.

Labsky et al. [76] presented an approach for information extraction on product offered by
companies from their websites. To extract information from HTML documents, they used Hidden
Markov Models to annotate these documents. Tokens modelled by this HMM include words, for-
matting tags and images. The HMM is modelled using four states: the target state (T) which is the
slot to extract, the prefix and the suffix state (P, S) which constitute the slot’s context, and the ir-
relevant tokens modelled by a single background state (B). This approach permitted the extraction
of slots and the relation between nearby slots. For example product image often follows its name.
Unlike the authors approach which consists of terms extraction, our approach uses meta-data ex-
tracted from source code in order to identify to which ontological component every term/group of
terms corresponds to.

4.5 Conclusion

This chapter presented ontologies engineering. In effect, ontologies are knowledge representation
languages used to model a domain/problem. Then, we presented in detail different types of ontolo-
gies and the methodologies, methods and tools involved in their development. Ontologies can be
classified by lightweight ontologies and heavyweight ontologies. Lightweight ontologies are mod-
elled using rules or software engineering techniques. Heavyweight ontologies are modeled using
logical techniques. To develop them, knowledge must be acquired from different sources such
as domain experts, unstructured, semi-structured, and structured sources. Semi-structured knowl-
edge sources contain knowledge facilitating the extraction of a schema. For instance, in the source
code, the names of data structures, variables, functions are close to the terms of the domain and
surrounded by a set of keywords. Several methods are proposed for ontologies development. The
top-down method consists of starting from general concepts and evolves towards major special-
izations. The bottom-up method involves the construction of the ontology from the most specific
concepts, which are then grouped into categories. The middle-Out method consists of an interme-
diary layer of concepts that serves as a starting point. With the manual construction method, the
various resources containing knowledge are collected, terms are identified and the ontology is con-
structed. Automatic or semi-automatic approaches, also called ontology learning implements the
generation of terms automatically from knowledge sources. The methods proposed for this task can
be classified in symbolic techniques, statistical techniques and multi-strategy techniques. Amongst
ontologies development methodologies proposed in the literature, we used the NeOn methodology
in this thesis. It is composed of a set of scenarios that the knowledge engineer can combine in
different ways. Once the knowledge is acquired from knowledge sources, knowledge representa-
tion languages such as RDFS, OWL allow to put them in a form understandable by the machine
and Queries languages are used to retrieve information. Given that building ontologies is a tedious

Semantic-aware epidemiological surveillance system


4.5 Conclusion 90

task, ontologies development tools allow us to carry out some of the main activities of the ontology
development process. Then, tools are used to build ontologies from scratch, by merging/aligning
many ontologies, by semi-automatically extracting knowledge from knowledge sources, etc.

We have seen in this chapter that despite the large amount of source code available and the fact
that they contain relevant domain knowledge, they are rarely used for ontology building. To use
source code to construct an ontology, knowledge must be extracted. In chapter 6, we proposed a
method for ontology learning from source code.

Semantic-aware epidemiological surveillance system


Ontology learning from source code using
5
Hidden Markov Models

Source code contains well-defined words in a language that everyone understands (for example the
elements generally found on the user interface), some statements with a particular lexicon specific
to the programming language and to the programmer. For example, in Java programming language,
the term "class" is used to define a class, the terms "if", "else", "switch", "case" are used to define
the business rules (candidate to become rules). Other terms defined by the programmer such as
"PatientTuberculeux" are used to represent the names of classes (candidate to be concept); the
term "examenATB" is used to define the relation (ObjectProperty) with cardinality (candidate to
become axiom) between the classes "PatientTuberculeux" and "Examen"; and the group of terms
"int agePatient" is used to define a property (DataProperty) of the class "PatientTuberculeux".
This chapter presents how ontological knowledge can be extracted from Java source code to build
an ontology. Given that the approach used to extract the knowledge is based on Hidden Markov
Models, which is a probabilistic model, this section will present the probabilistic models before the
presentation and the use of the approach. Then, the section 5.1 presents the probabilistic models in
general, section 5.2 presents the Hidden Markov Models, section 5.3 presents the source code and
section 5.4 presents the approach.

5.1 Probabilistic models

Probabilities are used to build models in order to represent random phenomena. Temporal proba-
bility models particularly are used to model phenomena that can be represented as a set of events
evolving in time. A probability model for a particular experiment is a probability distribution that
predicts the relative frequency of each outcome if the experiment is performed a large number of
times. For example, in a model of "weather" tomorrow, the outcomes might be sunny, cloudy,
rainy, and snowy. A subset of these outcomes constitutes an event. For example, the event of
precipitation is the subset consisting of {rainy, sunny}. This section presents computations with
probabilities and probabilities models.

Semantic-aware epidemiological surveillance system


5.1 Probabilistic models 92

5.1.1 Computations with Probabilities

Probability theory is the mathematical study of random phenomena. As a mathematical foundation


for statistics, probability theory is essential to many human activities that involve quantitative anal-
ysis of data. Methods of probability theory also apply to description of complex systems giving
only partial knowledge of their state [46, 112]. For instance, the insurance industry and markets
use actuarial science to determine pricing and to make trading decisions [46]. In this section, some
important definitions in the field of probability theory and mathematical statistics will be presented
that are relevant for the further presentation of HMMs.

Definition 5.1 (Probability). According to Oxford dictionary, probability is "the extent to which an
event is likely to occur, measured by the ratio of the favourable cases to the whole number of cases
possible".

Probability is quantified as a number between 0 and 1, where 0 indicates impossibility and 1


indicates certainty (formula 5.1). The higher the probability of an event, the more likely it is that
the event will occur [46, 112].

0 ≤ P (ω) ≤ 1 (5.1)

Definition 5.2 (Random experiment). A random experiment describes a procedure which can be
repeated arbitrary often and produces a random result from a well defined set of possible outcomes
[46, 112].

Definition 5.3 (Random event). A random event (also called elementary event) is a single result
or a set of potential results of a random experiment [46, 112].

Definition 5.4 (Sample space). A sample space (also called universe) is the complete set of all
possible results of a random experiment.

Sample space is noted by Ω and ω refers to the elements of the space. The probability of the
entire sample space is 1, and the probability of the null event is 0. The usual set of operations (con-
junction, disjunction and complement) are generally applied to events with respect to the sample
space.

Definition 5.5 (Relative frequency). A relative frequency (f (A)) of an event A that occurred n
times during a N fold repetition of a random experiment is obtained as the quotient of its absolute
frequency c(A) = n and the total number of trials N (formula 5.2).

c(A) n
f (A) = = (5.2)
N N
A pragmatic derivation of the notion of probability is directly based on the relative frequency of
an event. Then, the occurrence of the event A is defined as its probability, P (A) as its relative
frequency. This probability satisfies three axioms:

Semantic-aware epidemiological surveillance system


5.1 Probabilistic models 93

• The measure of each event is between 0 and 1, written as 0 ≤ P (A = ai ) ≤ 1. Where A is a


random variable representing an event and ai are its possible values.

• The measure of the whole set is 1: ni=1 P (A = ai ) = 1.

• The probability of a union of disjoint events is the sum of the probabilities of the individual
event: P (A = a1 ∪ A = a2 ) = P (A = a1 ) + P (A = a2 ), where a1 and a2 are disjoints.
Definition 5.6 (Joint probability). Joint probability (also called intersection) of two event A and
B is the event occurring on a single performance of an experiment and is denoted by P (A ∩ B).
Full joint probability distribution is the joint probability for all the random variables.
Definition 5.7 (Independence). The events A and B are statistically independent if the observation
of B don’t provide information about the occurrence of A and vice versa.

When events are independent, the joint probability P (A, B) is defined by the formula 5.3.

P (A|B) = P (A),
P (B|A) = P (B), (5.3)
P (A, B) = P (A)P (B).

Independent assertions are usually based on knowledge of the domain and help in reducing the
size of the domain representation and the complexity of the inference problem.

In ideal cases, exploiting conditional independence reduces the complexity of representing the
exponential joint distribution in linear.
Definition 5.8 (Union). Union probability of two event A and B (denoted as P (A ∪ B)) where A
and B occurs on a single performance of an experiment is define by P (A ∪ B) = P (A) + P (B) −
P (A ∩ B)
Definition 5.9 (Unconditional or prior probabilities). Unconditional or prior probabilities (or just
"priors") is the degree of belief in propositions in the absence of any other information.

In fact, most of the time, when calculating probability, some information called evidence has
already been revealed.
Definition 5.10 (Conditional probability or posterior probability). The conditional probability for
the occurrence of an event A under the condition of the occurrence of the event B having oc-
curred before is derived from the probability of A and B occurring jointly and the unconditional
probability of B (formula 5.4).

P (A, B)
P (A|B) = , P (B) �= 0 (5.4)
P (B)

A and B are conditionally independent if P (B|A) = P (B) (or equivalently, P (A|B) = P (A)).

Semantic-aware epidemiological surveillance system


5.1 Probabilistic models 94

Definition 5.11 (Marginalization). Marginal probability is the probability on a subset of the ran-
dom variables.

Marginalization and conditioning turn out to be useful rules for all kinds of derivations involv-
ing probability expressions.

Definition 5.12 (Random variables). Random variables can be seen as a function that takes an
elementary event and returns a value. They are characterized by means of their distribution func-
tion:

• A discrete random variable (e.g., X) is a random variable that takes its value on a countable
infinite number of values (e.g., x1 , x2 , ..., xN ).

• A continuous random variable (e.g., X) is a random variable that takes arbitrary values
(e.g., x ∈ R).

Definition 5.13 (Probabilistic inference). Probabilistic inference is the computation of posterior


probabilities for query propositions giving observed evidence.

Theorem 5.1 (Chaining rule). Chaining rule allows the link between a marginal probability and
n

conditional probability. For n random variable, it is given by P (X1 , ..., Xn ) = P (Xi+1 |Xi , ..., X1 )
1

Theorem 5.2 (Bayes’ Rule). Bayes rule (or Bayes’ theorem), is derived from chain rule by P (Y |X) =
P (X|Y )P (Y )/P (X).

Bayes’ theorem allows to compute the posterior probability P (B|A) of event B from the con-
ditional probability P (A|B) by taking into account model knowledge about the events A and B in
the form of the associated prior probabilities. Bayes’ rule is very useful in practice because there
are many cases where one do not have a good probability estimation for P (X|Y ), (P Y ) and P (X)
and need to compute P (Y |X).

5.1.2 Probabilistic models

This section presents two main groups of probabilities models (Bayes networks, Dynamic Bayes
Networks), how to estimate their parameters, and how they are used.

5.1.2.1 Bayes networks (BNs)

A Bayesian network is a directed acyclic graph in which each node is annotated with quantita-
tive probability information (see figure 24). Its topology is defined by the set of nodes and links-
specifying the conditional independence relationships that hold in the domain, in such a way that
is made precise shortly. The intuitive meaning of an arrow is typically that X has a direct influence

Semantic-aware epidemiological surveillance system


5.1 Probabilistic models 95

Host

Parasite virulence Environmental factors


factors Predisposing to exposure

Plasmodium
infection

Mild Severe
Asymptomatic

Figure 24: Bayes Network example

on Y , which suggests that causes should be parents of effects. It is usually easy for a domain expert
to decide what direct influence exists in the domain than specifying the probabilities themselves.

Bayes networks are used to represent the probabilistic knowledge of a given application. For
example, a clinician’s clinical knowledge of causal relationships between diseases and symptoms.
They are useful for modelling the knowledge of an expert system or a decision support system,
in a situation where causality plays an important role. Pathfinder application is an example of an
application based on Bayes Network developed by extracting the expertise of different systems and
modelled in a Bayesian network. The predictions were as good as those of the doctors’ expertise
[112].

One way to define what the network means (its semantics) is to define the way in which it
represents a specific joint distribution over all the variables. This is done by defining parameters
associated with each node corresponding to conditional probabilities P (Xi |P arents(Xi )). The
resume of the specification of the Bayesian networks is given by:

• Each node corresponds to a random variable, which may be discrete or continuous;

• A set of directed links or arrows connects pairs of nodes. If there is an arrow from node X
to node Y , X is said to be a parent of Y ;

• The graph has no directed cycles. It is a directed acyclic graph;

• Each node Xi has a conditional probability distribution P (Xi |P arents(Xi )) that quantifies
the effect of the parents of the node.

Semantic-aware epidemiological surveillance system


5.1 Probabilistic models 96

5.1.2.2 Dynamic Bayes Networks(DBNs)

Dynamic Bayes Networks (DBNs) is a Bayes network considering that the situation to be modelled
is dynamic (the world can change with time). In this model, the world can be seen as a series of
snapshots, or time slices where each of which contains a set of random variables, some observable
and others not. The interval between time slices depends on the problem. DBNs are used to model
dynamic situations, e.g., situations in which information is collected as time passes. In this case,
the random variables are indexed by time and [112]:

• Dynamic changes are seen as a sequence of states in which each state represents a situation
at a given time t;

• The random variable is indexed by time where:

– Xt represents the set of unobservable (hidden) variables describing the state of the
modelled environment at time t,
– Et represents all the variables observed (evidence) at time t.

In DBNs, dynamic changes are caused by a so-called Markovian process. In fact, in these networks,
the current state depends only on a finite number of previous states. For example, in a first-order
Markov process, the hidden state at time t − 1 determines the hidden state at time t. The hidden
state at time t is independent of other hidden states.

By resuming, DBNs is a Bayesian network in which the system is considered by considering:

• P (X0 ) which specifies how everything get started (the prior probability distribution at time
0);

• P (Xt+1 |Xt ) called the transition model, which specifies the conditional distribution
P (Xt |Xt−1 , Xt−1 , ...). In other words, a state provides enough information to make the future
conditionally independent of the past, that is P (Xt |X0:t−1 = P (Xt |Xt−1 ).

• P (Et |Xt ) called the sensor model or the observation model, which specifies the evidence
variables Et which could depend on previous variables as well as the current state variables.

There exist specific types of DBNs:

• Markov Chain is a special type of DBN having a finite set of states, and only the current state
influences where it goes next [46]. For example, in the source code, each source file can be
modelled by a sequence of words.

• Hidden Markov Models are special types of DBN in which the state of the process is de-
scribed by a single discrete random variable (which is considered to be hidden) and the
possible value of the variable are the possible states of the world (observations). HMM is
detailed in section 5.2.

Semantic-aware epidemiological surveillance system


5.2 Hidden Markov Models 97

5.1.2.3 Inferences in temporal models

The basic task for any probabilistic inference system is to compute the posterior probability distri-
bution for a set of query variables, given some observed events. In the context of DBNs, from the
belief state and a transition model, the task is to predict how the world might evolve in the next
time step and to update the belief state. In the next paragraphs, we will present the main inference
task that must be solved in DBNs.

• Filtering (state estimation): The purpose of filtering is to calculate the belief state, that is,
the posterior distribution of the most recent hidden variable (P (Xt|e1 : t)). For example,
what is the probability that the word the programmer will enter now is "public".
• Prediction: The purpose of prediction is to calculate the posterior distribution on a future
state given all evidence to date (P (Xt+k |e1:t )where k > 0). For example, what is the proba-
bility that the tenth word that will be entered by the programmer is "int".
• Smoothing: Smoothing consists of calculating the posterior distribution on a past state given
evidence up to the present, that is P (Xk |e1:t )where0 ≤ k ≤ t. For example, what is the
probability that the word "f inal" has been entered by the programmer.
• Most likely explanation: The purpose of the most likely explanation is to find the sequence
of states that best explains the observations (argmax1:t P (X1:t ≤ e1:t )). For example, what
is the set of keywords in a program.

5.1.2.4 Learning temporal models

Learning probabilistic models is to estimate the model parameters by taking into account the spec-
ified model structure. In the particular case of temporal models, the task of learning is to determine
the transition and sensor models. This task can be done by using data or by using data and a
specialized algorithm [112].

• Learning on data: Parameters required can be learned from sample data. Statistical parameter
estimation methods provide reliable results with sufficiently many training samples data;
• Using a specialized algorithm such as Baum-Welch algorithm, the Baldi-Chauvin algorithm,
Segmental k-Means Algorithm, Viterbi training algorithm [46, 112].

5.2 Hidden Markov Models (HMMs)

Hidden Markov Models are particular types of Markov Chain composed of a finite state automaton
with edges between any pair of states that are labeled with transition probabilities. It also describes
a 2-stage statistical process in which the behavior of the process at a given time t is only de-
pendent on the immediate predecessor state. It is characterized by the probability between states

Semantic-aware epidemiological surveillance system


5.2 Hidden Markov Models 98

P (qt |q1 , q2 , ..., qt−1 ) = P (qt |qt−1 ) and for every state at time t an output or observation ot is gen-
erated. The associated probability distribution is only dependent on the current state qt and not on
any previous states or observations: P (ot |o1 , ..., ot−1 , q1 , ..., qt ) = P (ot |qt ) [42, 46, 48, 75, 119].

A first order HMM perfectly describes the source code because it can be seen as a string
sequence typed by a programmer in which the current word (corresponding to an assigned hidden
state) depends on the previous word. In this HMM, the observed symbol depends only on the
current state [46, 112, 119]. Equation 5.8 presents the joint probability of a series of observations
O1:T given a series of hidden states Q1:T . The HMM of figure 30 shows how the source code can be
modeled using a HMM. In this figure, the observations are the words ("public", "class", "Patient",
etc.) typed by the programmers and each of these words are labeled by the hidden states "PRE",
"TARGET", "POST", and "OTHER".


P (O1:T , Q1:T ) = P (q1 )P (o1 |q1 ) P (qt |qt−1 )P (ot |qt ) (5.5)
t=2

Filtering, smoothing, prediction, and the most likely explanation are four uses of HMMs. The
probability that a string O is emitted by a HMM M is calculated as the sum of all possible paths
by the equation 5.9.

l+1
� �
P (O | M ) = P (qk−1 → qk )P (qk ↑ ok ) (5.6)
q1 ,...,ql k=1

Where q0 and ql+1 are limited to qI and qN respectively and ol+1 is an end of word. The ob-
servable output of the system is the sequence of symbols emitted by the states, but the underlying
state sequence itself is hidden.

In the most likely explanation, the goal is to find the sequence of hidden states V (O | M ) that
best explains the sequence of observations (equation 5.10) [46, 112, 119]. To this end, the sequence
of states V (O | M ) which has the greatest probability to produce an observation sequence is
searched.

For example, in automatic translation, one may want the most probable string sequence that
corresponds to the string to be translated. In this case, instead of taking the sum of the probabilities,
the maximum must be chosen (equation 5.10).

l+1

P (O | M ) = max q1 ...ql ∈Ql P (qk−1 → qk )P (qk ↑ ok ) (5.7)
k=1

Before using the model, its parameters (transition probabilities, emission probabilities and ini-
tial probabilities) must be calculated using statistical learning or specialized algorithms [46].

Semantic-aware epidemiological surveillance system


5.2 Hidden Markov Models 99

5.2.1 HMMs structures

In the main application areas of HMM-based modeling, the input data to be processed have a
chronological or sequential structure. One assumes that the models are run through in causal
chronological sequence and, therefore, the model states can be arranged sequentially. Transition
probabilities to states that describe data segments lying backwards in time are constantly set to
zero. In graphical representations of HMMs (see figures 25, 26, 27, 28), such edges which are
excluded from possible state sequences are omitted for the purpose of simplification. The diagram
of figure 25 shows the general architecture of an instantiated HMM. Each oval shape represents a
random variable that can adopt any of a number of values. The random variable x(t) is the hidden
state at time t (with the model from the above diagram, x(t) ∈ {x1, x2, x3}). The random variable
y(t) is the observation at time t (with y(t) ∈ {y1, y2, y3, y4}). The arrows in the diagram denote
conditional dependencies. For HMMs to be applied for the analysis of data that is already avail-
able, one must first assume that the data to be analysed was generated by a natural process which
obeys similar statistical regularities. Then one tries to reproduce this process with the capabilities
of HMMs as closely as possible. If this attempt is successful, inferences about the real process can
be drawn on the basis of the artificial model.

….. x(t - 1) x(t) x(t + 1) …..

y(t - 1) x(t) x(t + 1)

Figure 25: General architecture of HMMs

There are many types of HMMs [46]:

• Linear HMMs (figure 26) are the most simple models in which only transitions to the
respective next state and to the current state itself are possible with some positive probability.
This model is used to capture variations in the temporal extension of the patterns described
with the help of the self-transitions.

• Bakis models (figure 27) are models in which the modeling of duration is achieved if the
skipping of individual states within a sequence is possible. This model is widely used in the
field of automatic speech and handwriting recognition.

• Left-to-right models (figure 28) are used to model larger variations in the temporal structure
of the data.

Semantic-aware epidemiological surveillance system


5.2 Hidden Markov Models 100

Figure 26: Linear HMM

Figure 27: Bakis HMM

Figure 28: Left-to-right HMM

• Ergodic model (figure 29) are models having a completely connected structure.

Figure 29: Ergodic HMM

Several inference problems are associated with hidden Markov models: filtering, prediction,
smoothing, and the most likely explanation [46, 112].

Semantic-aware epidemiological surveillance system


5.2 Hidden Markov Models 101

5.2.2 Parameters estimations

Parameter estimations consist of finding, given an output sequence (or a set of such sequences),
the best set of state transition and emission probabilities. The task is usually to derive the max-
imum likelihood estimation of the parameters of the HMMs given the set of output sequences.
This task can be done by training the model on a dataset using statistical learning or using special-
ized algorithms such as Baum-Welch algorithm, the Baldi-Chauvin algorithm, Segmental k-Means
Algorithm, EM algorithm, Viterbi training algorithm [46, 112].

Statistical parameter estimation methods provide reliable results with sufficiently many train-
ing samples data. Powerful HMMs can thus be created only if sample sets of considerable size are
available for the parameter training. Moreover, only the parameters of the models and not their con-
figuration (e.g., the structure and the number of free parameters) can be determined automatically
by the training algorithms. Considered intuitively, the parameter estimation methods for HMMs
are based on the idea to "observe" the actions of the model during the generation of an observation
sequence. The original state transition and output probabilities are then simply replaced by the
relative frequencies of the respective events.

5.2.3 HMMs usage

Hidden Markov Models are applied in many fields where the task is the modelling and analyzing
of chronological organized data as, for example, genetic sequences, handwriting texts, automatic
speech recognition. In the following paragraphs, we are going to present some successful applica-
tions for music genre classification, speech and handwriting recognition.

5.2.3.1 Music genre classification

Musical genres are labels used to distinguish between different types or categories of musical style.
The growing amount of music available creates a need for automated classification. This task can
be done by assigning a genre according to the listening impression. Music can be considered as a
high-dimensional digital time-variant signal, and music databases can be very large. As music is a
time-varying signal, several segments can be employed to extract features in order to produce a set
of feature vectors that characterizes a decomposition of the original signal according to the time
dimension. Then, HMMs can be used for music genre classification [66]. Iloga and al. proposed an
approach based on HMMs that represent each genre with a state, the model statistically captures
the transitions between genres. This approach is used to classify music genres by modelling each
genre with one HMM [66].

5.2.3.2 Speech recognition

In automatic speech recognition, the output of the models corresponds to a parametric feature
representation extracted from the acoustic signal. In contrast, the model states define elementary

Semantic-aware epidemiological surveillance system


5.3 Source code 102

acoustic events (speech sounds of a certain language). Sequences of states then correspond to
words and complete spoken utterances. If one is able to reconstruct the expected internal state
sequence for a given speech signal, then hopefully the correct sequence of words spoken can be
associated with it and the segmentation and classification problem can be solved in an integrated
manner. The possibility to treat segmentation and classification within an integrated formalism
constitutes the predominant strength of HMMs. When decomposing models for spoken or writ-
ten words into a sequence of sub-word units, we implicitly assumed that more complex models
can be created from existing partial HMMs by concatenation. Such construction principles are ei-
ther explicitly or implicitly applied in order to define compound models for different recognition
tasks. There are many speech recognition systems based on HMMs [46]. For instance the speech
recognition system of RWTH Aachen University [46]; ESMERALDA development environment
for pattern recognition [46].

5.2.3.3 Handwriting recognition

In the field of handwriting recognition, the signal data considered can be represented as a linear se-
quence of words written. The temporal progress of the writing process itself defines a chronological
order of the position measurements which are provided by the respective sensors. The time-line of
the signal thus virtually runs along the trajectory of the pen. The classical application of automatic
processing of writing is called Optical Character Recognition (OCR). The goal is to automatically
transcribe the image of the writing into a computer internal symbolic representation of the text.
Many systems based on HMMs are developed to address the problem of hand writing recogni-
tion [46]: Ratheon BBN Technologies [46]; RWTH Aachen handwriting recognition system [46];
ESMERALDA Offline HWR Recognition System [46].

5.3 Source code

During software development, it is recommended to write the source code according to good pro-
gramming practices, including naming conventions [18]. These practices inform programmers on
how to name variables, organize and present the source code. This organization can be used to
model source code using HMMs. For example, from Java source code, we can say that at a time
t, the programmer enters a word (e.g. "public" at the beginning of a Java source file). Thus, the
keyword "public" at time t conditions the next word at time t + 1 which in this case can be "class",
"int", etc. We can say that P RE and T ARGET are the hidden states and "public" and "class" are
respectively their observations.

5.3.1 Source code description

Source code contains several types of files: files describing data, files processing data, user interface
files and configuration files.

Semantic-aware epidemiological surveillance system


5.3 Source code 103

5.3.1.1 Files describing data

These files describe the data to be manipulated and equally, some constraints on this data (e.g., data
types). In Java EE for example, there are entities whose names are close to the terms of the domain
that will be transformed into tables in the database. These files often contain certain rules to verify
the reliability of the data. Thus, from these files, we can retrieve concepts, properties, axioms and
rules.

5.3.1.2 Files containing data processing

Located between user interface files and data description files is the data processing files of the
source code consisting of:

• Control: For example, restricting certain data from certain users (e.g., only the attending
physician has the right to access the data), checking the validity of a field (checking whether
the data entered in an "age" field is of type integer);
• Calculation: For example, converting a date of birth into an age, determining the date of the
next appointment of a patient, calculating the body mass index of a patient based on his/her
weight and height.

These are the algorithms implementing the business rules to be applied to the data. They are thus
good candidates for axioms and rules extraction.

5.3.1.3 User interfaces files

The User interfaces are composed of files which describe the information that will be presented to
users for data viewing or recording. Unlike the first two file types, these files contain the words of
a human-readable vocabulary that can be found in a dictionary. User interfaces usually provide:

• Translations allowing navigation from one language to another, control for users to enter the
correct data;
• An aid allowing users to know for example, the role of a data entry field.

User Interfaces are therefore good candidates for concepts and their definitions, properties, axioms
and rules extraction.

5.3.1.4 Configuration files

These files allow developers to specify certain information such as the type and path of a data
source, different languages used by users, etc. For instance, from these files, the languages labels

Semantic-aware epidemiological surveillance system


5.3 Source code 104

(e.g. English, French, Spanish) for terms can be extracted.

The files we just presented generally contain comments that can be useful for knowledge ex-
traction or ontology documentation. Knowledge extraction from user interfaces/web interfaces
has already been addressed in [25, 143], knowledge extraction from text has been presented in
[4, 5, 22, 29]. In this chapter, we will focus on knowledge extraction from files describing data and
their processing.

5.3.2 Modelling source code using HMMs

Figure 30: An example of HMM modeling the Java source code

A first order HMM perfectly describes the source code because it can be seen as a sequence
typed by a programmer in which the current word (corresponding to an assigned hidden state)
depends on the previous word. In this HMM, the observed symbol depends only on the current
state [46, 112, 119]. Formula 5.8 presents the joint probability of a series of observations O1:T
given a series of hidden states Q1:T . The HMM of figure 30 shows how the source code can be
modelled using a HMM. In this figure, the observation states are the words ("public", "class",
"Patient", etc.) typed by the programmers and each of these words are labeled by the hidden states
"PRE", "TARGET", "POST", and "OTHER".


P (O1:T , Q1:T ) = P (q1 )P (o1 |q1 ) P (qt |qt−1 )P (ot |qt ) (5.8)
t=2

The probability that a word X is emitted by a HMM M is calculated as the sum of all possible
paths by:

l+1
� �
P (X | M ) = P (qk−1 → qk )P (qk ↑ xk ) (5.9)
q1 ,...,ql k=1

Where q0 and ql+1 are limited to qI and qN respectively and xl+1 is an end of word. The
observable output of the system is the sequence of words in the source code files and emitted by
the states, but the underlying state sequence itself is hidden.

Semantic-aware epidemiological surveillance system


5.4 Ontology Learning from Source Code 105

In the most likely explanation, the goal is to find the sequence of hidden states (or source code
labels) V (X | M ) that best explains the sequence of observations (words in the source code-
formula 5.10) [46, 112, 119]. To do this, the sequence of states V (X | M ) which has the greatest
probability to produce a sequence of source code is searched [46, 112, 119].

l+1

P (X | M ) = max q1 ...ql ∈Ql P (qk−1 → qk )P (qk ↑ xk ) (5.10)
k=1

5.4 Ontology Learning from Source Code

The previous sections present how the source code can be modelled using HMMs. This section
presents how the HMMs can be modelled, trained and used to extract knowledge from Java source
code. The experimentation is made on EPICAMTB, the epidemiological surveillance platform
presented in chapter 3 and all the source code used are available on github1 Then, sections 5.4.1,
5.4.2, 5.4.3, 5.4.4 will present the approach we proposed for knowledge extraction from source
code, the definition and training of HMMs for knowledge extraction from Java source code, the
extraction of knowledge from EPICAMTB source code and the evaluation of knowledge extracted
respectively.

5.4.1 An approach based on HMMs for ontology learning from source code

To extract knowledge from Java source code, we designed a method divided into five main steps:
data collection, data preprocessing, entity labeling, formal language translation, and knowledge
validation.

5.4.1.1 Data collection and preprocessing

This section presents the first and the second step of the approach which are data collection and
data preprocessing.

Data collection. The data collection step consists of the extraction of a dataset necessary for the
next steps. In Java files, statements for importing third-party libraries and comments are deleted.
We proposed the definition of a regular expression that allows them to be identified.

Data preprocessing. The purpose of data preprocessing is to put data in a form compatible
with the tools to be used in the next steps. During this phase, potentially relevant knowledge will be
identified and retrieved, and some entities will be re-coded. The problem of extracting knowledge
from the source code has been reduced to the problem of syntactic labeling. This is to determine
1
https://github.com/jiofidelus/source2onto/

Semantic-aware epidemiological surveillance system


5.4 Ontology Learning from Source Code 106

the syntactic label of the words of a text [112]. In our case, it will be a matter of assigning a label
to all the words of the source code and extracting the words marked as target words. This problem
can be solved using HMMs [112, 119]. In the following paragraphs, we will first present the HMM
structure for source code modelling. Then, we will show how this HMM is trained and finally, how
it is used to extract the knowledge from Java source code.

HMMs structure definition. To define the structure of the HMMs, we manually studied the
organization of the source code of Java language. Generally, data structures, attributes, and condi-
tions are surrounded by one or more specific words. Some of these words are predefined in advance
in the programming language. To label the source code, we have defined four labels, corresponding
to four hidden states of the HMM:

• PRE: Corresponding to the preamble of the knowledge. This preamble is usually defined in
advance;

• TARGET: The target, (i.e. the knowledge sought) may be preceded by one or more words
belonging to the PRE set. The knowledge we are looking for are the names of classes, at-
tributes, methods, and the relationships between classes. They are usually preceded by a
meta-knowledge which describes them. For example, the meta-knowledge "class" allows for
concept identification;

• POST: Any information that follows the knowledge sought. In some cases, POST is a punc-
tuation character or braces;

• OTHER: Any other word in the source code that neither precedes nor follows the knowledge
sought.

An example of HMM annotated with labels is given by Fig. 30. Concepts, properties, axioms, and
rules are usually arranged differently in the source code. We propose the definition of two HMMs
which permit them to be identified: one to identify concepts, properties, axioms and the other one
to identify rules.

Learning Model Parameters. There are several techniques to determine the parameters of a
HMM: Statistical learning on data, specialized algorithms such as Baum-Welch or Viterbi training
[46, 112]. In this paper, we have chosen statistical learning on data to train the HMMs modelled
in the previous paragraphs. Thus, we assumed that we have access to T source code files labeled
ft knowing that ft is not just a sequence of words, but a sequence of words pairs with the word
and its label (see figure 30) modelled by the equation 5.11. To train the model, we assume that we
can define the order in which the different words are entered by the programmer. We assume that
before entering the first word, the programmer reflects on the label of that word and as a function
of it, defines the label of the next word and so on. For example, before entering the word public, the
programmer knows that its label is P RE and that the label of the next word is T ARGET . Thus,
the current word depends only on the current label, the following label depends on the previous
label, and so on. The process continues until the end of the file.

Semantic-aware epidemiological surveillance system


5.4 Ontology Learning from Source Code 107

ft = [(w1t , et1 ), ..., (wdt , etd )],


words(ft ) = [wtt , ..., wdt ], (5.11)
labels(ft ) = [et1 , ..., etd ].

In the equation 5.11, wi and ei are words and labels of fi files respectively. In practice, wi are
words contained in the source code (observations) and ei are the labels of wi used as hidden states.

From the training data, we can extract statistics on:

• The first label P (q1 ) (equation 5.12). A priori probability that the first label is equal to the
word � a� is the number of times the first label in each file of the source code is the word � a�
divided by the number of source code files.


f req(et1 = a, ft )
P (Q1 = a) = t
(5.12)
T

• The relation between a word and its label P (Ok | qk ) (equation 5.13). The conditional
probability that the k th word is � w� , knowing that the label is � b� corresponds to the number
of times the word � w� associated with the label � b� in the source code file ft normalized with
the fact that the label � b� is associated with any other word in ft source code. For example,
"Patient" can be a concept, an attribute, but cannot be a rule.

α + t f req((w, b), ft )
P (Ok = w | qk = b) = � (5.13)
β + t f req((� ∗, b), ft )
To avoid zero probabilities for observations that do not occur in the training data, we added
smoothing terms (α and β).

• The relation between the adjacent syntactic labels is P (qk | qk+1 ) (equation 5.14). The prob-
ability that qk+1 is equal to label � a� knowing that qk is equal to label � b� (previous hidden
state) is the number of times � a� follows � b� in the source code of the training data divided by
the number of times that � b� is followed by any other label.

P (qk+1 = a | qk = b) =


α + t f req(b, a), label(ft )
� (5.14)
β + t f req(b, ∗� ), label(ft )

To avoid zero probabilities for transitions that do not occur in the training data, we added
smoothing terms (α and β).

Let us consider the HMM in Fig. 30. Then, training data to identify concepts and attributes
would be: [("public", PRE), ("class", TARGET), ("Patient", TARGET), ("extends", TARGET),

Semantic-aware epidemiological surveillance system


5.4 Ontology Learning from Source Code 108

Table 5.1: The initial vector - probability to have a state as the first label
f(PRE) f(TARGET) f(POST) f(OTHER)

Table 5.2: An example of a transition table


States PRE TARGET POST OTHER
PRE f(PRE,PRE) f(PRE,TARGET) f(PRE,POST) f(PRE,OTHER)
TARGET f(TARGET,PRE) f(TARGET,TARGET) f(TARGET,POST) f(TARGET,OTHER)
POST f(POST,PRE) f(POST,TARGET) f(POST,POST) f(POST,OTHER)
OTHER f(OTHER,PRE) f(OTHER,TARGET) f(OTHER,POST) f(OTHER,OTHER)

("ImogEntityImpl", TARGET), ("{", OTHER), (...), ("int", TARGET), ("age", TARGET), ...]. Tab.
5.1 presents the initial vector, which is the probability that the first label is PRE, TARGET, POST,
or OTHER; Tab. 5.2 presents the transition vector containing the frequencies that a state follows
another state; and Tab. 5.3 presents the emission vector containing the frequencies that a state emits
an observation.

Knowledge extraction. The model previously defined and trained can be applied to any Java
source code in order to identify T ARGET elements. It will be necessary to find from the files
f1 , ..., fn , a sequence of states q1 , ..., qn that is plausible. For this, equation 5.10 will be used to
determine the most plausible string sequence. From this string, the hidden states will be identified
and the targets (words that are labeled T ARGET ) will be extracted. In our approach, we used the
Viterbi algorithm which provides an efficient way of finding the most plausible string sequence of
hidden states [47, 134]. The algorithm 1 gives an overview of the Viterbi Algorithm. More details
can be found in [46].

Any source code can then be submitted to the HMM trained and a table similar to Tab. 5.10
containing the probability for the hidden states to emit a word from the source code is built.

Re-coding variables. Programmers usually use expressions made up of words from a specific
lexicon, sometimes encoded with "ad hoc" expressions, requiring specific processing to assign a
new name or a label understandable by humans before using. These words are generally divided
into words or groups of words according to the naming conventions of the programming language.

Table 5.3: An example of an observation table

package pac ; public


PRE f(PRE,package) f(PRE, pac) f(PRE,;) f(PRE,public)
TARGET f(TARGET,package) f(TARGET, pac) f(TARGET,;)
POST f(POST,package) f(POST, pac) f(POST,;) f(POST,public)
OTHER f(OTHER,package) f(OTHER, pac) f(OTHER,;) f(OTHER,public)
class patient ...
PRE f(PRE,class) f(PRE,patient) ...
TARGET f(TARGET,class) f(TARGET,patient) ...
POST f(POST,class) f(POST,patient) ...
OTHER f(OTHER,class) f(OTHER,patient) ...

Semantic-aware epidemiological surveillance system


5.4 Ontology Learning from Source Code 109

Algorithm 1: The Viterbi algorithm [46, 134]


1 Let M = (π, A, B) our HM M
2 W ith π the vector of start probabilities, A the matrix of state-transition probabilities, and B
the matrix of observation probabilities
3 Let δt (i) = maxq1 ,...,qt−1 P (O1 , ..., Ot , q1 , ...qt−1 , qt = i|M )
4 1. Initialization
5 δ1 (i) := πi bi (O1 ) ψ1 (i) := 0
6 2. Recursion
7 For all times t, t1 , ..., T − 1:
8 δt+1 (j) := maxi {δt (i)aij }bj (Ot+1 )
9 ψt+1 (j) := argmaxi {δt (i)aij }
10 3. Termination
∗ ∗
11 P (O|M ) = P (O, q |M ) = maxi δT (i)

12 qT := argmaxj δT (j)
13 4. Back-Tracking of the Optimal Path
14 for all times t, t = T − 1, ..., 1 :
∗ ∗
15 qt = ψt+1 (qt+1 )
16

For example, we can have "PatientTuberculeux" → "Patient tuberculeux", "agePatient" → "Age


Patient", "listeExamens" → "liste Examens", etc. Therefore, during the re-coding, these names are
separated in order to find their real meaning in human understandable language.

5.4.1.2 Entities labeling and translation into a formal language

After the extraction of knowledge, the two last steps consist of giving labels to all the terms ex-
tracted and given these labels, translating the knowledge extracted into a formal language.

Entities labeling. The extraction of relevant terms has yielded knowledge and meta-knowledge.
This knowledge and meta-knowledge will permit us to identify to which ontological components
they may belong to. For example, the code: "class Patient extends Person int age", submitted to a
trained HMM to identify concepts and relations will identify three meta-knowledge ("class", "ex-
tends" and "int") that will be used to identify two concepts (Patient and Person), one attribute of
type integer and a hierarchical relation between "Patient" and "Person". From the extracted knowl-
edge, two candidates to be concepts are related if one is declared in the structure of the other. One
may identify three types of relations:

• ObjectProperty: If two classes ’A’ and ’B’ are candidates to be concepts and ’b’ of type B
is declared as attribute of class ’A’, then classes ’A’ and ’B’ are related. The attribute ’b’ is
an ObjectProperty having ’A’ as domain and ’B’ as range.

• DatatypeProperty: If a class ’A’ is a candidate to be a concept and contains the attributes


’a’ and ’b’ of basic data types (integers, string, boolean, etc.), then, ’a’ and ’b’ are Datatype-
Property having the class ’A’ as domain;

Semantic-aware epidemiological surveillance system


5.4 Ontology Learning from Source Code 110

• Taxonomy (subClassOf): If two classes ’A’ and ’B’ are candidates to be concepts and the
class ’B’ extends the class ’A’ (in Java, the keyword "extends" is used), then, one can define
a taxonomic relation between the classes ’B’ and ’A’.

Translation in a formal language. Once all relevant knowledge are identified in the previous
phase, they are automatically translated to a machine readable language. We use OWL for concepts,
properties and axioms, and SWRL for rules.

5.4.1.3 Knowledge evaluation

After the extraction process, the evaluation phase permits us to know if this knowledge is relevant
to the related domain and to conclude on the relevance in using source code as a knowledge source.
Given that the knowledge extracted is ontological knowledge, two evaluation techniques will be
used: (1) Manual evaluation by human experts in which the knowledge extracted is presented to
one or more domain experts who have to judge to what extent these knowledge are correct; (2)
The comparison of the knowledge extracted (alignment) to gold standards which will be existing
ontologies.

5.4.2 HMMs definition, training and use

To extract knowledge from Java source code, two HMMs have to be defined and trained: a HMM
for concepts, properties, and axioms identification, and a HMM for rules identification. All the
algorithms for HMMs training and usage have been coded in Java2 .

5.4.2.1 HMM structure for concepts, properties and axioms

The HMM used to identify concepts, properties and axioms is defined by:

1. P RE = {public, private, protected, static, f inal}, the set of words that precedes TAR-
GET;

2. T ARGET = {package, class, interf ace, extends, implements, abstract, enum, wi }, ∀i,
wi−1 ∈ P RE || wi−2 ∈ P RE ∧ wi−1 ∈ P RE, the set of all words that we are seeking;

3. P OST = {{, ; , }}, the set of words that follow TARGET;

4. OT HER = {wi }, wi ∈
/ P RE, ∧wi ∈ / P OST , the set of all other words.
/ T ARGET, ∧wi ∈

Each HMM state emitted a term corresponding to a word from the source code. We have seen
that the observation emitted by the P RE set can be enumerated. However, the observation of
2
https://github.com/jiofidelus/source2onto

Semantic-aware epidemiological surveillance system


5.4 Ontology Learning from Source Code 111

Table 5.4: The initial vector of the HMM for concepts, properties and axioms extraction

PRE TARGET POST OTHER


0.0 1.0 0.0 0.0

T ARGET and OT HER sets cannot be enumerated because they depend on the programmer.
Then, we considered data to be all the observations emitted by T ARGET and other to be all the
observations emitted by OT HER. We obtained the HMM presented by an initial vector (e.g., Tab.
5.4) a transition vector (e.g., Tab. 5.5), and an observation vector (e.g., Tab. 5.6).

5.4.2.2 HMM structure for rules

Rules can be contained in conditions. Then, we will exploit the structure of source code to ex-
tract the rules. For example, the portion of code (if (agePatient> 21) {Patient = Adult}) is a rule
determining whether a patient is an adult or not. It must therefore be extracted.

The HMM to identify the rules is composed of:

1. P RE = {”}”, ”; ”, ”{”}, the set of words that precede one or more TARGET;

2. T ARGET = {if, else, switch, wi } | ∃k, r ∈ N | wi−k ∈ P RE ∧ wi + r ∈ P OST : the set


of all words that follow P RE and precede P OST ;

3. P OST = {”}”}, the end of the condition;

4. OT HER = {wi } | wi ∈
/ P RE, T ARGET, P OST : the set of all other words.

We can identify the beginning and the end of a condition represented here by the sets P RE
and P OST respectively. Note that all the observations emitted by T ARGET and OT HER sets
cannot be fully enumerated. Therefore, we have considered data to be all the observations emitted
by T ARGET , and other to be all the observations emitted by OT HER.

5.4.2.3 Statistical learning of the HMMs

Learn Java source code (composed of 59 files and 2663 statements) was downloaded from github3
and from this source code, we used statistical learning on data to calculate the values of the HMMs
parameters4 . Tabs 5.4, 5.5, 5.6, 5.7, 5.8, 5.9 present the initialization, transition and observation
vectors respectively obtained after the training step.
3
https://github.com/mafudge/LearnJava
4
https://github.com/jiofidelus/source2onto/blob/master/code2onto-model/src/main/java/cm/uy1/training/
HMMTrainingData.java

Semantic-aware epidemiological surveillance system


5.4 Ontology Learning from Source Code 112

Table 5.5: Transition vector of the HMM for concepts, properties and axioms extraction

PRE TARGET POST OTHER


PRE 0.1686 0.8260 0.0027 0.0027
TARGET 0.0008 0.7523 0.2461 0.0008
POST 0.0603 0.0033 0.0234 0.9130
OTHER 0.7364 0.1133 0.0025 0.1478

Table 5.6: Observation vector of the HMM for concepts, properties and axioms extraction

public private protected static final data {


PRE 0.6417 0.1684 0.0053 0.1124 0.0722 0.0 0.0
TARGET 0.0 0.0 0.0 0.0 0.0 1.0 0.0
POST 0.0 0.0 0.0 0.0 0.0 0.0 0.6678
OTHER 0.0 0.0 0.0 0.0 0.0 0.0 0.0
; } other
PRE 0.0 0.0 0.0
TARGET 0.0 0.0
POST 0.3256 0.0066 0.0
OTHER 0.0 0.0 1.0

Table 5.7: The initial vector of the HMM for rules extraction

PRE TARGET POST OTHER


0.0 0.0 0.0 1.0

Table 5.8: Transition vector of the HMM for rules extraction

PRE TARGET POST OTHER


PRE 0.0667 0.7999 0.0667 0.0667
TARGET 0.0010 0.9321 0.0659 0.0010
POST 0.0172 0.0172 0.0172 0.9484
OTHER 0.0072 0.0001 0.0001 0.9926

Table 5.9: Observation vector of the HMM for rules extraction

{ } ; if else switch data other


PRE 0.8462 0.0769 0.0769 0.0 0.0 PRE 0.0 0.0 0.0
TARGET 0.0 0.0 0.0 0.0185 0.0031 TARGET 0.0010 0.9774 0.0
POST 0.0 1.0 0.0 0.0 0.0 POST 0.0 0.0 0.0
OTHER 0.0 0.0 0.0 0.0 0.0 OTHER 0.0 0.0 1.0

5.4.2.4 Knowledge extraction

Once the HMMs are built, we can apply them to the source code of any Java applications in order to
extract the knowledge. To do this, the most likely state sequence (equation 5.10) that produced this
source code is calculated. To calculate the most likely state sequence, we have implemented the

Semantic-aware epidemiological surveillance system


5.4 Ontology Learning from Source Code 113

Figure 31: An overview of the Java source code of the EPICAM project

Viterbi algorithm [46, 47, 134] in Java5 . In fact, we have exploited the structure of the HMM in the
context of dynamic programming. It consists of breaking down the calculations into intermediate
calculations which are structured in a table. An example of the Viterbi table is given by the Tab.
5.10. Every element of the table is being calculated using the previous ones. From this table, the
Viterbi path is retrieved by getting the frame with the highest probability in the last column and
given this frame, to search all the frames that were used to build it. All the elements whose labels
are T ARGET are extracted as candidates.

5.4.3 Knowledge extraction from the EPICAM source code

This section presents the experimentation of the approach described in section 5.4.1. This experi-
mentation consists in extracting ontological knowledge from EPICAM source code composed of
1254 Java files and 271782 instructions. Fig. 31 presents a screenshot of some concepts from the
EPICAM source code.

5.4.3.1 Knowledge extraction from EPICAM

To extract ontological knowledge from EPICAM source code, we proceeded step by step using the
method presented in section 5.4.1.

Data collection The source files of EPICAM platform are composed of statements, imported
libraries and comments. Data collection involves removing the imported libraries and comments.
5
https://github.com/jiofidelus/source2onto/blob/master/code2onto-model/src/main/java/cm/uy1/modelUse
/KnowledgeExtractionHMM.java

Semantic-aware epidemiological surveillance system


5.4 Ontology Learning from Source Code 114

To this end, we defined the regular expression


import[��u0000�−��uf f f f ]∗?; |��(.)∗�n|(���∗ [��u0000�−��uf f f f ]∗?��∗
�) to identify them. Once identified, we wrote a Java program to delete them.

Data preprocessing Data preprocessing consists in extracting the elements likely to be relevant
from the source code and re-coding them if necessary. We have used the HMMs defined and trained
in section 5.4.2. These HMMs were applied to the source code of EPICAM by calculating the
values of the Viterbi table (see Tab. 5.10). Once the table is built, we searched the Viterbi path by
getting the frames with the highest probability in the last column and using this frame, we search
all the frames that were used to build it. Once the Viterbi path is identified, all the elements labeled
TARGET are extracted.

package org.epicam ; public ...


PRE 0 α(P RE, 2) α(P RE, 3) α(P RE, 4) ...
TARGET 1 α(T ARGET, 2) α(T ARGET, 3) α(T ARGET, 4) ...
OTHER 0 α(OT HER, 2) α(OT HER, 3) α(OT HER, 4) ...
}
PRE 0
TARGET 1
OTHER 0

Table 5.10: The Viterbi table (α table) built using EPICAM source code

Fig. 32 presents the set of candidates for concepts, properties, and axioms identified and Fig.
33 presents the set of candidates for rules identified.

Re-coding terms and rules To re-code the candidates extracted, we used Java naming con-
ventions. All the candidates were browsed and for the candidates containing the keywords of the
programming language, these keywords were removed. For example, consider the term CasTuber-
culoseEditorWorkflow that was extracted from the source code; the terms Editor and Workflow are
keywords of Google Web Toolkit, the technology used to build the EPICAM platform. Then, the
terms Editor and Workflow are removed and the term CasTuberculose is retained as a candidate.

After the re-coding, we moved to the next step which is the translation into formal language.

Entities identification and translation into OWL Data preprocessing phase produced a file
containing only the meta-knowledge (e.g "package", "class", "extends", "if", "switch") and the
knowledge (e.g "patientManagement.Patient", "Patient" or "serology"). We wrote a Java program
to browse these files in order to identify relevant knowledge. Meta-knowledge allows the identi-
fication of the candidates as concepts, properties and axioms. For example, if the string "pack-
age minHealth.Region.District.hospitals.patientRecord ... class Patient extends Person ... int age ...
List<Exam> listExam" is extracted, then, the following ontological knowledge is identified:

• "package minHealth.Region.District.hospitals. patientRecord:" This is used to identify


the class hierarchy;

• "class Patient extends Person": This expression means that "Patient" and "Person" are

Semantic-aware epidemiological surveillance system


5.4 Ontology Learning from Source Code 115

Figure 32: An excerpt of candidates extracted for concepts, properties and axioms

Figure 33: An excerpt of candidates extracted for rules identification

candidates that will become concepts and there is a hierarchical relation between concepts
"Patient" and "Person";

• "int age; List <Exam> listExam": This expression means that "age" and "listExam" are

Semantic-aware epidemiological surveillance system


5.4 Ontology Learning from Source Code 116

properties of the concept "Patient"; the following axiom is also defined: a patient has only a
single age (i.e. age is a functional property).

After the identification of entities, we proposed a second Java program6 to automatically translate
them into an OWL ontology7 .

In the same way, rules were also extracted and translated into Semantic Web Rule Language8 .
An example of a rule specifying the rights of a doctor on patient data is given by:
doctorsRule = "Personnel (?pers) ∧ personnel_login (?pers, login) ∧ personnel_passwd (?pers,
passwd) ∧ Patient (?p) ∧ RendezVous (?rdv) ∧ hasRDV (?rdv, ?p) ∧ patient_nom (?p, ?nom) ∧
patient_age (?p, ?age) ∧ patient_sexe (?p, ?sexe) ∧ patient_telephoneUn (?p, ?telephone) ∧ ren-
dezVous_dat eRendezVous (?rdv, ?datardv) ∧ rendezVous_honore (?rdv, ?honore) ∧
rendezVous_honore (?rdv, Non) → sqwrl:select (?nom, ?age, ?sexe, ?telephone, ?datardv, ?hon-
ore)";

5.4.3.2 Analysis of the elements extracted

The extraction process produced a set of candidates (Figs 32 and 33), but also false positives
(Tab. 5.11 presents the statistics). The false positives consist of the set of candidates that belong
to the P RE, P OST or OT HER sets that normally should not be extracted as observations of
T ARGET . We wrote a Java program to identify and delete them.

Tab. 5.11 presents the statistics of candidates/group of candidates that were extracted. After the
extraction process, we obtained different types of candidates/group of candidates:

• Irrelevant candidates/group of candidates: These are utility classes and temporary vari-
ables. Utility classes are classes that the programmer defines to perform certain operations.
These classes usually contain constants and methods. The names of these classes are usually
not related to the domain. Temporary variables (e.g., the variables used in a loop) are used
temporarily in the source code and are not related to the domain.

• Relevant candidates/group of candidates: These are knowledge found. These candidates


are composed of synonyms (candidates of identical meaning) and redundancies (candidates
that come up several times). We wrote a Java program to identify and remove redundancies
candidates automatically.

We also extracted candidates’ conditions to be rules. As we did with the candidates to be con-
cepts, properties and axioms, false positives were identified and deleted. From the rules extracted,
we found:
6
https://github.com/jiofidelus/source2onto/blob/master/code2onto-model/src/main/java/cm/uy1
/helper/OWLHelper.java
7
https://github.com/jiofidelus/ontologies/blob/master/epicam/epicam.owl
8
https://github.com/jiofidelus/ontologies/blob/master/epicam/epicamrules.owl

Semantic-aware epidemiological surveillance system


5.4 Ontology Learning from Source Code 117

Candidates Relevant Irrelevant


Concepts 1840 (72.87%) 685 (27.13%)
Properties 38355 (81.42%) 8755 (18.58%)
Axioms 3397 (83.22%) 685 (16.78%)
Rules 1484 (07.89%) 17332 (92.11%)

Table 5.11: Statistics on candidates extracted

• Irrelevant conditions: These are conditions that are not really important. For example, test-
ing whether a temporary variable is positive or is equal to a certain value. These conditions
were the most numerous;
• Relevant conditions: Conditions corresponding to a business rule (e.g., testing if a user has
access right to certain data).

5.4.4 Knowledge evaluation

The concepts, properties and axioms extracted were translated into an OWL ontology. The ex-
tracted rules are represented in SWRL. We used the Protege editor to provide a graphical visual-
ization of the ontology and rules to human experts for their evaluation. Fig. 34 presents an overview
of the ontology obtained.

Three experts from the tuberculosis surveillance domain involved in the EPICAM project
were invited to evaluate the knowledge extracted. They are from three different organizations in
Cameroon (Centre Pasteur of Cameroon, National Tuberculosis Control Program and a hospital
in Yaounde). The domain experts were asked to check first if the terms extracted are relevant to
the tuberculosis clinical or epidemiological perspectives. Second, they analyzed the axioms and
rules. First of all, they found that the terminology was relevant to the tuberculosis. However, they
suggested correcting some typos caused by the names of the classes and attributes given by pro-
grammers. Axioms and rules were generally correct. Some rules were suggested to be updated
as the business rules have evolved (e.g. user access to patient data has been improved taking into
account their post such as epidemiologist, physician, nurse or administrative staff).

In line with the experts validation, we evaluated the coverage of the ontology terms by taking
reference on other ontologies in the biomedical domain. We used BioPortal [136] as a biomedical
ontology repository. BioPortal contains more than 300 ontologies including a large number of med-
ical terminologies such as SNOMED (Systematized Nomenclature of Medicine) [122]. BioPortal
has an Ontology Recommender module that is used to find the best ontologies for a biomedical text
or a set of keywords [110]. This task is done according to four criteria: (1) the extent to which the
ontology covers the input data; (2) the acceptance of the ontology in the biomedical community;
(3) the level of detail of the ontology classes that cover the input data; (4) and the specialization
of the ontology to the domain of the input data. We gave as input keywords to the Recommender
the set of terms (concepts and properties) of the ontology extracted by our HMM. Fig. 35 shows
that the ontology terms are covered by many biomedical ontologies. In the first line of the rec-
ommended ontologies, we could see that NCIT, SNOWMEDCT, ONTOPARON (accepted by the

Semantic-aware epidemiological surveillance system


5.5 Conclusion 118

Figure 34: An overview of the generated OWL ontology

community with a score of 75.6%) cover the terms from our ontology with a score of 82.9%, have
a level of details of 64% and the level of specialization of 40%. We came to the conclusion that
terms extracted by our HMM are relevant to the biomedical domain.

5.5 Conclusion

We proposed in this chapter an approach for knowledge extraction from Java source code using
Hidden Markov Models (HMMs). This approach consists of the definition of a Hidden Markov
Model, its training and use for knowledge extraction from source code. The HMMs are defined
by labeling the source code with the labels P RE, P OST , T ARGET and OT HER. There-
after, they are trained using existing source code and used to extract knowledge. We experimented
this approach by extracting ontological knowledge from EPICAM, a tuberculosis epidemiological
surveillance platform developed in Java. Evaluation by domain experts (clinicians and epidemiol-
ogists) permitted us to show the relevance of the knowledge extracted. In line with the experts’
validation, we evaluated the coverage of terms extracted by reference ontologies in the biomedical
domain. We used Ontology Recommender from BioPortal repository. The results of the evaluation

Semantic-aware epidemiological surveillance system


5.5 Conclusion 119

Figure 35: The Ontology Recommender output from the extracted ontology terms

shows that the terms are well covered by many biomedical ontologies (e.g., NCIT, SNOWMEDCT,
ONTOPARON). In the chapter 6, we will show how the knowledge extracted in this chapter was
used to build an ontology for tuberculosis surveillance.

Semantic-aware epidemiological surveillance system


An ontology for Tuberculosis Surveillance
6
System (O4TBSS)

Effective management of tuberculosis requires to put in place a system which provides all needed
information to stakeholders. In chapter 3, we have presented the EPICAM platform used for epi-
demiological surveillance of tuberculosis in Cameroon. This platform permitted the National Tu-
berculosis Control Program to collect data and obtain the statistics they generally use. The EPI-
CAM platform uses PostgesQL to store data and the SQL language to get information and build
statistics tables and graphics. However, a lack of logical and machine-readable relations among
PostgresQL tables prevent computer-assisted automated reasoning. The ability to reason, that is to
draw inferences from the existing knowledge to derive new knowledge is an important element for
modern medical applications [59] such as epidemiological surveillance systems. To support auto-
mated reasoning, ontological terms are often expressed in formal logic [59, 81]. In this chapter, we
report the development of an Ontology for Tuberculosis Surveillance System (O4TBSS) that will
permit users of TB surveillance, by using reasoning mechanisms to derive new knowledge using
existing ones. The rest of the chapter is organized as follow: section 6.1 presents the methodology
used to construct the ontology, section 6.2 presents the development of the ontology and section
6.3 presents the use cases.

6.1 Ontology development methodology

During the development of the Ontology for Tuberculosis Surveillance System (O4TBSS), we
have followed a methodology made up of a set of principles, design activities and phases based on
an agile software development methodology [3, 39] presented in chapter 3 and NeOn methodology
[125] presented in chapter 4. Our methodology is composed of the Pre-development step presented
in section 6.1.1, the Development and the Post-development step presented in section 6.1.2.

Semantic-aware epidemiological surveillance system


6.1 Ontology development methodology 121

6.1.1 The Pre-development step

The Pre-development step involves the specification, the analysis and the design of the application
in which the ontology will be integrated. To make the system specifications, the Scrum Team, con-
ducted by the Scrum Master makes an Application Specification Document (ASD). This document
contains the users’ needs and all the features of the software to develop. The analysis activity uses
the ASD to understand the system in order to delineate and identify its features. To do this, we
recommend the use of Unified Modelling Language (UML) in order to identify the actors of the
system and the use cases that will be executed by these actors. Recall that an actor is any user
outside the system who can be a person or another system. He/she uses the system and runs use
cases. A use case determines a system functionality and meets a need.

Based on software specifications and analysis, software design specifies how to represent and
build the solution. During the design of the software architecture, the different modules and the
relations among these modules are defined. If the ontology is necessary, it will be specified in the
software architecture and its role will be clearly defined. This step corresponds to scenario 1 of the
NeOn methodology in which knowledge engineers make the Ontology Requirements Specification
Document (ORSD).

At the end of the Pre-development step, the first version of the application specification, anal-
ysis and design is produced. The Product Backlog of the ontology to be built is also produced and
a Scrum Meeting will permit us to define the list of tasks to be executed to build the ontology.

6.1.2 The Development and Post-development steps

The goal of the development step is to develop the ontology through repeated cycles (iteratively)
and in modules (incrementally), allowing the Scrum Team to take advantage of what was learned
during development of earlier versions. The tasks contained in the Product Backlog are organized
in many Sprint Backlogs and executed. At each Scrum Meeting, a Scrum Review is made to evalu-
ate the evolution of the development. This step is composed of two main phases: the development
of the first version of the ontology and the development of the next versions.

First version The first phase consists of the development of the first version of the ontology
given the specifications, the analysis and the design provided by the Pre-development step. It is
composed of three activities and proceeds as follows:

1. Identification of knowledge sources: During this activity, an inventory of existing knowl-


edge sources (human experts, domain resources, existing ontologies) is made. Firstly, exist-
ing ontologies are listed and analyzed. If one of them matches the needs, it is adopted. If not,
the resources identified previously must be used to build the ontology. For each resource,
determine the method to be used for knowledge acquisition. The method chosen will guide
the choice of tool. For example, if existing ontologies are identified as relevant resources,
Protege software [92] can be used to build the ontology by importing/merging them.

Semantic-aware epidemiological surveillance system


6.1 Ontology development methodology 122

2. Knowledge acquisition: The second activity in the development step is the most critical.
Four aspects are to be considered:

(a) Acquiring knowledge from domain experts: Ideally, knowledge must be obtained
from domain experts. However, domain experts are not always available for interviews;
(b) Acquiring knowledge from existing ontologies: for each ontology selected during
the identification of knowledge sources, a part or the whole ontology can be used. For
this task, existing ontologies can be re-engineered or relevant terms can be manually
or (semi)automatically extracted. The knowledge obtained can be used to build the
ontology by merging/aligning the knowledge extracted;
(c) Using domain resources: When using domain knowledge sources for ontology build-
ing, knowledge is manually/automatically extracted from these resources. This is called
ontology learning [7, 120];
(d) The mixed approach: The mixed approach consists of the use of existing ontologies,
domain resources to acquire the relevant knowledge and build the ontology.

3. Knowledge representation: During the knowledge representation activity, the knowledge


extracted previously is serialized in a machine readable form. This activity can be composed
of: the construction of the ontology which consists of converting the concepts, properties,
axioms and rules in a knowledge representation language; the adaptation of the ontology to
one or more various languages and cultural communities; and the population of the ontology
obtained with instances. After the knowledge representation activity, one obtains a knowl-
edge base which can use the automated reasoning to reason about the knowledge, make
inference and infer new knowledge.

After the development of the first version of the ontology, the evaluation is performed. The
feedback of the evaluation, presented during the Scrum Meetings will permit us to define the next
steps of the ontology development.

The next versions. The second phase is an iterative and incremental phase in which each
increment consists of exploiting the evaluation feedback in order to complete specifications, anal-
ysis, design and to develop the new versions of the ontology. Each increment involves the Sprint
planning meeting which will result in a the set of features that the ontology must meet; knowl-
edge identification consisting of identifying relevant knowledge to be used to complete the on-
tology constructed during the previous Sprints; knowledge acquisition, which is based on the re-
sources identified and involve the identification of methods and tools for knowledge acquisition;
and knowledge representation. At the end of each Sprint, a Sprint Review Meeting permits us to
evaluate the ontology given the specifications, analysis and design. Note that at each review, a
reasoner is used to check the ontology consistency.

Post-development step. The Post-development step involves the integration of the developed
ontology in the related software. For example, a query interface can be developed to allow users to
access knowledge.

Semantic-aware epidemiological surveillance system


6.2 Ontology building 123

6.2 Ontology building

The dramatic increase in the use of knowledge discovery applications requires end users to write
complex database queries to retrieve information. Such users are not only expected to grasp the
structural complexity of complex databases but also the semantic relationships between data stored
in these databases. In order to overcome such difficulties, researchers have been focusing on knowl-
edge representation and interactive query generation through ontologies [59, 88, 121]. In clinical
practice particularly, Hauer et al. [59] have proved the relevance of the use of ontologies for knowl-
edge discovery. In this thesis, we propose the use of an ontology named Ontology for Tuberculosis
Surveillance System (O4TBSS) for knowledge discovery during epidemiological surveillance of
tuberculosis. Then, in this section, we will show how the methodology presented in section 6.1
has been followed to develop O4TBSS. Section 6.2.1 will present the Pre-development step and
section 6.2.2 will present the development step.

6.2.1 Pre-development

During the Pre-development, the specifications, analysis and design of the application which will
integrate the ontology will permit us to determine the need and role of an ontology.

6.2.1.1 Software specifications

To fight against TB, the government of Cameroon has recognized the National Tuberculosis Con-
trol Program (NTCP) as a priority program of the Ministry of Health in 2012. The goal of the
NTCP is to detect and treat patients with TB and prevent it. To this end, all the stakeholders at
the NTCP must have all needed information for decision making. Firstly, we have developed a
platform named EPICAM used for the epidemiological surveillance of TB [68].

Figure 36: Searching for patients using criteria defined by the NTCP

The EPICAM platform permits the NTCP to obtain data for tuberculosis management. This
platform integrates interfaces which allow users to request information. The figure 36 presents

Semantic-aware epidemiological surveillance system


6.2 Ontology building 124

an example of patient search using multiple searching criteria. The EPICAM platform uses Post-
greSQL to store data and get information given the search criteria provided by users. However,
a lack of logical and machine-readable relations among PostgreSQL tables prevents computer-
assisted automated reasoning and useful information may be lost. Then, a new module of the
EPICAM platform which enables users access all needed information is required. The main func-
tionalities of this module are:

• Provide all needed information to stakeholders;

• Facilitate the integration of other data sources such as climate and demographic data, in order
to establish risk factors;

• Discovering new knowledge from existing. For example, to get the correct answers to the
queries like "does patient x be at risk to become TB-MDR," the system must have access
to patients’ knowledge (e.g, patient characteristics and treatment behaviour) and be able to
reason based on this knowledge.

6.2.1.2 Analysis

The new module of the EPICAM software must permit doctors, epidemiologists and decision
makers to get access to all the relevant knowledge. The use case these actors will execute is given
by the figure 37.

Figure 37: The general use case executed by all users

6.2.1.3 System design

To permit users to have access to all knowledge, the data must be stored using a data structure
supporting inferences. As many researchers have proved that ontologies are the best choice for
knowledge modelling [84, 121], we have chosen to use an ontology.

The architecture of figure 38 shows how the ontology can be integrated in the existing system.
This architecture is composed of two main modules: the EPICAM module [68], which permits
stakeholders to obtain tuberculosis data and the OEPICAM module which helps users access in-
formation. Note that the EPICAM module is in use. The OEPICAM module is composed of an

Semantic-aware epidemiological surveillance system


6.2 Ontology building 125

Platform layer
Data collection and knowledge management
management module module
Doctor, nurses, etc. Epidemiologist, decision maker

Relevant information
Data registration

Semantic request
Data visualization
Doctor
Hospital

patient
Treat

Data layer for


ces ment
d tan
fee Population Ins enrich
ate the EPICAM knowledge base
EPICAM database ed Framework
Liv

Figure 38: The general architecture presenting the integration of an ontology in the EPICAM
platform.

ontology populated with the data extracted from the EPICAM database, an inference system which
will be used to infer new knowledge and a user interface which will be used by the users to access
information.

6.2.1.4 Product Backlog definition

The product backlog comprises the list of tasks to be executed in order to develop the ontology.
They are:

• Identification and evaluation of existing ontologies. This task consists of finding existing
ontologies that can be used in the system;

• Identification of domain resources. During the identification of domain resources, existing


knowledge sources will be identified;

• Knowledge acquisition from ontological knowledge sources. This task consists of using ex-
isting methodologies to acquire knowledge from existing ontologies and domain sources;

• Knowledge representation. After the knowledge is obtained, it is serialized in a machine


readable form;

• Ontology population. The ontology obtained after its serialization is populated with in-
stances.

The identification of ontological resources, knowledge acquisition, knowledge representation


is based on the NeOn methodology and is done iteratively (in many Sprints) and incrementally
(until the ontology fulfills the needs). After the Pre-development step and each Sprint, the Scrum
Master organized Scrum Meetings with the Scrum Team composed of the knowledge engineer
and the epidemiologist. During these meetings the ontology is evaluated and the Sprint Backlog

Semantic-aware epidemiological surveillance system


6.2 Ontology building 126

containing what to do in the next Sprint is defined. Each evaluation permitted us to determine the
ontology consistency using the Pellet reasoner, and to what extent the ontology developed fulfills
the requirements.

6.2.2 Development

The O4TBSS was developed in five Sprints.

6.2.2.1 First Sprint: Searching for existing ontologies that fulfilled the need

According to the NTCP, during epidemiological surveillance of TB, the following information are
recorded:

• Patients and their follow-up: Captures information about patients and the follow-up of
their treatment;
• Symptoms of the disease: Contains information that can be used to suspect, infirm or con-
firm that a patient is suffering from tuberculosis;
• Laboratory testing: Models the laboratory examinations that confirm or infirm that a patient
is suffering from tuberculosis;
• Epidemiology: Contains a set of indicators used to provide information for a better moni-
toring of the disease;
• Drugs: Contains information on the medication used for TB treatment;
• Sensitization: Captures information on patients and population sensitization;
• Users: Captures information on all the persons involve in the surveillance;
• Training and training materials: Models the management of the training of health workers
and their training materials.

The ontology modelling epidemiological surveillance used by the NTCP must contain all
this information. We have conducted a review of existing ontologies using Bioportal [136] and
Google’s Search Engine. Keywords such as "tuberculosis", "tuberculosis surveillance", "ontology
for tuberculosis surveillance" and "tuberculosis ontology" were used to carry out searches. In sum-
mary, we proceeded as follows:

• Firstly, we searched for ontologies modelling the TB surveillance on the Bioportal repository
using the keywords "tuberculosis" and "tuberculosis surveillance." A total of 38 ontologies
were found using the keyword "tuberculosis" and 48 ontologies were found using the key-
word "tuberculosis surveillance." These ontologies were examined and all excluded because
they did not focus on epidemiological surveillance of tuberculosis.

Semantic-aware epidemiological surveillance system


6.2 Ontology building 127

• Secondly, we used the keywords "ontology for tuberculosis surveillance" and "tuberculosis
ontology" to search for existing ontologies using Google’s Search Engine. Scientific papers
obtained were analyzed. A total of 12 scientific papers were initially identified, 9 of these
papers were excluded because they did not focus on tuberculosis ontology and four papers
were retained. The first one entitled "A Tuberculosis Ontology for Host Systems Biology"
[79] focuses on clinical terminology. The ontology presented has been made available in
a csv format; "RepTB: a gene ontology based drug repurposing approach for tuberculosis"
[100] focuses on drug repurposing; "An ontology for factors affecting tuberculosis treatment
adherence behavior in sub-Saharan Africa" [94] focuses on the factors that influence TB
treatment behaviour in sub-Saharan Africa; and "An Ontology based Decision support for
Tuberculosis Management and Control in India" [2] which presents the use of an ontology
for TB management in India. Although these papers are about ontologies of TB, only one
ontology is available for download in a csv format and this ontology covers just the clinical
aspects of epidemiological surveillance.

At the end of the first Sprint, we have noted that no existing ontology covers the domain that we
want to represent. This justifies the development of a new ontology.

6.2.2.2 Second Sprint: knowledge extraction from EPICAM source code

To develop the new ontology, the first domain resource we used is the EPICAM source code. In
fact, source code is any fully executable description of a software designed for a specific domain
such as medical, industrial, military, communication, aerospace, commercial, scientific, etc. In the
software design process, a set of knowledge related to the domain is captured and integrated in the
source code [13, 14, 15].

In a previous work, we extracted knowledge from the source code of the EPICAM platform
and used this knowledge to construct an ontology (named ontoEPICAM), modelling epidemiolog-
ical surveillance of tuberculosis in Cameroon [15]. This ontology is composed of 329 terms with
97 classes, 117 DataProperties and 115 ObjectProperties. Given that this ontology models the epi-
demiological surveillance system of tuberculosis in Cameroon, it is yet to be evaluated to see if it
is complete. That is why, we evaluated this ontology given two criteria: (1) the completeness of the
modelled domains, which measures if all the domains covered by epidemiological surveillance are
well covered by the ontology; (2) the completeness of the ontology for each domain involved in the
epidemiological surveillance, which measures if each domain of interest is appropriately covered
in this ontology.

The keywords were identified from ontoEPICAM terms and used to carry out searches of
existing ontologies on Bioportal repository and Google’s Search Engine. The ontologies found
were examined using the browsing tool integrated in Bioportal. The figure 39 is an example of
browsing the "Human Disease Ontology". We found 275 ontologies. For each term, we noted the
list of ontologies obtained. For the ontologies found in the BioPortal repository, the BioPortal
ontology visualization tool was used to visualize the terms that are presented in the ontology. If an
ontology contains the relevant terms, it is selected. In many cases, two ontologies have the same
terms when searching using certain keywords e.g., "patient", "doctor", "nurse", "tuberculosis", etc.

Semantic-aware epidemiological surveillance system


6.2 Ontology building 128

Figure 39: Example of browsing Human Disease Ontology (DOID) using Bioportal visualization
tool

Then, the most complete were selected. The ontologies not present in the BioPortal repository were
examined using Protege. The ontology in csv file was examined using LibreOffice Calc. The table
6.1 presents the ontologies selected for our purpose. In this table, the keyword column presents
the keywords that were used to find the ontology. The ontology column presents the ontology
selected given the keyword. The covered domain column presents the domain of epidemiological
surveillance covered by the ontology and the description column presents a brief description of the
ontology

In the table 6.2, presenting the comparison of selected ontologies with the EPICAM ontol-
ogy: patients information, characteristics and Follow-up is represented by "Patient"; symptoms of
the disease is represented by "Symp"; training and training materials is represented by Training.
Comparing the ontologies selected using ontoEPICAM terms (see table 6.2), we found that only
ontoEPICAM takes into account all the aspects of epidemiological surveillance. However, by con-
sidering the completeness of each domain covered by epidemiological surveillance, we remarked
that the ontologies selected are more complete. For example, information about patient follow-up
is more complete in Mental Health Management Ontology (MHMO) than in ontoEPICAM. In the

Semantic-aware epidemiological surveillance system


6.2 Ontology building 129

Keywords Ontology Covered domains Ontology description


Epidemiological surveillance Epidemiology Ontology Epidemiology This is an ontology describing the epidemiological, demographics and infec-
(EPO) tion transmission process [102].
Tuberculosis symptoms Symptom Ontology (Symp) Tuberculosis sign and symp- Symp aims to understand the relationship between signs and symptoms and
toms capture the terms relative to the signs and symptoms of a disease 1 .
Tuberculosis Human Disease Ontology Patients and their follow-up, Human Disease Ontology is an ontology that represents a comprehensive hier-
(DOID) Epidemiology archically controlled vocabulary for human disease representation [117].
Tuberculosis ontology A Tuberculosis Ontology for Patients, symptoms, labora- Tuberculosis Ontology for Host Systems Biology focuses on clinical termi-
Host Systems Biology tory testing nology of tuberculosis diagnosis and treatment. It is available in a csv format
[79].
Patient Adherence and Integrated Patient and their follow-up This ontology is an ontology that defines the medication adherence of patient.
Care2
Patient Presence Ontology (PREO)3 Patient and their follow-up This ontology defines relationships that model the encounters taking place ev-
ery day among providers, patients, and family members or friends in environ-
ments such as hospitals and clinics.
Patient Mental Health Management Patient and their follow-up The Mental Health Management Ontology is an ontology for mental health-
Ontology (MHMO) care management [141]

Table 6.1: The list of ontologies selected for our purpose.

next Sprint, we will show how knowledge has been extracted from the ontologies presented in table
6.1 and combined with ontoEPICAM to build the Ontology for Tuberculosis Surveillance.

Ontologies Patients Symp Lab testing Epidemiology Drugs users Sensitization Training
ontoEPICAM Yes Yes Yes Yes Yes Yes Yes Yes
Epidemiology Ontology No No No Yes No No No No
Symptom Ontology No Yes No No No No No No
Adherence and Integrated Yes No No No No Yes No No
Care
Presence Ontology Yes No No No No Yes No No
Human Disease Ontology Yes No No No No No No No
Mental Health Management Yes Yes No No No Yes No No
Ontology
A Tuberculosis Ontology For Yes Yes Yes No Yes No No No
Host Systems Biology

Table 6.2: Comparison of selected ontologies with the EPICAM ontology.

6.2.2.3 Third Sprint: Ontology construction

The ontologies selected in the second Sprint were used to construct the O4TBSS. To do so, on-
tological knowledge was extracted using either ontofox [139] or Protege. By using ontofox, we
specified the source ontology, the classes, the keyword "includeAllChildren" to extract terms of
the ontology hierarchy branch and the keyword "includeAllAxioms" to extract all annotations. For
the ontologies not presented in ontofox such as "Adherence and Integrated Care ontology", Pro-
tege software was used for their examination, identification of irrelevant terms and the deletion
of the latter. Knowledge obtained were imported in Protege and examined term by term with the
help of an epidemiologist to evaluate each term and identify redundancies. Redundant terms iden-
tified were removed. Additional terms were extracted from ontoEPICAM to enrich the ontology
obtained. The Pellet reasoner in Protege permitted us to verify the consistency of the ontology
obtained. The table 6.3 and the figure 40 respectively present the metric and a part of the ontology
obtained.

Semantic-aware epidemiological surveillance system


6.2 Ontology building 130

# Ontologies Classes DataProperties ObjectProperties Total


O4TBSS 865 123 13 1001
1 Epidemiology Ontology 95 0 0 95
2 Symptom Ontology 12 0 0 12
3 Adherence and Integrated Care 246 12 2 260
4 Presence Ontology 205 25 5 235
5 Human Disease Ontology 22 14 0 36
6 Mental Health Management Ontology 143 64 0 217
7 A Tuberculosis Ontology For Host Systems Biology 125 0 0 125
8 ontoEPICAM 17 8 6 31

Table 6.3: O4TBSS terms and terms imported from 7 other ontologies sources and enriched with
EPICAM terms

Figure 40: A screenshot of O4TBSS obtained after the third Sprint

6.2.2.4 Fourth Sprint: Ontology enrichment

After building the ontology, we decided to populate it with data gathered from EPICAM database.
But we remarked that some data contained in the database can be considered as concepts/property.
For example, the occupation of patients was already represented in the ontology as a class with a
list of occupations as subclasses. Some occupations (specific to Cameroon like taxi drivers) were
not represented in the ontology. Then, with a SQL query, we extracted these knowledge composed
of 70 classes and enriched our ontology. The actual version of the ontology is composed of 1068
terms, with 935 classes, 123 ObjectProperties and 13 DataProperties. The complete ontology and
the source code write for its population is available on github4 .

6.2.2.5 Fifth Sprint: Ontology population

The purpose of ontology population is to complete the ontology built with instances. It consists of
extracting and systematically listing all the instances contained in the database that can reflect a
concept or a relationship of the domain to be modelled. As a matter of fact, we developed an inte-
4
https://github.com/jiofidelus/ontologies/tree/master/O4TBSS

Semantic-aware epidemiological surveillance system


6.3 Use cases 131

grator (see the architecture presented by the figure 38) which permits us to import and manage all
the data from the database to the ontology in Java. The source code of this integrator is available on
github5 . A flat view of the database was created by making a simple SQL query. This query permits
us to gain access to information and the information obtained was populated in the ontology. To
keep the relation between the tables in the database, the tuples identification in the database were
used as the identification of these instances in the ontology. For example, the TB case with ID "TB-
CASE_14f7ee" is linked to its appointment with ID "RDV_14f5e7a" in the database. Then, in the
ontology, their identifications will also be "TBCASE_14f7ee" and "RDV_14f5e7a". The complete
ontology and the source code write for its population is available on github6 .

6.3 Use cases

In section 6.2, we presented O4TBSS, an ontology that we built for epidemiological surveillance
of tuberculosis. This ontology was developed using OWL and populated with data of epidemiolog-
ical surveillance of TB in Cameroon. Given that O4TBSS supports OWL-ontological reasoning,
this section presents two use cases in which the reasoning mechanism permits us to derive new
knowledge from existing knowledge. To proceed these use cases, we populated the ontology with
100 patients among which 88 tuberculosis patients. Then, we used the DL query implemented the
DL query tab in Protege software to query the ontology and the Pellet reasoner engine.

6.3.1 Use case 1: inferring patient instances

The first use case on the use of the O4TBSS is the inference about patients who come to hospital
with health problems. In fact, the epidemiologist with which we work believes that an extensive set
of patient data would reveal subtle patterns if these patterns can be identified. The epidemiologist
revealed that all the patients who come for consultation are important in their job because he/she
wants to know the characteristics of the patients suffering from TB and the one who does not suffer
from TB to see what are the differences between these patients. However, the EPICAM platform
does not provide the information on the patients who come for consultation because the main
goal of the platform was to follow the TB patients. Given that in the ontology the "TB Patient"
is subclass of "Patient", the first use case consists of showing the result that is obtained using an
inference system and without an inference system. The figure 41 presents on the left the list of
all patients when the inference system is not used and on the right, the list of all patients when
the inference is used. The reasoner uses the set of assertions and knowledge accumulated in the
ontology to answer semantic queries, in particular inferred patients.

This use case shows that using O4TBSS permits the epidemiologists to identify from a set of
patients data the TB patients and non TB patients. The characteristics of these patients can then be
used to identify risk factors.
5
https://github.com/jiofidelus/ontologies/tree/master/O4TBSS
6
https://github.com/jiofidelus/ontologies/tree/master/O4TBSS/populatingO4TSS

Semantic-aware epidemiological surveillance system


6.3 Use cases 132

Figure 41: Request of the list of patients without the inference system (on the left) and using the
inference system (on the right)

6.3.2 Use case 2: automatic detection of TB-MDR susceptible patients by


reasoning on ontology

The TB-MDR is generally caused by an inadequate treatment of tuberculosis, which can give rise
to an epidemic of TB difficult to cure. In fact, poor adherence or non adherence of patients to TB
treatment is a major cause of treatment failure in Africa. Poor adherence is the failure of patients to
take medication or follow a diet and lifestyle in accordance with the prescription of the clinician.
Patients with poor adherence to TB treatment over a period of time have a high risk to become
resistant to prescribed drugs [94]. According to the NTCP, the patients who did not come to their
rendez-vous to get the medications are those who will later develop resistance to drugs and come
back with TB-MDR. A health worker revealed that often, some patients will follow the first 4
months of treatment, feeling better, they will not come back in the last two months and will come
later with TB-MDR. According to the epidemiologist, these patients and their characteristics must
be identified at time and an action must be taken.

The actual version of the EPICAM platform does not consider the TB-MDR patients. However,
the information on the following appointments of the patients are stored in the database. To get
access to this information, a SQL request must be made. However, given that the database is flat,
to get access to other information with the link to the patients, a joined request with 6 tables must
be done and a source code written to filter patient information (e.g., patient health center, district
and/or region). The current use case shows how a simple DL query with the inference system
permitted to get all the patients at risk to become TB-MDR (figure 42). A simple click permits
access to the patients’ characteristics.

This second use case shows that the ontology can be used to classify patients according to their
behavior. It can also be used to detect by inferring the other types of patients. For example, the

Semantic-aware epidemiological surveillance system


6.4 Conclusion 133

Figure 42: Inferring the patients at risk of TB-MDR

positive microscopy, negative microscopy and what is the difference between them, identify risk
factors according to many parameters such as time, location, etc.

6.3.3 Other useful feature of O4TBSS

One of the major benefits of using an ontology is the possibility of using reasoner to automatically
compute class hierarchy. The ontology we have developed in this thesis also facilitates the checking
for class subsumption. The reasoner is used to automatically compute a classification hierarchy
given the class definition. The example of figure 43 shows some class hierarchies obtained from
asserted classes by reasoning.

6.4 Conclusion

In this chapter, we reported the development of an ontology for tuberculosis surveillance. This
ontology can be used for the annotation of clinical and epidemiological data of tuberculosis. Our
motivation was to provide a model for epidemiological data which permits stakeholders involved
in epidemiological surveillance of TB to have access to all needed information. The goal being
to infer new knowledge from asserted ones using the reasoning mechanism. During the develop-
ment of O4TBSS, we found many biomedical ontologies that we classified in three main groups.
The first group was made of large ontologies such as "Human Disease ontology", modeling one
aspect of O4TBSS. These ontologies were not completely reused because they were too large and
the parts involved in the epidemiological surveillance of TB were too small. The second group

Semantic-aware epidemiological surveillance system

You might also like