An Application of Apriori Algorithm
on a Diabetic Database
Nevcihan Duru
Department of Computer Eng., University of Kocaeli, 41440, Izmit, Kocaeli, Turkey
[email protected] Abstract. In recent days, mining information from large databases has been
recognized by many researchers and many data mining techniques and systems
have been developed. In this study, a software (DMAP), which uses Apriori al-
gorithm, was developed. Apriori is an influential algorithm that used in data
mining. The name of the algorithm is based on the fact that the algorithm uses
prior knowledge of frequent item set properties. The software is used for dis-
covering the social status of the diabetics. A diabetic database that belongs to
faculty of medicine of Kocaeli University has been used. The software was exe-
cuted on a database which has records of 66 patients for test purpose. In the lit-
erature, diabetic databases have been often analyzed by rough sets. In this pa-
per, Apriori algorithm, which has been usually used for the market basket
analysis, was used for analyzing a diabetic database.
1 Introduction
The explosive growth in databases has generated an urgent need for new techniques
and tools that can intelligently and automatically transform the processed data into
useful information and knowledge [1]. In fact, as data volumes grow dramatically, this
type of manual data analysis is becoming completely impractical in many domains.
Databases are increasing in size in two ways: (1) the number N of records or objects
in the database and (2) the number d of fields or attributes to an object [2]. Therefore,
data mining has become a research area with increasing importance [3,4]. Although
data mining and knowledge discovery in databases are often treated as synonym, data
mining is actually part of the knowledge discovery process. There have been many
advances on researches and developments of data mining, and many data mining
techniques and systems have recently been developed. Different classification
schemes can be used to categorize data mining methods and systems based on the
kinds of databases to be studied, the kinds of knowledge to be discovered, and the
kinds of techniques to be utilized [5]. In this approach, Apriori algorithm, on a dia-
betic database, has generated association rules.
There have been a lot of works on diabetic databases for different purposes.
Micheal and Beguin have used a database to query for diabetes mellitus [6]. Kopel-
man and Sanderson used a database to provide continuous quality improvement in
diabetes care [7]. Breault used rough sets on Pima Indian diabetic database which has
become a standard for testing data mining algorithms to see their accuracy in predict-
ing diabetic status [8]. According to Knowler and Bennett et al., the Pima Indians may
be genetically predisposed to diabetes [9]. For this reason, there have been many
studies on data mining techniques to the Pima Indian database.
R. Khosla et al. (Eds.): KES 2005, LNAI 3681, pp. 398–404, 2005.
© Springer-Verlag Berlin Heidelberg 2005
An Application of Apriori Algorithm on a Diabetic Database 399
Diabetes is suitable for applying data mining technology, for a number of reasons.
First, there is tremendous amount of data. Second, diabetes is a disease that can cause
many complications of blindness, kidney failure, amputation, premature cardiovascu-
lar death and so on [8]. Third, it can be decided that a person’s predisposition to dia-
betes by examining these type of database.
1.1 A Brief Look at Data Mining and the Models
Data mining is a step in the knowledge discovery process. However, in industry, in
media and in the database research area, the term data mining has become more popu-
lar than the term of knowledge discovery in databases. It means a process of nontriv-
ial extraction of implicit, previously unknown and potentially useful information
(such as knowledge rules, constraints, regularities) from data in databases [10]. Data
mining should be applicable to relational databases, data warehouses, World Wide
Web and advanced database systems like object oriented and object relational data-
bases.
There have been many methods for mining different kinds of knowledge, including
association rules, characterization, classification, clustering, etc. In general, the mod-
els that are used in data mining can be classified into two categories: predictive and
descriptive [1]. Descriptive mining tasks characterize the general properties of the
data in the database. Predictive mining tasks perform inference on the current data in
order to make predictions. Classification and regression are predictive; association
rules and clustering are descriptive models [11].
Association analysis is the general process of determining which things go to-
gether. That means it is the process of discovering association rules showing attribute-
value conditions that occur frequently together in a given set of data. Association
rules are unlike traditional classification rules in that an attribute appearing as a pre-
condition in one rule may appear in the consequent of second rule. In addition, tradi-
tional classification rules usually limit the consequent of a rule to contain one or sev-
eral attribute values [12]. It is widely used for market basket analysis. In this approach
it is applied to a diabetic database by means of developed software.
A mathematical model was proposed in [13] to address the problem of mining as-
sociation rules. Let I={ i1, i2, …, im} be a set of literals, called items. Let D be a set of
transactions, where each transaction T is a set of items such that T ⊆ I. Note that the
quantities of items bought in a transaction are not considered, meaning that each item
is a binary variable representing if an item was bought. Each transaction is associated
with an identifier, called TID.
Let X be a set of items. A transaction T is said to contain X if and only if X ⊆ T. An
association rule is an implication of the form X ⇒ Y, where X ⊂ I, Y ⊂ I and X∩ Y =
∅. The rule X ⇒ Y holds in the transaction set D with confidence c if c% of transac-
tions in D that contain X also contain Y. The rule X ⇒ Y has support s in the transac-
tion set D if s % of transactions in D contains X ∪Y.
Confidence denotes the strength of implication and support indicates the frequen-
cies of the occurring patterns in the rule. It is often desirable to pay attention to only
those rules, which may have reasonably large support. Such rules with high confi-
dence and strong support are referred to as strong rules. The task of mining associa-
tion rules is essentially to discover strong association rules in large databases [5].
400 Nevcihan Duru
When several attributes are present, the association rules generating process be-
comes difficult to deal with because of the large number of possible conditions for the
consequent of each rule. To generate the rules efficiently, special algorithms have
been developed. One such algorithm is the Apriori algorithm [14]. Apriori algorithm
is one of the most known algorithm used to generate association rules.
This algorithm generates item sets, which are attribute-value combinations. Those
attribute-value combinations, which do not meet the coverage requirement, are dis-
carded. Because of this, the rule generation process can be completed in a reasonable
amount of time.
Apriori association rule-generation is a two-step process. The first step is item set
generation and the second step is to generate a set of association rules by using the
generated items.
2 Implementation of the Algorithm
In the approach, a software, which uses Apriori algorithm was developed. Apriori is
an influential algorithm, which is used in data mining. The software is named as
DMAP (Data Mining with Apriori). Here DMAP has been used for defining the dia-
betic persons’ social status. A diabetic database, which has been formerly created by
the Faculty of Medicine, Kocaeli University were used. The database originally was
created in SPSS. It has n=66 patients each with 8 variables. These 8 variables are: 1)
Family Type, 2) Age, 3) Career, 4) Date of diagnose, 5) Marital status, 6) Education,
7) Method of care type, 8) Sex. The available values for these variables (attributes)
are given in Table 1.
Table 1. The available values for the attributes
DV FmlyType Age Career Beg.Date MaritalStatus Education MethodofCare Sex
(years)
1 BasicFamily 30- Housewife Less Married Literate Insulin Male
39 than 1
2 LargeFamily 40- Retired 1-5 Unmarried Primary Oral Female
49
3 Alone 50- Independent 6-10 Widow Secondary Diet
59
4 60- Official 11+ University
69
5 70+ Worker
6 Unemployed
The database file which has been built in SPSS, was converted to MS Access file.
During this conversion process, only integer values between 1 and 6 were used. These
six values are shown in the DV field of Table 1. For example, number 2 was used, if
the patient was “female”, and number 6 was used, if the patient was “unemployed”.
After this value conversion process, the database was shaped as shown in Figure 1.
DMAP was developed by using Borland Delphi 7.0. The first created item set table
contains single-item set. This set is shown in Table 2. These values were determined
by the DMAP interface which is shown in Fig. 2.
In the first iteration, Apriori simply scans all the transactions to count the number
of occurrences for each item. The candidate 1-itemsets obtained is shown in Table 2.
An Application of Apriori Algorithm on a Diabetic Database 401
FmlyType Age Career MaritalStatus Education MethodofCare BegDate Sex
1 5 1 3 1 1 1 1
1 5 2 1 2 1 3 2
1 5 1 3 6 2 3 1
1 4 2 1 2 2 2 2
1 5 2 1 2 2 1 2
1 4 3 1 4 1 3 2
3 5 3 3 2 2 3 2
1 5 2 1 2 1 3 2
1 4 1 3 2 1 1 1
1 5 2 1 2 2 1 2
Fig. 1. Diabetic database for 10 samples
Table 2. Single-item set
Single-Item sets Number of Items Single-Item sets Number of Items
Family Type=1 51 Education=1 10
Family Type=2 13 Education=2 36
Family Type=3 2 Education=3 6
Age=1 3 Education=4 4
Age=2 21 Education=5 3
Age=3 14 Education=6 7
Age=4 17 MethodofCare=1 22
Age=5 11 MethodofCare=2 42
Career=1 36 MethodofCare=3 2
Career=2 21 Sex=1 40
Career=3 9 Sex=2 26
MaritalStatus=1 54
MaritalStatus=3 12
BeGdate=1 28
BeGdate=2 16
BeGdate=3 22
Fig. 2. DMAP interface (frequency values of the attributes)
Assuming that the minimum support required is 6, some items shown in Table 2
are discarded and new set is produced. This new set is shown in Table 3. As noticed,
402 Nevcihan Duru
Table 3. The set of frequent 1-itemset
Single-Item sets Support Single-Item sets Support
Family Type=1 51 BeGdate=1 28
Family Type=2 13 BeGdate=2 16
Age=2 21 BeGdate=3 22
Age=3 14 Education=1 10
Age=4 17 Education=2 36
Age=5 11 Education=6 7
Career=1 36 MethodofCare=1 22
Career=2 21 MethodofCare=2 42
Career=3 9 Sex=1 40
MaritalStatus=1 54 Sex=2 26
MaritalStatus=3 12
reproducing new n-itemsets is available with DMAP. By means of executing the in-
terface shown in Fig.3. iteratively, the sets are generated. As seen on the top of the 2-
item set in Fig. 3, our first possibility is:
Family Type=1(Basic Family) & Career=1 (Housewife). The rule confidence is
obtained as 39.3%. The next step is to use the attribute-value combinations from the
2-item set table to generate 3-item sets.
Fig. 3. Generating of 2-itemset
In Fig.4. it is shown another example for another 2-itemset groups. In Age=2 (40-
49) & MaritalStatus=1 (Married), the rule confidence is obtained as 30.3%. This
process can be repeated for generating 3-itemset sets and so on.
Association rules are particularly popular because of their ability to find relation-
ships in large databases without having the restriction of choosing a single dependent
variable. However caution must be exercised in the interpretation of association rules
since many discovered relationships turn out to be trivial [12].
An Application of Apriori Algorithm on a Diabetic Database 403
Fig. 4. Generating of 2-itemset for another items
3 Conclusions
In this approach, a software, which uses Apriori algorithm, was developed. Even there
are many commercial tools that one can make data mining, they are costly and not
available in our university. In this work, our purpose was to develop a software which
could be used for data mining.The software is used for discovering the social status of
the diabetics. A diabetic database that belongs to faculty of medicine of Kocaeli Uni-
versity has been used. It has 66 records of patients, each with 8 numeric variables.
The purpose of this software is to serve to analyze the diabetics. Before the develop-
ment of this software, SPSS database could only be surveyed by eye and only single-
itemset relations could have been generated. The use of this software made possible to
generation of two, three and even four itemsets. It concluded that, developed software
and the methodology have served the purpose and worked well. A comparative analy-
sis of different data mining techniques seems as an interesting work for the near fu-
ture.
References
1. Han, J. and M. Kamber (2001). Data mining: concepts and techniques. San Francisco,
Morgan Kaufmann Publishers.
2. Fayyad U.:From Data Mining to Knowledge Discovery in Databases, American
Association for Artificial Intelligence, 1996.
3. U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy. Advances in Knowl-
edge Discovery and Data Mining. AAAI/MIT Press, 1996.
4. G. Piatetsky-Shapiro and W. J. Frawley. Knowledge Discovery in Databases. AAAI/MIT
Press, 1991.
5. Chen M., Han J.: Data Mining: An Overview from Database Perspective, IEEE Transac-
tions on Knowledge and Data Engineering, 8(6):866-883, 1996.
6. Michel, C. and C. Beguin: Using a database to query for diabetes mellitus, Stud Health
Technol Inform 14: 1994, 178-182.
7. Kopelman, P. G. and A. J. Sanderson: Application of database systems in diabetes care,
Med Inform (Lond) 21(4): 1996, 259-271.
8. Breault, J. L.: Data Mining Diabetic Databases: Are Rough Sets a Useful Addition?,
Computing Science and Statistics, Vol:34, 2001.
404 Nevcihan Duru
9. Knowler, W. C., P. H. Bennett, et al. (1978). “Diabetes incidence and prevalence in Pima
Indians: a 19-fold greater incidence than in Rochester, Minnesota.” Am J Epidemiol
108(6): 497-505.
10. G. Piatetsky-Shapiro and W. J. Frawley. Knowledge Discovery in Databases. AAAI/MIT
Press, 1991.
11. Berry, M.J.A., Linoff, G.S.:Mastering Data Mining:The Art and Science of Customer
relationhip Management, John Wiley & Sons, 1 st Ed., 1999.
12. Roiger, R.J., Geatz M. W.: Data Mining, A Tutorial-based Primer, Addison Wesley,2003.
13. Agrawal, R., Imielinski, T., Swami, A.:Mining Association Rules between Sets of Items in
Large Databases. Proceedings of ACM SIGMOD, pages 207-216, May 1993.
14. R. Agrawal and R. Srikant. Fast Algorithms for Mining Association Rules in Large Data-
bases. Proceedings of the 20th International Conference on Very Large Data Bases, pages
478-499, September 1994.