Duke Genetic Algorithm Linking
Duke Genetic Algorithm Linking
1
The background
• Duke
– open source data matching engine (Java)
– can find near-duplicate database records
– probabilistic configuration
– http://code.google.com/p/duke/
• People find making configurations difficult
– can we help them? Field Record 1 Record 2 Probability
Name acme inc acme inc 0.9
Assoc no 177477707 0.5
Zip code 9161 9161 0.6
Country norway norway 0.51
Address 1 mb 113 mailbox 113 0.49
Address 2 0.5
2
The idea
• Given
– a test file showing the correct linkages
• can we
– evolve a configuration
• using
– genetic algorithms?
3
What a configuration looks like
4
The hill-climbing problem
5
How it works
6
Actual code
for generation in range(POPULATIONS):
print "===== GENERATION %s ================================" % generation
for c in population:
f = evaluate(c)
if f > highest:
best = c
highest = f
show_best(best, False)
# mutate
population = [c.make_new(population) for c in population]
7
Actual code #2
class GeneticConfiguration:
def __init__(self):
self._props = []
self._threshold = 0.0
def _copy(self):
c = GeneticConfiguration()
c.set_threshold(self._threshold)
for prop in self.get_properties():
if prop.getName() == "ID":
c.add_property(Property(prop.getName()))
else:
c.add_property(Property(prop.getName(), prop.getComparator(), prop.getLowProbability(), prop.getHighProbability()))
return c
8
But ... does it work?!?
9
Linking countries
Id http://dbpedia.org/resource/Samoa Id 17019
10
The actual configuration
Threshold 0.6
Confusing.
11
Semantic dogfood
Threshold 0.91
PersonNameComparator?!?
Otherwise as expected.
13
Hafslund
• 1st generation
– best scores: 0.47, 0.43, 0.3
• 2nd generation
– mutated 0.47 configuration scores 0.136, 0.467, 0.002,
and 0.49
– best scores: 0.49, 0.467, 0.4, and 0.38
• 3rd generation
– mutated 0.49 scores 0.001, 0.49, 0.46, and 0.25
– best scores: 0.49, 0.46, 0.45, and 0.42
• 4th generation
– we hit 0.525 (modified from 0.21)
15
The progress of evolution #2
• 5th generation
– we hit 0.568 (modified from 0.479)
• 6th generation
– 0.602
• 7th generation
– 0.702
• ...
• 60th generation
– 0.765
– I’d done no better than 0.64 manually
16
Evaluation
• We don’t know
• The experts say genetic algorithms tend to get
stuck at local maxima
– they also point out that well-known techniques for
dealing with this are described in the literature
• Rerunning tends to produce similar
configurations
18
The literature
http://www.cleveralgorithms.com/ http://www.gp-field-guide.org.uk/
19
Conclusion
• Easy to implement
– you don’t need a GP library
• Requires reliable test data
• It actually works
• Configurations may not be very tweakable
– because they don’t necessarily make any sense
• This is a big field, with lots to learn
20 http://www.garshol.priv.no/blog/225.html
Linking data without common identifiers
1
About me
2
Agenda
3
The problem
4
A real-world example
DBPEDIA MONDIAL
Id http://dbpedia.org/resource/Samoa Id 17019
5
A difficult problem
7
Record linkage
1) http://ajph.aphapublications.org/cgi/reprint/36/12/1412
8 2) http://www.sciencemag.org/content/130/3381/954.citation
3) http://www.jstor.org/pss/2286061
Other terms for the same thing
9
Application areas
• Statistics (obviously)
• Data cleaning
• Data integration
• Conversion
• Fraud detection / intelligence / surveillance
10
Mathematical model
11
Model, simplified
12
Example
13
String comparisons
15
Existing record linkage tools
• Commercial tools
– big, sophisticated, and expensive
– have found little information on what they actually do
– presumably also effective
• Open source tools
– generally made by and for statisticians
– nice user interfaces and rich configurability
– architecture often not as flexible as it could be
16
Standard algorithm
17
Good research papers
18
Duke
DUplicate KillEr
19
Context
Suppliers
Companies
20
Requirements
21
Reviewed existing tools...
22
Duke
25
Components
26
Features
A real-world example
28
Finding properties to match
Id http://dbpedia.org/resource/Samoa Id 17019
29
Configuration – data sources
<group> <group>
<csv> <csv>
<param name="input-file" value="dbpedia.csv"/> <param name="input-file" value="mondial.csv"/>
<param name="header-line" value="false"/>
<column name="id" property="ID"/>
<column name="1" property="ID"/> <column name="country"
<column name="2" cleaner="no.priv...examples.CountryNameCleaner"
cleaner="no.priv...CountryNameCleaner" property="NAME"/>
property="NAME"/> <column name="capital"
<column name="3" cleaner="no.priv...LowerCaseNormalizeCleaner"
property="AREA"/> property="CAPITAL"/>
<column name="4" <column name="area"
cleaner="no.priv...CapitalCleaner" property="AREA"/>
property="CAPITAL"/> </csv>
</csv> </group>
</group>
30
Configuration – matching
<schema>
<threshold>0.65</threshold>
Duke analyzes this setup and decides
<property type="id"> only NAME and CAPITAL need to be
<name>ID</name> searched on in Lucene.
</property>
<property>
<name>NAME</name>
<comparator>no.priv.garshol.duke.Levenshtein</comparator>
<low>0.3</low>
<high>0.88</high>
</property>
<property>
<name>AREA</name>
<comparator>AreaComparator</comparator>
<low>0.2</low> <object class="no.priv.garshol.duke.NumericComparator"
<high>0.6</high> name="AreaComparator">
</property> <param name="min-ratio" value="0.7"/>
<property> </object>
<name>CAPITAL</name>
<comparator>no.priv.garshol.duke.Levenshtein</comparator>
<low>0.4</low>
<high>0.88</high>
</property>
</schema>
31
Result
32
Examples
Field DBpedia Mondial Field DBpedia Mondial
Name albania albania Name kazakhstan kazakstan
Area 28748 28750 Area 2724900 2717300
Capital tirana tirane Capital astana almaty
Probability 0.980 Probability 0.838
33
Choosing the right match
34
An example of failure
Field DBpedia Mondial
• Duke doesn’t find this match Name kazakhstan kazakstan
Area 2724900 2717300
– no tokens matching exactly Capital astana almaty
36
Usage at Hafslund
37
The SESAM project
38
The big picture
DUPLICATES!
SDshare 360 SDshare
CRM
SDshare Billing
Duke SDshare
contains owl:sameAs and
haf:possiblySameAs
39
Experiences so far
40
Duke roadmap
• 0.3
– clean up the public API and document properly
– maybe some more comparators
– support for writing owl:sameAs to Sparql endpoint
• 0.4
– add a web service interface
• 0.5 and onwards
– more comparators
– maybe some parallelism
41
Comments/questions?
42