GitHub - quesada/wikiprep-esa: ESA implementation using Wikiprep output

quesada / wikiprep-esa Public

forked from faraday/wikiprep-esa

Notifications You must be signed in to change notification settings
Fork 0
Star 1

ESA implementation using Wikiprep output

Notifications

Name		Name	Last commit message	Last commit date
Latest commit History 71 Commits
esa-lucene		esa-lucene
README		README
addAnchors.py		addAnchors.py
addRedirects.py		addRedirects.py
directScan.py		directScan.py
lewis_smart_sorted_uniq.txt		lewis_smart_sorted_uniq.txt
readCatHier.py		readCatHier.py
scanCatHier.py		scanCatHier.py
scanData.py		scanData.py
scanLinks.py		scanLinks.py
selected.txt		selected.txt
wiki_stop_categories.txt		wiki_stop_categories.txt

Repository files navigation

Wikiprep-ESA

This is an effort to implement Explicit Semantic Analysis (ESA) as described in this paper:

"Wikipedia-based semantic interpretation for natural language processing"
2009, Gabrilovich, E. and Markovitch, S.

You can find this paper at: http://www.jair.org/media/2669/live-2669-4346-jair.pdf

This implementation consists of:
* scanData.py : that reads Wikiprep output into a MySQL database.
It creates "article","text" and "pagelinks" tables.

* addAnchors.py : that adds anchor text to target articles.
* addRedirects.py : that adds redirect text to target articles.

The scripts above are able to work on both Wikiprep legacy formats and modern format (as in Zemanta fork).

Evgeniy Gabrilovich provides a preprocessed dump for 5 November 2005 snapshot of Wikipedia English.

It is available at: http://www.cs.technion.ac.il/~gabr/resources/code/wikiprep/wikipedia-051105-preprocessed.tar.bz2

In its current settings, Python scripts of wikiprep-esa are ready to process this dump.
If you need to process dumps in formats of Zemanta, you need to set FORMAT in these scripts.

FORMAT can be following: "Gabrilovich", "Zemanta-legacy", "Zemanta-modern"

After reading preprocessed dump into the database and adding anchors and redirects, you need to use
"esa-lucene" to perform indexing.

* ESAWikipediaIndexer: performs indexing with Lucene by feeding it with article content from database.

* WikipediaNormalSearcher: at this step, you can use this class to perform a search in Lucene index.
keep in mind that at this point, the implementation won't be the same with Gabrilovich et al. (2009),
since cosine normalization is term-based in Gabrilovich et al. but document length based in Lucene.
Additionally, pruning is not yet applied in Lucene index as in Gabrilovich et al.

However, TF.IDF weighing scheme is the same (log-based) and is located in ESASimilarity class.

* IndexModifier: reads term frequency vectors from Lucene index and writes cosine-normalized TF.IDF values into
"tfidf" table in the database. This is done to apply the same normalization method used in Gabrilovich et al. (2009).

* IndexPruner: prunes concept vectors for each term with a sliding window.
By default, window_size = 100 and threshold = 0.05 as in Gabrilovich et al. (2009). You can modify these values
in IndexPruner class.

* ESASearcher: performs search and computes vectors by using the resulting index in the database.

* TestESAVectors: produces and displays regular feature vector.

* TestGeneralESAVectors: produces and displays "Second Order Interpretation" vector filtered with "Concept Generality Filter" as in Gabrilovich et al. (2009).

DEPENDENCIES

Python scripts use MySQL-Python to access database.
MySQL-Python: http://sourceforge.net/projects/mysql-python/

"esa-lucene" Java project used for indexing, pruning etc. uses MySQL Connector/J to access database,
Lucene 3.0 for indexing and Trove and these libraries are included in project files.

MySQL Connector/J: http://www.mysql.com/downloads/connector/j/
Lucene 3.0: http://lucene.apache.org
Trove: http://trove4j.sourceforge.net/

USAGE

This creates the pagelinks table and records incoming and outgoing link counts.

python scanLinks.py <hgw.xml file from Wikiprep dump>

As stop categories, a list "wiki_stop_categories.txt" is provided.
But if you want to descend down and include all subtrees of these categories, you can use:

python scanCatHier.py <hgw.xml file from Wikiprep dump> <cat_hier output path>

[The commands below are standard]

python scanData.py <hgw.xml file from Wikiprep dump>
python addAnchors.py <anchor_text file from Wikiprep dump> <a writeable folder>

java -cp esa-lucene.jar edu.wiki.index.ESAWikipediaIndexer <Lucene index folder>
java -cp esa-lucene.jar edu.wiki.modify.IndexModifier <Lucene index folder>
java -cp esa-lucene.jar edu.wiki.modify.IndexPruner

Then perform a feature generation to test:

To generate regular features:
java -cp esa-lucene.jar edu.wiki.demo.TestESAVectors

To generate features using only more general links:
java -cp esa-lucene.jar edu.wiki.demo.TestGeneralESAVectors