Skip to content

hacked together bad python code to index digital ancient-greek dictionaries

Notifications You must be signed in to change notification settings

no382001/a-grk_indexer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 

Repository files navigation

hacked together bad python code to index digital ancient-greek dictionaries (i hate digital dictionaries, but you cant have everything)

install

(this will take a while)

pip install cltk pytesseract pdf2image

apt install tesseract-ocr-grc

after, run this to get the CLTK ancient greek model (or just run download_corpus.py)

from cltk.data.fetch import FetchCorpus

corpus_downloader = FetchCorpus(language="grc")
corpus_downloader.import_corpus("grc_models_cltk")

use

  • run indexer.py
  • select a pdf
  • select an area (or two) with L or RMouse (this is the only part of all the pages thats going to be OCR'd and tokenized by CLTK)
  • press Tokenize
  • ...

About

hacked together bad python code to index digital ancient-greek dictionaries

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages