Skip to content

ian-jooble/jooble-ir-2018-paper

Repository files navigation

PAPER search engine

Table of contents:

  • Services description
  • Index realization
  • Testing
  • Data
  • TODO

Services:

search

Search server provide UI hosting and all search workflow

Receive: ajax post request from user interface with meta and search
Return: ranked docs

Make call to:

  • Analyzer service (analyzer) with user search query and receive stemmed tokenized query and lang
  • Rank service (rank) with preprocessed query

analyzer

Analyzer provide stemm, tokenize and lang detect workflow

Receive: list of lists of textes
Return: list of lists of stemmed tokens for each document & list of languages codes

Make call to:

  • Language detection service (lang_detection) with query
  • Stemmer service (stemmer) with query

stemmer

Provide stemming and tokenization

Receive: text document and language
Return: list of stemmed tokens for document

lang_detection

Make language detection of texts list

Receive: list of strs
Returns: list of detected languages (strs code like de, en etc.)

rank

Make search in index and rank documents using paper.index module

Receive: list of meta, stemmed_query
Return: list of ranked documents

updater

Make index update using paper.index module

Receive: textes - list of strs
Return: list of ids of updated documents

Make call to:

  • Analyzer service with textes

Index

Index is a Redis database

There is module paper.index where basic functions for working with index are provided:

Function Arguments Description
update_index(docs, stemmed) docs - list or array of strs
stemmed - list of lists with token strs
Make indexing of textes (inverted and forward)
Return: list of ids of updated docs
search(tokens) tokens - stemmed tokens for search Make search (boolean AND - intersect of sets)
Return: returns set of doc_ids strs (like "23" or "12345")
get_docs(ids, is_str=False) ids - int ids of docs if is_str=False,
in other case: ids - strs of ids like 'doc:id'
Get docs by their ids
Returns: list of strs (documents textes)
delete_all() without arguments delete all instances from redis database

Testing

There is a folder testing where are some notebooks for testing apis (lang_detect, update)

Data

Data folder consist eval_textes.csv and hhru.json (parsed vacancies from hh.ru)

TODO

  • Make doc class
  • Snippets api
  • Make other functions for index like delete(ids) etc.
  • Make doc as a HASH instance in redis database
  • Config file for ports and other stuff
  • Config for redis db
  • Make Stemmer for other languages
  • Make Index with ./Data/by/* gzip files

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published