- Services description
- Index realization
- Testing
- Data
- TODO
Search server provide UI hosting and all search workflow
Receive: ajax post request from user interface with meta and search
Return: ranked docs
Make call to:
- Analyzer service (analyzer) with user search query and receive stemmed tokenized query and lang
- Rank service (rank) with preprocessed query
Analyzer provide stemm, tokenize and lang detect workflow
Receive: list of lists of textes
Return: list of lists of stemmed tokens for each document & list of languages codes
Make call to:
- Language detection service (lang_detection) with query
- Stemmer service (stemmer) with query
Provide stemming and tokenization
Receive: text document and language
Return: list of stemmed tokens for document
Make language detection of texts list
Receive: list of strs
Returns: list of detected languages (strs code like de, en etc.)
Make search in index and rank documents using paper.index module
Receive: list of meta, stemmed_query
Return: list of ranked documents
Make index update using paper.index module
Receive: textes - list of strs
Return: list of ids of updated documents
Make call to:
- Analyzer service with textes
Index is a Redis database
There is module paper.index where basic functions for working with index are provided:
Function | Arguments | Description |
---|---|---|
update_index(docs, stemmed) |
docs - list or array of strs stemmed - list of lists with token strs |
Make indexing of textes (inverted and forward) Return: list of ids of updated docs |
search(tokens) |
tokens - stemmed tokens for search | Make search (boolean AND - intersect of sets) Return: returns set of doc_ids strs (like "23" or "12345") |
get_docs(ids, is_str=False) |
ids - int ids of docs if is_str=False, in other case: ids - strs of ids like 'doc:id' |
Get docs by their ids Returns: list of strs (documents textes) |
delete_all() |
without arguments | delete all instances from redis database |
There is a folder testing where are some notebooks for testing apis (lang_detect, update)
Data folder consist eval_textes.csv and hhru.json (parsed vacancies from hh.ru)
- Make doc class
- Snippets api
- Make other functions for index like delete(ids) etc.
- Make doc as a HASH instance in redis database
- Config file for ports and other stuff
- Config for redis db
- Make Stemmer for other languages
- Make Index with
./Data/by/*
gzip files