Improve tokenizer #51

xhluca · 2024-09-04T05:00:14Z

Closes On-the-fly stemming #31
Makes tokenizing queries much faster since we don't need to rebuild the vocab (can just use the existing vocab from the corpus

TODO

… be thoroughly tested

…hecks

…ted when stemmer is not used

xhluca added 14 commits September 3, 2024 19:21

Add a tokenizer class (WIP)

1f43256

fix word_to_wid logic and add example for using tokenizer class

8244a89

WIP changes to make token ids valid inputs of retrieve, still need to…

35eb602

… be thoroughly tested

add todo to example

90fdf37

Major refactoring of tokenizer dclass

370d643

Minor QOL improvements

c190f80

Refactor streaming_tokenize to be faster by reducing unecessary set c…

c5cb9f5

…hecks

Remove _word_to_wid to simplify vocab design. Now, word_to_id is upda…

Loading
Loading status checks…

a082501

…ted when stemmer is not used

Update beir.py utils to use new URL

Loading
Loading status checks…

16df49f

Remove unused function, lint code, add test cases

Loading
Loading status checks…

43cc36c

Add example of using the tokenizer

Loading
Loading status checks…

d105d03

Rename example

9bda058

Add details about the new tokenizer class in readme

Loading
Loading status checks…

6ccafe5

Update class to test tokenize in int, ids, strings, tuple

Loading
Loading status checks…

407f67b

xhluca merged commit 0a49c62 into main Sep 8, 2024
2 checks passed

xhluca deleted the improve-tokenizer branch September 8, 2024 04:58