Refactor retrieval to make it faster to run in numba mode #47

xhluca · 2024-08-24T22:49:14Z

This a work in progress!

This PR will make numba mode faster by rewriting the entire retrieve process into a numba JIT-able function (see _retrieve_internal_numba_parallel)

TODO:

Cleanup retrieve_numba to make it compatible with retrieve when BM25 object is initiatilized with backend="numba"
Deprecate selection_backend in retrieve so that it happens at the object init time
Potentially rename _retrieve_internal_numba_parallel
Make tqdm work in _retrieve_internal_numba_parallel
Potentially refactor the behavior of the selection and numba.selection modules
Create a tokenizer class (perhaps in a separate PR? also should handle On-the-fly stemming #31 at the same time)
add Tests for numba in numpy-disk mode and with bm25+ (use non-occurrence matrix)

xhluca · 2024-08-30T02:51:38Z

I wonder if it is possible to do invertex indexing here, by creating an array that tracks start and end:

bm25s/bm25s/scoring.py

Lines 329 to 352 in daf29ce

    
           def _compute_relevance_from_scores_jit_ready( 
        
               data: np.ndarray, 
        
               indptr: np.ndarray, 
        
               indices: np.ndarray, 
        
               num_docs: int, 
        
               query_tokens_ids: np.ndarray, 
        
               dtype: np.dtype, 
        
           ) -> np.ndarray: 
        
               """ 
        
               This internal static function calculates the relevance scores for a given query, 
        
               by using the BM25 scores that have been precomputed in the BM25 eager index. 
        
               This version is ready for JIT compilation with numba, but is slow if not compiled. 
        
               """ 
        
               indptr_starts = indptr[query_tokens_ids] 
        
               indptr_ends = indptr[query_tokens_ids + 1] 
        
               scores = np.zeros(num_docs, dtype=dtype) 
        
               for i in range(len(query_tokens_ids)): 
        
                   start, end = indptr_starts[i], indptr_ends[i] 
        
                   # The following code is slower with numpy, but faster after JIT compilation 
        
                   for j in range(start, end): 
        
                       scores[indices[j]] += data[j] 
        
               return scores

…object is initiatilized with backend="numba"

xhluca · 2024-09-02T22:40:40Z

Deprecate selection_backend in retrieve so that it happens at the object init time

In retrospective, it seems that selection_backend remains useful for testing purposes, as well as using the jax backend. Let's not deprecate it in 0.2.0

xhluca · 2024-09-02T22:41:21Z

Make tqdm work in _retrieve_internal_numba_parallel

Unfortuantely tqdm won't work, so we can't add progress bar to retrieve when backend is set to numba

xhluca · 2024-09-02T23:01:11Z

Create a tokenizer class (perhaps in a separate PR? also should handle #31 at the same time)

Will do that in a separate PR

xhluca added 4 commits August 21, 2024 01:34

WIP

Loading
Loading status checks…

e12ec95

Stil WIP, seems to be working, cleanup TODO

Loading
Loading status checks…

1bae05d

Add _tokenize_with_vocab (WIP)

d3308fa

Rename to tokenize_with_vocab_exp

8f3a476

xhluca mentioned this pull request Aug 24, 2024

Add more bm25s+numba results xhluca/bm25-benchmarks#8

Merged

xhluca marked this pull request as draft August 24, 2024 22:49

xhluca added 2 commits August 25, 2024 11:59

simplify _tokenize_with_vocab_exp

1e6e78c

update _tokenize_with_vocab_exp

6839751

xhluca added 2 commits September 2, 2024 18:34

Delete unecessary comment and rename retrieve utils functiion

ba2fa2a

Cleanup retrieve_numba to make it compatible with retrieve when BM25 …

be178b8

…object is initiatilized with backend="numba"

xhluca added 2 commits September 2, 2024 18:59

Add backend saving

56c2202

Add tests for numba backend, including mmap

e36780d

Move test file to numba section

a05a072

xhluca marked this pull request as ready for review September 3, 2024 00:28

xhluca merged commit 072d242 into main Sep 3, 2024
2 checks passed

xhluca deleted the refactor-retrieve-for-numba branch September 3, 2024 00:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor retrieval to make it faster to run in numba mode #47

Refactor retrieval to make it faster to run in numba mode #47

xhluca commented Aug 24, 2024 •

edited

Loading

xhluca commented Aug 30, 2024

xhluca commented Sep 2, 2024

xhluca commented Sep 2, 2024

xhluca commented Sep 2, 2024

Refactor retrieval to make it faster to run in numba mode #47

Refactor retrieval to make it faster to run in numba mode #47

Conversation

xhluca commented Aug 24, 2024 • edited Loading

xhluca commented Aug 30, 2024

xhluca commented Sep 2, 2024

xhluca commented Sep 2, 2024

xhluca commented Sep 2, 2024

xhluca commented Aug 24, 2024 •

edited

Loading