-
Notifications
You must be signed in to change notification settings - Fork 55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make empty strings an acceptable token #67
Conversation
@zhuwenxing does this tackle the issue you encountered? Feel free to contribute tests for that specific instance. |
Hi, @xhluca . From my perspective, raising an error when tokenizing an empty string is not a good practice. In real-world data distribution, it's common for data to be imperfect and contain null values (or empty strings), so it's normal for data to include empty strings. Raising an error directly for empty strings might frustrate developers, who would then have to clean the data before using bm25s. Additionally, I experimented with Elasticsearch, and inserting data with empty strings didn't cause any issues. Even if all the data were empty strings, searching wouldn't result in an error. |
You are right. I've removed the valueerrror and instead use make the empty string into an accepted vocab token. Can you let me know if that works on your end? |
@zhuwenxing I tried running your example, however there was a few things i had to change, i also removed jieba to make it more generic (since the problem can be found in english as well), here's the updated code: import bm25s
from bm25s.tokenization import Tokenizer
# Create an empty corpus
corpus = ["", "", "", ""]
# Create a list of queries
queries = ["what is the meaning of life?"]
# The tokenizer will return a list of list of tokens
tokenizer = Tokenizer()
corpus_tokens = tokenizer.tokenize(corpus, return_as="tuple", allow_empty=True)
print("Corpus tokens:", corpus_tokens)
query_tokens = tokenizer.tokenize(queries, return_as="ids", update_vocab=False, allow_empty=True)
print(f"Query tokens: {query_tokens}")
retriever = bm25s.BM25()
retriever.index(corpus_tokens)
results, scores = retriever.retrieve(query_tokens, corpus=corpus, k=2)
print("Results:", results) If you run the updated code, you will find an error with the latest version of this library, however, if you run it after commit c87f2fe of this branch, the error will be resolved. Anyways, let me know if this is the intended behavior you were expecting. |
…call to avoid running into an error
…t contain ints that do not exist in the vocabulary, but allows empty string
Closes #60