Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make empty strings an acceptable token #67

Merged
merged 8 commits into from
Oct 18, 2024

Conversation

xhluca
Copy link
Owner

@xhluca xhluca commented Oct 14, 2024

Closes #60

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
@xhluca
Copy link
Owner Author

xhluca commented Oct 14, 2024

@zhuwenxing does this tackle the issue you encountered? Feel free to contribute tests for that specific instance.

@zhuwenxing
Copy link

Hi, @xhluca . From my perspective, raising an error when tokenizing an empty string is not a good practice. In real-world data distribution, it's common for data to be imperfect and contain null values (or empty strings), so it's normal for data to include empty strings.

Raising an error directly for empty strings might frustrate developers, who would then have to clean the data before using bm25s.

Additionally, I experimented with Elasticsearch, and inserting data with empty strings didn't cause any issues. Even if all the data were empty strings, searching wouldn't result in an error.

@xhluca
Copy link
Owner Author

xhluca commented Oct 17, 2024

You are right. I've removed the valueerrror and instead use make the empty string into an accepted vocab token.

Can you let me know if that works on your end?

@xhluca xhluca changed the title Raise error when tokenizer encounters empty strings Make empty strings an acceptable token Oct 17, 2024

Verified

This commit was signed with the committer’s verified signature.
xhluca Xing Han Lu
@xhluca
Copy link
Owner Author

xhluca commented Oct 18, 2024

@zhuwenxing I tried running your example, however there was a few things i had to change, i also removed jieba to make it more generic (since the problem can be found in english as well), here's the updated code:

import bm25s
from bm25s.tokenization import Tokenizer

# Create an empty corpus
corpus = ["", "", "", ""]
# Create a list of queries
queries = ["what is the meaning of life?"]

# The tokenizer will return a list of list of tokens
tokenizer = Tokenizer()
corpus_tokens = tokenizer.tokenize(corpus, return_as="tuple", allow_empty=True)
print("Corpus tokens:", corpus_tokens)

query_tokens = tokenizer.tokenize(queries, return_as="ids", update_vocab=False, allow_empty=True)
print(f"Query tokens: {query_tokens}")

retriever = bm25s.BM25()
retriever.index(corpus_tokens)

results, scores = retriever.retrieve(query_tokens, corpus=corpus, k=2)
print("Results:", results)

If you run the updated code, you will find an error with the latest version of this library, however, if you run it after commit c87f2fe of this branch, the error will be resolved.

Anyways, let me know if this is the intended behavior you were expecting.

Verified

This commit was signed with the committer’s verified signature.
xhluca Xing Han Lu

Verified

This commit was signed with the committer’s verified signature.
xhluca Xing Han Lu
…call to avoid running into an error

Verified

This commit was signed with the committer’s verified signature.
xhluca Xing Han Lu
…t contain ints that do not exist in the vocabulary, but allows empty string

Verified

This commit was signed with the committer’s verified signature.
xhluca Xing Han Lu
@xhluca xhluca merged commit e06ecf4 into main Oct 18, 2024
2 checks passed
@xhluca xhluca deleted the xhluca-add-error-for-empty-string branch October 18, 2024 20:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Index out of bounds errors in 0.2.0 and 0.2.1
2 participants