Make empty strings an acceptable token #67

xhluca · 2024-10-14T18:20:57Z

Closes #60

xhluca · 2024-10-14T18:23:13Z

@zhuwenxing does this tackle the issue you encountered? Feel free to contribute tests for that specific instance.

zhuwenxing · 2024-10-17T03:00:07Z

Hi, @xhluca . From my perspective, raising an error when tokenizing an empty string is not a good practice. In real-world data distribution, it's common for data to be imperfect and contain null values (or empty strings), so it's normal for data to include empty strings.

Raising an error directly for empty strings might frustrate developers, who would then have to clean the data before using bm25s.

Additionally, I experimented with Elasticsearch, and inserting data with empty strings didn't cause any issues. Even if all the data were empty strings, searching wouldn't result in an error.

xhluca · 2024-10-17T04:52:24Z

You are right. I've removed the valueerrror and instead use make the empty string into an accepted vocab token.

Can you let me know if that works on your end?

xhluca · 2024-10-18T20:10:09Z

@zhuwenxing I tried running your example, however there was a few things i had to change, i also removed jieba to make it more generic (since the problem can be found in english as well), here's the updated code:

import bm25s
from bm25s.tokenization import Tokenizer

# Create an empty corpus
corpus = ["", "", "", ""]
# Create a list of queries
queries = ["what is the meaning of life?"]

# The tokenizer will return a list of list of tokens
tokenizer = Tokenizer()
corpus_tokens = tokenizer.tokenize(corpus, return_as="tuple", allow_empty=True)
print("Corpus tokens:", corpus_tokens)

query_tokens = tokenizer.tokenize(queries, return_as="ids", update_vocab=False, allow_empty=True)
print(f"Query tokens: {query_tokens}")

retriever = bm25s.BM25()
retriever.index(corpus_tokens)

results, scores = retriever.retrieve(query_tokens, corpus=corpus, k=2)
print("Results:", results)

If you run the updated code, you will find an error with the latest version of this library, however, if you run it after commit c87f2fe of this branch, the error will be resolved.

Anyways, let me know if this is the intended behavior you were expecting.

…call to avoid running into an error

…t contain ints that do not exist in the vocabulary, but allows empty string

xhluca added 2 commits October 14, 2024 14:20

Update tokenization.py

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.

GPG key ID: B5690EEEBB952194

Verified
Learn about vigilant mode

Loading
Loading status checks…

aac45c8

Update tokenization.py

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.

GPG key ID: B5690EEEBB952194

Verified
Learn about vigilant mode

Loading
Loading status checks…

dff8783

remove valueerror, instead make empty string into an accepted vocab item

Loading
Loading status checks…

e25c75f

xhluca changed the title ~~Raise error when tokenizer encounters empty strings~~ Make empty strings an acceptable token Oct 17, 2024

Allow empty strings to suppress errors

Verified

This commit was signed with the committer’s verified signature.

xhluca Xing Han Lu

GPG key ID: 4F2E01F38C7A6858

Verified
Learn about vigilant mode

Loading
Loading status checks…

c87f2fe

Add tests for empty tokens

Verified

This commit was signed with the committer’s verified signature.

xhluca Xing Han Lu

GPG key ID: 4F2E01F38C7A6858

Verified
Learn about vigilant mode

Loading
Loading status checks…

1931654

xhluca mentioned this pull request Oct 18, 2024

Index out of bounds errors in 0.2.0 and 0.2.1 #60

Closed

xhluca added 3 commits October 18, 2024 16:25

Consolidate tests into a single file

Verified

This commit was signed with the committer’s verified signature.

xhluca Xing Han Lu

GPG key ID: 4F2E01F38C7A6858

Verified
Learn about vigilant mode

Loading
Loading status checks…

02fc3df

xhluca merged commit e06ecf4 into main Oct 18, 2024
2 checks passed

xhluca deleted the xhluca-add-error-for-empty-string branch October 18, 2024 20:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make empty strings an acceptable token #67

Make empty strings an acceptable token #67

xhluca commented Oct 14, 2024

xhluca commented Oct 14, 2024

zhuwenxing commented Oct 17, 2024

xhluca commented Oct 17, 2024

xhluca commented Oct 18, 2024 •

edited

Loading

Make empty strings an acceptable token #67

Make empty strings an acceptable token #67

Conversation

xhluca commented Oct 14, 2024

xhluca commented Oct 14, 2024

zhuwenxing commented Oct 17, 2024

xhluca commented Oct 17, 2024

xhluca commented Oct 18, 2024 • edited Loading

xhluca commented Oct 18, 2024 •

edited

Loading