-
Notifications
You must be signed in to change notification settings - Fork 827
Issues: huggingface/tokenizers
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Author
Label
Projects
Milestones
Assignee
Sort
Issues list
Suggestion: speed improvement - hashmap implementation
#1722
opened Jan 15, 2025 by
royweiss1
updated Jan 15, 2025
Request for pre-tokenizer that creates words based length alone.
#1697
opened Dec 10, 2024 by
filbeofITK
updated Jan 12, 2025
batch_encode_plus doesn't work correctly
#1704
opened Dec 18, 2024 by
tempdeltavalue
updated Jan 11, 2025
if split_special_tokens==True,fast_tokenizer is slower than slow_tokenizer
#1700
opened Dec 12, 2024 by
gongel
updated Jan 10, 2025
Cannot find package 'tokenizers-linux-x64-musl' - Alpine support
#1703
opened Dec 14, 2024 by
PylotLight
updated Jan 10, 2025
Memory leak is observed when using the
AutoTokenizer
and AutoModel
with Python 3.10.*
#1706
opened Dec 27, 2024 by
KhoiTrant68
updated Jan 9, 2025
2 of 4 tasks
out of memory when training a BBPE tokenizer on a large corpus
#1681
opened Nov 14, 2024 by
yucc-leon
updated Dec 31, 2024
[building on windows] onig_sys/oniguruma two or more data types in declaration specifiers
#1581
opened Jul 29, 2024 by
louis030195
updated Dec 25, 2024
Unable to Load Custom GPT2 Tokenizer - " data did not match any variant of untagged enum ModelWrapper at line 1 column 3193814" Error
#1562
opened Jul 5, 2024 by
maghwa
updated Dec 22, 2024
Tokenizer Training Errors: pyo3_runtime.PanicException: called
Result::unwrap()
on an Err
value: TryFromIntError(())
#1698
opened Dec 10, 2024 by
Chimaco37
updated Dec 11, 2024
How to determine the splicing logic in post_processor based on the sentence to be tokenized?
#1696
opened Dec 5, 2024 by
gongel
updated Dec 5, 2024
NormalizedString.clear() broken?
bug
Something isn't working
#1636
opened Sep 25, 2024 by
lkurlandski
updated Nov 30, 2024
Bug: is_pretokenized is not used when calling tokenizer.encode(...)
#1695
opened Nov 29, 2024 by
jannessm
updated Nov 29, 2024
Question: Shrinking Tokenizer Vocabulary for Reduced Memory Consumption with Pre-Trained Model (LLaMA) Fine-Tuning
#1686
opened Nov 23, 2024 by
Amerehei
updated Nov 23, 2024
wikitext-103-raw-v1.zip is not available on the amazonaws anymore
#1683
opened Nov 18, 2024 by
gec1-dev
updated Nov 18, 2024
Option to disable cache for FromPretrained and FromFile
Feature Request
#1680
opened Nov 12, 2024 by
daulet
updated Nov 15, 2024
Reduce vocab size for BPE tokenizer
Feature Request
#1668
opened Oct 29, 2024 by
fzyzcjy
updated Oct 30, 2024
Inconsistent behaviour of Something isn't working
PreTrainedTokenizerFast
s on diacritics marked texts
bug
#1663
opened Oct 11, 2024 by
sven-nm
updated Oct 22, 2024
2 of 4 tasks
docs-check.yml
uses node12 which is deprecated
#1658
opened Oct 17, 2024 by
hamirmahal
updated Oct 17, 2024
Previous Next
ProTip!
no:milestone will show everything without a milestone.