You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When using the Pecos Tfidf Vectorizer, if you train it using a corpus which includes newlines, this causes errors when loading the saved version. The error is because the vocab file is parsed using newlines to delimit (index,vocab) pairs and if the vocab contains a newline it will crash since the entry is now across multiple lines.
This is solvable by cleaning the input but it may be desirable to handle this case internally so that cases where newlines are important do not break the vectorizer.
Error message or code output
(Paste the complete error message, including stack trace, or the undesired output that the above snippet produces.)
terminate called after throwing an instance of 'std::runtime_error'
what(): Corrupted vocab file.
Aborted
Environment
Operating system:
Python version: 3.10
PECOS version: 1.2
The text was updated successfully, but these errors were encountered:
The TF-IDF word level tokenizer expect already space separated text corpus and will not use any other separator.
However, it is true that due to the current serialization design, TF-IDF cannot handle \n within the token. We are planning to upgrade the serialization format in the following releases.
Description
When using the Pecos Tfidf Vectorizer, if you train it using a corpus which includes newlines, this causes errors when loading the saved version. The error is because the vocab file is parsed using newlines to delimit (index,vocab) pairs and if the vocab contains a newline it will crash since the entry is now across multiple lines.
How to Reproduce?
Using latest version of libpecos
Steps to reproduce
What have you tried to solve it?
Error message or code output
(Paste the complete error message, including stack trace, or the undesired output that the above snippet produces.)
Environment
The text was updated successfully, but these errors were encountered: