Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Newlines in Tfidf vectorizer corpus cause runtime exceptions when loading a trained vectorizer #263

Open
grant-miller-faire opened this issue Oct 12, 2023 · 1 comment
Labels
bug Something isn't working

Comments

@grant-miller-faire
Copy link

Description

When using the Pecos Tfidf Vectorizer, if you train it using a corpus which includes newlines, this causes errors when loading the saved version. The error is because the vocab file is parsed using newlines to delimit (index,vocab) pairs and if the vocab contains a newline it will crash since the entry is now across multiple lines.

How to Reproduce?

Using latest version of libpecos

Steps to reproduce

Python 3.10.8 | packaged by conda-forge | (main, Nov 22 2022, 08:23:14) [GCC 10.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from pecos.utils.featurization.text.vectorizers import Tfidf
>>> vectorizer = Tfidf()
>>> trained = vectorizer.train(["test\ncorpus"], config={'ngram_range':(1,1)})
>>> trained.save('test')
>>> Tfidf.load('test')
terminate called after throwing an instance of 'std::runtime_error'
  what():  Corrupted vocab file.
Aborted

What have you tried to solve it?

  1. This is solvable by cleaning the input but it may be desirable to handle this case internally so that cases where newlines are important do not break the vectorizer.

Error message or code output

(Paste the complete error message, including stack trace, or the undesired output that the above snippet produces.)

terminate called after throwing an instance of 'std::runtime_error'
  what():  Corrupted vocab file.
Aborted

Environment

  • Operating system:
  • Python version: 3.10
  • PECOS version: 1.2
@grant-miller-faire grant-miller-faire added the bug Something isn't working label Oct 12, 2023
@jiong-zhang
Copy link
Contributor

Hi @grant-miller-faire thanks for reporting this issue.

  1. The TF-IDF word level tokenizer expect already space separated text corpus and will not use any other separator.
  2. However, it is true that due to the current serialization design, TF-IDF cannot handle \n within the token. We are planning to upgrade the serialization format in the following releases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants