-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pipe(): ValueError Error parsing doc #429
Comments
Same issue here with the german model and |
I just tried this again and it seems to work now (reinstalled spaCy and the german model). @blang can you confirm? @syllog1sm Out of curiosity: Has there been an update to the German model which fixed this? Or was it a code change? |
Same problem here, working on Windows 10 with German text. Thought it was German that made it break. I also reinstalled spaCy and the German model yesterday, but this din't fix the problem in my case. I then tried to break it down to a specific sentence, but even after having removed this and succesively the follwoing sentences from my texts, the problem remained the same. |
I also had this problem with english text, it looks like a parser issue. Steps to reproduce: def texts():
yield "11th September 11 years ago I started my first business"
for doc in nlp.pipe(texts()):
pass raises:
Curious thing, If you add a comma like this:
the error goes away. Directly doing:
is ok in both cases. |
I think the The issue is arising because the entity recogniser's push-down automaton finds itself in a state with no continuations. I haven't stepped through the automaton yet (if you want to do that, use the method I'm afraid that 1.0 might paper over this problem, because the matcher won't by default set entities anymore — this will be up to the user's control (there's a better API for customising the pipeline, though). |
This might have something to do with state between the documents (don't know what if any is kept). I was having this exact issue with the german model and decided to just randomly shuffle the corpus to see what happens. At first errors were still being produced, but after a few shuffles of the corpus to errors to my surprise went away. I'll try to produce a more reliable report of the behaviour. update so the error going away didn't have anything to do with An example document where
Tested on |
I think I have this taken care of, but I'm not 100% sure. Please reopen if it reoccurs. |
FYI it did happen for me with 1.1.0, but so far I cannot provide any steps to reproduce it. The text it tried to parse isn't relevant: PS: Now I'm getting |
Do you have a minute to video chat about this? If so click here: https://appear.in/spacy_issue429 |
Sorry, my internet isn't good for video chatting, but I'm happy to text. |
No worries. If you're getting a segfault the handiest thing to do would be to break out the pipeline manually. Instead of:
You can do:
Then you can investigate what's going on. |
The segfault is caused by matcher. The number of matches I have is up to a million, python process eats about 4 GB of ram, and there's still enough for it to grow. I could investigate this later, maybe in another issue. Trying to narrow the scope of ParserStateError right now. |
Hmm. Is the match proliferation expected for your use-case? |
import spacy
from spacy.attrs import ORTH
nlp = spacy.load('en')
def merge_phrases(matcher, doc, i, matches):
if i != len(matches) - 1:
return None
spans = [(ent_id, label, doc[start:end]) for ent_id, label, start, end in matches]
for ent_id, label, span in spans:
span.merge('NNP' if label else span.root.tag_, span.text, nlp.vocab.strings[label])
doc = nlp('a')
nlp.matcher.add('key', label='TEST', attrs={}, specs=[[{ORTH: 'a'}]], on_match=merge_phrases)
doc = nlp('a b') ->
re: Hmm. Is the match proliferation expected for your use-case?
|
I can easily make the matches list a numpy array if necessary. A segfault via the Python API (as opposed to the Cython API) is always a bug. So yes, please open an issue. |
I'll do it tomorrow, once I know the steps to reproduce it. I guess now you have enough info for bug related to current issue. On Thu, Oct 27, 2016, 21:45 Matthew Honnibal [email protected]
|
Yes. I have the test set up and I'm pretty sure I understand the problem now. Fix should be out soon. |
…cogniser that adds missing entity types. This is a messy place to add this, because it's strange to have the method mutate state. A better home for this logic could be found.
I think this should fix the segfault too — I think they were related. Closing for now. Again, if it reoccurs, don't hesitate to reopen :) |
UPDATE: I was able to get around this by converting multiple spaces to a single space. Not sure if this was an issue with my string or with spaCy's processing. @honnibal A spaCy error told me to reopen this thread :-/ Not sure I have the rights to do that, but here's the text:
|
Hi, We are getting a parser state error. Here is the trace: Traceback (most recent call last): Here is our test: |
I'm afraid I'm getting this, too, in version 1.5.0:
All was fine, until I added some matcher rules and an on_match callback:
where unit is 'BOPD', for example. The on_match callback is being called. |
Got this error in version 1.7.3: Traceback (most recent call last): I am using a customized tokenizer that merges the three tokens, 'Linux', '.' and 'Mirai', into one token. |
I'm also running in this issue on 1.8.2, nevertheless only after processing multiple documents in parallel:
Edit: I think it's just the parallelization, that's not done by |
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |
I found strange behaviour using the
pipe()
method (only verified on german variant):If you parse a document using
pipe()
you can get a ValueError, while if i usenlp(text)
everything is fine. I boiled it down to single words, while german words work, english words like 'windows' don't work.Steps to reproduce:
Trace
If you use
nlp("Windows")
it works fine. Also if you executenlp("Windows")
before the samepipe()
call,pipe()
does not raise an exception (a dictionary is built?)Versions:
Maybe this is related to this region syntax/parser.pyx
The text was updated successfully, but these errors were encountered: