Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stanza's sentencizer only works when processors = 'tokenize,pos,lemma,depparse' #57

Open
namiyousef opened this issue Feb 3, 2021 · 1 comment

Comments

@namiyousef
Copy link

Hi all,

I started an NLP project where I needed high accuracy sentence segmentation, and therefore decided to use stanza.

I was thrilled to find this library, since Spacy is quite intuitive. However, I found that the sentence segmentation only gets carried into spacy under certain conditions.

Baseline:

The baseline text is to use the Stanza model alone to see if the sentence segmentation works.

This is the simplest model that I could use, I simply turned on the tokenize processor.

Screenshot 2021-02-03 at 18 57 31

Test with Spacy-Stanza:

I then tried the same thing, but this time added the spacy-stanza wrapper.

Screenshot 2021-02-03 at 18 58 00

As shown above, the sentences were not actually tokenized.

Test with spacy-stanza with more processors on Stanza:

Screenshot 2021-02-03 at 18 56 23

It seems that the depparse processor is necessary, but this is rather confusing since the vanilla stanza model does not require it to tokenize.

Any help would be appreciated :)

@adrianeboyd
Copy link
Contributor

Yes, your analysis is correct. A typical spacy pipeline sets the boundaries from the dependency parses (the transition-based parser decides where to set the sentence breaks), so we've set up the wrapper here to work the same way even though the sentence boundaries come from tokenize and not depparse.

It would be a problem to have separate sentence boundaries that potentially conflict with the parses (the Doc can't store both), but here we know that they're consistent because they're only coming from one source in the pipeline.

I'm not sure that there's much benefit to using stanza just for sentence segmentation (I'd be interested to hear about the use case where it's a lot better?) and I'm not sure we want to make this change to spacy-stanza v0.2.x at this point, but here's what it could look like:

https://github.com/adrianeboyd/spacy-stanza/tree/feature/sent-starts

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants