Stanza's sentencizer only works when `processors = 'tokenize,pos,lemma,depparse'` #57

namiyousef · 2021-02-03T18:03:46Z

Hi all,

I started an NLP project where I needed high accuracy sentence segmentation, and therefore decided to use stanza.

I was thrilled to find this library, since Spacy is quite intuitive. However, I found that the sentence segmentation only gets carried into spacy under certain conditions.

Baseline:

The baseline text is to use the Stanza model alone to see if the sentence segmentation works.

This is the simplest model that I could use, I simply turned on the tokenize processor.

Test with Spacy-Stanza:

I then tried the same thing, but this time added the spacy-stanza wrapper.

As shown above, the sentences were not actually tokenized.

Test with spacy-stanza with more processors on Stanza:

It seems that the depparse processor is necessary, but this is rather confusing since the vanilla stanza model does not require it to tokenize.

Any help would be appreciated :)

The text was updated successfully, but these errors were encountered:

adrianeboyd · 2021-02-04T16:27:20Z

Yes, your analysis is correct. A typical spacy pipeline sets the boundaries from the dependency parses (the transition-based parser decides where to set the sentence breaks), so we've set up the wrapper here to work the same way even though the sentence boundaries come from tokenize and not depparse.

It would be a problem to have separate sentence boundaries that potentially conflict with the parses (the Doc can't store both), but here we know that they're consistent because they're only coming from one source in the pipeline.

I'm not sure that there's much benefit to using stanza just for sentence segmentation (I'd be interested to hear about the use case where it's a lot better?) and I'm not sure we want to make this change to spacy-stanza v0.2.x at this point, but here's what it could look like:

https://github.com/adrianeboyd/spacy-stanza/tree/feature/sent-starts

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stanza's sentencizer only works when `processors = 'tokenize,pos,lemma,depparse'` #57

Stanza's sentencizer only works when `processors = 'tokenize,pos,lemma,depparse'` #57

namiyousef commented Feb 3, 2021

adrianeboyd commented Feb 4, 2021

Stanza's sentencizer only works when processors = 'tokenize,pos,lemma,depparse' #57

Stanza's sentencizer only works when processors = 'tokenize,pos,lemma,depparse' #57

Comments

namiyousef commented Feb 3, 2021

adrianeboyd commented Feb 4, 2021

Stanza's sentencizer only works when `processors = 'tokenize,pos,lemma,depparse'` #57

Stanza's sentencizer only works when `processors = 'tokenize,pos,lemma,depparse'` #57