v1.7.0: Neural coref!
Neural coref processor added!
Conjunction-Aware Word-Level Coreference Resolution
https://arxiv.org/abs/2310.06165
original implementation: https://github.com/KarelDO/wl-coref/tree/master
Updated form of Word-Level Coreference Resolution
https://aclanthology.org/2021.emnlp-main.605/
original implementation: https://github.com/vdobrovolskii/wl-coref
If you use Stanza's coref module in your work, please be sure to cite both of the above papers.
Special thanks to vdobrovolskii, who graciously agreed to allow for integration of his work into Stanza, to @KarelDO for his support of his training enhancement, and to @Jemoka for the LoRA PEFT integration, which makes the finetuning of the transformer based coref annotator much less expensive.
Currently there is one model provided, a transformer based English model trained from OntoNotes. The provided model is currently based on Electra-Large, as that is more harmonious with the rest of our transformer architecture. When we have LoRA integration with POS, depparse, and the other processors, we will revisit the question of which transformer is most appropriate for English.
Future work includes ZH and AR models from OntoNotes, additional language support from UD-Coref, and lower cost non-transformer models
Interface change: English MWT
English now has an MWT model by default. Text such as won't
is now marked as a single token, split into two words, will
and not
. Previously it was expected to be tokenized into two pieces, but the Sentence
object containing that text would not have a single Token
object connecting the two pieces. See https://stanfordnlp.github.io/stanza/mwt.html and https://stanfordnlp.github.io/stanza/data_objects.html#token for more information.
Code that used to operate with for word in sentence.words
will continue to work as before, but for token in sentence.tokens
will now produce one object for MWT such as won't
, cannot
, Stanza's
, etc.
Pipeline creation will not change, as MWT is automatically (but not silently) added at Pipeline
creation time if the language and package includes MWT.
Other updates
- NetworkX representation of enhanced dependencies. Allows for easier usage of Semgrex on enhanced dependencies - searching over enhanced dependencies requires CoreNLP >= 4.5.6 #1295 #1298
- Sentence ending punct tags improved for English to avoid labeling non-punct as punct (and POS is switched to using a DataLoader) #1000 #1303
- Optional rewriting of MWT after the MWT processing step - will give the user more control over fixing common errors. Although we still encourage posting issues on github so we can fix them for everyone! #1302
- Remove deprecated output methods such as
conll_as_string
anddoc2conll_text
. Use"{:C}".format(doc)
instead e01650f - Mixed OntoNotes and WW NER model for English is now the default. Future versions may include CoNLL 2003 and CoNLL++ data as well.
- Sentences now have a
doc_id
field if the document they are created from has adoc_id
. 8e2201f - Optional processors added in cases where the user may not want the model we have run by default. For example, conparse for Turkish (limited training data) or coref for English (the only available model is the transformer model) 3d90d2b
Updated requirements
- Support dropped for python 3.6 and 3.7. The
peft
module used for finetuning the transformer used in the coref processor does not support those versions. - Added
peft
as an optional dependency to transformer based installations - Added
networkx
as a dependency for reading enhanced dependencies. Addedtoml
as a dependency for reading the coref config.