Skip to content

Can't get the Chinese models to work #24

Open
@victoryhb

Description

@victoryhb

Hi! I wonder if anyone has used the Wrapper to parse Chinese texts before?
I have the following code:

from stanford_corenlp_pywrapper import sockwrap

parser_path = "/Users/hbyan2/Downloads/stanford-corenlp-full-2015-04-20/*"
cn_model_path = "/Users/hbyan2/Downloads/stanford-corenlp-full-2015-04-20/stanford-chinese-corenlp-2015-04-20-models.jar"

p = sockwrap.SockWrap(
configdict={
'annotators':"segment, ssplit, pos, parse",
'customAnnotatorClass.segment': 'edu.stanford.nlp.pipeline.ChineseSegmenterAnnotator',
'segment.model': 'edu/stanford/nlp/models/segmenter/chinese/ctb.gz',
'segment.sighanCorporaDict': 'edu/stanford/nlp/models/segmenter/chinese',
'segment.serDictionary': 'edu/stanford/nlp/models/segmenter/chinese/dict-chris6.ser.gz',
'segment.sighanPostProcessing': True,
'ssplit.boundaryTokenRegex': '[.]|[!?]+|[。]|[!?]+',
"parse.model": "edu/stanford/nlp/models/lexparser/chinesePCFG.ser.gz",
"pos.model": "edu/stanford/nlp/models/pos-tagger/chinese-distsim/chinese-distsim.tagger"
},
corenlp_jars=[parser_path, cn_model_path]
)

p.parse_doc(u"你爱我吗?")

The configs are taken from the default CoreNLP properties for parsing Chinese: https://github.com/stanfordnlp/CoreNLP/blob/master/src/edu/stanford/nlp/pipeline/StanfordCoreNLP-chinese.properties

When running the Wrapper, I got the following error:

[Server] Started socket server on port 12340
INFO:StanfordSocketWrap:Successful ping. The server has started.
INFO:StanfordSocketWrap:Subprocess is ready.
Adding Segmentation annotation ... INFO: TagAffixDetector: useChPos=false | useCTBChar2=true | usePKChar2=false
INFO: TagAffixDetector: building TagAffixDetector from edu/stanford/nlp/models/segmenter/chinese/dict/character_list and edu/stanford/nlp/models/segmenter/chinese/dict/in.ctb
Loading character dictionary file from edu/stanford/nlp/models/segmenter/chinese/dict/character_list
Loading affix dictionary from edu/stanford/nlp/models/segmenter/chinese/dict/in.ctb
你爱我吗?
--->
[你, 爱, 我, 吗, ?]
java.lang.RuntimeException: don't know how to handle annotator segment
at corenlp.JsonPipeline.addAnnoToSentenceObject(JsonPipeline.java:282)
at corenlp.JsonPipeline.processTextDocument(JsonPipeline.java:312)
at corenlp.SocketServer.runCommand(SocketServer.java:140)
at corenlp.SocketServer.socketServerLoop(SocketServer.java:194)
at corenlp.SocketServer.main(SocketServer.java:107)

Any idea why this is happening? Many thanks in advance!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions