Description
Hi! I wonder if anyone has used the Wrapper to parse Chinese texts before?
I have the following code:
from stanford_corenlp_pywrapper import sockwrap
parser_path = "/Users/hbyan2/Downloads/stanford-corenlp-full-2015-04-20/*"
cn_model_path = "/Users/hbyan2/Downloads/stanford-corenlp-full-2015-04-20/stanford-chinese-corenlp-2015-04-20-models.jar"
p = sockwrap.SockWrap(
configdict={
'annotators':"segment, ssplit, pos, parse",
'customAnnotatorClass.segment': 'edu.stanford.nlp.pipeline.ChineseSegmenterAnnotator',
'segment.model': 'edu/stanford/nlp/models/segmenter/chinese/ctb.gz',
'segment.sighanCorporaDict': 'edu/stanford/nlp/models/segmenter/chinese',
'segment.serDictionary': 'edu/stanford/nlp/models/segmenter/chinese/dict-chris6.ser.gz',
'segment.sighanPostProcessing': True,
'ssplit.boundaryTokenRegex': '[.]|[!?]+|[。]|[!?]+',
"parse.model": "edu/stanford/nlp/models/lexparser/chinesePCFG.ser.gz",
"pos.model": "edu/stanford/nlp/models/pos-tagger/chinese-distsim/chinese-distsim.tagger"
},
corenlp_jars=[parser_path, cn_model_path]
)
p.parse_doc(u"你爱我吗?")
The configs are taken from the default CoreNLP properties for parsing Chinese: https://github.com/stanfordnlp/CoreNLP/blob/master/src/edu/stanford/nlp/pipeline/StanfordCoreNLP-chinese.properties
When running the Wrapper, I got the following error:
[Server] Started socket server on port 12340
INFO:StanfordSocketWrap:Successful ping. The server has started.
INFO:StanfordSocketWrap:Subprocess is ready.
Adding Segmentation annotation ... INFO: TagAffixDetector: useChPos=false | useCTBChar2=true | usePKChar2=false
INFO: TagAffixDetector: building TagAffixDetector from edu/stanford/nlp/models/segmenter/chinese/dict/character_list and edu/stanford/nlp/models/segmenter/chinese/dict/in.ctb
Loading character dictionary file from edu/stanford/nlp/models/segmenter/chinese/dict/character_list
Loading affix dictionary from edu/stanford/nlp/models/segmenter/chinese/dict/in.ctb
你爱我吗?
--->
[你, 爱, 我, 吗, ?]
java.lang.RuntimeException: don't know how to handle annotator segment
at corenlp.JsonPipeline.addAnnoToSentenceObject(JsonPipeline.java:282)
at corenlp.JsonPipeline.processTextDocument(JsonPipeline.java:312)
at corenlp.SocketServer.runCommand(SocketServer.java:140)
at corenlp.SocketServer.socketServerLoop(SocketServer.java:194)
at corenlp.SocketServer.main(SocketServer.java:107)
Any idea why this is happening? Many thanks in advance!