Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't get the Chinese models to work #24

Open
victoryhb opened this issue Jun 21, 2015 · 1 comment
Open

Can't get the Chinese models to work #24

victoryhb opened this issue Jun 21, 2015 · 1 comment

Comments

@victoryhb
Copy link

Hi! I wonder if anyone has used the Wrapper to parse Chinese texts before?
I have the following code:

from stanford_corenlp_pywrapper import sockwrap

parser_path = "/Users/hbyan2/Downloads/stanford-corenlp-full-2015-04-20/*"
cn_model_path = "/Users/hbyan2/Downloads/stanford-corenlp-full-2015-04-20/stanford-chinese-corenlp-2015-04-20-models.jar"

p = sockwrap.SockWrap(
configdict={
'annotators':"segment, ssplit, pos, parse",
'customAnnotatorClass.segment': 'edu.stanford.nlp.pipeline.ChineseSegmenterAnnotator',
'segment.model': 'edu/stanford/nlp/models/segmenter/chinese/ctb.gz',
'segment.sighanCorporaDict': 'edu/stanford/nlp/models/segmenter/chinese',
'segment.serDictionary': 'edu/stanford/nlp/models/segmenter/chinese/dict-chris6.ser.gz',
'segment.sighanPostProcessing': True,
'ssplit.boundaryTokenRegex': '[.]|[!?]+|[。]|[!?]+',
"parse.model": "edu/stanford/nlp/models/lexparser/chinesePCFG.ser.gz",
"pos.model": "edu/stanford/nlp/models/pos-tagger/chinese-distsim/chinese-distsim.tagger"
},
corenlp_jars=[parser_path, cn_model_path]
)

p.parse_doc(u"你爱我吗?")

The configs are taken from the default CoreNLP properties for parsing Chinese: https://github.com/stanfordnlp/CoreNLP/blob/master/src/edu/stanford/nlp/pipeline/StanfordCoreNLP-chinese.properties

When running the Wrapper, I got the following error:

[Server] Started socket server on port 12340
INFO:StanfordSocketWrap:Successful ping. The server has started.
INFO:StanfordSocketWrap:Subprocess is ready.
Adding Segmentation annotation ... INFO: TagAffixDetector: useChPos=false | useCTBChar2=true | usePKChar2=false
INFO: TagAffixDetector: building TagAffixDetector from edu/stanford/nlp/models/segmenter/chinese/dict/character_list and edu/stanford/nlp/models/segmenter/chinese/dict/in.ctb
Loading character dictionary file from edu/stanford/nlp/models/segmenter/chinese/dict/character_list
Loading affix dictionary from edu/stanford/nlp/models/segmenter/chinese/dict/in.ctb
你爱我吗?
--->
[你, 爱, 我, 吗, ?]
java.lang.RuntimeException: don't know how to handle annotator segment
at corenlp.JsonPipeline.addAnnoToSentenceObject(JsonPipeline.java:282)
at corenlp.JsonPipeline.processTextDocument(JsonPipeline.java:312)
at corenlp.SocketServer.runCommand(SocketServer.java:140)
at corenlp.SocketServer.socketServerLoop(SocketServer.java:194)
at corenlp.SocketServer.main(SocketServer.java:107)

Any idea why this is happening? Many thanks in advance!

@brendano
Copy link
Owner

the wrapper doesnt support it -- you'd have to modify the java code where the error is happening, to add in the segmentation information to the json output.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants