Skip to content

*.prp files contain ^M artifacts which break model.setup() #14

@ypuzikov

Description

@ypuzikov

Observed on LDC2014T12 data instances:

  • train_380
  • train_961
  • train_995
  • train_1442

After preprocessing, there is this *.prp file which contains annotations done by the Stanford CoreNLP tool. I have noticed that in all the cases above there is a ^M in the middle of CoreNLP output, like so:

[Text=currently CharacterOffsetBegin=0 ... ]^M                                                                                                                                                                                                    
[Text=america CharacterOffsetBegin=10 ... ]^M                                                                                                      
^M                                                                                                                                                                                                                 
[Text=is CharacterOffsetBegin=18 ... ]^M         
...

Not sure why this happens -- maybe CoreNLP does not process multi-sentence instances correctly? In any case, reporting for those who might wonder what is going on.

I solved it by manually deleting the dangling ^M part from the *.prp file.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions