Skip to content

IndexError: list index out of range #12

@UntotaufUrlaub

Description

@UntotaufUrlaub

Hi,

I encountered an error:

File "/add_score.py", line 53, in add_score
    res = function(["? I haven't had a birthday since 2007. I have a b-day in October and it's almost completely ignored."], ["",])
  File "/add_score_summac.py", line 28, in <lambda>
    "my_summacZS_batched": lambda summs, docs: modelZS.score(docs, summs)['scores'],
  File "/usr/local/lib/python3.9/site-packages/summac/model_summac.py", line 351, in score
    score = self.score_one(source, gen)
  File "/usr/local/lib/python3.9/site-packages/summac/model_summac.py", line 322, in score_one
    image = self.imager.build_image(original, generated)
  File "/usr/local/lib/python3.9/site-packages/summac/model_summac.py", line 113, in build_image
    generated_chunks = self.split_text(generated, granularity=gran_sum)
  File "/usr/local/lib/python3.9/site-packages/summac/model_summac.py", line 94, in split_text
    return self.split_sentences(text)
  File "/usr/local/lib/python3.9/site-packages/summac/model_summac.py", line 71, in split_sentences
    sentences = nltk.tokenize.sent_tokenize(text)
  File "/usr/local/lib/python3.9/site-packages/nltk/tokenize/__init__.py", line 107, in sent_tokenize
    return tokenizer.tokenize(text)
  File "/usr/local/lib/python3.9/site-packages/nltk/tokenize/punkt.py", line 1276, in tokenize
    return list(self.sentences_from_text(text, realign_boundaries))
  File "/usr/local/lib/python3.9/site-packages/nltk/tokenize/punkt.py", line 1332, in sentences_from_text
    return [text[s:e] for s, e in self.span_tokenize(text, realign_boundaries)]
  File "/usr/local/lib/python3.9/site-packages/nltk/tokenize/punkt.py", line 1332, in <listcomp>
    return [text[s:e] for s, e in self.span_tokenize(text, realign_boundaries)]
  File "/usr/local/lib/python3.9/site-packages/nltk/tokenize/punkt.py", line 1322, in span_tokenize
    for sentence in slices:
  File "/usr/local/lib/python3.9/site-packages/nltk/tokenize/punkt.py", line 1421, in _realign_boundaries
    for sentence1, sentence2 in _pair_iter(slices):
  File "/usr/local/lib/python3.9/site-packages/nltk/tokenize/punkt.py", line 318, in _pair_iter
    prev = next(iterator)
  File "/usr/local/lib/python3.9/site-packages/nltk/tokenize/punkt.py", line 1395, in _slices_from_text
    for match, context in self._match_potential_end_contexts(text):
  File "/usr/local/lib/python3.9/site-packages/nltk/tokenize/punkt.py", line 1382, in _match_potential_end_contexts
    before_words[match] = split[-1]
IndexError: list index out of range

I think it is caused by the leading "? ", which might lead in an empty sentence within the metric.
Is this to be expected and explained somewhere or is this a bug?

kind regards

Edit:
I circumvented (not fixed) this is issue for now using this code:

match = re.match(r"(\s*[.?!]+\s)", summaries[i])
if match:
    summaries[i] = summaries[i][len(match.group(1)):]

because empty leading sentences with other symbols than "?" also caused this issue.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions