Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

lrm character and rlm character throw exception #119

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open

lrm character and rlm character throw exception #119

wants to merge 1 commit into from

Conversation

Insutanto
Copy link

when the code parse html code like:
‎June, 2016
program will throw IndexError exception.
I find this bug in the implement of handle_charref.

In handle_data, it maybe match the zero element of char, but the lrm and rlm character are defined as ''(empty).
So, when program match the zero element of lrm and rlm character data,
+++++++++++++++++++++++++++++++++++
elif (self.preceding_stressed
and re.match(r'[^\s.!?]', data[0])
and not hn(self.current_tag)
and self.current_tag not in ['a', 'code', 'pre']):
+++++++++++++++++++++++++++++++++++
This is traceback:
Traceback (most recent call last):
File "get_email.py", line 37, in
text = h.handle(mail_content_string) # html格式 转成 markdown 格式
File "/data/vijay/emailcont_venv/lib/python3.4/site-packages/html2text/init.py", line 149, in handle
self.feed(data)
File "/data/vijay/emailcont_venv/lib/python3.4/site-packages/html2text/init.py", line 146, in feed
HTMLParser.HTMLParser.feed(self, data)
File "/usr/lib64/python3.4/html/parser.py", line 165, in feed
self.goahead(0)
File "/usr/lib64/python3.4/html/parser.py", line 268, in goahead
self.handle_charref(name)
File "/data/vijay/emailcont_venv/lib/python3.4/site-packages/html2text/init.py", line 186, in handle_charref
self.handle_data(self.charref(c), True)
File "/data/vijay/emailcont_venv/lib/python3.4/site-packages/html2text/init.py", line 802, in handle_data
and re.match(r'[^\s.!?]', data[0])
IndexError: string index out of range

when the code parse html code like:
<b> &#8206;June, 2016</b>
program will throw IndexError exception.
I find this bug in the implement of handle_charref.

In handle_data, it maybe match the zero element of char, but the lrm and rlm character are defined as ''(empty).
So, when program match the zero element of lrm and rlm character data, 
+++++++++++++++++++++++++++++++++++
        elif (self.preceding_stressed
              and re.match(r'[^\s.!?]', data[0])
              and not hn(self.current_tag)
              and self.current_tag not in ['a', 'code', 'pre']):
+++++++++++++++++++++++++++++++++++
This is traceback:
Traceback (most recent call last):
  File "get_email.py", line 37, in <module>
    text = h.handle(mail_content_string)  # html格式 转成 markdown 格式
  File "/data/vijay/emailcont_venv/lib/python3.4/site-packages/html2text/__init__.py", line 149, in handle
    self.feed(data)
  File "/data/vijay/emailcont_venv/lib/python3.4/site-packages/html2text/__init__.py", line 146, in feed
    HTMLParser.HTMLParser.feed(self, data)
  File "/usr/lib64/python3.4/html/parser.py", line 165, in feed
    self.goahead(0)
  File "/usr/lib64/python3.4/html/parser.py", line 268, in goahead
    self.handle_charref(name)
  File "/data/vijay/emailcont_venv/lib/python3.4/site-packages/html2text/__init__.py", line 186, in handle_charref
    self.handle_data(self.charref(c), True)
  File "/data/vijay/emailcont_venv/lib/python3.4/site-packages/html2text/__init__.py", line 802, in handle_data
    and re.match(r'[^\s.!?]', data[0])
IndexError: string index out of range
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant