lrm character and rlm character throw exception #119

Insutanto · 2019-03-15T09:29:21Z

when the code parse html code like:
‎June, 2016
program will throw IndexError exception.
I find this bug in the implement of handle_charref.

In handle_data, it maybe match the zero element of char, but the lrm and rlm character are defined as ''(empty).
So, when program match the zero element of lrm and rlm character data,
+++++++++++++++++++++++++++++++++++
elif (self.preceding_stressed
and re.match(r'[^\s.!?]', data[0])
and not hn(self.current_tag)
and self.current_tag not in ['a', 'code', 'pre']):
+++++++++++++++++++++++++++++++++++
This is traceback:
Traceback (most recent call last):
File "get_email.py", line 37, in
text = h.handle(mail_content_string) # html格式转成 markdown 格式
File "/data/vijay/emailcont_venv/lib/python3.4/site-packages/html2text/init.py", line 149, in handle
self.feed(data)
File "/data/vijay/emailcont_venv/lib/python3.4/site-packages/html2text/init.py", line 146, in feed
HTMLParser.HTMLParser.feed(self, data)
File "/usr/lib64/python3.4/html/parser.py", line 165, in feed
self.goahead(0)
File "/usr/lib64/python3.4/html/parser.py", line 268, in goahead
self.handle_charref(name)
File "/data/vijay/emailcont_venv/lib/python3.4/site-packages/html2text/init.py", line 186, in handle_charref
self.handle_data(self.charref(c), True)
File "/data/vijay/emailcont_venv/lib/python3.4/site-packages/html2text/init.py", line 802, in handle_data
and re.match(r'[^\s.!?]', data[0])
IndexError: string index out of range

when the code parse html code like: <b> ‎June, 2016</b> program will throw IndexError exception. I find this bug in the implement of handle_charref. In handle_data, it maybe match the zero element of char, but the lrm and rlm character are defined as ''(empty). So, when program match the zero element of lrm and rlm character data, +++++++++++++++++++++++++++++++++++ elif (self.preceding_stressed and re.match(r'[^\s.!?]', data[0]) and not hn(self.current_tag) and self.current_tag not in ['a', 'code', 'pre']): +++++++++++++++++++++++++++++++++++ This is traceback: Traceback (most recent call last): File "get_email.py", line 37, in <module> text = h.handle(mail_content_string) # html格式转成 markdown 格式 File "/data/vijay/emailcont_venv/lib/python3.4/site-packages/html2text/__init__.py", line 149, in handle self.feed(data) File "/data/vijay/emailcont_venv/lib/python3.4/site-packages/html2text/__init__.py", line 146, in feed HTMLParser.HTMLParser.feed(self, data) File "/usr/lib64/python3.4/html/parser.py", line 165, in feed self.goahead(0) File "/usr/lib64/python3.4/html/parser.py", line 268, in goahead self.handle_charref(name) File "/data/vijay/emailcont_venv/lib/python3.4/site-packages/html2text/__init__.py", line 186, in handle_charref self.handle_data(self.charref(c), True) File "/data/vijay/emailcont_venv/lib/python3.4/site-packages/html2text/__init__.py", line 802, in handle_data and re.match(r'[^\s.!?]', data[0]) IndexError: string index out of range

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

lrm character and rlm character throw exception #119

lrm character and rlm character throw exception #119

Insutanto commented Mar 15, 2019

lrm character and rlm character throw exception #119

Are you sure you want to change the base?

lrm character and rlm character throw exception #119

Conversation

Insutanto commented Mar 15, 2019