Aggressively process entry content for Readability-like formatting #27

passiomatic · 2013-10-13T12:41:19Z

Currently Coldsweat does very little to format feed entries. It optionally parses entries looking for images and links having blacklisted domains and removes it and nothing more.

This yields to entries which are mostly rendered as-is then applying generic CSS styles which mostly works. However, there are entries which are written like this:

Aenean lacinia bibendum nulla sed consectetur. Sed posuere consectetur est at lobortis. Duis mollis, est non commodo luctus, nisi erat porttitor ligula, eget lacinia odio sem nec elit. Cras justo odio, dapibus ac facilisis in, egestas eget quam.
<br>
<br>
Etiam porta sem malesuada magna mollis euismod. Integer posuere erat a ante venenatis dapibus posuere velit aliquet. Cras mattis consectetur purus sit amet fermentum. Vestibulum id ligula porta felis euismod semper. Cum sociis natoque penatibus et magnis dis parturient montes, nascetur ridiculus mus.

Or worse:

Etiam porta sem malesuada magna mollis euismod. Integer posuere erat a ante venenatis dapibus posuere velit aliquet. Cras mattis consectetur purus sit amet fermentum. Vestibulum id ligula porta felis euismod semper. Cum sociis natoque penatibus et magnis dis parturient montes, nascetur ridiculus mus.
<br>
<br>
<br>
<br>
<br>
<br>
[eof]

This causes a huge padding added at the end of entry. Which makes me think that there's room for improvement.

One idea is to pass each entry content thru a processor which strips most of the HTML tags, keeping only the necessary formatting hints. Think of something like HTML -> Markdown and then Markdown -> HTML.

Empty elements

Empty elements like <p></p> or <td></td>will be stripped. Multiple consecutive occurrences of <br> will be removed too.

Allowed tags

Non empty tags left as-is while parsing will be: p, table and all its child elements, ul, ol, dl, li, dt, dd, bold, blockquote, strong, i, em, code, var, kdb, img, figure and figcaption, ecc.

Script blocks

Script blocks are already removed by Feedparser.

Allowed attributes

Most formatting attributes like "style", "align", etc. will be stripped. This will help us to reformat content, especially replaced-inline elements like embedded images.

References

Readability - http://www.readability.com/

The text was updated successfully, but these errors were encountered:

ghost assigned passiomatic Oct 13, 2013

passiomatic added enhancement and removed feature labels Feb 28, 2014

passiomatic added the bluesky label Jul 25, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Aggressively process entry content for Readability-like formatting #27

Aggressively process entry content for Readability-like formatting #27

passiomatic commented Oct 13, 2013

Aggressively process entry content for Readability-like formatting #27

Aggressively process entry content for Readability-like formatting #27

Comments

passiomatic commented Oct 13, 2013

Empty elements

Allowed tags

Script blocks

Allowed attributes

References