Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Aggressively process entry content for Readability-like formatting #27

Open
passiomatic opened this issue Oct 13, 2013 · 0 comments
Open
Assignees
Labels
Milestone

Comments

@passiomatic
Copy link
Owner

Currently Coldsweat does very little to format feed entries. It optionally parses entries looking for images and links having blacklisted domains and removes it and nothing more.

This yields to entries which are mostly rendered as-is then applying generic CSS styles which mostly works. However, there are entries which are written like this:

Aenean lacinia bibendum nulla sed consectetur. Sed posuere consectetur est at lobortis. Duis mollis, est non commodo luctus, nisi erat porttitor ligula, eget lacinia odio sem nec elit. Cras justo odio, dapibus ac facilisis in, egestas eget quam.
<br>
<br>
Etiam porta sem malesuada magna mollis euismod. Integer posuere erat a ante venenatis dapibus posuere velit aliquet. Cras mattis consectetur purus sit amet fermentum. Vestibulum id ligula porta felis euismod semper. Cum sociis natoque penatibus et magnis dis parturient montes, nascetur ridiculus mus.

Or worse:

Etiam porta sem malesuada magna mollis euismod. Integer posuere erat a ante venenatis dapibus posuere velit aliquet. Cras mattis consectetur purus sit amet fermentum. Vestibulum id ligula porta felis euismod semper. Cum sociis natoque penatibus et magnis dis parturient montes, nascetur ridiculus mus.
<br>
<br>
<br>
<br>
<br>
<br>
[eof]

This causes a huge padding added at the end of entry. Which makes me think that there's room for improvement.

One idea is to pass each entry content thru a processor which strips most of the HTML tags, keeping only the necessary formatting hints. Think of something like HTML -> Markdown and then Markdown -> HTML.

Empty elements

Empty elements like <p></p> or <td></td>will be stripped. Multiple consecutive occurrences of <br> will be removed too.

Allowed tags

Non empty tags left as-is while parsing will be: p, table and all its child elements, ul, ol, dl, li, dt, dd, bold, blockquote, strong, i, em, code, var, kdb, img, figure and figcaption, ecc.

Script blocks

Script blocks are already removed by Feedparser.

Allowed attributes

Most formatting attributes like "style", "align", etc. will be stripped. This will help us to reformat content, especially replaced-inline elements like embedded images.

References

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant