Add article content extraction pipeline with intelligent text extraction#10
Add article content extraction pipeline with intelligent text extraction#10
Conversation
Co-authored-by: njt <66792+njt@users.noreply.github.com>
Co-authored-by: njt <66792+njt@users.noreply.github.com>
Co-authored-by: njt <66792+njt@users.noreply.github.com>
|
@copilot I have added https://ww.rnz.co.nz to the allowed urls list. Try again. |
Tested successfully! The HTML fetcher is now working with RNZ URLs (www.rnz.co.nz). Verification results:
The |
|
@copilot Don't just strip out style and script tags, take it back to the content -- paragraphs of text from the current article, external links. No point burning OpenAI tokens on unnecessary markup. |
Co-authored-by: njt <66792+njt@users.noreply.github.com>
… blank lines Co-authored-by: njt <66792+njt@users.noreply.github.com>
Done. Updated the extraction to keep only article content (paragraphs, headings, lists, external links) while removing all unnecessary markup. Results:
|
RSS feeds only provide headlines and summaries. This adds a pipeline stage to fetch articles and extract clean content optimized for LLM processing.
Changes
Database schema (
database.py)raw_htmlcolumn to articles table (stores extracted content, not raw HTML)update_article_html()andget_articles_without_html()methodsHTML fetcher (
html_fetcher.py- new module)HTMLFetcherclass fetches URLs and intelligently extracts article contentCLI (
forcible.py)fetch-htmlcommand with optional--limitparameterDependencies
beautifulsoup4>=4.12.0Token Optimization
The content extraction significantly reduces OpenAI token usage:
Only article content is preserved - no unnecessary markup, navigation, or page elements.
Usage
The
raw_htmlcolumn is nullable, so existing databases auto-migrate on first run. Content is extracted and formatted as markdown-style text, ready for LLM processing without wasting tokens on markup.Original prompt
✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.