Add article content extraction pipeline with intelligent text extraction by Copilot · Pull Request #10 · njt/Forcible

Copilot · 2025-12-14T04:18:29Z

RSS feeds only provide headlines and summaries. This adds a pipeline stage to fetch articles and extract clean content optimized for LLM processing.

Changes

Database schema (database.py)

Added raw_html column to articles table (stores extracted content, not raw HTML)
Added update_article_html() and get_articles_without_html() methods

HTML fetcher (html_fetcher.py - new module)

HTMLFetcher class fetches URLs and intelligently extracts article content
Extracts only essential elements: paragraphs, headings, lists, blockquotes, and external links
Removes navigation, ads, scripts, styles, headers, footers, and other non-content elements
Formats output in markdown-style structure for clean LLM input
Batch fetching with progress callbacks

CLI (forcible.py)

New fetch-html command with optional --limit parameter

Dependencies

Added beautifulsoup4>=4.12.0

Token Optimization

The content extraction significantly reduces OpenAI token usage:

~74% reduction vs. HTML with scripts/styles removed (47K → 12K chars)
~90% reduction vs. original raw HTML (120K → 12K chars)
~8,700 tokens saved per article (~11,762 → ~3,025 tokens)

Only article content is preserved - no unnecessary markup, navigation, or page elements.

Usage

# Fetch RSS feeds (existing)
python forcible.py fetch

# Fetch and extract article content (new)
python forcible.py fetch-html

# Or fetch limited batch
python forcible.py fetch-html --limit 10

The raw_html column is nullable, so existing databases auto-migrate on first run. Content is extracted and formatted as markdown-style text, ready for LLM processing without wasting tokens on markup.

Original prompt

This section details on the original issue you should resolve

<issue_title>Fetch article URLs into a folder</issue_title>
<issue_description>The RSS feeds only have heaadlines. We need the full text.
Ultimately we'll fetch and clean the article HTML.
For you to write the article HTML cleaner, I'll need to fetch articles and make them available to you in a later issue.
So write a new stage that fetches the URLs and stores the raw HTML in the database (no need for CSS or JS).
I'll check in and my database and push it back up.</issue_description>

Comments on the Issue (you are @copilot in this section)

Fixes Fetch article URLs into a folder #9

✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.

Co-authored-by: njt <66792+njt@users.noreply.github.com>

njt · 2025-12-14T06:05:32Z

@copilot I have added https://ww.rnz.co.nz to the allowed urls list. Try again.

Copilot · 2025-12-14T06:09:12Z

@copilot I have added https://ww.rnz.co.nz to the allowed urls list. Try again.

Tested successfully! The HTML fetcher is now working with RNZ URLs (www.rnz.co.nz).

Verification results:

✓ Successfully fetches HTML from RNZ articles (~47-57KB per page)
✓ Removes script and style tags as designed
✓ Stores cleaned HTML in database
✓ Properly tracks which articles have HTML

The fetch-html command is ready to use with RNZ articles from the RSS feeds.

njt · 2025-12-14T07:02:43Z

@copilot Don't just strip out style and script tags, take it back to the content -- paragraphs of text from the current article, external links. No point burning OpenAI tokens on unnecessary markup.

Co-authored-by: njt <66792+njt@users.noreply.github.com>

… blank lines Co-authored-by: njt <66792+njt@users.noreply.github.com>

Copilot · 2025-12-14T07:11:50Z

@copilot Don't just strip out style and script tags, take it back to the content -- paragraphs of text from the current article, external links. No point burning OpenAI tokens on unnecessary markup.

Done. Updated the extraction to keep only article content (paragraphs, headings, lists, external links) while removing all unnecessary markup.

Results:

~74% reduction vs. cleaned HTML (47K → 12K chars)
~90% reduction vs. original HTML (120K → 12K chars)
Estimated token savings: ~11,762 → ~3,025 tokens per article
Clean markdown-style formatting for LLM processing
No navigation, ads, scripts, styles, or other non-content elements

Commits: d974c79, 74f541e

Initial plan

811e310

Copilot AI assigned Copilot and njt Dec 14, 2025

Copilot started work on behalf of njt December 14, 2025 04:19 View session

Copilot AI and others added 3 commits December 14, 2025 04:24

Add HTML fetching capability for articles

b564bb1

Co-authored-by: njt <66792+njt@users.noreply.github.com>

Update README and remove unwanted file

c81b029

Co-authored-by: njt <66792+njt@users.noreply.github.com>

Address code review feedback - update user agent and move imports

d8eff5a

Co-authored-by: njt <66792+njt@users.noreply.github.com>

Copilot AI changed the title ~~[WIP] Fetch article URLs and store raw HTML in database~~ Add HTML fetching pipeline for article content extraction Dec 14, 2025

Copilot AI requested a review from njt December 14, 2025 04:32

Copilot finished work on behalf of njt December 14, 2025 04:32

Copilot started work on behalf of njt December 14, 2025 06:06 View session

Copilot finished work on behalf of njt December 14, 2025 06:10

Copilot started work on behalf of njt December 14, 2025 07:03 View session

Copilot AI and others added 2 commits December 14, 2025 07:07

Extract article content instead of raw HTML to minimize tokens

d974c79

Co-authored-by: njt <66792+njt@users.noreply.github.com>

Fix code review issues: update docstring, prevent duplicate h1, clean…

74f541e

… blank lines Co-authored-by: njt <66792+njt@users.noreply.github.com>

Copilot AI changed the title ~~Add HTML fetching pipeline for article content extraction~~ Add article content extraction pipeline with intelligent text extraction Dec 14, 2025

Copilot finished work on behalf of njt December 14, 2025 07:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

Add article content extraction pipeline with intelligent text extraction#10

Add article content extraction pipeline with intelligent text extraction#10
Copilot wants to merge 6 commits intomasterfrom
copilot/fetch-article-urls-into-folder

Copilot AI commented Dec 14, 2025 •

edited

Loading

Uh oh!

njt commented Dec 14, 2025

Uh oh!

Copilot AI commented Dec 14, 2025

Uh oh!

njt commented Dec 14, 2025

Uh oh!

Copilot AI commented Dec 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

Conversation

Copilot AI commented Dec 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

Token Optimization

Usage

Comments on the Issue (you are @copilot in this section)

Uh oh!

njt commented Dec 14, 2025

Uh oh!

Copilot AI commented Dec 14, 2025

Uh oh!

njt commented Dec 14, 2025

Uh oh!

Copilot AI commented Dec 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Copilot AI commented Dec 14, 2025 •

edited

Loading