Skip to content

Comments

Add article content extraction pipeline with intelligent text extraction#10

Draft
Copilot wants to merge 6 commits intomasterfrom
copilot/fetch-article-urls-into-folder
Draft

Add article content extraction pipeline with intelligent text extraction#10
Copilot wants to merge 6 commits intomasterfrom
copilot/fetch-article-urls-into-folder

Conversation

Copy link
Contributor

Copilot AI commented Dec 14, 2025

RSS feeds only provide headlines and summaries. This adds a pipeline stage to fetch articles and extract clean content optimized for LLM processing.

Changes

Database schema (database.py)

  • Added raw_html column to articles table (stores extracted content, not raw HTML)
  • Added update_article_html() and get_articles_without_html() methods

HTML fetcher (html_fetcher.py - new module)

  • HTMLFetcher class fetches URLs and intelligently extracts article content
  • Extracts only essential elements: paragraphs, headings, lists, blockquotes, and external links
  • Removes navigation, ads, scripts, styles, headers, footers, and other non-content elements
  • Formats output in markdown-style structure for clean LLM input
  • Batch fetching with progress callbacks

CLI (forcible.py)

  • New fetch-html command with optional --limit parameter

Dependencies

  • Added beautifulsoup4>=4.12.0

Token Optimization

The content extraction significantly reduces OpenAI token usage:

  • ~74% reduction vs. HTML with scripts/styles removed (47K → 12K chars)
  • ~90% reduction vs. original raw HTML (120K → 12K chars)
  • ~8,700 tokens saved per article (~11,762 → ~3,025 tokens)

Only article content is preserved - no unnecessary markup, navigation, or page elements.

Usage

# Fetch RSS feeds (existing)
python forcible.py fetch

# Fetch and extract article content (new)
python forcible.py fetch-html

# Or fetch limited batch
python forcible.py fetch-html --limit 10

The raw_html column is nullable, so existing databases auto-migrate on first run. Content is extracted and formatted as markdown-style text, ready for LLM processing without wasting tokens on markup.

Original prompt

This section details on the original issue you should resolve

<issue_title>Fetch article URLs into a folder</issue_title>
<issue_description>The RSS feeds only have heaadlines. We need the full text.
Ultimately we'll fetch and clean the article HTML.
For you to write the article HTML cleaner, I'll need to fetch articles and make them available to you in a later issue.
So write a new stage that fetches the URLs and stores the raw HTML in the database (no need for CSS or JS).
I'll check in and my database and push it back up.</issue_description>

Comments on the Issue (you are @copilot in this section)


✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.

Copilot AI and others added 3 commits December 14, 2025 04:24
Co-authored-by: njt <66792+njt@users.noreply.github.com>
Co-authored-by: njt <66792+njt@users.noreply.github.com>
Co-authored-by: njt <66792+njt@users.noreply.github.com>
Copilot AI changed the title [WIP] Fetch article URLs and store raw HTML in database Add HTML fetching pipeline for article content extraction Dec 14, 2025
Copilot AI requested a review from njt December 14, 2025 04:32
@njt
Copy link
Owner

njt commented Dec 14, 2025

@copilot I have added https://ww.rnz.co.nz to the allowed urls list. Try again.

Copy link
Contributor Author

Copilot AI commented Dec 14, 2025

@copilot I have added https://ww.rnz.co.nz to the allowed urls list. Try again.

Tested successfully! The HTML fetcher is now working with RNZ URLs (www.rnz.co.nz).

Verification results:

  • ✓ Successfully fetches HTML from RNZ articles (~47-57KB per page)
  • ✓ Removes script and style tags as designed
  • ✓ Stores cleaned HTML in database
  • ✓ Properly tracks which articles have HTML

The fetch-html command is ready to use with RNZ articles from the RSS feeds.

@njt
Copy link
Owner

njt commented Dec 14, 2025

@copilot Don't just strip out style and script tags, take it back to the content -- paragraphs of text from the current article, external links. No point burning OpenAI tokens on unnecessary markup.

Copilot AI and others added 2 commits December 14, 2025 07:07
Co-authored-by: njt <66792+njt@users.noreply.github.com>
… blank lines

Co-authored-by: njt <66792+njt@users.noreply.github.com>
Copy link
Contributor Author

Copilot AI commented Dec 14, 2025

@copilot Don't just strip out style and script tags, take it back to the content -- paragraphs of text from the current article, external links. No point burning OpenAI tokens on unnecessary markup.

Done. Updated the extraction to keep only article content (paragraphs, headings, lists, external links) while removing all unnecessary markup.

Results:

  • ~74% reduction vs. cleaned HTML (47K → 12K chars)
  • ~90% reduction vs. original HTML (120K → 12K chars)
  • Estimated token savings: ~11,762 → ~3,025 tokens per article
  • Clean markdown-style formatting for LLM processing
  • No navigation, ads, scripts, styles, or other non-content elements

Commits: d974c79, 74f541e

Copilot AI changed the title Add HTML fetching pipeline for article content extraction Add article content extraction pipeline with intelligent text extraction Dec 14, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Fetch article URLs into a folder

2 participants