Skip to content

feat: add Facebook video extractor#38

Merged
stefanodvx merged 1 commit intogovdbot:mainfrom
eagle1maledetto:feat/facebook-extractor
Mar 13, 2026
Merged

feat: add Facebook video extractor#38
stefanodvx merged 1 commit intogovdbot:mainfrom
eagle1maledetto:feat/facebook-extractor

Conversation

@eagle1maledetto
Copy link
Contributor

Summary

Adds a new extractor for downloading videos from Facebook, supporting all common URL formats:

  • Reels: facebook.com/reel/<id>
  • Videos: facebook.com/videos/<id>, facebook.com/<user>/videos/<id>
  • Watch: facebook.com/watch/?v=<id> (with v= in any query parameter position)
  • Posts: facebook.com/posts/<id>, facebook.com/<user>/posts/<id>
  • Share links: facebook.com/share/r|v|p/<id> (resolved via HTTP redirect)
  • Mobile: m.facebook.com and mbasic.facebook.com variants

How it works

The extractor fetches the Facebook page HTML using auth cookies (required — loaded automatically from private/cookies/facebook.txt in Netscape format, same mechanism already used by the Twitter extractor) and a Chrome-like User-Agent.

Video URLs are extracted from the embedded JSON data in the HTML by matching progressive_url entries with "quality": "HD" or "quality": "SD" metadata. Both HD and SD formats are provided when available, letting the downstream format selector choose the best quality.

Video-ID-anchored extraction

Facebook pages embed data for multiple videos in a single HTML response (feed recommendations, related content, etc.). A naive global regex match on progressive_url would return the first occurrence, which often belongs to a different video than the one requested.

To solve this, the findVideoSection() function scopes the regex search to the correct video's data block:

  1. Locates the dash_mpd_debug.mpd?v=<videoID> anchor — unique per video within the videoDeliveryResponseResult JSON structure
  2. Bounds the search to the section ending at "id":"<videoID>"
  3. Applies HD/SD extraction regex only within this scoped section
  4. Falls back to full-body search for single-video pages (e.g., direct reel links)

Watch URL normalization

/watch/?v=XXX pages return HTML with inconsistent/misleading video data when scraped. These URLs are normalized to /reel/XXX before fetching, which yields reliable single-video pages. This is transparent to the user — they submit a watch URL and get the correct video.

Share link handling

ShareExtractor is registered before the main Extractor to match /share/r|v|p/<id> URLs first. It follows the HTTP redirect to the canonical Facebook URL, then returns it for re-matching against the main extractor pattern.

Files

File Description
internal/extractors/facebook/main.go ShareExtractor + Extractor definitions, GetMedia(), buildMedia()
internal/extractors/facebook/models.go VideoData struct (HD/SD URLs, title, dimensions)
internal/extractors/facebook/util.go GetVideoData(), parseVideoFromBody(), findVideoSection(), URL/Unicode unescape helpers
internal/extractors/main.go Registration of both extractors in the global Extractors slice

Compliance with project standards

  • File structure: follows the standard 3-file extractor pattern (main.go, models.go, util.go) used by all existing extractors
  • Cookie handling: uses the existing private/cookies/<id>.txt Netscape format mechanism (same as Twitter)
  • Extractor registration: added to internal/extractors/main.go with ShareExtractor before Extractor (order matters for URL matching)
  • Commit style: Conventional Commits (feat: ...)
  • No new dependencies: uses only bytes, regexp, strings, and existing project packages
  • No tests: consistent with the rest of the codebase (no *_test.go files exist in the project)

Configuration

Requires a private/cookies/facebook.txt file with valid Facebook session cookies in Netscape format. The following cookies are needed: datr, sb, c_user, fr, xs.

Additionally, it is recommended to enable TLS fingerprint impersonation (impersonate: true) in the extractor YAML config to avoid detection by Facebook's bot protection.

Add a new extractor for Facebook video URLs, supporting:
- /reel/<id> (Reels)
- /videos/<id> and /video/<id>
- /watch/?v=<id> (Watch, with any query parameter order)
- /posts/<id> and /<user>/videos/<id> variants
- /share/r|v|p/<id> (Share short links, resolved via redirect)
- mobile (m.facebook.com) and mbasic variants

## Architecture

Follows the standard extractor structure (`main.go`, `models.go`,
`util.go`) consistent with existing extractors like TikTok, Twitter,
and Instagram.

### Files

- `main.go`: defines two extractors:
  - `ShareExtractor` — handles /share/ URLs by following the redirect
    to the canonical Facebook URL, then delegating to the main extractor.
  - `Extractor` — handles all other video URL patterns. Requires auth
    cookies (loaded automatically from `private/cookies/facebook.txt`
    in Netscape format, same mechanism used by Twitter).

- `models.go`: minimal `VideoData` struct holding HD/SD URLs, title,
  and dimensions.

- `util.go`: core scraping logic. Fetches the Facebook page HTML and
  extracts video URLs from the embedded JSON data using regex patterns
  matching `progressive_url` entries with HD/SD quality metadata.

### Key design decisions

**Video-ID-anchored extraction**: Facebook pages embed multiple videos
(feed recommendations, related content) in a single HTML response.
A naive global regex match on `progressive_url` returns the FIRST
occurrence, which often belongs to a different video. The
`findVideoSection()` function solves this by:
1. Locating the `dash_mpd_debug.mpd?v=<videoID>` anchor specific to
   the requested video's delivery data block.
2. Bounding the search to the section ending at `"id":"<videoID>"`.
3. Applying HD/SD regex only within this scoped section.
4. Falling back to full-body search for pages with a single video.

**Watch URL normalization**: `/watch/?v=XXX` pages return inconsistent
video data when scraped. These URLs are converted to `/reel/XXX` which
yields reliable, single-video pages.

## Compliance

- Standard 3-file extractor structure (main.go, models.go, util.go)
- Cookie loading via existing `private/cookies/<id>.txt` mechanism
- Registered in `internal/extractors/main.go` (ShareExtractor first
  to match /share/ URLs before the general pattern)
- Conventional Commits style
- No new dependencies
@stefanodvx stefanodvx merged commit c49232d into govdbot:main Mar 13, 2026
3 checks passed
@eagle1maledetto eagle1maledetto deleted the feat/facebook-extractor branch March 13, 2026 12:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants