Intelligent tiered web scraping. A single Go binary that routes each URL through the cheapest engine that returns valid content, remembers which tier worked per host, persists the frontier so long crawls survive crashes, and produces clean markdown + structured extraction ready for downstream pipelines.
No API key. No runtime dependency. No Docker required. go install and
you're done.
# One URL, clean markdown out, page metadata auto-extracted
trawl scrape https://linear.app --format markdown --readability
# Batch of URLs from a file, resumable, CSV output
trawl batch urls.txt \
--selector "title=h1" --selector "price=.price" \
--output results.csv
# BFS-crawl an entire site, clean markdown from every page
trawl crawl https://example.com \
--depth 2 --same-domain --limit 500 \
--format markdown --readability \
--output site.jsonl
# Enumerate every URL a site publishes (sitemap + HTML-crawl, deduped)
trawl map https://example.com > urls.txt
trawl batch urls.txt --output results.jsonl
# Structured extraction via YAML schema (nested fields, arrays of objects)
trawl scrape https://plato.stanford.edu/entries/kant/ \
--schema docs/examples/sep-article.yaml \
--format markdown --readability
# Iterate on selectors cheaply: opt-in content cache, 24h TTL
trawl batch urls.txt --selector "title=h1" --cache -o results.jsonl
# Full-page PNG screenshots for chromium-served pages
trawl scrape https://linear.app --tiers chromium --screenshot-dir shots/
# Per-host politeness: slow-crawl SEP, normal speed elsewhere
trawl batch mixed-urls.txt --politeness docs/examples/politeness.yaml
# CSV with hybrid discovery: try pricing_url first, fall back to
# homepage + link follow on http_4xx / dns_failure
trawl batch companies.csv \
--url-column pricing_url \
--fallback-column homepage \
--fallback-selector 'a[href*="pricing"]'
# Resume a job that was SIGINT'd mid-crawl
trawl resume <job-id>One-liner (darwin/linux, amd64/arm64 — installs to ~/.local/bin):
curl -fsSL https://raw.githubusercontent.com/jeffdhooton/trawl/main/scripts/install.sh | shThe installer pulls the latest tagged release, verifies the SHA256
checksum, extracts the binary to INSTALL_DIR (default
~/.local/bin), and prints a PATH advisory if needed. Customize via
TRAWL_VERSION=vX.Y.Z to pin, INSTALL_DIR=/usr/local/bin to
relocate, or TRAWL_REPO=your/fork to install from a fork.
Via Go toolchain (builds from source, keeps the binary's version
string as dev):
go install github.com/jeffdhooton/trawl/cmd/trawl@latestFrom a clone (for development):
git clone https://github.com/jeffdhooton/trawl.git
cd trawl
go build ./cmd/trawlTrawl has no CGO dependencies and produces a single static binary suitable for copying onto a $5 VPS.
- Tiered routing. Each URL starts at HTTP (
net/http+goquery); if the response is invalid (SPA shell, timeout, etc.) the router escalates to Chromium (chromedp). Invalid-final responses (404, DNS failure) short-circuit — chromium can't help there either. - HTTP retries with backoff. Transient failures (429, 5xx, connection
resets, timeouts) retry automatically with exponential backoff and
±25% jitter, capped at 10s. Permanent failures (4xx except 429, TLS
cert errors, ctx cancellation) return immediately.
--retriesand--retry-delaytune the policy; chromium does not retry in v1. - Persistent frontier. BadgerDB-backed queue survives SIGINT, machine
restarts, and mid-crawl crashes.
trawl resume <job-id>picks up where it left off. - Tier learning. After the first run, trawl remembers which tier
served each host successfully. Subsequent crawls skip the HTTP tier
on known-SPA hosts, saving wasted work. Cache lives at
$TRAWL_HOME/tier-cacheand is cross-job by design. - BFS crawl mode.
trawl crawl <seed> --depth N --same-domain --limit Nwalks a site breadth-first, respecting a depth cap and a hard cap on URLs enqueued. Link discovery uses the same tiered pipeline as scrape, so SPA pages get chromium-rendered before their links are extracted. Composes with--format markdownto become "give me clean content from an entire site." - URL mapping.
trawl map <url>combines sitemap parsing with lightweight HTML link discovery (HTTP tier only) to produce a deduped list of URLs to stdout. Fast, composable withtrawl batch. - Content extraction.
--format markdownruns an HTML→markdown converter;--readabilitystrips nav/footer/ads before conversion (or before CSS extraction). Every record gets automatic page metadata: title, description, canonical URL, language, Open Graph, Twitter cards, JSON-LD structured data, published date. - Schema extraction.
--schema sep-article.yamlruns a nested, declarative extraction against the page. Supportsselector,attr,multiple(arrays of strings or objects), and empty-selector self-reference inside nested contexts (needed for "array of objects where each object is a link's text + href"). YAML or JSON, strict parser catches typos. Seedocs/examples/sep-article.yamlfor a working Stanford Encyclopedia of Philosophy schema. - Screenshots.
--screenshot-dir dir/captures a full-page PNG for every chromium-served row via chromedp'scaptureBeyondViewport. Deterministic filenames (<sha256-of-url>.png) so re-runs overwrite rather than accumulate. HTTP-served rows leavemetadata.screenshot_pathempty — zero cost on the happy path. - Content cache.
--cacheopts into a persistent content cache keyed by canonical URL + tier. A cache hit short-circuits the tier loop without touching the live site. Default TTL 24h (tunable via--cache-ttl).metadata.from_cache: trueon replay rows so downstream can distinguish live from cached. Lives at$TRAWL_HOME/content-cache. - Hybrid discovery. CSV seeds can declare a primary URL and a
fallback URL per row. When the primary fails with
http_4xxordns_failure, trawl re-routes through the fallback URL + a CSS selector to resolve the target, recovering rows that neither path alone would get. - Sitemap discovery.
trawl sitemap <url>fetches robots.txt, extractsSitemap:directives, falls back to well-known paths, recursively walks<sitemapindex>files, handles gzip, dedupes URLs — stream-friendly for large sites. - Polite by default. robots.txt, per-domain rate limits, and
concurrency caps are opt-out, not opt-in.
--politeness <file.yaml>overrides rate and concurrency for specific hosts via exact or*.suffixwildcard match (seedocs/examples/politeness.yaml).--ignore-robotsexists but logs a warning. - CSV / TSV output. Extension-sniffed:
-o results.csvor-o results.tsvwrites flattened rows instead of JSONL. Default columns are a stable base set plus auto-discoveredextracted.*keys from the first record;--csv-columnsoverrides with explicit dot-paths. Nested values get JSON-encoded inline so cells stay single-valued. - Failure classification + stats. Every job emits a
stats.jsonwith reachable/unreachable counts, per-category failure breakdown, per-tier latency, chromium escalation rate, and fallback yield.
| Command | What it does |
|---|---|
trawl scrape |
Scrape one URL, emit one record |
trawl batch |
Scrape a URL list (plain text, CSV, or TSV), resumable |
trawl crawl |
BFS-crawl a site from a seed URL |
trawl map |
Enumerate URLs from a site (sitemap + HTML crawl) |
trawl sitemap |
Discover URLs from a site's sitemap(s) only |
trawl resume |
Resume an interrupted job by ID |
trawl version |
Print version info |
Run trawl <command> --help for the full flag surface.
Every record is one line of JSONL (or one CSV row). The stable fields:
{
"url": "https://example.com/pricing",
"canonical_url": "https://example.com/pricing",
"fetched_at": "2026-04-10T22:49:41Z",
"tier": "http",
"status_code": 200,
"duration_ms": 384,
"content_hash": "sha256:...",
"extracted": {
"title": "...",
"plans": [
{ "name": "Free", "price": "$0" },
{ "name": "Plus", "price": "$10" }
]
},
"body": "# Pricing\n\n...",
"body_format": "markdown",
"metadata": {
"content_type": "text/html; charset=utf-8",
"body_bytes": 24815,
"final_url": "https://example.com/pricing",
"screenshot_path": "/tmp/shots/<sha256>.png",
"from_cache": false,
"page": {
"title": "...",
"description": "...",
"canonical": "https://example.com/pricing",
"language": "en",
"published_at": "2026-02-14T09:30:00Z",
"open_graph": { "title": "...", "image": "..." },
"json_ld": [ { "@type": "Article", ... } ]
}
},
"failure_category": "success"
}Which fields are populated:
body/body_formatonly when--formatis set.extractedwhen--selectorand/or--schemaran successfully.metadata.pageunless--no-metadata.metadata.screenshot_pathonly when--screenshot-diris set AND the chromium engine served the row.metadata.from_cache: trueonly on cache replay rows.
Failed records still get written, with failure_category set to one
of the classified buckets (http_4xx, dns_failure, tls_error,
spa_shell, etc.) — easy to jq-filter for the real failures.
See docs/examples/:
sep-article.yaml— nested schema for Stanford Encyclopedia of Philosophy entries (title, pubinfo, TOC, related entries, author, copyright). Verified against realplato.stanford.edu/entries/kant/during development.politeness.yaml— per-host rate and concurrency overrides with exact and*.suffixwildcard match.
docs/ROADMAP.md— current phase, strategic gap analysis, in-scope/out-of-scope list. Start here.docs/SPEC.md— the original PRD. Architecture, non-goals, technology choices. Historical design reference; ROADMAP is the live source of truth.docs/DECISIONS.md— log of architectural calls that deviate from SPEC, each with the data that drove the decision.docs/BENCHMARK.md— operational playbook for running Phase 0 against the 7000-company test dataset.docs/EVASION.md— anti-detection / stealth design doc. Tiered opt-in model for sites that fight back, with explicit refusals for CAPTCHA solvers, credential bypass, and other identity-changing features.docs/RELEASING.md— operational checklist for cutting a new release. Semver policy, the full pre-flight → tag → workflow → smoke-test procedure, and common gotchas.docs/TODO.md— standing commitments and open papercuts.
Trawl is a general-purpose CLI for content-extraction and tiered
scraping. It is not a managed service, not an LLM framework, not a
proxy rotation toolkit, not a distributed crawler. For the list of
things deliberately kept out of scope — LLM-based extraction, search
integration, interactive actions, webhooks, pricing-aware logic — see
docs/ROADMAP.md's "explicitly deferred or out of scope" section.
Trawl is also not a bypass tool. For sites that actively fight
back against automated traffic, trawl's design intent is a tiered
opt-in evasion model (realistic browser headers, chromium stealth
patches, optional TLS fingerprint forgery) with explicit refusals
for CAPTCHA-solving services, credential-based auth bypass, and DoS-
level rate patterns. See docs/EVASION.md for the considered stance
— that doc locks in the principled shape before any of it ships, so
the eventual implementation preserves trawl's polite-by-default
identity rather than drifting into an arms-race tool.
The Unix-pipeline answer for LLM extraction is: pipe trawl scrape ... --format markdown into whatever LLM tool you prefer. Trawl's job
ends at clean content.
TBD.