GitHub - root4loot/recrawl: A web crawler written in Go

A web crawler written in Go

Installation

Go

go install github.com/root4loot/recrawl@master

Docker

git clone https://github.com/root4loot/recrawl.git && cd recrawl
docker build -t recrawl .
docker run -it recrawl -h

Usage

Usage:
  recrawl [options] (-t <target> | -I <targets.txt>)

TARGETING:
  -i, --infile            file containing targets                (one per line)
  -t, --target            target domain/url                      (comma-separated)
  -ih, --include-host     also crawls this host (if found)       (comma-separated)
  -eh, --exclude-host     do not crawl this host (if found)      (comma-separated)

CONFIGURATIONS:
  -c, --concurrency       number of concurrent requests          (Default: 20)
  -to, --timeout          max request timeout                    (Default: 10 seconds)
  -d, --delay             delay between requests                 (Default: 0 milliseconds)
  -dj, --delay-jitter     max jitter between requests            (Default: 0 milliseconds)
  -ua, --user-agent       set user agent                         (Default: Mozilla/5.0)
  -fr, --follow-redirects follow redirects                       (Default: true)
  -p, --proxy             set proxy                              (Default: none)
  -r, --resolvers         file containing list of resolvers      (Default: System DNS)
  -H, --header            set custom header                      (Default: none)
  -ph, --prefer-http      prefer HTTP over HTTPS for targets     (Default: false)
  -mp, --mine-params      mine HTTP parameters from responses    (Default: false)
  -ed, --enable-discovery enable web discovery fuzzing           (Default: false)

OUTPUT:
  -fs, --filter-status    filter by status code                  (comma-separated)
  -fe, --filter-ext       filter by extension                    (comma-separated)
  -v, --verbose           verbose output                         (use -vv for added verbosity)
  -o, --outfile           output results to given file
  -hs, --hide-status      hide status code from output
  -hw, --hide-warning     hide warnings from output
  -hm, --hide-media       hide media from output (images, fonts, etc.)
  -s, --silence           silence results from output
  -h, --help              display help
      --version           display version

Parameter Mining

When enabled with -mp/--mine-params, recrawl mines likely HTTP parameters from:

URL queries (e.g., ?q=term&page=2)
HTML forms (<input>, <textarea>, <select>)
JavaScript bodies in fetch, XMLHttpRequest, and jQuery $.post/$.ajax
HTML data-* attributes
Hidden inputs
GraphQL variable declarations
WebSocket URLs (query string)

Parameters are grouped by certainty: high (URL/query + form fields), medium (JS/data/hidden/meta), low (reserved for weaker heuristics). Common non-parameters like class, style, etc. are ignored.

Notes:

The parameter summary prints after the crawl finishes.
Parameter names must be simple identifiers (letters, numbers, _, -) and start with a letter or underscore.

Web Discovery

When enabled with -ed/--enable-discovery, recrawl performs web discovery fuzzing after normal crawling. Uses a curated 1022-entry wordlist containing:

Common files and directories (from SecLists raft-small)
API endpoints (api/, api/auth, etc.)
NPM/dev paths (package.json, node_modules, .env, etc.)

Scope Behavior

If no includes are defined, everything is in scope unless explicitly excluded.
If includes are defined, only those targets are in scope; everything else is out of scope by default.
Exclusions always take priority over inclusions.

Example

# Crawl *.example.com
➜ recrawl -t example.com
➜ recrawl -t example.com

# Crawl *.example.com and IP address
➜ recrawl -t example.com,103.196.38.38

# Crawl all hosts in given file
➜ recrawl -i targets.txt

# Crawl *.example.com and also include *.example2.com if found
➜ recrawl -t example.com -ih example2.com

# Crawl all domains in target that contain the word example
➜ recrawl -t example.com -ih example

# Crawl *.example.com but avoid foo.example.com
➜ recrawl -t example.com -eh foo.example.com

# Only crawl hosts within explicit scope
➜ recrawl -t example.com -ih example.com,example.net

# Crawl and output mined params
➜ recrawl -t "https://example.org/?q=abc&page=1" --mine-params

Example running

Running recrawl against hackerone.com to filter JavaScript files:

➜ recrawl -t hackerone.com --filter-ext js

Other ways to set target

Pipe the target URL

➜ echo hackerone.com | recrawl

Pipe a file containing targets

➜ echo targets.txt | recrawl

Use the -i option to provide a file with targets

➜ recrawl -i targets.txt

This will crawl hackerone.com and filter JavaScript files. Sample output:

[recrawl] (INF) Included extensions: js
[recrawl] (INF) Concurrency: 20
[recrawl] (INF) Timeout: 10 seconds
[recrawl] (RES) 200 https://www.hackerone.com/sites/default/files/js/js_EOrKavGmjAkpIaCW_cpGJ240OpVZev_5NI-WGIx5URg.js
[recrawl] (RES) 200 https://www.hackerone.com/sites/default/files/js/js_5JbqBIuSpSQJk1bRx1jnlE-pARPyPPF5H07tKLzNC80.js
[recrawl] (RES) 200 https://www.hackerone.com/sites/default/files/js/js_a7_tjanmGpd_aITZ38ofV8QT2o2axkGnWqPwKna1Wf0.js
[recrawl] (RES) 200 https://www.hackerone.com/sites/default/files/js/js_xF9mKu6OVNysPMy7w3zYTWNPFBDlury_lEKDCfRuuHs.js
[recrawl] (RES) 200 https://www.hackerone.com/sites/default/files/js/js_coYiv6lRieZN3l0IkRYgmvrMASvFk2BL-jdq5yjFbGs.js
[recrawl] (RES) 200 https://www.hackerone.com/sites/default/files/js/js_Z1eePR_Hbt8TCXBt3JlFoTBdW2k9-IFI3f96O21Dwdw.js
[recrawl] (RES) 200 https://www.hackerone.com/sites/default/files/js/js_LEbRIvnUToqIQrjG9YpPgaIHK6o77rKVGouOaWLGI5k.js
[recrawl] (RES) 200 https://www.hackerone.com/sites/default/files/js/js_ol7H2KkxPxe7E03XeuZQO5qMcg0RpfSOgrm_Kg94rOs.js
[recrawl] (RES) 200 https://www.hackerone.com/sites/default/files/js/js_p5BLPpvjnAGGBCPUsc4EmBUw9IUJ0jMj-QY_1ZpOKG4.js
[recrawl] (RES) 200 https://www.hackerone.com/sites/default/files/js/js_V5P0-9GKw8QQe-7oWrMD44IbDva6o8GE-cZS7inJr-g.js
...

Results can be piped to stdout:

➜ recrawl -t hackerone.com --hide-status --filter-ext js | cat

Or saved to specified file:

➜ recrawl -t hackerone.com --hide-status --filter-ext js -o results.txt

As lib

go get -u github.com/root4loot/recrawl

package main

import (
    "fmt"

    "github.com/root4loot/recrawl/pkg/recrawl"
    "github.com/root4loot/scope"
)

func main() {

	opts := recrawl.NewOptions().WithDefaults()
	s := scope.NewScope()

	_ = s.AddInclude("sub.example.com")     // also follow links here
	_ = s.AddExclude("support.example.com") // but don't follow links here

	opts.Scope = s
	opts.Concurrency = 2
	opts.Timeout = 10
	opts.Resolvers = []string{"8.8.8.8", "208.67.222.222"}
	opts.UserAgent = "recrawl"
    // Enable parameter mining
    opts.MineParams = true

	// Defaults already applied by NewOptions(); nothing else needed here

	r := recrawl.NewRecrawlWithOptions(opts)

    // process results as they come in
    go func() {
        for result := range r.Results {
            fmt.Println(result.StatusCode, result.RequestURL, result.Error)
        }
    }()

    // single run
    r.Run("example.com")

    if jsonStr, err := r.ParamMiner.ToJSON(); err == nil && jsonStr != "" {
        fmt.Println("Parameters:")
        fmt.Println(jsonStr)
    }

Contributing

Contributions are very welcome. See CONTRIBUTING.md

Name		Name	Last commit message	Last commit date
Latest commit History 205 Commits
.github/workflows		.github/workflows
examples		examples
pkg/recrawl		pkg/recrawl
.gitignore		.gitignore
.golangci.yml		.golangci.yml
.goreleaser.yaml		.goreleaser.yaml
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
cli.go		cli.go
flags.go		flags.go
go.mod		go.mod
go.sum		go.sum
recrawl.png		recrawl.png
version.go		version.go

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Installation

Go

Docker

Usage

Parameter Mining

Web Discovery

Example

Example running

As lib

Contributing

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

root4loot/recrawl

Folders and files

Latest commit

History

Repository files navigation

Installation

Go

Docker

Usage

Parameter Mining

Web Discovery

Example

Example running

As lib

Contributing

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages