Add resilient multi-engine dork scanner by ChrisAdamsdevelopment · Pull Request #2 · ChrisAdamsdevelopment/byob-2025

ChrisAdamsdevelopment · 2026-01-06T15:52:39Z

Summary

add a standalone multi-engine dork scanning utility with SQLi/XSS heuristics
improve result parsing to respect CSS selectors and update progress after each request batch
include export, scoring, and validation helpers for discovered targets

Testing

python -m compileall utils/dork_scanner.py

Summary by Sourcery

Add a standalone asynchronous multi-engine dork scanning utility for SQLi/XSS-oriented OSINT discovery with scoring, caching, and export capabilities.

New Features:

Introduce a CLI-based multi-search-engine dork scanner that aggregates results from Google, Bing, DuckDuckGo, and Yandex.
Add heuristic scoring and helper command generation for SQLi/XSS-focused targets discovered via dork searches.
Support optional domain filtering, proxy usage, and concurrent cross-engine scanning with basic liveness validation of top results.
Provide JSON/CSV export of ranked scan results, including per-target metadata and suggested tooling commands.

Enhancements:

Implement on-disk caching of search results to avoid reprocessing previously seen URLs across runs.

sourcery-ai · 2026-01-06T15:52:46Z

Reviewer's Guide

Introduces a new standalone asynchronous multi-engine dork scanner utility with heuristic scoring, caching, batched progress-reporting, and export helpers for SQLi/XSS-oriented OSINT scanning.

Sequence diagram for execute_cross_engine_scan workflow

sequenceDiagram
    actor User
    participant CLI as main
    participant Framework as MultiEngineDorkFramework
    participant Engines as SearchEngines
    participant Cache as CacheFile
    participant Targets as TargetValidation

    User->>CLI: invoke with arguments
    CLI->>Framework: __init__(proxies, output_format, cache_file)
    Framework->>Cache: _load_cache()
    Cache-->>Framework: cached_results

    CLI->>Framework: execute_cross_engine_scan(domain_filter, output_file, max_results, pages)
    Framework->>Framework: generate_master_dork_list()

    loop for each engine and dork
        Framework->>Framework: schedule search_engine(engine_name, dork, engine_config)
    end

    par concurrent_search_tasks
        Framework->>Engines: search_engine requests (bounded by semaphore)
        Engines-->>Framework: raw HTML responses per engine/page
        Framework->>Framework: parse HTML with CSS selectors
        Framework->>Framework: normalize URLs and score results
    end

    Framework->>Framework: deduplicate across engines

    Framework->>Targets: HEAD requests for top scored URLs
    Targets-->>Framework: liveness flags
    Framework->>Framework: filter to valid_results

    alt output_file provided
        Framework->>Framework: _export_results(results, filename)
        Framework->>Cache: _save_cache()
    else no output_file
        Framework->>Cache: _save_cache()
    end

    Framework-->>CLI: final results
    CLI-->>User: print top scored targets and helper CLI commands

Class diagram for the new multi-engine dork scanner utility

classDiagram
    class SearchResult {
        +str url
        +str title
        +str snippet
        +str engine
        +int score
        +str sqlmap_cli
        +str nuclei_cli
    }

    class MultiEngineDorkFramework {
        +List~str~ USER_AGENTS
        -List~str~ proxies
        -str output_format
        -aiohttp.ClientSession session
        -str cache_file
        -Dict~str, List~SearchResult~~ cache
        -Dict~str, Dict~str, any~~ engines
        +__init__(proxies, output_format, cache_file)
        -_load_cache() Dict~str, List~SearchResult~~
        -_save_cache() None
        -_get_session() aiohttp.ClientSession
        -_get_random_ua() str
        -_normalize_url(url) str
        -_score_dork_result(result) int
        -_generate_sqlmap_cli(url) str
        -_generate_nuclei_cli(url) str
        +generate_master_dork_list() Dict~str, List~str~~
        +_retry_on_failure(max_retries) function_wrapper
        +search_engine(engine_name, dork, engine_config, max_results, pages) List~SearchResult~
        -_get_random_proxy() str
        +execute_cross_engine_scan(domain_filter, output_file, max_results, pages) List~SearchResult~
        -_export_results(results, filename) None
    }

    class main_module {
        +async main() None
    }

    MultiEngineDorkFramework "*" o-- "*" SearchResult : produces
    main_module ..> MultiEngineDorkFramework : creates
    main_module ..> SearchResult : prints_top_results

File-Level Changes

Change	Details	Files
Add asynchronous multi-search-engine dork scanner with heuristic scoring, caching, and CLI entrypoint.	Implement MultiEngineDorkFramework class that orchestrates concurrent searches across Google, Bing, DuckDuckGo, and Yandex with per-engine CSS selectors, pagination, and rate limiting. Introduce SearchResult dataclass that carries URL metadata, heuristic score, and prebuilt sqlmap/nuclei command lines for discovered targets. Add URL normalization, scoring heuristics for SQLi/XSS indicators, and CLI generation helpers for sqlmap/nuclei to streamline follow-up testing. Implement retry decorator, bounded concurrency with semaphore, proxy rotation, and tqdm-based progress tracking per dork/engine/page. Add cross-engine scan routine that builds a master dork list, aggregates and de-duplicates normalized URLs, validates top results with async HEAD requests, and optionally exports to JSON/CSV while persisting a local cache file. Provide an async main() entrypoint wired to argparse for domain filtering, proxy configuration, pagination, output selection, and summary printing of top results.	`utils/dork_scanner.py`

Tips and commands

Interacting with Sourcery

Trigger a new review: Comment @sourcery-ai review on the pull request.
Continue discussions: Reply directly to Sourcery's review comments.
Generate a GitHub issue from a review comment: Ask Sourcery to create an
issue from a review comment by replying to it. You can also reply to a
review comment with @sourcery-ai issue to create an issue from it.
Generate a pull request title: Write @sourcery-ai anywhere in the pull
request title to generate a title at any time. You can also comment
@sourcery-ai title on the pull request to (re-)generate the title at any time.
Generate a pull request summary: Write @sourcery-ai summary anywhere in
the pull request body to generate a PR summary at any time exactly where you
want it. You can also comment @sourcery-ai summary on the pull request to
(re-)generate the summary at any time.
Generate reviewer's guide: Comment @sourcery-ai guide on the pull
request to (re-)generate the reviewer's guide at any time.
Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
pull request to resolve all Sourcery comments. Useful if you've already
addressed all the comments and don't want to see them anymore.
Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
request to dismiss all existing Sourcery reviews. Especially useful if you
want to start fresh with a new review - don't forget to comment
@sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

Enable or disable review features such as the Sourcery-generated pull request
summary, the reviewer's guide, and others.
Change the review language.
Add, remove or edit custom review instructions.
Adjust other review settings.

Getting Help

Contact our support team for questions or feedback.
Visit our documentation for detailed guides and information.
Keep in touch with the Sourcery team by following us on X/Twitter, LinkedIn or GitHub.

sourcery-ai

Hey - I've found 4 issues, and left some high level feedback:

The caching logic currently only reads from dork_cache.json and filters against it but never appends new SearchResult objects into self.cache before _save_cache, so cache contents will never evolve and the cache effectively doesn’t reflect newly discovered URLs; consider updating self.cache as you collect results per engine.
You’re applying rate limiting in multiple layers (sleep inside search_engine per page plus an additional sleep in bounded_search per task based on the same rate_limit), which may slow scans more than intended; you might want to consolidate the rate limiting in a single place to make the effective request rate easier to reason about.
The --proxies help string ('Proxy list[](http://ip:port)') looks slightly malformed and may confuse users; consider simplifying it to something like 'Proxy list (e.g. http://ip:port)'.

Prompt for AI Agents

Please address the comments from this code review:

## Overall Comments
- The caching logic currently only reads from `dork_cache.json` and filters against it but never appends new `SearchResult` objects into `self.cache` before `_save_cache`, so cache contents will never evolve and the cache effectively doesn’t reflect newly discovered URLs; consider updating `self.cache` as you collect results per engine.
- You’re applying rate limiting in multiple layers (sleep inside `search_engine` per page plus an additional sleep in `bounded_search` per task based on the same `rate_limit`), which may slow scans more than intended; you might want to consolidate the rate limiting in a single place to make the effective request rate easier to reason about.
- The `--proxies` help string (`'Proxy list[](http://ip:port)'`) looks slightly malformed and may confuse users; consider simplifying it to something like `'Proxy list (e.g. http://ip:port)'`.

## Individual Comments

### Comment 1
<location> `utils/dork_scanner.py:329-330` </location>
<code_context>
+                            )
+                            snippet = snippet_match.group(1).strip() if snippet_match else "N/A"
+
+                        norm_url = self._normalize_url(href)
+                        if norm_url in {r.url for r in self.cache.get(engine_name, [])}:
+                            continue
+                        result = SearchResult(norm_url, title, snippet, engine_name)
</code_context>

<issue_to_address>
**suggestion (performance):** Cache lookup recreates a set on every iteration, which is unnecessary overhead.

`if norm_url in {r.url for r in self.cache.get(engine_name, [])}` rebuilds a set on every iteration, which can be costly for large caches. Precompute the URL set once per engine (e.g., before the loop) or maintain a separate `Set[str]` of cached URLs to get O(1) membership checks without repeated allocations.

Suggested implementation:

```python
                        norm_url = self._normalize_url(href)
                        if norm_url in cached_urls:
                            continue

```

To fully implement the optimization, you also need to:

1. Initialize `cached_urls` once per `engine_name` before entering the loop that processes results/pages for that engine. For example, in the method where this snippet lives (likely something like `_search_engine` or similar), after you have `engine_name` and before iterating over pages/results, add:
   ```python
   cached_urls = {r.url for r in self.cache.get(engine_name, [])}
   ```

2. Whenever you add new `SearchResult` objects to `self.cache[engine_name]`, also add their URLs to `cached_urls` to keep the set in sync:
   ```python
   self.cache[engine_name].append(result)
   cached_urls.add(result.url)
   ```
   Adjust this to match your existing cache update logic and data structures.

These changes ensure you only build the set once per engine and then benefit from O(1) membership checks without repeated allocations inside the loop.
</issue_to_address>

### Comment 2
<location> `utils/dork_scanner.py:332-335` </location>
<code_context>
+                        norm_url = self._normalize_url(href)
+                        if norm_url in {r.url for r in self.cache.get(engine_name, [])}:
+                            continue
+                        result = SearchResult(norm_url, title, snippet, engine_name)
+                        result.score = self._score_dork_result(result)
+                        result.sqlmap_cli = self._generate_sqlmap_cli(norm_url)
+                        result.nuclei_cli = self._generate_nuclei_cli(norm_url)
+                        results.append(result)
+
</code_context>

<issue_to_address>
**issue (bug_risk):** New search results are not added to the cache, so the cache never grows beyond what was loaded from disk.

`self.cache.get(engine_name, [])` is only used for the membership check; the new `result` instances are never added back to `self.cache`. As a result, de-duplication only considers entries loaded from disk, not those found in the current run. You could append each new result to `self.cache.setdefault(engine_name, [])` so `_save_cache` persists them and the cache reflects all past scans.
</issue_to_address>

### Comment 3
<location> `utils/dork_scanner.py:376-388` </location>
<code_context>
+
+        semaphore = asyncio.Semaphore(40)
+
+        async def bounded_search(task, delay):
+            async with semaphore:
+                res = await task
+                await asyncio.sleep(delay + random.uniform(0, 0.5))
+                pbar.update(pages)
+                return res
</code_context>

<issue_to_address>
**suggestion (performance):** Rate limiting is applied both inside `search_engine` and again in `bounded_search`, which may be redundant.

`search_engine` already does `await asyncio.sleep(engine_config["rate_limit"] + random.uniform(0, 0.5))` per page, and `bounded_search` adds another `await asyncio.sleep(delay + random.uniform(0, 0.5))` after each task. This stacked delay likely slows scans more than intended without much benefit. Consider centralizing the throttling (either per page or per task) so the effective request rate matches the configured `rate_limit`.

```suggestion
                    task = self.search_engine(engine_name, query_dork, engine_config, max_results, pages)
                    tasks.append(task)

        semaphore = asyncio.Semaphore(40)

        async def bounded_search(task):
            async with semaphore:
                res = await task
                pbar.update(pages)
                return res

        scan_tasks = [bounded_search(task) for task in tasks]
```
</issue_to_address>

### Comment 4
<location> `utils/dork_scanner.py:459` </location>
<code_context>
+    parser.add_argument("--domain", help="Domain filter (e.g., example.com)")
+    parser.add_argument("--output", help="Output file (json/csv)")
+    parser.add_argument("--format", choices=["json", "csv"], default="json", help="Output format")
+    parser.add_argument("--proxies", nargs="+", help="Proxy list[](http://ip:port)")
+    parser.add_argument("--num-results", type=int, default=10, help="Max results per page")
+    parser.add_argument("--pages", type=int, default=5, help="Pages to scan per engine")
</code_context>

<issue_to_address>
**nitpick (typo):** The proxies help string has an odd `list[]` fragment that looks accidental.

The help text currently shows `"Proxy list[](http://ip:port)"`, which looks like it has a stray `[]`. Consider wording like `"Proxy list (http://ip:port)"` or `"List of proxies, e.g. http://ip:port"` for clearer `--help` output.

```suggestion
    parser.add_argument("--proxies", nargs="+", help="List of proxies, e.g. http://ip:port")
```
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.}

sourcery-ai · 2026-01-06T15:54:17Z

+                        norm_url = self._normalize_url(href)
+                        if norm_url in {r.url for r in self.cache.get(engine_name, [])}:


suggestion (performance): Cache lookup recreates a set on every iteration, which is unnecessary overhead.

if norm_url in {r.url for r in self.cache.get(engine_name, [])} rebuilds a set on every iteration, which can be costly for large caches. Precompute the URL set once per engine (e.g., before the loop) or maintain a separate Set[str] of cached URLs to get O(1) membership checks without repeated allocations.

Suggested implementation:

norm_url = self._normalize_url(href) if norm_url in cached_urls: continue

To fully implement the optimization, you also need to:

Initialize cached_urls once per engine_name before entering the loop that processes results/pages for that engine. For example, in the method where this snippet lives (likely something like _search_engine or similar), after you have engine_name and before iterating over pages/results, add:

cached_urls = {r.url for r in self.cache.get(engine_name, [])}

Whenever you add new SearchResult objects to self.cache[engine_name], also add their URLs to cached_urls to keep the set in sync:

self.cache[engine_name].append(result) cached_urls.add(result.url)

Adjust this to match your existing cache update logic and data structures.

These changes ensure you only build the set once per engine and then benefit from O(1) membership checks without repeated allocations inside the loop.

sourcery-ai · 2026-01-06T15:54:17Z

+                        result = SearchResult(norm_url, title, snippet, engine_name)
+                        result.score = self._score_dork_result(result)
+                        result.sqlmap_cli = self._generate_sqlmap_cli(norm_url)
+                        result.nuclei_cli = self._generate_nuclei_cli(norm_url)


issue (bug_risk): New search results are not added to the cache, so the cache never grows beyond what was loaded from disk.

self.cache.get(engine_name, []) is only used for the membership check; the new result instances are never added back to self.cache. As a result, de-duplication only considers entries loaded from disk, not those found in the current run. You could append each new result to self.cache.setdefault(engine_name, []) so _save_cache persists them and the cache reflects all past scans.

sourcery-ai · 2026-01-06T15:54:18Z

+                    task = self.search_engine(engine_name, query_dork, engine_config, max_results, pages)
+                    tasks.append((task, engine_config["rate_limit"]))
+
+        semaphore = asyncio.Semaphore(40)
+
+        async def bounded_search(task, delay):
+            async with semaphore:
+                res = await task
+                await asyncio.sleep(delay + random.uniform(0, 0.5))
+                pbar.update(pages)
+                return res
+
+        scan_tasks = [bounded_search(task, delay) for task, delay in tasks]


suggestion (performance): Rate limiting is applied both inside search_engine and again in bounded_search, which may be redundant.

search_engine already does await asyncio.sleep(engine_config["rate_limit"] + random.uniform(0, 0.5)) per page, and bounded_search adds another await asyncio.sleep(delay + random.uniform(0, 0.5)) after each task. This stacked delay likely slows scans more than intended without much benefit. Consider centralizing the throttling (either per page or per task) so the effective request rate matches the configured rate_limit.

Suggested change

task = self.search_engine(engine_name, query_dork, engine_config, max_results, pages)

tasks.append((task, engine_config["rate_limit"]))

semaphore = asyncio.Semaphore(40)

async def bounded_search(task, delay):

async with semaphore:

res = await task

await asyncio.sleep(delay + random.uniform(0, 0.5))

pbar.update(pages)

return res

scan_tasks = [bounded_search(task, delay) for task, delay in tasks]

task = self.search_engine(engine_name, query_dork, engine_config, max_results, pages)

tasks.append(task)

semaphore = asyncio.Semaphore(40)

async def bounded_search(task):

async with semaphore:

res = await task

pbar.update(pages)

return res

scan_tasks = [bounded_search(task) for task in tasks]

sourcery-ai · 2026-01-06T15:54:18Z

+    parser.add_argument("--domain", help="Domain filter (e.g., example.com)")
+    parser.add_argument("--output", help="Output file (json/csv)")
+    parser.add_argument("--format", choices=["json", "csv"], default="json", help="Output format")
+    parser.add_argument("--proxies", nargs="+", help="Proxy list[](http://ip:port)")


nitpick (typo): The proxies help string has an odd list[] fragment that looks accidental.

The help text currently shows "Proxy list[](http://ip:port)", which looks like it has a stray []. Consider wording like "Proxy list (http://ip:port)" or "List of proxies, e.g. http://ip:port" for clearer --help output.

Suggested change

parser.add_argument("--proxies", nargs="+", help="Proxy list[](http://ip:port)")

parser.add_argument("--proxies", nargs="+", help="List of proxies, e.g. http://ip:port")

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 4a4631c5ab

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-01-06T15:54:28Z

+            offset = 10 * (page - 1)
+            url = (
+                engine_config["base_url"]
+                + urllib.parse.quote(dork)
+                + f"{pag_param}{offset}&num={max_results}"


Use max_results to compute pagination offset

When --num-results is set to anything other than 10, the pagination offset for Google/Bing is still hardcoded to 10 * (page - 1), while the request asks for num={max_results}. That means page 2+ will overlap or skip results because the offset is no longer aligned with the page size, so users requesting (e.g.) 25 results per page will see duplicated/partial coverage for those engines. Consider computing the offset from max_results instead so pagination matches the requested page size.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-01-06T15:54:28Z

+                        norm_url = self._normalize_url(href)
+                        if norm_url in {r.url for r in self.cache.get(engine_name, [])}:
+                            continue
+                        result = SearchResult(norm_url, title, snippet, engine_name)
+                        result.score = self._score_dork_result(result)


Populate cache before saving results

The code only checks self.cache to skip cached URLs but never inserts new results into self.cache; later _save_cache() persists the unchanged cache. As a result, running the scanner multiple times with the same cache file will re-fetch already-seen targets because nothing was ever added to the cache in this run. This defeats the intended caching/duplication behavior and adds unnecessary load; consider appending new SearchResults into self.cache[engine_name] before saving.

Useful? React with 👍 / 👎.

Add resilient multi-engine dork scanner

4a4631c

ChrisAdamsdevelopment added the codex label Jan 6, 2026 — with ChatGPT Codex Connector

sourcery-ai Bot reviewed Jan 6, 2026

View reviewed changes

chatgpt-codex-connector Bot reviewed Jan 6, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add resilient multi-engine dork scanner#2

Add resilient multi-engine dork scanner#2
ChrisAdamsdevelopment wants to merge 1 commit into
mainfrom
codex/debug-and-fix-multienginedorkframework

ChrisAdamsdevelopment commented Jan 6, 2026 •

edited by sourcery-ai Bot

Loading

Uh oh!

sourcery-ai Bot commented Jan 6, 2026 •

edited

Loading

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

sourcery-ai Bot left a comment

Uh oh!

sourcery-ai Bot Jan 6, 2026

Uh oh!

sourcery-ai Bot Jan 6, 2026

Uh oh!

sourcery-ai Bot Jan 6, 2026

Uh oh!

sourcery-ai Bot Jan 6, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Jan 6, 2026

Uh oh!

chatgpt-codex-connector Bot Jan 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		norm_url = self._normalize_url(href)
		if norm_url in {r.url for r in self.cache.get(engine_name, [])}:

	parser.add_argument("--proxies", nargs="+", help="Proxy list[](http://ip:port)")
	parser.add_argument("--proxies", nargs="+", help="List of proxies, e.g. http://ip:port")

Conversation

ChrisAdamsdevelopment commented Jan 6, 2026 • edited by sourcery-ai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Testing

Summary by Sourcery

Uh oh!

sourcery-ai Bot commented Jan 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviewer's Guide

Sequence diagram for execute_cross_engine_scan workflow

Class diagram for the new multi-engine dork scanner utility

File-Level Changes

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

sourcery-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

sourcery-ai Bot Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

sourcery-ai Bot Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

sourcery-ai Bot Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

sourcery-ai Bot Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ChrisAdamsdevelopment commented Jan 6, 2026 •

edited by sourcery-ai Bot

Loading

sourcery-ai Bot commented Jan 6, 2026 •

edited

Loading