Skip to content

Add resilient multi-engine dork scanner#2

Open
ChrisAdamsdevelopment wants to merge 1 commit into
mainfrom
codex/debug-and-fix-multienginedorkframework
Open

Add resilient multi-engine dork scanner#2
ChrisAdamsdevelopment wants to merge 1 commit into
mainfrom
codex/debug-and-fix-multienginedorkframework

Conversation

@ChrisAdamsdevelopment
Copy link
Copy Markdown
Owner

@ChrisAdamsdevelopment ChrisAdamsdevelopment commented Jan 6, 2026

Summary

  • add a standalone multi-engine dork scanning utility with SQLi/XSS heuristics
  • improve result parsing to respect CSS selectors and update progress after each request batch
  • include export, scoring, and validation helpers for discovered targets

Testing

  • python -m compileall utils/dork_scanner.py

Codex Task

Summary by Sourcery

Add a standalone asynchronous multi-engine dork scanning utility for SQLi/XSS-oriented OSINT discovery with scoring, caching, and export capabilities.

New Features:

  • Introduce a CLI-based multi-search-engine dork scanner that aggregates results from Google, Bing, DuckDuckGo, and Yandex.
  • Add heuristic scoring and helper command generation for SQLi/XSS-focused targets discovered via dork searches.
  • Support optional domain filtering, proxy usage, and concurrent cross-engine scanning with basic liveness validation of top results.
  • Provide JSON/CSV export of ranked scan results, including per-target metadata and suggested tooling commands.

Enhancements:

  • Implement on-disk caching of search results to avoid reprocessing previously seen URLs across runs.

@sourcery-ai
Copy link
Copy Markdown

sourcery-ai Bot commented Jan 6, 2026

Reviewer's Guide

Introduces a new standalone asynchronous multi-engine dork scanner utility with heuristic scoring, caching, batched progress-reporting, and export helpers for SQLi/XSS-oriented OSINT scanning.

Sequence diagram for execute_cross_engine_scan workflow

sequenceDiagram
    actor User
    participant CLI as main
    participant Framework as MultiEngineDorkFramework
    participant Engines as SearchEngines
    participant Cache as CacheFile
    participant Targets as TargetValidation

    User->>CLI: invoke with arguments
    CLI->>Framework: __init__(proxies, output_format, cache_file)
    Framework->>Cache: _load_cache()
    Cache-->>Framework: cached_results

    CLI->>Framework: execute_cross_engine_scan(domain_filter, output_file, max_results, pages)
    Framework->>Framework: generate_master_dork_list()

    loop for each engine and dork
        Framework->>Framework: schedule search_engine(engine_name, dork, engine_config)
    end

    par concurrent_search_tasks
        Framework->>Engines: search_engine requests (bounded by semaphore)
        Engines-->>Framework: raw HTML responses per engine/page
        Framework->>Framework: parse HTML with CSS selectors
        Framework->>Framework: normalize URLs and score results
    end

    Framework->>Framework: deduplicate across engines

    Framework->>Targets: HEAD requests for top scored URLs
    Targets-->>Framework: liveness flags
    Framework->>Framework: filter to valid_results

    alt output_file provided
        Framework->>Framework: _export_results(results, filename)
        Framework->>Cache: _save_cache()
    else no output_file
        Framework->>Cache: _save_cache()
    end

    Framework-->>CLI: final results
    CLI-->>User: print top scored targets and helper CLI commands
Loading

Class diagram for the new multi-engine dork scanner utility

classDiagram
    class SearchResult {
        +str url
        +str title
        +str snippet
        +str engine
        +int score
        +str sqlmap_cli
        +str nuclei_cli
    }

    class MultiEngineDorkFramework {
        +List~str~ USER_AGENTS
        -List~str~ proxies
        -str output_format
        -aiohttp.ClientSession session
        -str cache_file
        -Dict~str, List~SearchResult~~ cache
        -Dict~str, Dict~str, any~~ engines
        +__init__(proxies, output_format, cache_file)
        -_load_cache() Dict~str, List~SearchResult~~
        -_save_cache() None
        -_get_session() aiohttp.ClientSession
        -_get_random_ua() str
        -_normalize_url(url) str
        -_score_dork_result(result) int
        -_generate_sqlmap_cli(url) str
        -_generate_nuclei_cli(url) str
        +generate_master_dork_list() Dict~str, List~str~~
        +_retry_on_failure(max_retries) function_wrapper
        +search_engine(engine_name, dork, engine_config, max_results, pages) List~SearchResult~
        -_get_random_proxy() str
        +execute_cross_engine_scan(domain_filter, output_file, max_results, pages) List~SearchResult~
        -_export_results(results, filename) None
    }

    class main_module {
        +async main() None
    }

    MultiEngineDorkFramework "*" o-- "*" SearchResult : produces
    main_module ..> MultiEngineDorkFramework : creates
    main_module ..> SearchResult : prints_top_results
Loading

File-Level Changes

Change Details Files
Add asynchronous multi-search-engine dork scanner with heuristic scoring, caching, and CLI entrypoint.
  • Implement MultiEngineDorkFramework class that orchestrates concurrent searches across Google, Bing, DuckDuckGo, and Yandex with per-engine CSS selectors, pagination, and rate limiting.
  • Introduce SearchResult dataclass that carries URL metadata, heuristic score, and prebuilt sqlmap/nuclei command lines for discovered targets.
  • Add URL normalization, scoring heuristics for SQLi/XSS indicators, and CLI generation helpers for sqlmap/nuclei to streamline follow-up testing.
  • Implement retry decorator, bounded concurrency with semaphore, proxy rotation, and tqdm-based progress tracking per dork/engine/page.
  • Add cross-engine scan routine that builds a master dork list, aggregates and de-duplicates normalized URLs, validates top results with async HEAD requests, and optionally exports to JSON/CSV while persisting a local cache file.
  • Provide an async main() entrypoint wired to argparse for domain filtering, proxy configuration, pagination, output selection, and summary printing of top results.
utils/dork_scanner.py

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

Copy link
Copy Markdown

@sourcery-ai sourcery-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've found 4 issues, and left some high level feedback:

  • The caching logic currently only reads from dork_cache.json and filters against it but never appends new SearchResult objects into self.cache before _save_cache, so cache contents will never evolve and the cache effectively doesn’t reflect newly discovered URLs; consider updating self.cache as you collect results per engine.
  • You’re applying rate limiting in multiple layers (sleep inside search_engine per page plus an additional sleep in bounded_search per task based on the same rate_limit), which may slow scans more than intended; you might want to consolidate the rate limiting in a single place to make the effective request rate easier to reason about.
  • The --proxies help string ('Proxy list[](http://ip:port)') looks slightly malformed and may confuse users; consider simplifying it to something like 'Proxy list (e.g. http://ip:port)'.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- The caching logic currently only reads from `dork_cache.json` and filters against it but never appends new `SearchResult` objects into `self.cache` before `_save_cache`, so cache contents will never evolve and the cache effectively doesn’t reflect newly discovered URLs; consider updating `self.cache` as you collect results per engine.
- You’re applying rate limiting in multiple layers (sleep inside `search_engine` per page plus an additional sleep in `bounded_search` per task based on the same `rate_limit`), which may slow scans more than intended; you might want to consolidate the rate limiting in a single place to make the effective request rate easier to reason about.
- The `--proxies` help string (`'Proxy list[](http://ip:port)'`) looks slightly malformed and may confuse users; consider simplifying it to something like `'Proxy list (e.g. http://ip:port)'`.

## Individual Comments

### Comment 1
<location> `utils/dork_scanner.py:329-330` </location>
<code_context>
+                            )
+                            snippet = snippet_match.group(1).strip() if snippet_match else "N/A"
+
+                        norm_url = self._normalize_url(href)
+                        if norm_url in {r.url for r in self.cache.get(engine_name, [])}:
+                            continue
+                        result = SearchResult(norm_url, title, snippet, engine_name)
</code_context>

<issue_to_address>
**suggestion (performance):** Cache lookup recreates a set on every iteration, which is unnecessary overhead.

`if norm_url in {r.url for r in self.cache.get(engine_name, [])}` rebuilds a set on every iteration, which can be costly for large caches. Precompute the URL set once per engine (e.g., before the loop) or maintain a separate `Set[str]` of cached URLs to get O(1) membership checks without repeated allocations.

Suggested implementation:

```python
                        norm_url = self._normalize_url(href)
                        if norm_url in cached_urls:
                            continue

```

To fully implement the optimization, you also need to:

1. Initialize `cached_urls` once per `engine_name` before entering the loop that processes results/pages for that engine. For example, in the method where this snippet lives (likely something like `_search_engine` or similar), after you have `engine_name` and before iterating over pages/results, add:
   ```python
   cached_urls = {r.url for r in self.cache.get(engine_name, [])}
   ```

2. Whenever you add new `SearchResult` objects to `self.cache[engine_name]`, also add their URLs to `cached_urls` to keep the set in sync:
   ```python
   self.cache[engine_name].append(result)
   cached_urls.add(result.url)
   ```
   Adjust this to match your existing cache update logic and data structures.

These changes ensure you only build the set once per engine and then benefit from O(1) membership checks without repeated allocations inside the loop.
</issue_to_address>

### Comment 2
<location> `utils/dork_scanner.py:332-335` </location>
<code_context>
+                        norm_url = self._normalize_url(href)
+                        if norm_url in {r.url for r in self.cache.get(engine_name, [])}:
+                            continue
+                        result = SearchResult(norm_url, title, snippet, engine_name)
+                        result.score = self._score_dork_result(result)
+                        result.sqlmap_cli = self._generate_sqlmap_cli(norm_url)
+                        result.nuclei_cli = self._generate_nuclei_cli(norm_url)
+                        results.append(result)
+
</code_context>

<issue_to_address>
**issue (bug_risk):** New search results are not added to the cache, so the cache never grows beyond what was loaded from disk.

`self.cache.get(engine_name, [])` is only used for the membership check; the new `result` instances are never added back to `self.cache`. As a result, de-duplication only considers entries loaded from disk, not those found in the current run. You could append each new result to `self.cache.setdefault(engine_name, [])` so `_save_cache` persists them and the cache reflects all past scans.
</issue_to_address>

### Comment 3
<location> `utils/dork_scanner.py:376-388` </location>
<code_context>
+
+        semaphore = asyncio.Semaphore(40)
+
+        async def bounded_search(task, delay):
+            async with semaphore:
+                res = await task
+                await asyncio.sleep(delay + random.uniform(0, 0.5))
+                pbar.update(pages)
+                return res
</code_context>

<issue_to_address>
**suggestion (performance):** Rate limiting is applied both inside `search_engine` and again in `bounded_search`, which may be redundant.

`search_engine` already does `await asyncio.sleep(engine_config["rate_limit"] + random.uniform(0, 0.5))` per page, and `bounded_search` adds another `await asyncio.sleep(delay + random.uniform(0, 0.5))` after each task. This stacked delay likely slows scans more than intended without much benefit. Consider centralizing the throttling (either per page or per task) so the effective request rate matches the configured `rate_limit`.

```suggestion
                    task = self.search_engine(engine_name, query_dork, engine_config, max_results, pages)
                    tasks.append(task)

        semaphore = asyncio.Semaphore(40)

        async def bounded_search(task):
            async with semaphore:
                res = await task
                pbar.update(pages)
                return res

        scan_tasks = [bounded_search(task) for task in tasks]
```
</issue_to_address>

### Comment 4
<location> `utils/dork_scanner.py:459` </location>
<code_context>
+    parser.add_argument("--domain", help="Domain filter (e.g., example.com)")
+    parser.add_argument("--output", help="Output file (json/csv)")
+    parser.add_argument("--format", choices=["json", "csv"], default="json", help="Output format")
+    parser.add_argument("--proxies", nargs="+", help="Proxy list[](http://ip:port)")
+    parser.add_argument("--num-results", type=int, default=10, help="Max results per page")
+    parser.add_argument("--pages", type=int, default=5, help="Pages to scan per engine")
</code_context>

<issue_to_address>
**nitpick (typo):** The proxies help string has an odd `list[]` fragment that looks accidental.

The help text currently shows `"Proxy list[](http://ip:port)"`, which looks like it has a stray `[]`. Consider wording like `"Proxy list (http://ip:port)"` or `"List of proxies, e.g. http://ip:port"` for clearer `--help` output.

```suggestion
    parser.add_argument("--proxies", nargs="+", help="List of proxies, e.g. http://ip:port")
```
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Comment thread utils/dork_scanner.py
Comment on lines +329 to +330
norm_url = self._normalize_url(href)
if norm_url in {r.url for r in self.cache.get(engine_name, [])}:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (performance): Cache lookup recreates a set on every iteration, which is unnecessary overhead.

if norm_url in {r.url for r in self.cache.get(engine_name, [])} rebuilds a set on every iteration, which can be costly for large caches. Precompute the URL set once per engine (e.g., before the loop) or maintain a separate Set[str] of cached URLs to get O(1) membership checks without repeated allocations.

Suggested implementation:

                        norm_url = self._normalize_url(href)
                        if norm_url in cached_urls:
                            continue

To fully implement the optimization, you also need to:

  1. Initialize cached_urls once per engine_name before entering the loop that processes results/pages for that engine. For example, in the method where this snippet lives (likely something like _search_engine or similar), after you have engine_name and before iterating over pages/results, add:

    cached_urls = {r.url for r in self.cache.get(engine_name, [])}
  2. Whenever you add new SearchResult objects to self.cache[engine_name], also add their URLs to cached_urls to keep the set in sync:

    self.cache[engine_name].append(result)
    cached_urls.add(result.url)

    Adjust this to match your existing cache update logic and data structures.

These changes ensure you only build the set once per engine and then benefit from O(1) membership checks without repeated allocations inside the loop.

Comment thread utils/dork_scanner.py
Comment on lines +332 to +335
result = SearchResult(norm_url, title, snippet, engine_name)
result.score = self._score_dork_result(result)
result.sqlmap_cli = self._generate_sqlmap_cli(norm_url)
result.nuclei_cli = self._generate_nuclei_cli(norm_url)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (bug_risk): New search results are not added to the cache, so the cache never grows beyond what was loaded from disk.

self.cache.get(engine_name, []) is only used for the membership check; the new result instances are never added back to self.cache. As a result, de-duplication only considers entries loaded from disk, not those found in the current run. You could append each new result to self.cache.setdefault(engine_name, []) so _save_cache persists them and the cache reflects all past scans.

Comment thread utils/dork_scanner.py
Comment on lines +376 to +388
task = self.search_engine(engine_name, query_dork, engine_config, max_results, pages)
tasks.append((task, engine_config["rate_limit"]))

semaphore = asyncio.Semaphore(40)

async def bounded_search(task, delay):
async with semaphore:
res = await task
await asyncio.sleep(delay + random.uniform(0, 0.5))
pbar.update(pages)
return res

scan_tasks = [bounded_search(task, delay) for task, delay in tasks]
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (performance): Rate limiting is applied both inside search_engine and again in bounded_search, which may be redundant.

search_engine already does await asyncio.sleep(engine_config["rate_limit"] + random.uniform(0, 0.5)) per page, and bounded_search adds another await asyncio.sleep(delay + random.uniform(0, 0.5)) after each task. This stacked delay likely slows scans more than intended without much benefit. Consider centralizing the throttling (either per page or per task) so the effective request rate matches the configured rate_limit.

Suggested change
task = self.search_engine(engine_name, query_dork, engine_config, max_results, pages)
tasks.append((task, engine_config["rate_limit"]))
semaphore = asyncio.Semaphore(40)
async def bounded_search(task, delay):
async with semaphore:
res = await task
await asyncio.sleep(delay + random.uniform(0, 0.5))
pbar.update(pages)
return res
scan_tasks = [bounded_search(task, delay) for task, delay in tasks]
task = self.search_engine(engine_name, query_dork, engine_config, max_results, pages)
tasks.append(task)
semaphore = asyncio.Semaphore(40)
async def bounded_search(task):
async with semaphore:
res = await task
pbar.update(pages)
return res
scan_tasks = [bounded_search(task) for task in tasks]

Comment thread utils/dork_scanner.py
parser.add_argument("--domain", help="Domain filter (e.g., example.com)")
parser.add_argument("--output", help="Output file (json/csv)")
parser.add_argument("--format", choices=["json", "csv"], default="json", help="Output format")
parser.add_argument("--proxies", nargs="+", help="Proxy list[](http://ip:port)")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick (typo): The proxies help string has an odd list[] fragment that looks accidental.

The help text currently shows "Proxy list[](http://ip:port)", which looks like it has a stray []. Consider wording like "Proxy list (http://ip:port)" or "List of proxies, e.g. http://ip:port" for clearer --help output.

Suggested change
parser.add_argument("--proxies", nargs="+", help="Proxy list[](http://ip:port)")
parser.add_argument("--proxies", nargs="+", help="List of proxies, e.g. http://ip:port")

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 4a4631c5ab

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread utils/dork_scanner.py
Comment on lines +283 to +287
offset = 10 * (page - 1)
url = (
engine_config["base_url"]
+ urllib.parse.quote(dork)
+ f"{pag_param}{offset}&num={max_results}"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Use max_results to compute pagination offset

When --num-results is set to anything other than 10, the pagination offset for Google/Bing is still hardcoded to 10 * (page - 1), while the request asks for num={max_results}. That means page 2+ will overlap or skip results because the offset is no longer aligned with the page size, so users requesting (e.g.) 25 results per page will see duplicated/partial coverage for those engines. Consider computing the offset from max_results instead so pagination matches the requested page size.

Useful? React with 👍 / 👎.

Comment thread utils/dork_scanner.py
Comment on lines +329 to +333
norm_url = self._normalize_url(href)
if norm_url in {r.url for r in self.cache.get(engine_name, [])}:
continue
result = SearchResult(norm_url, title, snippet, engine_name)
result.score = self._score_dork_result(result)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Populate cache before saving results

The code only checks self.cache to skip cached URLs but never inserts new results into self.cache; later _save_cache() persists the unchanged cache. As a result, running the scanner multiple times with the same cache file will re-fetch already-seen targets because nothing was ever added to the cache in this run. This defeats the intended caching/duplication behavior and adds unnecessary load; consider appending new SearchResults into self.cache[engine_name] before saving.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant