Add resilient multi-engine dork scanner#2
Conversation
Reviewer's GuideIntroduces a new standalone asynchronous multi-engine dork scanner utility with heuristic scoring, caching, batched progress-reporting, and export helpers for SQLi/XSS-oriented OSINT scanning. Sequence diagram for execute_cross_engine_scan workflowsequenceDiagram
actor User
participant CLI as main
participant Framework as MultiEngineDorkFramework
participant Engines as SearchEngines
participant Cache as CacheFile
participant Targets as TargetValidation
User->>CLI: invoke with arguments
CLI->>Framework: __init__(proxies, output_format, cache_file)
Framework->>Cache: _load_cache()
Cache-->>Framework: cached_results
CLI->>Framework: execute_cross_engine_scan(domain_filter, output_file, max_results, pages)
Framework->>Framework: generate_master_dork_list()
loop for each engine and dork
Framework->>Framework: schedule search_engine(engine_name, dork, engine_config)
end
par concurrent_search_tasks
Framework->>Engines: search_engine requests (bounded by semaphore)
Engines-->>Framework: raw HTML responses per engine/page
Framework->>Framework: parse HTML with CSS selectors
Framework->>Framework: normalize URLs and score results
end
Framework->>Framework: deduplicate across engines
Framework->>Targets: HEAD requests for top scored URLs
Targets-->>Framework: liveness flags
Framework->>Framework: filter to valid_results
alt output_file provided
Framework->>Framework: _export_results(results, filename)
Framework->>Cache: _save_cache()
else no output_file
Framework->>Cache: _save_cache()
end
Framework-->>CLI: final results
CLI-->>User: print top scored targets and helper CLI commands
Class diagram for the new multi-engine dork scanner utilityclassDiagram
class SearchResult {
+str url
+str title
+str snippet
+str engine
+int score
+str sqlmap_cli
+str nuclei_cli
}
class MultiEngineDorkFramework {
+List~str~ USER_AGENTS
-List~str~ proxies
-str output_format
-aiohttp.ClientSession session
-str cache_file
-Dict~str, List~SearchResult~~ cache
-Dict~str, Dict~str, any~~ engines
+__init__(proxies, output_format, cache_file)
-_load_cache() Dict~str, List~SearchResult~~
-_save_cache() None
-_get_session() aiohttp.ClientSession
-_get_random_ua() str
-_normalize_url(url) str
-_score_dork_result(result) int
-_generate_sqlmap_cli(url) str
-_generate_nuclei_cli(url) str
+generate_master_dork_list() Dict~str, List~str~~
+_retry_on_failure(max_retries) function_wrapper
+search_engine(engine_name, dork, engine_config, max_results, pages) List~SearchResult~
-_get_random_proxy() str
+execute_cross_engine_scan(domain_filter, output_file, max_results, pages) List~SearchResult~
-_export_results(results, filename) None
}
class main_module {
+async main() None
}
MultiEngineDorkFramework "*" o-- "*" SearchResult : produces
main_module ..> MultiEngineDorkFramework : creates
main_module ..> SearchResult : prints_top_results
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
There was a problem hiding this comment.
Hey - I've found 4 issues, and left some high level feedback:
- The caching logic currently only reads from
dork_cache.jsonand filters against it but never appends newSearchResultobjects intoself.cachebefore_save_cache, so cache contents will never evolve and the cache effectively doesn’t reflect newly discovered URLs; consider updatingself.cacheas you collect results per engine. - You’re applying rate limiting in multiple layers (sleep inside
search_engineper page plus an additional sleep inbounded_searchper task based on the samerate_limit), which may slow scans more than intended; you might want to consolidate the rate limiting in a single place to make the effective request rate easier to reason about. - The
--proxieshelp string ('Proxy list[](http://ip:port)') looks slightly malformed and may confuse users; consider simplifying it to something like'Proxy list (e.g. http://ip:port)'.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- The caching logic currently only reads from `dork_cache.json` and filters against it but never appends new `SearchResult` objects into `self.cache` before `_save_cache`, so cache contents will never evolve and the cache effectively doesn’t reflect newly discovered URLs; consider updating `self.cache` as you collect results per engine.
- You’re applying rate limiting in multiple layers (sleep inside `search_engine` per page plus an additional sleep in `bounded_search` per task based on the same `rate_limit`), which may slow scans more than intended; you might want to consolidate the rate limiting in a single place to make the effective request rate easier to reason about.
- The `--proxies` help string (`'Proxy list[](http://ip:port)'`) looks slightly malformed and may confuse users; consider simplifying it to something like `'Proxy list (e.g. http://ip:port)'`.
## Individual Comments
### Comment 1
<location> `utils/dork_scanner.py:329-330` </location>
<code_context>
+ )
+ snippet = snippet_match.group(1).strip() if snippet_match else "N/A"
+
+ norm_url = self._normalize_url(href)
+ if norm_url in {r.url for r in self.cache.get(engine_name, [])}:
+ continue
+ result = SearchResult(norm_url, title, snippet, engine_name)
</code_context>
<issue_to_address>
**suggestion (performance):** Cache lookup recreates a set on every iteration, which is unnecessary overhead.
`if norm_url in {r.url for r in self.cache.get(engine_name, [])}` rebuilds a set on every iteration, which can be costly for large caches. Precompute the URL set once per engine (e.g., before the loop) or maintain a separate `Set[str]` of cached URLs to get O(1) membership checks without repeated allocations.
Suggested implementation:
```python
norm_url = self._normalize_url(href)
if norm_url in cached_urls:
continue
```
To fully implement the optimization, you also need to:
1. Initialize `cached_urls` once per `engine_name` before entering the loop that processes results/pages for that engine. For example, in the method where this snippet lives (likely something like `_search_engine` or similar), after you have `engine_name` and before iterating over pages/results, add:
```python
cached_urls = {r.url for r in self.cache.get(engine_name, [])}
```
2. Whenever you add new `SearchResult` objects to `self.cache[engine_name]`, also add their URLs to `cached_urls` to keep the set in sync:
```python
self.cache[engine_name].append(result)
cached_urls.add(result.url)
```
Adjust this to match your existing cache update logic and data structures.
These changes ensure you only build the set once per engine and then benefit from O(1) membership checks without repeated allocations inside the loop.
</issue_to_address>
### Comment 2
<location> `utils/dork_scanner.py:332-335` </location>
<code_context>
+ norm_url = self._normalize_url(href)
+ if norm_url in {r.url for r in self.cache.get(engine_name, [])}:
+ continue
+ result = SearchResult(norm_url, title, snippet, engine_name)
+ result.score = self._score_dork_result(result)
+ result.sqlmap_cli = self._generate_sqlmap_cli(norm_url)
+ result.nuclei_cli = self._generate_nuclei_cli(norm_url)
+ results.append(result)
+
</code_context>
<issue_to_address>
**issue (bug_risk):** New search results are not added to the cache, so the cache never grows beyond what was loaded from disk.
`self.cache.get(engine_name, [])` is only used for the membership check; the new `result` instances are never added back to `self.cache`. As a result, de-duplication only considers entries loaded from disk, not those found in the current run. You could append each new result to `self.cache.setdefault(engine_name, [])` so `_save_cache` persists them and the cache reflects all past scans.
</issue_to_address>
### Comment 3
<location> `utils/dork_scanner.py:376-388` </location>
<code_context>
+
+ semaphore = asyncio.Semaphore(40)
+
+ async def bounded_search(task, delay):
+ async with semaphore:
+ res = await task
+ await asyncio.sleep(delay + random.uniform(0, 0.5))
+ pbar.update(pages)
+ return res
</code_context>
<issue_to_address>
**suggestion (performance):** Rate limiting is applied both inside `search_engine` and again in `bounded_search`, which may be redundant.
`search_engine` already does `await asyncio.sleep(engine_config["rate_limit"] + random.uniform(0, 0.5))` per page, and `bounded_search` adds another `await asyncio.sleep(delay + random.uniform(0, 0.5))` after each task. This stacked delay likely slows scans more than intended without much benefit. Consider centralizing the throttling (either per page or per task) so the effective request rate matches the configured `rate_limit`.
```suggestion
task = self.search_engine(engine_name, query_dork, engine_config, max_results, pages)
tasks.append(task)
semaphore = asyncio.Semaphore(40)
async def bounded_search(task):
async with semaphore:
res = await task
pbar.update(pages)
return res
scan_tasks = [bounded_search(task) for task in tasks]
```
</issue_to_address>
### Comment 4
<location> `utils/dork_scanner.py:459` </location>
<code_context>
+ parser.add_argument("--domain", help="Domain filter (e.g., example.com)")
+ parser.add_argument("--output", help="Output file (json/csv)")
+ parser.add_argument("--format", choices=["json", "csv"], default="json", help="Output format")
+ parser.add_argument("--proxies", nargs="+", help="Proxy list[](http://ip:port)")
+ parser.add_argument("--num-results", type=int, default=10, help="Max results per page")
+ parser.add_argument("--pages", type=int, default=5, help="Pages to scan per engine")
</code_context>
<issue_to_address>
**nitpick (typo):** The proxies help string has an odd `list[]` fragment that looks accidental.
The help text currently shows `"Proxy list[](http://ip:port)"`, which looks like it has a stray `[]`. Consider wording like `"Proxy list (http://ip:port)"` or `"List of proxies, e.g. http://ip:port"` for clearer `--help` output.
```suggestion
parser.add_argument("--proxies", nargs="+", help="List of proxies, e.g. http://ip:port")
```
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
| norm_url = self._normalize_url(href) | ||
| if norm_url in {r.url for r in self.cache.get(engine_name, [])}: |
There was a problem hiding this comment.
suggestion (performance): Cache lookup recreates a set on every iteration, which is unnecessary overhead.
if norm_url in {r.url for r in self.cache.get(engine_name, [])} rebuilds a set on every iteration, which can be costly for large caches. Precompute the URL set once per engine (e.g., before the loop) or maintain a separate Set[str] of cached URLs to get O(1) membership checks without repeated allocations.
Suggested implementation:
norm_url = self._normalize_url(href)
if norm_url in cached_urls:
continueTo fully implement the optimization, you also need to:
-
Initialize
cached_urlsonce perengine_namebefore entering the loop that processes results/pages for that engine. For example, in the method where this snippet lives (likely something like_search_engineor similar), after you haveengine_nameand before iterating over pages/results, add:cached_urls = {r.url for r in self.cache.get(engine_name, [])}
-
Whenever you add new
SearchResultobjects toself.cache[engine_name], also add their URLs tocached_urlsto keep the set in sync:self.cache[engine_name].append(result) cached_urls.add(result.url)
Adjust this to match your existing cache update logic and data structures.
These changes ensure you only build the set once per engine and then benefit from O(1) membership checks without repeated allocations inside the loop.
| result = SearchResult(norm_url, title, snippet, engine_name) | ||
| result.score = self._score_dork_result(result) | ||
| result.sqlmap_cli = self._generate_sqlmap_cli(norm_url) | ||
| result.nuclei_cli = self._generate_nuclei_cli(norm_url) |
There was a problem hiding this comment.
issue (bug_risk): New search results are not added to the cache, so the cache never grows beyond what was loaded from disk.
self.cache.get(engine_name, []) is only used for the membership check; the new result instances are never added back to self.cache. As a result, de-duplication only considers entries loaded from disk, not those found in the current run. You could append each new result to self.cache.setdefault(engine_name, []) so _save_cache persists them and the cache reflects all past scans.
| task = self.search_engine(engine_name, query_dork, engine_config, max_results, pages) | ||
| tasks.append((task, engine_config["rate_limit"])) | ||
|
|
||
| semaphore = asyncio.Semaphore(40) | ||
|
|
||
| async def bounded_search(task, delay): | ||
| async with semaphore: | ||
| res = await task | ||
| await asyncio.sleep(delay + random.uniform(0, 0.5)) | ||
| pbar.update(pages) | ||
| return res | ||
|
|
||
| scan_tasks = [bounded_search(task, delay) for task, delay in tasks] |
There was a problem hiding this comment.
suggestion (performance): Rate limiting is applied both inside search_engine and again in bounded_search, which may be redundant.
search_engine already does await asyncio.sleep(engine_config["rate_limit"] + random.uniform(0, 0.5)) per page, and bounded_search adds another await asyncio.sleep(delay + random.uniform(0, 0.5)) after each task. This stacked delay likely slows scans more than intended without much benefit. Consider centralizing the throttling (either per page or per task) so the effective request rate matches the configured rate_limit.
| task = self.search_engine(engine_name, query_dork, engine_config, max_results, pages) | |
| tasks.append((task, engine_config["rate_limit"])) | |
| semaphore = asyncio.Semaphore(40) | |
| async def bounded_search(task, delay): | |
| async with semaphore: | |
| res = await task | |
| await asyncio.sleep(delay + random.uniform(0, 0.5)) | |
| pbar.update(pages) | |
| return res | |
| scan_tasks = [bounded_search(task, delay) for task, delay in tasks] | |
| task = self.search_engine(engine_name, query_dork, engine_config, max_results, pages) | |
| tasks.append(task) | |
| semaphore = asyncio.Semaphore(40) | |
| async def bounded_search(task): | |
| async with semaphore: | |
| res = await task | |
| pbar.update(pages) | |
| return res | |
| scan_tasks = [bounded_search(task) for task in tasks] |
| parser.add_argument("--domain", help="Domain filter (e.g., example.com)") | ||
| parser.add_argument("--output", help="Output file (json/csv)") | ||
| parser.add_argument("--format", choices=["json", "csv"], default="json", help="Output format") | ||
| parser.add_argument("--proxies", nargs="+", help="Proxy list[](http://ip:port)") |
There was a problem hiding this comment.
nitpick (typo): The proxies help string has an odd list[] fragment that looks accidental.
The help text currently shows "Proxy list[](http://ip:port)", which looks like it has a stray []. Consider wording like "Proxy list (http://ip:port)" or "List of proxies, e.g. http://ip:port" for clearer --help output.
| parser.add_argument("--proxies", nargs="+", help="Proxy list[](http://ip:port)") | |
| parser.add_argument("--proxies", nargs="+", help="List of proxies, e.g. http://ip:port") |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 4a4631c5ab
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| offset = 10 * (page - 1) | ||
| url = ( | ||
| engine_config["base_url"] | ||
| + urllib.parse.quote(dork) | ||
| + f"{pag_param}{offset}&num={max_results}" |
There was a problem hiding this comment.
Use max_results to compute pagination offset
When --num-results is set to anything other than 10, the pagination offset for Google/Bing is still hardcoded to 10 * (page - 1), while the request asks for num={max_results}. That means page 2+ will overlap or skip results because the offset is no longer aligned with the page size, so users requesting (e.g.) 25 results per page will see duplicated/partial coverage for those engines. Consider computing the offset from max_results instead so pagination matches the requested page size.
Useful? React with 👍 / 👎.
| norm_url = self._normalize_url(href) | ||
| if norm_url in {r.url for r in self.cache.get(engine_name, [])}: | ||
| continue | ||
| result = SearchResult(norm_url, title, snippet, engine_name) | ||
| result.score = self._score_dork_result(result) |
There was a problem hiding this comment.
Populate cache before saving results
The code only checks self.cache to skip cached URLs but never inserts new results into self.cache; later _save_cache() persists the unchanged cache. As a result, running the scanner multiple times with the same cache file will re-fetch already-seen targets because nothing was ever added to the cache in this run. This defeats the intended caching/duplication behavior and adds unnecessary load; consider appending new SearchResults into self.cache[engine_name] before saving.
Useful? React with 👍 / 👎.
Summary
Testing
Codex Task
Summary by Sourcery
Add a standalone asynchronous multi-engine dork scanning utility for SQLi/XSS-oriented OSINT discovery with scoring, caching, and export capabilities.
New Features:
Enhancements: