Conversation
- Add comprehensive unit tests for strix/llm module: - test_llm_utils.py: 35 tests for tool parsing, HTML entity decoding - test_config.py: 18 tests for LLMConfig initialization - test_memory_compressor.py: 27 tests for token counting, compression - test_request_queue.py: 19 tests for rate limiting, retries - Create test fixtures: - Sample LLM responses (valid, truncated, multiple functions) - Vulnerability test cases (SQL injection, XSS, IDOR) - Add conftest.py with shared fixtures for testing Results: - 97 tests passing, 2 skipped - LLM module coverage: 53% - utils.py: 100%, config.py: 100%, request_queue.py: 98% This completes Phase 1 of the optimization plan.
Phase 2 - Prompt Optimization & False Positive Reduction: New Features: - Add confidence scoring system (strix/llm/confidence.py): - ConfidenceLevel enum (HIGH, MEDIUM, LOW, FALSE_POSITIVE) - VulnerabilityFinding dataclass with serialization - Automatic analysis of responses for FP and exploitation indicators - Pattern dictionaries for common vulnerability types - Add tool invocation validation (strix/llm/utils.py): - validate_tool_invocation() with URL, path, command validation - KNOWN_TOOLS dictionary with required parameters - validate_all_invocations() for batch validation Prompt Improvements: - Add Chain-of-Thought (CoT) vulnerability validation protocol to system_prompt.jinja with 6-step mandatory analysis - Add detailed <false_positive_indicators> sections to: - sql_injection.jinja (WAF, rate limit, encoding checks) - xss.jinja (CSP, sanitization, execution verification) - idor.jinja (public vs private, authorization checks) - ssrf.jinja (client vs server, egress verification) Testing: - 79 new tests for Phase 2 features - Total: 176 tests passing (2 skipped) - Coverage improved: 62% (up from 53%) - confidence.py: 100% coverage - utils.py: 95% coverage
There was a problem hiding this comment.
Pull request overview
This pull request implements Phase 2 of a 3-phase LLM optimization plan for Strix, an open-source AI security testing tool. The primary goal is to reduce false positives in vulnerability detection by at least 25% through prompt optimization, confidence scoring, and structured validation.
Key Changes:
- Implements a confidence scoring system to classify vulnerability findings (HIGH/MEDIUM/LOW/FALSE_POSITIVE)
- Adds comprehensive tool invocation validation with URL, file path, and command checks
- Enhances vulnerability detection prompts with detailed false positive indicators for SQL injection, XSS, SSRF, and IDOR
- Adds mandatory Chain-of-Thought (6-step) analysis protocol to the system prompt
- Introduces 176 unit tests (79 new tests for Phase 2), bringing total coverage to 62% for the LLM module
Reviewed changes
Copilot reviewed 28 out of 28 changed files in this pull request and generated 16 comments.
Show a summary per file
| File | Description |
|---|---|
| todo.md | Comprehensive 3-phase optimization plan document (1122 lines) detailing Phase 1 completion and Phase 2 implementation status |
| strix/llm/confidence.py | New confidence scoring module with pattern-based false positive and exploitation detection |
| strix/llm/utils.py | Extended tool validation functions for URLs, file paths, and commands with structured error reporting |
| strix/agents/StrixAgent/system_prompt.jinja | Added <vulnerability_validation_protocol> with 6-step mandatory CoT analysis and false positive avoidance guidance |
| strix/prompts/vulnerabilities/*.jinja | Enhanced SQL injection, XSS, SSRF, and IDOR prompts with detailed <false_positive_indicators> sections |
| tests/unit/test_confidence.py | 46 comprehensive tests for confidence scoring system |
| tests/unit/test_llm_utils.py | 33 new validation tests added to existing tool parsing tests |
| tests/unit/test_*.py | Additional unit tests for request queue, memory compressor, and config modules |
| tests/fixtures/* | Sample responses and vulnerability test cases for validation |
| tests/conftest.py | Shared pytest fixtures for testing LLM responses and vulnerability findings |
| pyproject.toml | Updated pytest configuration with test markers (unit/integration/slow) |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| def to_dict(self) -> dict[str, Any]: | ||
| """Convierte el hallazgo a diccionario para serialización.""" |
There was a problem hiding this comment.
[nitpick] The docstring comment uses Spanish ("Convierte el hallazgo a diccionario para serialización") while the codebase appears to use English for documentation. Consider maintaining consistency by using English for all docstrings and comments.
| """Valida que una invocación de herramienta sea correcta. | ||
|
|
||
| Realiza validaciones de: | ||
| - Presencia de toolName | ||
| - Formato correcto de args | ||
| - Parámetros requeridos según la herramienta | ||
| - Validación de URLs para herramientas de browser | ||
|
|
||
| Args: | ||
| invocation: Diccionario con la invocación de herramienta | ||
|
|
||
| Returns: | ||
| Tuple de (es_válido, lista_de_errores) | ||
|
|
||
| Example: | ||
| >>> invocation = {"toolName": "browser_actions.navigate", "args": {"url": "https://example.com"}} | ||
| >>> is_valid, errors = validate_tool_invocation(invocation) | ||
| >>> is_valid | ||
| True | ||
| """ |
There was a problem hiding this comment.
[nitpick] The docstrings in this file mix Spanish and English. For example, "Valida que una invocación de herramienta sea correcta" is in Spanish while the example is in English. Maintain consistency by using English throughout for better international collaboration and code maintainability.
| addopts = [ | ||
| "-v", | ||
| "--strict-markers", | ||
| "--strict-config", | ||
| "--cov=strix", | ||
| "--cov-report=term-missing", | ||
| "--cov-report=html", | ||
| "--cov-report=xml", | ||
| "--cov-fail-under=80" | ||
| "--tb=short", | ||
| ] |
There was a problem hiding this comment.
The coverage-related options (--cov=strix, --cov-report=*, --cov-fail-under=80) have been removed from the default pytest addopts. This means coverage will no longer be measured by default when running tests. If this is intentional, it should be documented. Otherwise, these options should be retained to maintain test quality standards.
| 4. CHAIN-OF-THOUGHT ANALYSIS (MANDATORY): | ||
| Before concluding any finding, analyze step by step: | ||
|
|
||
| Step 1 - Initial Observation: | ||
| "I observed [specific behavior] when sending [specific payload]" | ||
|
|
||
| Step 2 - Hypothesis: | ||
| "This could indicate [vulnerability type] because [reasoning]" | ||
|
|
||
| Step 3 - Verification: | ||
| "To verify, I will [additional tests to perform]" | ||
|
|
||
| Step 4 - Evidence Evaluation: | ||
| "The evidence [supports/contradicts] my hypothesis because [specific reasons]" | ||
|
|
||
| Step 5 - False Positive Check: | ||
| "I checked for false positive indicators: [list what you checked]" | ||
|
|
||
| Step 6 - Conclusion: | ||
| "My confidence level is [HIGH/MEDIUM/LOW/FALSE_POSITIVE] because [justification]" | ||
|
|
||
| 5. AVOID COMMON FALSE POSITIVE PATTERNS: | ||
| - Generic error pages mistaken for injection success | ||
| - Rate limiting responses confused with vulnerability indicators | ||
| - Cached responses giving inconsistent results | ||
| - WAF blocks interpreted as application errors | ||
| - Input validation errors vs actual vulnerabilities | ||
| - Timing variations due to network latency vs actual time-based injection | ||
| </vulnerability_validation_protocol> |
There was a problem hiding this comment.
The new <vulnerability_validation_protocol> section requires a mandatory 6-step Chain-of-Thought analysis, but the protocol document only shows 6 steps in the example. Ensure this matches the documented requirement of "CoT obligatorio de 6 pasos" mentioned in line 415 of todo.md.
| def calculate_confidence( | ||
| indicators: list[str], | ||
| fp_indicators: list[str], | ||
| exploitation_confirmed: bool = False, | ||
| vuln_type: str = "generic", | ||
| ) -> ConfidenceLevel: | ||
| """Calcula el nivel de confianza basado en evidencia. | ||
|
|
||
| Args: | ||
| indicators: Lista de indicadores positivos de vulnerabilidad | ||
| fp_indicators: Lista de indicadores de falso positivo encontrados | ||
| exploitation_confirmed: Si la explotación fue confirmada | ||
| vuln_type: Tipo de vulnerabilidad para aplicar reglas específicas | ||
|
|
||
| Returns: | ||
| ConfidenceLevel apropiado basado en la evidencia | ||
|
|
||
| Example: | ||
| >>> calculate_confidence( | ||
| ... indicators=["sql_error", "data_leak", "timing_diff"], | ||
| ... fp_indicators=[], | ||
| ... exploitation_confirmed=True | ||
| ... ) | ||
| ConfidenceLevel.HIGH | ||
| """ | ||
| # Si hay explotación confirmada con suficiente evidencia, es HIGH | ||
| if exploitation_confirmed and len(indicators) >= 2: | ||
| return ConfidenceLevel.HIGH | ||
|
|
||
| # Si los indicadores de FP superan a los positivos, es FALSE_POSITIVE | ||
| if len(fp_indicators) > len(indicators) and not exploitation_confirmed: | ||
| return ConfidenceLevel.FALSE_POSITIVE | ||
|
|
||
| # Si hay múltiples indicadores sin FP significativos | ||
| if len(indicators) >= 3 and len(fp_indicators) <= 1: | ||
| return ConfidenceLevel.HIGH if exploitation_confirmed else ConfidenceLevel.MEDIUM | ||
|
|
||
| # Si hay algunos indicadores | ||
| if len(indicators) >= 2: | ||
| return ConfidenceLevel.MEDIUM | ||
|
|
||
| # Pocos indicadores = baja confianza | ||
| return ConfidenceLevel.LOW |
There was a problem hiding this comment.
The function calculate_confidence has an unused parameter vuln_type that is documented but never used in the function body. Either utilize this parameter for vulnerability-specific confidence calculations or remove it from the signature.
|
|
||
| import os | ||
| import pytest | ||
| from unittest.mock import MagicMock, AsyncMock, patch |
There was a problem hiding this comment.
Import of 'AsyncMock' is not used.
| from unittest.mock import MagicMock, AsyncMock, patch | |
| from unittest.mock import MagicMock, patch |
|
|
||
| import os | ||
| import pytest | ||
| from typing import Any |
There was a problem hiding this comment.
Import of 'Any' is not used.
| from typing import Any |
| - Detección de indicadores de explotación | ||
| - Creación de VulnerabilityFinding | ||
| """ | ||
| import pytest |
There was a problem hiding this comment.
Import of 'pytest' is not used.
| import pytest |
| - Content cleaning | ||
| """ | ||
|
|
||
| import pytest |
There was a problem hiding this comment.
Import of 'pytest' is not used.
| import pytest |
| import os | ||
| import pytest | ||
| import asyncio | ||
| from unittest.mock import patch, MagicMock, AsyncMock |
There was a problem hiding this comment.
Import of 'AsyncMock' is not used.
| from unittest.mock import patch, MagicMock, AsyncMock | |
| from unittest.mock import patch, MagicMock |
No description provided.