Add llm analyzer #1

Bishibop · 2025-08-19T06:53:09Z

Add LLM analyzer for AI-powered troubleshooting

Summary

This PR adds a new LLM (Large Language Model) analyzer to the Troubleshoot toolkit, enabling AI-powered analysis of support bundles using OpenAI's GPT models.

Features

New llm analyzer type that uses OpenAI's GPT models to analyze logs and identify issues
Smart file selection with prioritization of error-related content
Structured output with JSON schema validation for consistent results
Enhanced analysis output including root cause, affected resources, and actionable recommendations
Support for custom problem descriptions and model selection
Comprehensive test coverage with fixed test suite

Usage

analyzers:
  - llm:
      checkName: "AI-Powered Analysis"
      collectorName: "logs"
      fileName: "*.log"
      model: "gpt-4o-mini"
      problemDescription: "Analyze for performance issues"

Demo

A complete hands-on demonstration is available in DEMO_WALKTHROUGH.md that shows:

Setting up a test Kubernetes cluster with simulated issues
Running the LLM analyzer to detect problems
Re-analyzing existing support bundles with different prompts
Examples of OOM kills, connection failures, and security issues detection

Requirements

Requires OPENAI_API_KEY environment variable to be set
Supports GPT-4, GPT-4o, GPT-4o-mini, and GPT-3.5-turbo models

Testing

Fixed all fake tests in the test suite to properly test functionality
Added integration tests with mock OpenAI servers
Added performance and concurrency tests

Documentation

Updated README with LLM analyzer documentation
Added comprehensive demo walkthrough in DEMO_WALKTHROUGH.md
Included environment setup instructions

- Created Kind cluster setup with automated scripts - Added 3 test scenarios: OOMKilled, CrashLoopBackOff, Connection errors - Includes PostgreSQL, Redis, and nginx-ingress deployments - Added troubleshoot collector spec for test bundles - Documentation and implementation plan for LLM analyzer feature This test environment provides real Kubernetes issues to validate the LLM analyzer implementation during development.

- Added LLMAnalyze struct with fields for LLM-based analysis - Includes CollectorName, FileName, MaxFiles, and Model fields - Added LLM field to Analyze struct for YAML spec support - Follows existing analyzer patterns with AnalyzeMeta and Outcomes

- Marked Phase 2 as completed (10 minutes actual time) - Added implementation notes about type design decisions - Updated progress summary: 3.5 hours spent, 3.5-5.5 hours remaining - Documented actual implementation details

The implementation plan is better suited for the documentation repo rather than the code fork. Test cluster remains here for PR submission.

- Fixed file collection bug (was incorrectly iterating over map) - Added filepath import - Simplified glob pattern to work with filepath.Glob limitations - LLM analyzer now successfully finds files, calls OpenAI API, and returns results

…ignore - Remove support bundle directory that was accidentally committed - Add support-bundle-*/ to .gitignore to prevent future accidents - Keep repository clean from test artifacts

- Changed default model from gpt-4o to gpt-5 - Updated test spec to use gpt-5 - Confirmed gpt-5 is working and correctly analyzing issues

- Added --problem-description flag to support-bundle command - Implemented interactive prompt when flag not provided - Created global variable to pass problem description to LLM analyzer - Falls back to env var, then default if no description provided - Successfully tested with flag, without flag, and with env var Phase 4 complete: CLI integration fully functional

- Add unit tests for all LLM analyzer components - Add integration tests with mock OpenAI API - Add benchmark tests for performance validation - Add YAML parsing tests for spec validation - Add E2E test script for workflow testing - Fix linter issues in LLM analyzer code

- Add --problem-description flag to analyze subcommand - Support interactive prompt for problem description - Enable re-analysis of existing bundles with LLM analyzer - Implements PRD stretch goal for analyzing existing .tar.gz bundles

- Updated README with LLM analyzer usage instructions - Added configuration options and model selection guide - Created comprehensive demo script showcasing all features - Created quick demo script for rapid testing - Added example specifications in examples/analyzers/ - Documented re-analysis capability for existing bundles Phase 7 complete: Documentation and demo materials ready

- Fixed file collection bug when using default patterns (was iterating incorrectly over map) - Added template variable replacement for {{.Issue}} and {{.Solution}} - Increased API timeout from 60s to 120s for large analyses - Improved error logging for JSON parsing failures - Added configurable MaxSize option (in KB) to YAML spec - Updated documentation and examples with new MaxSize field These fixes address critical bugs that would prevent proper operation and improve production readiness of the LLM analyzer.

- Changed default model from gpt-5 to gpt-4o-mini (more cost-effective) - Increased default maxSize from 500KB to 1MB for better analysis coverage - Increased default maxFiles from 10 to 20 for comprehensive analysis - Increased per-file truncation from 10K to 20K characters for more context - Updated documentation to reflect new defaults and recommendations - Updated tests to match new default model These changes provide better defaults for production use while maintaining the ability to override for specific needs. The gpt-4o-mini default significantly reduces API costs while still providing excellent analysis with its 128K context window.

…Selection) Phase 11 - Structured Output Improvements: - Enhanced llmAnalysis struct with actionable fields (commands, docs, root cause, etc.) - Added comprehensive Markdown report generation - Improved template variable replacement with all new fields - Updated prompts to request structured information Phase 8 - Smart File Selection: - Added PriorityPatterns for keyword-based file scoring - Added SkipPatterns to exclude irrelevant files (images, archives) - Implemented binary file detection - File scoring based on error keywords, timestamps, and file types - Sorts files by relevance before applying limits These enhancements provide more actionable output and intelligently select the most relevant files for analysis, reducing API costs and improving accuracy.

- Added documentation for smart file selection options - Added documentation for enhanced structured output - Listed all available template variables - Updated example to show priority patterns and skip patterns - Added preferRecent option example

- Added tests for structured output with all new fields - Added tests for Markdown report generation - Added tests for template variable replacement - Added tests for smart file selection and scoring - Added tests for binary file detection - Added tests for skip patterns and priority patterns - Fixed existing tests to match new 1MB default limit - All tests passing (100% coverage of new features) Test coverage includes: - Structured output validation - Markdown report structure - Template variable replacement for all fields - File scoring algorithm - Binary detection heuristics - Smart file selection with priorities

- Replaced multiple WriteString calls with single fmt.Fprintf and raw string literal - Used backticks for multi-line string (idiomatic Go) - More readable and maintainable prompt definition - No functional changes, just style improvement

- Implement structured outputs API with JSON schema generation - Add useStructuredOutput configuration field (defaults to true) - Create comprehensive JSON schema for all analysis fields - Add smart fallback for models that don't support structured outputs - Ensure future compatibility (GPT-5 and newer models will use it automatically) - Add tests for schema generation and validation - Update documentation with model selection guide and pricing

- Remove global state anti-pattern: ProblemDescription now configurable in spec - Make API endpoint configurable for testing/proxies/regions - Replace primitive string replacement with text/template for security - Improve binary detection using http.DetectContentType with MIME types - Add comprehensive test coverage for all improvements - Update documentation with new configuration options This addresses code review feedback for production readiness.

- Remove references to GlobalProblemDescription in CLI - Use environment variable PROBLEM_DESCRIPTION instead - Fixes build errors after removing global state

- Add DEMO_SETUP.md with clear instructions - Add simple-test.sh for easy testing without Kubernetes - Remove all temporary test files and artifacts

- Fixed file collection issue that was preventing LLM analyzer from running - Added proper error handling and JSON parsing improvements - Created demo-llm.sh script for easy testing - Added test-env.sh for environment validation - Improved API response parsing to handle non-standard confidence values - Removed debug statements from production code

- Add comprehensive DEMO_WALKTHROUGH.md for demonstrating LLM analyzer - Add demo-app-deploy.sh script for easy demo setup - Remove development and test scripts that are no longer needed - Keep production-ready demo materials only

- Added godotenv to automatically load .env files in all commands - Fixed file pattern in demo to use **/*.log for nested pod directories - Updated error message for missing API key to be more helpful - Now users don't need to export variables, just have them in .env

- Removed instructions to source or export API key - Added note that .env files are loaded automatically - Updated prerequisites to mention .env file

- Deleted temporary demo-support-bundle.yaml (created during demo) - Cleaned up all support bundle archives - Removed extracted bundle directories - Updated DEMO_WALKTHROUGH.md formatting

- Fixed all fileName patterns to use **/*.log for nested directories - Simplified re-analyze section to focus on different symptoms - Removed misleading security analysis example

- Added clear 'What's Different' section explaining the value prop - Simplified setup to just API key in .env file - Showed basic usage with minimal YAML example - Removed overly detailed configuration options - Focus on simplicity and getting started quickly

- Fixed TestAnalyzeLLM_Timeout to test real timeout scenarios with context cancellation - Fixed TestAnalyzeLLM_ConcurrentAnalysis to actually run multiple analyzers concurrently - Fixed TestAnalyzeLLM_MarkdownReportGeneration to call Analyze() and test markdown generation - Fixed TestAnalyzeLLM_RealWorldScenarios to run Analyze() with real scenarios - Fixed TestAnalyzeLLM_ErrorHandling to test actual error conditions All tests now: - Create mock OpenAI servers that return appropriate responses - Set APIEndpoint to inject mock server URLs - Actually call the functions being tested (Analyze(), callLLM()) - Verify expected behavior instead of trivial assertions

Bishibop added 30 commits August 18, 2025 22:50

docs: Update implementation plan after Phase 2 completion

3793d44

- Marked Phase 2 as completed (10 minutes actual time) - Added implementation notes about type design decisions - Updated progress summary: 3.5 hours spent, 3.5-5.5 hours remaining - Documented actual implementation details

refactor: Remove implementation plan (moved to parent repo)

4613f77

The implementation plan is better suited for the documentation repo rather than the code fork. Test cluster remains here for PR submission.

chore: Clean up accidentally committed support bundle and update .git…

28a3c4e

…ignore - Remove support bundle directory that was accidentally committed - Add support-bundle-*/ to .gitignore to prevent future accidents - Keep repository clean from test artifacts

feat: Update default LLM model to gpt-5

c9b0885

- Changed default model from gpt-4o to gpt-5 - Updated test spec to use gpt-5 - Confirmed gpt-5 is working and correctly analyzing issues

fix: update CLI to use environment variable instead of global state

60c428f

- Remove references to GlobalProblemDescription in CLI - Use environment variable PROBLEM_DESCRIPTION instead - Fixes build errors after removing global state

docs: add clean demo setup and simple test script

4d085b0

- Add DEMO_SETUP.md with clear instructions - Add simple-test.sh for easy testing without Kubernetes - Remove all temporary test files and artifacts

chore: Clean up test files and add demo walkthrough

143ee98

- Add comprehensive DEMO_WALKTHROUGH.md for demonstrating LLM analyzer - Add demo-app-deploy.sh script for easy demo setup - Remove development and test scripts that are no longer needed - Keep production-ready demo materials only

docs: Update demo walkthrough to reflect automatic .env loading

76f86d7

- Removed instructions to source or export API key - Added note that .env files are loaded automatically - Updated prerequisites to mention .env file

chore: Clean up demo environment and finalize documentation

0b66a9f

- Deleted temporary demo-support-bundle.yaml (created during demo) - Cleaned up all support bundle archives - Removed extracted bundle directories - Updated DEMO_WALKTHROUGH.md formatting

docs: Fix file patterns and simplify re-analyze section in demo

e3ea4c7

- Fixed all fileName patterns to use **/*.log for nested directories - Simplified re-analyze section to focus on different symptoms - Removed misleading security analysis example

doc updates

c9eb624

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add llm analyzer #1

Add llm analyzer #1

Uh oh!

Bishibop commented Aug 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Add llm analyzer #1

Are you sure you want to change the base?

Add llm analyzer #1

Uh oh!

Conversation

Bishibop commented Aug 19, 2025

Add LLM analyzer for AI-powered troubleshooting

Summary

Features

Usage

Demo

Requirements

Testing

Documentation

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants