Skip to content

Add Shannon entropy analysis for secret detection#53

Merged
GuthL merged 7 commits intomasterfrom
feature/shannon-entropy
Mar 9, 2026
Merged

Add Shannon entropy analysis for secret detection#53
GuthL merged 7 commits intomasterfrom
feature/shannon-entropy

Conversation

@GuthL
Copy link
Owner

@GuthL GuthL commented Mar 8, 2026

Summary

  • Adds Shannon entropy analysis as a second detection pass in RuleSet::find_secrets(), catching high-entropy tokens (≥3.5 bits/char, ≥20 chars) that regex rules miss
  • New src/entropy.rs module with shannon_entropy(), tokenizer, and find_high_entropy_tokens()
  • Entropy matches emit as SecretMatch { rule_id: "entropy" } — zero changes to downstream pipeline (placeholder, vault, resolution, notices)
  • Configurable via KEYCLAW_ENTROPY_ENABLED, KEYCLAW_ENTROPY_THRESHOLD, KEYCLAW_ENTROPY_MIN_LEN env vars

Closes #52

Test plan

  • 8 unit tests in entropy.rs (entropy math, tokenization, offset correctness, edge cases)
  • 2 integration tests in gitleaks_rules.rs (entropy matches in find_secrets(), disabled mode)
  • 2 config tests (env var reading, defaults)
  • 1 pipeline integration test (high-entropy token redacted end-to-end)
  • Existing e2e test updated to disable entropy when testing custom gitleaks config
  • Full test suite passes, release build succeeds

🤖 Generated with Claude Code

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 006eb715ad

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

matches: &mut Vec<EntropyMatch<'a>>,
) {
let token = &input[start..end];
if token.len() < min_len {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Measure entropy token length in characters, not UTF-8 bytes

check_token gates candidates with token.len() < min_len, but len() is byte length, so the default KEYCLAW_ENTROPY_MIN_LEN=20 starts evaluating non-ASCII text after only ~7 CJK characters. Because the only skip heuristic is is_all_lowercase_alpha, ordinary Japanese/Chinese/Korean message content can be treated as high-entropy secrets and rewritten into placeholders, which corrupts normal multilingual prompts before they reach the model.

Useful? React with 👍 / 👎.

GuthL and others added 7 commits March 9, 2026 16:26
Add entropy-based secret detection as a complementary pass alongside
regex rules. Tokens with Shannon entropy >= 3.5 and length >= 20 are
flagged, catching base64-encoded API keys and similar machine-generated
secrets that regex patterns may miss. The entropy pass runs after regex
matching and skips overlapping matches and existing placeholders.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace the split+find offset calculation with a char_indices-based
manual tokenizer. The previous approach could misidentify byte offsets
when segments repeated or consecutive delimiters appeared. Also fix
module-level doc comments from /// to //!.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The test verifies that empty gitleaks rules produce no redaction.
With entropy detection now active, the high-entropy test secret
gets caught even without regex rules. Disable entropy for this
specific test to preserve its original intent.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Verifies that a high-entropy token not matched by any gitleaks
regex rule is still caught and redacted by the entropy analyzer.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Parse the `entropy` field from gitleaks rules and use it as a
minimum Shannon entropy threshold for regex matches. 130 bundled
rules define this field — matches with entropy below the threshold
are now discarded as likely false positives.

Closes #54

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The per-rule entropy threshold on generic-api-key (≥3.5) filtered out
the low-entropy test UUIDs (entropy ~2.42). Replace with realistic
UUIDs (entropy ~3.9) and fix all rustfmt violations.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@GuthL GuthL force-pushed the feature/shannon-entropy branch from 31f8a30 to 609e928 Compare March 9, 2026 16:26
@GuthL GuthL merged commit ac75667 into master Mar 9, 2026
7 checks passed
GuthL added a commit that referenced this pull request Mar 9, 2026
Add Shannon entropy analysis for secret detection
GuthL added a commit that referenced this pull request Mar 10, 2026
Add Shannon entropy analysis for secret detection
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add Shannon entropy analysis for secret detection

1 participant