Add Shannon entropy analysis for secret detection by GuthL · Pull Request #53 · GuthL/KeyClaw

GuthL · 2026-03-08T22:12:04Z

Summary

Adds Shannon entropy analysis as a second detection pass in RuleSet::find_secrets(), catching high-entropy tokens (≥3.5 bits/char, ≥20 chars) that regex rules miss
New src/entropy.rs module with shannon_entropy(), tokenizer, and find_high_entropy_tokens()
Entropy matches emit as SecretMatch { rule_id: "entropy" } — zero changes to downstream pipeline (placeholder, vault, resolution, notices)
Configurable via KEYCLAW_ENTROPY_ENABLED, KEYCLAW_ENTROPY_THRESHOLD, KEYCLAW_ENTROPY_MIN_LEN env vars

Closes #52

Test plan

8 unit tests in entropy.rs (entropy math, tokenization, offset correctness, edge cases)
2 integration tests in gitleaks_rules.rs (entropy matches in find_secrets(), disabled mode)
2 config tests (env var reading, defaults)
1 pipeline integration test (high-entropy token redacted end-to-end)
Existing e2e test updated to disable entropy when testing custom gitleaks config
Full test suite passes, release build succeeds

🤖 Generated with Claude Code

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 006eb715ad

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-03-08T22:15:55Z

src/entropy.rs

+    matches: &mut Vec<EntropyMatch<'a>>,
+) {
+    let token = &input[start..end];
+    if token.len() < min_len {


Measure entropy token length in characters, not UTF-8 bytes

check_token gates candidates with token.len() < min_len, but len() is byte length, so the default KEYCLAW_ENTROPY_MIN_LEN=20 starts evaluating non-ASCII text after only ~7 CJK characters. Because the only skip heuristic is is_all_lowercase_alpha, ordinary Japanese/Chinese/Korean message content can be treated as high-entropy secrets and rewritten into placeholders, which corrupts normal multilingual prompts before they reach the model.

Useful? React with 👍 / 👎.

Add entropy-based secret detection as a complementary pass alongside regex rules. Tokens with Shannon entropy >= 3.5 and length >= 20 are flagged, catching base64-encoded API keys and similar machine-generated secrets that regex patterns may miss. The entropy pass runs after regex matching and skips overlapping matches and existing placeholders. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Replace the split+find offset calculation with a char_indices-based manual tokenizer. The previous approach could misidentify byte offsets when segments repeated or consecutive delimiters appeared. Also fix module-level doc comments from /// to //!. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The test verifies that empty gitleaks rules produce no redaction. With entropy detection now active, the high-entropy test secret gets caught even without regex rules. Disable entropy for this specific test to preserve its original intent. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Verifies that a high-entropy token not matched by any gitleaks regex rule is still caught and redacted by the entropy analyzer. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Parse the `entropy` field from gitleaks rules and use it as a minimum Shannon entropy threshold for regex matches. 130 bundled rules define this field — matches with entropy below the threshold are now discarded as likely false positives. Closes #54 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The per-rule entropy threshold on generic-api-key (≥3.5) filtered out the low-entropy test UUIDs (entropy ~2.42). Replace with realistic UUIDs (entropy ~3.9) and fix all rustfmt violations. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Add Shannon entropy analysis for secret detection

chatgpt-codex-connector bot reviewed Mar 8, 2026

View reviewed changes

GuthL and others added 7 commits March 9, 2026 16:26

feat: add KEYCLAW_ENTROPY_* env var configuration

28303d0

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

test: add pipeline integration test for entropy detection

b77b292

Verifies that a high-entropy token not matched by any gitleaks regex rule is still caught and redacted by the entropy analyzer. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

GuthL force-pushed the feature/shannon-entropy branch from 31f8a30 to 609e928 Compare March 9, 2026 16:26

GuthL merged commit ac75667 into master Mar 9, 2026
7 checks passed

GuthL added a commit that referenced this pull request Mar 9, 2026

Merge pull request #53 from GuthL/feature/shannon-entropy

82a97b4

Add Shannon entropy analysis for secret detection

GuthL added a commit that referenced this pull request Mar 10, 2026

Merge pull request #53 from GuthL/feature/shannon-entropy

0b6f093

Add Shannon entropy analysis for secret detection

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Shannon entropy analysis for secret detection#53

Add Shannon entropy analysis for secret detection#53
GuthL merged 7 commits intomasterfrom
feature/shannon-entropy

GuthL commented Mar 8, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

chatgpt-codex-connector bot Mar 8, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

GuthL commented Mar 8, 2026

Summary

Test plan

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Mar 8, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant