Add Shannon entropy analysis for secret detection#53
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 006eb715ad
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| matches: &mut Vec<EntropyMatch<'a>>, | ||
| ) { | ||
| let token = &input[start..end]; | ||
| if token.len() < min_len { |
There was a problem hiding this comment.
Measure entropy token length in characters, not UTF-8 bytes
check_token gates candidates with token.len() < min_len, but len() is byte length, so the default KEYCLAW_ENTROPY_MIN_LEN=20 starts evaluating non-ASCII text after only ~7 CJK characters. Because the only skip heuristic is is_all_lowercase_alpha, ordinary Japanese/Chinese/Korean message content can be treated as high-entropy secrets and rewritten into placeholders, which corrupts normal multilingual prompts before they reach the model.
Useful? React with 👍 / 👎.
Add entropy-based secret detection as a complementary pass alongside regex rules. Tokens with Shannon entropy >= 3.5 and length >= 20 are flagged, catching base64-encoded API keys and similar machine-generated secrets that regex patterns may miss. The entropy pass runs after regex matching and skips overlapping matches and existing placeholders. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace the split+find offset calculation with a char_indices-based manual tokenizer. The previous approach could misidentify byte offsets when segments repeated or consecutive delimiters appeared. Also fix module-level doc comments from /// to //!. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The test verifies that empty gitleaks rules produce no redaction. With entropy detection now active, the high-entropy test secret gets caught even without regex rules. Disable entropy for this specific test to preserve its original intent. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Verifies that a high-entropy token not matched by any gitleaks regex rule is still caught and redacted by the entropy analyzer. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Parse the `entropy` field from gitleaks rules and use it as a minimum Shannon entropy threshold for regex matches. 130 bundled rules define this field — matches with entropy below the threshold are now discarded as likely false positives. Closes #54 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The per-rule entropy threshold on generic-api-key (≥3.5) filtered out the low-entropy test UUIDs (entropy ~2.42). Replace with realistic UUIDs (entropy ~3.9) and fix all rustfmt violations. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
31f8a30 to
609e928
Compare
Add Shannon entropy analysis for secret detection
Add Shannon entropy analysis for secret detection
Summary
RuleSet::find_secrets(), catching high-entropy tokens (≥3.5 bits/char, ≥20 chars) that regex rules misssrc/entropy.rsmodule withshannon_entropy(), tokenizer, andfind_high_entropy_tokens()SecretMatch { rule_id: "entropy" }— zero changes to downstream pipeline (placeholder, vault, resolution, notices)KEYCLAW_ENTROPY_ENABLED,KEYCLAW_ENTROPY_THRESHOLD,KEYCLAW_ENTROPY_MIN_LENenv varsCloses #52
Test plan
entropy.rs(entropy math, tokenization, offset correctness, edge cases)gitleaks_rules.rs(entropy matches infind_secrets(), disabled mode)🤖 Generated with Claude Code