-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Shared: Improve sensitive data heuristics #20024
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR improves sensitive data heuristics across multiple programming languages by enhancing the regular expressions used to identify sensitive data patterns. The changes add new matching patterns for various types of sensitive information while also improving exclusions to reduce false positives.
- Expands detection patterns for authentication data (oauth, api keys, mfa), financial information (cvv, iban, routing numbers), health data (gender, patient records), and device information (IP addresses, MAC addresses, fingerprints)
- Adds "confidential" to secret detection patterns and improves the trusted pattern exclusion
- Enhances exclusion patterns to reduce false positives for file paths, URLs, and accounting-related terms
Reviewed Changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.
Show a summary per file
File | Description |
---|---|
swift/ql/lib/codeql/swift/security/internal/SensitiveDataHeuristics.qll | Updates sensitive data detection patterns for Swift |
rust/ql/lib/codeql/rust/security/internal/SensitiveDataHeuristics.qll | Updates sensitive data detection patterns for Rust |
ruby/ql/lib/codeql/ruby/security/internal/SensitiveDataHeuristics.qll | Updates sensitive data detection patterns for Ruby |
python/ql/lib/semmle/python/security/internal/SensitiveDataHeuristics.qll | Updates sensitive data detection patterns for Python |
javascript/ql/lib/semmle/javascript/security/internal/SensitiveDataHeuristics.qll | Updates sensitive data detection patterns for JavaScript |
rust/ql/test/library-tests/sensitivedata/test.rs | Updates test expectations to reflect improved heuristics |
Based on DCA results across all affected languages I think I'm going to:
|
Improve sensitive data heuristics, by adding new matching heuristics and new exclusions. This is very, very loosely based on an initial version by Copilot, though I had to cut or refine the majority of those suggestions to get good results (testing with Rust).
I will start DCA runs on all affected languages. I do anticipate some further refinements may be required as common patterns vary slightly between languages.
There are three things I'd like opinions about:
e(mail|_mail)
heuristic? Its labelled with a comment "seems too noisy", but I don't understand what motivated that - it always worked very well in my tests (Swift + Rust).uid
heuristic (([uU]|^|_|[a-z](?=U))([uU][iI][dD])
)? I believe its responsible for a lot of false positive results because UID might stand for "User ID" (likely sensitive) or just "Unique ID" (which may or may not be sensitive, often they aren't). What are your experiences? (@andersfugmann FYI as we discussed this recently)