-
Notifications
You must be signed in to change notification settings - Fork 527
dropbox repeated token attack #1244
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this PR doesn't pass tests (see https://reference.garak.ai/en/latest/extending.html) and the probe and detector don't fix garak's expectations/format
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think these probes may fit in the glitch
module as something like glitch.DropboxRepeatedToken
or possibly glitch.cl100k_base
. My read suggests a similar technique in play here.
Also please note from a naming convention perspective the plugin type probe/detector
is not needed in the final class name as the package path inherently includes the plugin class type.
Co-authored-by: Jeffrey Martin <[email protected]> Signed-off-by: Divya Chitimalla <[email protected]>
Tests are passing now, please take a look. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor naming preference.
Signed-off-by: Divya Chitimalla <[email protected]>
Co-authored-by: Jeffrey Martin <[email protected]> Signed-off-by: Divya Chitimalla <[email protected]>
This PR introduces a new probe and corresponding detector based on Dropbox’s LLM Security Research, specifically targeting repeated token attacks that can cause model divergence, hallucinations, or instruction bypass.
Repeated token attacks have been shown to:
Cause LLMs to hallucinate memorized training data
Break instruction-following capabilities
Trigger unsafe or nonsensical completions