Skip to content

dropbox repeated token attack #1244

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 22 commits into
base: main
Choose a base branch
from

Conversation

dchiitmalla
Copy link
Contributor

This PR introduces a new probe and corresponding detector based on Dropbox’s LLM Security Research, specifically targeting repeated token attacks that can cause model divergence, hallucinations, or instruction bypass.

Repeated token attacks have been shown to:

  • Cause LLMs to hallucinate memorized training data

  • Break instruction-following capabilities

  • Trigger unsafe or nonsensical completions

@leondz leondz added the probes Content & activity of LLM probes label Jun 2, 2025
Copy link
Collaborator

@leondz leondz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this PR doesn't pass tests (see https://reference.garak.ai/en/latest/extending.html) and the probe and detector don't fix garak's expectations/format

@dchiitmalla dchiitmalla marked this pull request as draft June 2, 2025 18:08
@dchiitmalla dchiitmalla requested a review from leondz June 8, 2025 22:32
@dchiitmalla dchiitmalla marked this pull request as ready for review June 8, 2025 22:33
Copy link
Collaborator

@jmartin-tech jmartin-tech left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these probes may fit in the glitch module as something like glitch.DropboxRepeatedToken or possibly glitch.cl100k_base. My read suggests a similar technique in play here.

Also please note from a naming convention perspective the plugin type probe/detector is not needed in the final class name as the package path inherently includes the plugin class type.

@dchiitmalla
Copy link
Contributor Author

this PR doesn't pass tests (see https://reference.garak.ai/en/latest/extending.html) and the probe and detector don't fix garak's expectations/format

Tests are passing now, please take a look.

@dchiitmalla dchiitmalla requested a review from jmartin-tech July 24, 2025 19:38
Copy link
Collaborator

@jmartin-tech jmartin-tech left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor naming preference.

Co-authored-by: Jeffrey Martin <[email protected]>
Signed-off-by: Divya Chitimalla <[email protected]>
@dchiitmalla dchiitmalla requested a review from jmartin-tech July 24, 2025 20:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
probes Content & activity of LLM probes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants