Skip to content

Add token-based field anonymization middleware#16

Merged
sergiobayona merged 3 commits into
mainfrom
anonymizer
Apr 13, 2026
Merged

Add token-based field anonymization middleware#16
sergiobayona merged 3 commits into
mainfrom
anonymizer

Conversation

@sergiobayona
Copy link
Copy Markdown
Owner

Introduces a general-purpose anonymization pipeline that substitutes selected string fields in outbound tool results with stable opaque tokens before they reach the LLM, and restores the originals on inbound tool invocations. All domain knowledge (which keys to match, token prefixes, atomic-blob handling) is supplied by the application at wiring time, so the library itself stays free of PII, CRM, or field-naming opinions.

Three components, each in its own file:

  • VectorMCP::TokenStore — thread-safe bidirectional value <-> token store backed by Concurrent::Hash. Tokenization is idempotent for the same (value, prefix) pair. Tokens have the shape "PREFIX_XXXXXXXX" (8 uppercase hex chars). token? is a pure pattern check and does not consult the store, so middleware can detect tokens without holding a store reference. resolve returns nil for unknown tokens rather than raising.

  • VectorMCP::Util::TokenSweeper — stateless recursive traversal utility for parsed JSON-like structures. Yields each String leaf together with its parent Hash key (propagated across enclosing Arrays), returns a new structure without mutating the input, and defends against circular references via an identity-compared visited set.

  • VectorMCP::Middleware::Anonymizer — wires the store and sweeper together with application-supplied field rules and an optional atomic_keys regexp. sweep_outbound tokenizes matched string fields and (if atomic_keys is set) collapses Hash nodes under matching parent keys into a single canonical-JSON token. sweep_inbound reverses the mapping; unknown token-shaped strings pass through unchanged. before_tool_call and after_tool_call implement the inbound and outbound sweeps against context.params["arguments"] and context.result respectively. install_on(server, priority:) registers the configured instance via a generated Base-subclass adapter, working around the middleware manager's argumentless instantiation.

Spec coverage: 49 new examples covering tokenization idempotency, thread safety under 100 concurrent tokenize calls, sweeper traversal and cycle handling, outbound/inbound round-tripping, atomic-node collapse, and the tool-call middleware hooks. Full suite: 1410 examples passing; rubocop clean on all new files.

Introduces a general-purpose anonymization pipeline that substitutes
selected string fields in outbound tool results with stable opaque
tokens before they reach the LLM, and restores the originals on
inbound tool invocations. All domain knowledge (which keys to match,
token prefixes, atomic-blob handling) is supplied by the application
at wiring time, so the library itself stays free of PII, CRM, or
field-naming opinions.

Three components, each in its own file:

* VectorMCP::TokenStore — thread-safe bidirectional value <-> token
  store backed by Concurrent::Hash. Tokenization is idempotent for
  the same (value, prefix) pair. Tokens have the shape
  "PREFIX_XXXXXXXX" (8 uppercase hex chars). token? is a pure
  pattern check and does not consult the store, so middleware can
  detect tokens without holding a store reference. resolve returns
  nil for unknown tokens rather than raising.

* VectorMCP::Util::TokenSweeper — stateless recursive traversal
  utility for parsed JSON-like structures. Yields each String leaf
  together with its parent Hash key (propagated across enclosing
  Arrays), returns a new structure without mutating the input, and
  defends against circular references via an identity-compared
  visited set.

* VectorMCP::Middleware::Anonymizer — wires the store and sweeper
  together with application-supplied field rules and an optional
  atomic_keys regexp. sweep_outbound tokenizes matched string fields
  and (if atomic_keys is set) collapses Hash nodes under matching
  parent keys into a single canonical-JSON token. sweep_inbound
  reverses the mapping; unknown token-shaped strings pass through
  unchanged. before_tool_call and after_tool_call implement the
  inbound and outbound sweeps against context.params["arguments"]
  and context.result respectively. install_on(server, priority:)
  registers the configured instance via a generated Base-subclass
  adapter, working around the middleware manager's argumentless
  instantiation.

Spec coverage: 49 new examples covering tokenization idempotency,
thread safety under 100 concurrent tokenize calls, sweeper traversal
and cycle handling, outbound/inbound round-tripping, atomic-node
collapse, and the tool-call middleware hooks. Full suite: 1410
examples passing; rubocop clean on all new files.
@sergiobayona
Copy link
Copy Markdown
Owner Author

/gemini-review

Copy link
Copy Markdown

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Gemini AI Code Review

Found 2 suggestions for improvement:

  • 💡 Medium: 2

This review was automatically generated by Gemini AI. Please review the suggestions carefully.

token
end
end

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a subtle race condition here. Another thread could read the token from @forward after this line executes but before the @reverse map is updated on the next line. If that other thread's caller then immediately tries to resolve the token, it might get nil because the reverse mapping doesn't exist yet.

To ensure the store is always in a consistent state, you should populate the @reverse map before making the token visible in the @forward map. This guarantees that any token found in @forward is resolvable.

Suggested change
@reverse[token] = value
@forward[key] = token

end
ensure
visited.delete(hash)
end
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current traversal logic correctly handles cycles using a visited set that tracks the current recursion path. However, this approach may have performance implications for input structures that are Directed Acyclic Graphs (DAGs), where the same object (Hash or Array) is referenced from multiple places.

With the current implementation, each time a shared sub-structure is encountered, it will be fully traversed and a new corresponding structure will be built. For large and highly-shared structures, this could be inefficient.

A common pattern to optimize this is to use memoization. You could adapt the visited hash to cache the transformed result of each container. This would both improve performance on DAGs and potentially make the cycle-detection logic more explicit.

Here is an example of how walk_hash could be adapted:

# Define a sentinel value at the module level
VISITING = Object.new.freeze

private

def walk_hash(hash, memo, &)
  memo_val = memo[hash]
  return hash if memo_val.equal?(VISITING) # Cycle detected
  return memo_val if memo_val # Return memoized result

  memo[hash] = VISITING # Mark as currently being processed

  new_hash = hash.each_with_object({}) do |(key, value), out|
    out[key] = walk(value, key, memo, &)
  end

  memo[hash] = new_hash # Cache the final result
end

A similar change would apply to walk_array. This would avoid re-processing shared nodes while still correctly handling cycles.

The original tokenize implementation inserted into @forward before
@reverse:

    @forward[key] = token
    @reverse[token] = value

Because @forward and @reverse are independent Concurrent::Hash
instances, a second thread could observe the token in @forward (via
its own mutex-protected tokenize or via the fast path that reads
@forward before entering the mutex) and then call resolve(token)
before the originating thread has populated @reverse, yielding nil
for a token that should be resolvable.

Swap the order so @reverse is written first. Any thread that sees
the token in @forward is then guaranteed to find it in @reverse.

Add a regression spec that exercises the invariant under concurrent
reader/writer load.

Addresses review feedback on PR #16.
@sergiobayona sergiobayona merged commit 626115e into main Apr 13, 2026
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant