Add token-based field anonymization middleware#16
Conversation
Introduces a general-purpose anonymization pipeline that substitutes selected string fields in outbound tool results with stable opaque tokens before they reach the LLM, and restores the originals on inbound tool invocations. All domain knowledge (which keys to match, token prefixes, atomic-blob handling) is supplied by the application at wiring time, so the library itself stays free of PII, CRM, or field-naming opinions. Three components, each in its own file: * VectorMCP::TokenStore — thread-safe bidirectional value <-> token store backed by Concurrent::Hash. Tokenization is idempotent for the same (value, prefix) pair. Tokens have the shape "PREFIX_XXXXXXXX" (8 uppercase hex chars). token? is a pure pattern check and does not consult the store, so middleware can detect tokens without holding a store reference. resolve returns nil for unknown tokens rather than raising. * VectorMCP::Util::TokenSweeper — stateless recursive traversal utility for parsed JSON-like structures. Yields each String leaf together with its parent Hash key (propagated across enclosing Arrays), returns a new structure without mutating the input, and defends against circular references via an identity-compared visited set. * VectorMCP::Middleware::Anonymizer — wires the store and sweeper together with application-supplied field rules and an optional atomic_keys regexp. sweep_outbound tokenizes matched string fields and (if atomic_keys is set) collapses Hash nodes under matching parent keys into a single canonical-JSON token. sweep_inbound reverses the mapping; unknown token-shaped strings pass through unchanged. before_tool_call and after_tool_call implement the inbound and outbound sweeps against context.params["arguments"] and context.result respectively. install_on(server, priority:) registers the configured instance via a generated Base-subclass adapter, working around the middleware manager's argumentless instantiation. Spec coverage: 49 new examples covering tokenization idempotency, thread safety under 100 concurrent tokenize calls, sweeper traversal and cycle handling, outbound/inbound round-tripping, atomic-node collapse, and the tool-call middleware hooks. Full suite: 1410 examples passing; rubocop clean on all new files.
|
/gemini-review |
| token | ||
| end | ||
| end | ||
|
|
There was a problem hiding this comment.
There's a subtle race condition here. Another thread could read the token from @forward after this line executes but before the @reverse map is updated on the next line. If that other thread's caller then immediately tries to resolve the token, it might get nil because the reverse mapping doesn't exist yet.
To ensure the store is always in a consistent state, you should populate the @reverse map before making the token visible in the @forward map. This guarantees that any token found in @forward is resolvable.
| @reverse[token] = value | |
| @forward[key] = token |
| end | ||
| ensure | ||
| visited.delete(hash) | ||
| end |
There was a problem hiding this comment.
The current traversal logic correctly handles cycles using a visited set that tracks the current recursion path. However, this approach may have performance implications for input structures that are Directed Acyclic Graphs (DAGs), where the same object (Hash or Array) is referenced from multiple places.
With the current implementation, each time a shared sub-structure is encountered, it will be fully traversed and a new corresponding structure will be built. For large and highly-shared structures, this could be inefficient.
A common pattern to optimize this is to use memoization. You could adapt the visited hash to cache the transformed result of each container. This would both improve performance on DAGs and potentially make the cycle-detection logic more explicit.
Here is an example of how walk_hash could be adapted:
# Define a sentinel value at the module level
VISITING = Object.new.freeze
private
def walk_hash(hash, memo, &)
memo_val = memo[hash]
return hash if memo_val.equal?(VISITING) # Cycle detected
return memo_val if memo_val # Return memoized result
memo[hash] = VISITING # Mark as currently being processed
new_hash = hash.each_with_object({}) do |(key, value), out|
out[key] = walk(value, key, memo, &)
end
memo[hash] = new_hash # Cache the final result
endA similar change would apply to walk_array. This would avoid re-processing shared nodes while still correctly handling cycles.
The original tokenize implementation inserted into @forward before @reverse: @forward[key] = token @reverse[token] = value Because @forward and @reverse are independent Concurrent::Hash instances, a second thread could observe the token in @forward (via its own mutex-protected tokenize or via the fast path that reads @forward before entering the mutex) and then call resolve(token) before the originating thread has populated @reverse, yielding nil for a token that should be resolvable. Swap the order so @reverse is written first. Any thread that sees the token in @forward is then guaranteed to find it in @reverse. Add a regression spec that exercises the invariant under concurrent reader/writer load. Addresses review feedback on PR #16.
Introduces a general-purpose anonymization pipeline that substitutes selected string fields in outbound tool results with stable opaque tokens before they reach the LLM, and restores the originals on inbound tool invocations. All domain knowledge (which keys to match, token prefixes, atomic-blob handling) is supplied by the application at wiring time, so the library itself stays free of PII, CRM, or field-naming opinions.
Three components, each in its own file:
VectorMCP::TokenStore — thread-safe bidirectional value <-> token store backed by Concurrent::Hash. Tokenization is idempotent for the same (value, prefix) pair. Tokens have the shape "PREFIX_XXXXXXXX" (8 uppercase hex chars). token? is a pure pattern check and does not consult the store, so middleware can detect tokens without holding a store reference. resolve returns nil for unknown tokens rather than raising.
VectorMCP::Util::TokenSweeper — stateless recursive traversal utility for parsed JSON-like structures. Yields each String leaf together with its parent Hash key (propagated across enclosing Arrays), returns a new structure without mutating the input, and defends against circular references via an identity-compared visited set.
VectorMCP::Middleware::Anonymizer — wires the store and sweeper together with application-supplied field rules and an optional atomic_keys regexp. sweep_outbound tokenizes matched string fields and (if atomic_keys is set) collapses Hash nodes under matching parent keys into a single canonical-JSON token. sweep_inbound reverses the mapping; unknown token-shaped strings pass through unchanged. before_tool_call and after_tool_call implement the inbound and outbound sweeps against context.params["arguments"] and context.result respectively. install_on(server, priority:) registers the configured instance via a generated Base-subclass adapter, working around the middleware manager's argumentless instantiation.
Spec coverage: 49 new examples covering tokenization idempotency, thread safety under 100 concurrent tokenize calls, sweeper traversal and cycle handling, outbound/inbound round-tripping, atomic-node collapse, and the tool-call middleware hooks. Full suite: 1410 examples passing; rubocop clean on all new files.