Add token-based field anonymization middleware by sergiobayona · Pull Request #16 · sergiobayona/vector_mcp

sergiobayona · 2026-04-13T18:44:50Z

Introduces a general-purpose anonymization pipeline that substitutes selected string fields in outbound tool results with stable opaque tokens before they reach the LLM, and restores the originals on inbound tool invocations. All domain knowledge (which keys to match, token prefixes, atomic-blob handling) is supplied by the application at wiring time, so the library itself stays free of PII, CRM, or field-naming opinions.

Three components, each in its own file:

VectorMCP::TokenStore — thread-safe bidirectional value <-> token store backed by Concurrent::Hash. Tokenization is idempotent for the same (value, prefix) pair. Tokens have the shape "PREFIX_XXXXXXXX" (8 uppercase hex chars). token? is a pure pattern check and does not consult the store, so middleware can detect tokens without holding a store reference. resolve returns nil for unknown tokens rather than raising.
VectorMCP::Util::TokenSweeper — stateless recursive traversal utility for parsed JSON-like structures. Yields each String leaf together with its parent Hash key (propagated across enclosing Arrays), returns a new structure without mutating the input, and defends against circular references via an identity-compared visited set.
VectorMCP::Middleware::Anonymizer — wires the store and sweeper together with application-supplied field rules and an optional atomic_keys regexp. sweep_outbound tokenizes matched string fields and (if atomic_keys is set) collapses Hash nodes under matching parent keys into a single canonical-JSON token. sweep_inbound reverses the mapping; unknown token-shaped strings pass through unchanged. before_tool_call and after_tool_call implement the inbound and outbound sweeps against context.params["arguments"] and context.result respectively. install_on(server, priority:) registers the configured instance via a generated Base-subclass adapter, working around the middleware manager's argumentless instantiation.

Spec coverage: 49 new examples covering tokenization idempotency, thread safety under 100 concurrent tokenize calls, sweeper traversal and cycle handling, outbound/inbound round-tripping, atomic-node collapse, and the tool-call middleware hooks. Full suite: 1410 examples passing; rubocop clean on all new files.

Introduces a general-purpose anonymization pipeline that substitutes selected string fields in outbound tool results with stable opaque tokens before they reach the LLM, and restores the originals on inbound tool invocations. All domain knowledge (which keys to match, token prefixes, atomic-blob handling) is supplied by the application at wiring time, so the library itself stays free of PII, CRM, or field-naming opinions. Three components, each in its own file: * VectorMCP::TokenStore — thread-safe bidirectional value <-> token store backed by Concurrent::Hash. Tokenization is idempotent for the same (value, prefix) pair. Tokens have the shape "PREFIX_XXXXXXXX" (8 uppercase hex chars). token? is a pure pattern check and does not consult the store, so middleware can detect tokens without holding a store reference. resolve returns nil for unknown tokens rather than raising. * VectorMCP::Util::TokenSweeper — stateless recursive traversal utility for parsed JSON-like structures. Yields each String leaf together with its parent Hash key (propagated across enclosing Arrays), returns a new structure without mutating the input, and defends against circular references via an identity-compared visited set. * VectorMCP::Middleware::Anonymizer — wires the store and sweeper together with application-supplied field rules and an optional atomic_keys regexp. sweep_outbound tokenizes matched string fields and (if atomic_keys is set) collapses Hash nodes under matching parent keys into a single canonical-JSON token. sweep_inbound reverses the mapping; unknown token-shaped strings pass through unchanged. before_tool_call and after_tool_call implement the inbound and outbound sweeps against context.params["arguments"] and context.result respectively. install_on(server, priority:) registers the configured instance via a generated Base-subclass adapter, working around the middleware manager's argumentless instantiation. Spec coverage: 49 new examples covering tokenization idempotency, thread safety under 100 concurrent tokenize calls, sweeper traversal and cycle handling, outbound/inbound round-tripping, atomic-node collapse, and the tool-call middleware hooks. Full suite: 1410 examples passing; rubocop clean on all new files.

sergiobayona · 2026-04-13T18:45:09Z

/gemini-review

github-actions

🤖 Gemini AI Code Review

Found 2 suggestions for improvement:

💡 Medium: 2

This review was automatically generated by Gemini AI. Please review the suggestions carefully.

github-actions · 2026-04-13T18:49:21Z

+        token
+      end
+    end
+


There's a subtle race condition here. Another thread could read the token from @forward after this line executes but before the @reverse map is updated on the next line. If that other thread's caller then immediately tries to resolve the token, it might get nil because the reverse mapping doesn't exist yet.

To ensure the store is always in a consistent state, you should populate the @reverse map before making the token visible in the @forward map. This guarantees that any token found in @forward is resolvable.

Suggested change

@reverse[token] = value

@forward[key] = token

github-actions · 2026-04-13T18:49:21Z

+            end
+          ensure
+            visited.delete(hash)
+          end


The current traversal logic correctly handles cycles using a visited set that tracks the current recursion path. However, this approach may have performance implications for input structures that are Directed Acyclic Graphs (DAGs), where the same object (Hash or Array) is referenced from multiple places.

With the current implementation, each time a shared sub-structure is encountered, it will be fully traversed and a new corresponding structure will be built. For large and highly-shared structures, this could be inefficient.

A common pattern to optimize this is to use memoization. You could adapt the visited hash to cache the transformed result of each container. This would both improve performance on DAGs and potentially make the cycle-detection logic more explicit.

Here is an example of how walk_hash could be adapted:

# Define a sentinel value at the module level VISITING = Object.new.freeze private def walk_hash(hash, memo, &) memo_val = memo[hash] return hash if memo_val.equal?(VISITING) # Cycle detected return memo_val if memo_val # Return memoized result memo[hash] = VISITING # Mark as currently being processed new_hash = hash.each_with_object({}) do |(key, value), out| out[key] = walk(value, key, memo, &) end memo[hash] = new_hash # Cache the final result end

A similar change would apply to walk_array. This would avoid re-processing shared nodes while still correctly handling cycles.

@reverse

The original tokenize implementation inserted into @forward before @reverse: @forward[key] = token @reverse[token] = value Because @forward and @reverse are independent Concurrent::Hash instances, a second thread could observe the token in @forward (via its own mutex-protected tokenize or via the fast path that reads @forward before entering the mutex) and then call resolve(token) before the originating thread has populated @reverse, yielding nil for a token that should be resolvable. Swap the order so @reverse is written first. Any thread that sees the token in @forward is then guaranteed to find it in @reverse. Add a regression spec that exercises the invariant under concurrent reader/writer load. Addresses review feedback on PR #16.

rubocop fix

2db0753

github-actions Bot reviewed Apr 13, 2026

View reviewed changes

sergiobayona merged commit 626115e into main Apr 13, 2026
8 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add token-based field anonymization middleware#16

Add token-based field anonymization middleware#16
sergiobayona merged 3 commits into
mainfrom
anonymizer

sergiobayona commented Apr 13, 2026

Uh oh!

sergiobayona commented Apr 13, 2026

Uh oh!

github-actions Bot left a comment

Uh oh!

github-actions Bot Apr 13, 2026

Uh oh!

github-actions Bot Apr 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

sergiobayona commented Apr 13, 2026

Uh oh!

sergiobayona commented Apr 13, 2026

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions Bot Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant