Graph Sandbox — Identity Resolution

Learning project. Two synthetic identity datasets (terrestrial and digital) loaded into Neo4j. Resolution works by traversing shared attribute nodes in the graph — a Phone node shared by two records is the evidence they might be the same person.

Quick Start

docker compose up -d                    # Neo4j on localhost:7687, browser on localhost:7474
uv sync --all-groups --all-extras       # install deps including notebook extras

uv run python -m idres generate         # writes data/terrestrial.csv, data/digital.csv, data/ground_truth.json
uv run python -m idres load --drop      # loads into Neo4j (--drop clears first)
uv run python -m idres resolve          # writes RESOLVED_TO edges + PersonCluster nodes
uv run python -m idres evaluate         # prints P/R/F1 across 5 thresholds

Neo4j credentials: neo4j / idres-sandbox

CLI Reference

generate  --persons N    number of base persons (default 2000)
          --seed N       RNG seed (default 42)
          --output-dir   where to write CSVs (default data/)

load      --drop         delete all graph data before loading
          --data-dir     where to read CSVs from (default data/)
          --neo4j-uri    override bolt URI

resolve   --threshold    match score cutoff (default 0.55)

evaluate  --ground-truth path to ground_truth.json (default data/ground_truth.json)

How Resolution Works

Generate: 2000 canonical BasePerson records → terrestrial dataset (~2100 rows, 5% duplicates) + digital dataset (~2800 rows, 1-3 per person). Noise injected: nickname swaps, transposed phone digits, address abbreviation variance, occasional year-off DOBs.
Load: Each record is MERGEd into the graph. Shared attributes (Phone, Address, PersonName) become shared nodes — if two records carry the same phone number they both point to the same Phone node.
Resolve: For each TerrestrialRecord, one Cypher traversal finds all DigitalRecord nodes reachable through shared attribute nodes. Shared attribute counts feed a weighted score (name 0.30, phone 0.25, DOB 0.20, address 0.25). Pairs above threshold get a RESOLVED_TO edge.
Cluster: RESOLVED_TO edges are pulled into NetworkX, weakly connected components become PersonCluster nodes.
Evaluate: Pairwise precision/recall/F1 at thresholds [0.45, 0.50, 0.55, 0.60, 0.65].

Neo4j Browser Queries

// See shared phone connections
MATCH (t:TerrestrialRecord)-[:HAS_PHONE]->(p:Phone)<-[:HAS_PHONE]-(d:DigitalRecord)
RETURN t, p, d LIMIT 25

// See resolved clusters
MATCH (r)-[:BELONGS_TO]->(c:PersonCluster)
RETURN c, collect(r) LIMIT 10

// Count resolved pairs
MATCH ()-[:RESOLVED_TO]->() RETURN count(*) AS pairs

What the Score Means

A score of 1.0 means all four attribute types matched (shared name + phone + DOB + address node). In practice 0.55 sits near the P/R crossover — use the threshold sweep output to tune.

Ground Truth

data/ground_truth.json is the answer key. It's built at generation time from the canonical BasePerson pool, before any noise is applied, so it reflects the actual relationships that the resolver is trying to discover.

Structure:

{
  "P000042": {
    "terrestrial_ids": ["T0000045"],
    "digital_ids": ["D0000078", "D0000079"]
  }
}

Each key is a person_id. The lists contain every terrestrial and digital record that was derived from that person. The evaluator uses this to compute pairwise TP/FP/FN: two records in the same person group that end up in the same PersonCluster are a true positive; two records in the same cluster but different person groups are a false positive.

Ground truth is never loaded into Neo4j — it stays on disk and is read only by evaluate.

Data Lives in `data/` (gitignored)

Re-run generate at any time; it's deterministic given the same seed. load --drop wipes and reloads.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
notebooks		notebooks
src/idres		src/idres
.gitignore		.gitignore
.python-version		.python-version
CLAUDE.md		CLAUDE.md
README.md		README.md
docker-compose.yml		docker-compose.yml
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Graph Sandbox — Identity Resolution

Quick Start

CLI Reference

How Resolution Works

Neo4j Browser Queries

What the Score Means

Ground Truth

Data Lives in `data/` (gitignored)

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Graph Sandbox — Identity Resolution

Quick Start

CLI Reference

How Resolution Works

Neo4j Browser Queries

What the Score Means

Ground Truth

Data Lives in data/ (gitignored)

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Data Lives in `data/` (gitignored)

Packages