Skip to content

jvspeed74/Graph-Sandbox

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Graph Sandbox — Identity Resolution

Learning project. Two synthetic identity datasets (terrestrial and digital) loaded into Neo4j. Resolution works by traversing shared attribute nodes in the graph — a Phone node shared by two records is the evidence they might be the same person.

Quick Start

docker compose up -d                    # Neo4j on localhost:7687, browser on localhost:7474
uv sync --all-groups --all-extras       # install deps including notebook extras

uv run python -m idres generate         # writes data/terrestrial.csv, data/digital.csv, data/ground_truth.json
uv run python -m idres load --drop      # loads into Neo4j (--drop clears first)
uv run python -m idres resolve          # writes RESOLVED_TO edges + PersonCluster nodes
uv run python -m idres evaluate         # prints P/R/F1 across 5 thresholds

Neo4j credentials: neo4j / idres-sandbox

CLI Reference

generate  --persons N    number of base persons (default 2000)
          --seed N       RNG seed (default 42)
          --output-dir   where to write CSVs (default data/)

load      --drop         delete all graph data before loading
          --data-dir     where to read CSVs from (default data/)
          --neo4j-uri    override bolt URI

resolve   --threshold    match score cutoff (default 0.55)

evaluate  --ground-truth path to ground_truth.json (default data/ground_truth.json)

How Resolution Works

  1. Generate: 2000 canonical BasePerson records → terrestrial dataset (~2100 rows, 5% duplicates) + digital dataset (~2800 rows, 1-3 per person). Noise injected: nickname swaps, transposed phone digits, address abbreviation variance, occasional year-off DOBs.

  2. Load: Each record is MERGEd into the graph. Shared attributes (Phone, Address, PersonName) become shared nodes — if two records carry the same phone number they both point to the same Phone node.

  3. Resolve: For each TerrestrialRecord, one Cypher traversal finds all DigitalRecord nodes reachable through shared attribute nodes. Shared attribute counts feed a weighted score (name 0.30, phone 0.25, DOB 0.20, address 0.25). Pairs above threshold get a RESOLVED_TO edge.

  4. Cluster: RESOLVED_TO edges are pulled into NetworkX, weakly connected components become PersonCluster nodes.

  5. Evaluate: Pairwise precision/recall/F1 at thresholds [0.45, 0.50, 0.55, 0.60, 0.65].

Neo4j Browser Queries

// See shared phone connections
MATCH (t:TerrestrialRecord)-[:HAS_PHONE]->(p:Phone)<-[:HAS_PHONE]-(d:DigitalRecord)
RETURN t, p, d LIMIT 25

// See resolved clusters
MATCH (r)-[:BELONGS_TO]->(c:PersonCluster)
RETURN c, collect(r) LIMIT 10

// Count resolved pairs
MATCH ()-[:RESOLVED_TO]->() RETURN count(*) AS pairs

What the Score Means

A score of 1.0 means all four attribute types matched (shared name + phone + DOB + address node). In practice 0.55 sits near the P/R crossover — use the threshold sweep output to tune.

Ground Truth

data/ground_truth.json is the answer key. It's built at generation time from the canonical BasePerson pool, before any noise is applied, so it reflects the actual relationships that the resolver is trying to discover.

Structure:

{
  "P000042": {
    "terrestrial_ids": ["T0000045"],
    "digital_ids": ["D0000078", "D0000079"]
  }
}

Each key is a person_id. The lists contain every terrestrial and digital record that was derived from that person. The evaluator uses this to compute pairwise TP/FP/FN: two records in the same person group that end up in the same PersonCluster are a true positive; two records in the same cluster but different person groups are a false positive.

Ground truth is never loaded into Neo4j — it stays on disk and is read only by evaluate.

Data Lives in data/ (gitignored)

Re-run generate at any time; it's deterministic given the same seed. load --drop wipes and reloads.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors