Learning project. Two synthetic identity datasets (terrestrial and digital) loaded into Neo4j. Resolution works by traversing shared attribute nodes in the graph — a Phone node shared by two records is the evidence they might be the same person.
docker compose up -d # Neo4j on localhost:7687, browser on localhost:7474
uv sync --all-groups --all-extras # install deps including notebook extras
uv run python -m idres generate # writes data/terrestrial.csv, data/digital.csv, data/ground_truth.json
uv run python -m idres load --drop # loads into Neo4j (--drop clears first)
uv run python -m idres resolve # writes RESOLVED_TO edges + PersonCluster nodes
uv run python -m idres evaluate # prints P/R/F1 across 5 thresholdsNeo4j credentials: neo4j / idres-sandbox
generate --persons N number of base persons (default 2000)
--seed N RNG seed (default 42)
--output-dir where to write CSVs (default data/)
load --drop delete all graph data before loading
--data-dir where to read CSVs from (default data/)
--neo4j-uri override bolt URI
resolve --threshold match score cutoff (default 0.55)
evaluate --ground-truth path to ground_truth.json (default data/ground_truth.json)
-
Generate: 2000 canonical
BasePersonrecords → terrestrial dataset (~2100 rows, 5% duplicates) + digital dataset (~2800 rows, 1-3 per person). Noise injected: nickname swaps, transposed phone digits, address abbreviation variance, occasional year-off DOBs. -
Load: Each record is
MERGEd into the graph. Shared attributes (Phone, Address, PersonName) become shared nodes — if two records carry the same phone number they both point to the samePhonenode. -
Resolve: For each
TerrestrialRecord, one Cypher traversal finds allDigitalRecordnodes reachable through shared attribute nodes. Shared attribute counts feed a weighted score (name 0.30, phone 0.25, DOB 0.20, address 0.25). Pairs above threshold get aRESOLVED_TOedge. -
Cluster:
RESOLVED_TOedges are pulled into NetworkX, weakly connected components becomePersonClusternodes. -
Evaluate: Pairwise precision/recall/F1 at thresholds
[0.45, 0.50, 0.55, 0.60, 0.65].
// See shared phone connections
MATCH (t:TerrestrialRecord)-[:HAS_PHONE]->(p:Phone)<-[:HAS_PHONE]-(d:DigitalRecord)
RETURN t, p, d LIMIT 25
// See resolved clusters
MATCH (r)-[:BELONGS_TO]->(c:PersonCluster)
RETURN c, collect(r) LIMIT 10
// Count resolved pairs
MATCH ()-[:RESOLVED_TO]->() RETURN count(*) AS pairsA score of 1.0 means all four attribute types matched (shared name + phone + DOB + address node). In practice 0.55 sits near the P/R crossover — use the threshold sweep output to tune.
data/ground_truth.json is the answer key. It's built at generation time from the canonical BasePerson pool, before any noise is applied, so it reflects the actual relationships that the resolver is trying to discover.
Structure:
{
"P000042": {
"terrestrial_ids": ["T0000045"],
"digital_ids": ["D0000078", "D0000079"]
}
}Each key is a person_id. The lists contain every terrestrial and digital record that was derived from that person. The evaluator uses this to compute pairwise TP/FP/FN: two records in the same person group that end up in the same PersonCluster are a true positive; two records in the same cluster but different person groups are a false positive.
Ground truth is never loaded into Neo4j — it stays on disk and is read only by evaluate.
Re-run generate at any time; it's deterministic given the same seed. load --drop wipes and reloads.