This repository contains the code and data for the Data Harmonization Benchmark. The benchmark is a collection of datasets that are used to evaluate the performance of data harmonization methods including schema matching, value mapping.
Note: datasets can be downloaded following the instructions in the next section.
|-- data_harmonization_benchmark
|-- datasets # Put everything downloaded from the link above here
|-- parse_valentine_benchmark.ipynb # parse valentine data format to our format
|-- matchers # Schema matching methods
|-- Coma
|-- ComaInst
|-- DistributionBased
|-- ISResMat # X. Du et al. - In Situ Neural Relational Schema Matcher (10.1109/ICDE60146.2024.00018)
|-- JaccardDistance
|-- Magneto # Magneto is introduced as a method from our team, find the source code here: https://github.com/VIDA-NYU/data-integration-eval
|-- SimilarityFlooding
|-- Unicorn # Tu et al. Unicorn: A unified multi-tasking model for supporting matching tasks in data integration
|-- utils
|-- mrr.py # Mean reciprocal rank metric
|-- result_proc.py # Process the result of schema matching methods
|-- config.py # Configuration file, including source, target, and running configurations
|-- matching.py # Wrapper for different matchers
|-- runbenchmark.py # Run benchmark tasks
|-- slurm_run # SLURM scripts for running schema matching methods on server
|-- benchmark_batch.sh # Run all schema matching methods
|-- benchmark_scalabilty.sh # Run scalability benchmarks on various target samples
|-- setup_penv.sh # Setup python environment with conda
|-- slurm_job_cpu.SBATCH # SLURM job script for CPU
|-- slurm_job_gpu.SBATCH # SLURM job script for GPU
The datasets used in this benchmark are available for download via the following links:
After downloading the datasets, unzip the subfolders under the datasets
directory. The directory structure should look like this:
|-- data_harmonization_benchmark
|-- datasets
|-- datasets
|-- GDC
|-- OpenData
|-- TPC-DI
|-- ...
Schema matching is the process of identifying correspondences between attributes from two database schemas. Typically, schema matching methods employ one or more functions to establish a similarity value between pairs of elements from the schemas, referred to as matching candidates. These functions, known as matchers, take two elements as input and estimate a similarity value between 0 and 1, where a higher value indicates greater similarity. Matchers can utilize a variety of strategies to estimate similarities, such as comparing schema element names, assessing their semantic similarity using a thesaurus, analyzing data types and cardinality, or even examining data values when available.
We support the following schema matching methods, all of them can be run on-server with SLURM or locally.
https://github.com/delftdata/valentine/tree/master/valentine/algorithms/coma
https://github.com/delftdata/valentine/tree/master/valentine/algorithms/coma
https://github.com/delftdata/valentine/tree/master/valentine/algorithms/distribution_based
https://github.com/delftdata/valentine/tree/master/valentine/algorithms/jaccard_distance
https://github.com/delftdata/valentine/tree/master/valentine/algorithms/similarity_flooding
https://github.com/ruc-datalab/Unicorn