GitHub - VIDA-NYU/data-harmonization-benchmark

Data Harmonization Benchmark

This repository contains the code and data for the Data Harmonization Benchmark. The benchmark is a collection of datasets that are used to evaluate the performance of data harmonization methods including schema matching, value mapping.

Code Structure

Note: datasets can be downloaded following the instructions in the next section.

|-- data_harmonization_benchmark
    |-- datasets # Put everything downloaded from the link above here
        |-- parse_valentine_benchmark.ipynb # parse valentine data format to our format
    |-- matchers # Schema matching methods
        |-- Coma
        |-- ComaInst
        |-- DistributionBased
        |-- ISResMat # X. Du et al. - In Situ Neural Relational Schema Matcher (10.1109/ICDE60146.2024.00018)
        |-- JaccardDistance
        |-- Magneto # Magneto is introduced as a method from our team, find the source code here: https://github.com/VIDA-NYU/data-integration-eval
        |-- SimilarityFlooding
        |-- Unicorn # Tu et al. Unicorn: A unified multi-tasking model for supporting matching tasks in data integration
    |-- utils
        |-- mrr.py # Mean reciprocal rank metric
        |-- result_proc.py # Process the result of schema matching methods
    |-- config.py # Configuration file, including source, target, and running configurations
    |-- matching.py # Wrapper for different matchers
    |-- runbenchmark.py # Run benchmark tasks
|-- slurm_run # SLURM scripts for running schema matching methods on server
    |-- benchmark_batch.sh # Run all schema matching methods
    |-- benchmark_scalabilty.sh # Run scalability benchmarks on various target samples
    |-- setup_penv.sh # Setup python environment with conda
    |-- slurm_job_cpu.SBATCH # SLURM job script for CPU
    |-- slurm_job_gpu.SBATCH # SLURM job script for GPU

0. Dataset Accessability

The datasets used in this benchmark are available for download via the following links:

After downloading the datasets, unzip the subfolders under the datasets directory. The directory structure should look like this:

|-- data_harmonization_benchmark
    |-- datasets
        |-- datasets
            |-- GDC
            |-- OpenData
            |-- TPC-DI
            |-- ...

1. Schema Matching

Schema matching is the process of identifying correspondences between attributes from two database schemas. Typically, schema matching methods employ one or more functions to establish a similarity value between pairs of elements from the schemas, referred to as matching candidates. These functions, known as matchers, take two elements as input and estimate a similarity value between 0 and 1, where a higher value indicates greater similarity. Matchers can utilize a variety of strategies to estimate similarities, such as comparing schema element names, assessing their semantic similarity using a thesaurus, analyzing data types and cardinality, or even examining data values when available.

1.1. Supported Matchers

We support the following schema matching methods, all of them can be run on-server with SLURM or locally.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
data_harmonization_benchmark		data_harmonization_benchmark
slurm_run		slurm_run
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Harmonization Benchmark

Code Structure

0. Dataset Accessability

1. Schema Matching

1.1. Supported Matchers

1.1.1 Coma

1.1.2 Coma++

1.1.3 Distribution-based

1.1.4 Jaccard Distance

1.1.5 Similarity Flooding

1.1.6 Unicorn

1.1.7 ISResMat

1.1.8 Magneto

About

Releases

Packages

Languages

License

VIDA-NYU/data-harmonization-benchmark

Folders and files

Latest commit

History

Repository files navigation

Data Harmonization Benchmark

Code Structure

0. Dataset Accessability

1. Schema Matching

1.1. Supported Matchers

1.1.1 Coma

1.1.2 Coma++

1.1.3 Distribution-based

1.1.4 Jaccard Distance

1.1.5 Similarity Flooding

1.1.6 Unicorn

1.1.7 ISResMat

1.1.8 Magneto

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages