A hypergraph-based tabular language model.
This repository contains the official implementation for the paper HyTrel: Hypergraph-enhanced Tabular Data Representation Learning with code, data, and checkpoints.
It's recommended to use python 3.9
.
Here is an example of creating the environment using Anaconda.
- Create the virtual environment using
conda create -n hytrel python=3.9
- Install the required packages with the corresponding versions from
requirements.txt
Note: If you encounter difficulty installing torch_geometric
, please refer here to install it according to your environment settings.
-
Pre-process the raw data, slicing the big file into chunks, and put the
*.jsonl
files into the directory/data/pretrain/chunks/
. Sample data is present here and the files can be used as reference.
Note: Pretraining data*.jsonl
are acquired and preprocessed by using the scripts from the TaBERT. -
Run
python parallel_clean.py
to clean and serialize the tables.
Note: We serialize the tables as arrow in consideration of memory usage. -
Run
sh pretrain_electra.sh
to pretrain HyTrel with the ELECTRA objective. -
Run
sh pretrain_contrast.sh
to pretrain HyTrel with the Contrastive objective.
First put the ELECTRA-pretrained checkpoint to /checkpoints/electra/
, and Contrast-pretrained checkpoint to /checkpoints/contrast/
.
-
Put the data
{train, dev, test}.table_col_type.json
andtype_vocab.txt
into the directory/data/col_ann/
. -
Run
sh evaluate_cta_electra.sh
with ELECTRA-pretrained checkpoint. -
Run
sh evaluate_cta_contrast.sh
with Contrast-pretrained checkpoint.
-
Put the data
{train, dev, test}.table_rel_extraction.json
andrelation_vocab.txt
into the directory/data/col_rel/
. -
Run
sh evaluate_cpa_electra.sh
with ELECTRA-pretrained checkpoint. -
Run
sh evaluate_cpa_contrast.sh
with Contrast-pretrained checkpoint.
-
Decompose
ttd.tar.gz
intotrain, dev, test
data folders under the directory/data/ttd/
. -
Run
sh evaluate_ttd_electra.sh
with ELECTRA-pretrained checkpoint. -
Run
sh evaluate_ttd_contrast.sh
with Contrast-pretrained checkpoint.
Please cite our paper.
@inproceedings{NEURIPS2023_66178bea,
author = {Chen, Pei and Sarkar, Soumajyoti and Lausen, Leonard and Srinivasan, Balasubramaniam and Zha, Sheng and Huang, Ruihong and Karypis, George},
booktitle = {Advances in Neural Information Processing Systems},
editor = {A. Oh and T. Neumann and A. Globerson and K. Saenko and M. Hardt and S. Levine},
pages = {32173--32193},
publisher = {Curran Associates, Inc.},
title = {HyTrel: Hypergraph-enhanced Tabular Data Representation Learning},
url = {https://proceedings.neurips.cc/paper_files/paper/2023/file/66178beae8f12fcd48699de95acc1152-Paper-Conference.pdf},
volume = {36},
year = {2023}
}
For the data and model checkpoints, please find them in the checkpoints
folder.
If you have more questions, please email: [email protected] (Pei Chen)