This is the repository for the paper -- A Hybrid Probabilistic Approach for Table Understanding
The code requires the tables to be in json line format. Each json object is a single table containing the fields:
- table_array: a 2D list of cell contents
- table_id: a id (string)
- file_name: a file name (string)
- embeddings: a 3D list of cell vector representations (obtained from a pre-trained cell embedding model)
- feature_array: a 3D list of features
- blocks: a list such that each item represents a block (top index, left index, bottom index, right index, functional label)
- data_types: a 2D list of cell data types
- layouts: a list such that each item represents a relationship (block_a top index, block_a left index, block_a bottom index, block_a right index, block_b top index, block_b left index, block_b bottom index, block_b right index, relation type)
The processed datasets are available in datasets.tar.gz
. It includes 4 available datasets. They do not contain the embeddings
field. To get the cell representations, you can use the pre-trained cell embedding model. The pre-processed datasets (datasets_w_emb.tar.gz
) can be found here.
dg_all.jl
has annotations of cell data types, blocks, and relationships between blocks.
cius_blocks.jl
, saus_blocks.jl
and deex_blocks.jl
have annotations for blocks. The blocks are automatically generated from cell-level labels. The original datasets can be found here.
See the example in cfg/dg_config.yaml
.
python generate_folds.py --config cfg/dg_config.yaml
python train_cl.py --config cfg/dg_config.yaml --cell --block
Run the cell classifier, the block detector and the layout predictor as follows. The predictions will be saved to the results/dg/
directory. The output files are in json format including predictions for k
folds. Each fold has a field predict
presenting the predictions.
python test_cc.py --config cfg/dg_config.yaml --method psl
python test_be.py --config cfg/dg_config.yaml --method psl
python test_lp.py --config cfg/dg_config.yaml --method psl