This repository contains the scripts necessary to create the data for the h2o benchmark and performance reports for the groupby
queries run using parquet data.
-
Clone this repository
-
Create and activate the conda environment
$ conda env create -f environment.yml $ conda activate h2o_data
You can find the script to generate the datasets under the scripts
directory. The create_groupby_data.py
script will generate the groupby
data for the h2o
benchmark according to the arguments provided. For information on the arguments, run:
$ python create_groupby_data.py --help
Notes
- This script does not include NA, and the data is NOT sorted.
- This script generates multiple files in parallel using dask, you will need to modify the number of files according to your RAM availability.
- If you desired one single file you can join the
num_files
generated usingscripts/create_single.csv
.
Use the following example to check everything has been set up correctly, before you attempt the bigger files.
Example: 1 million rows (num_rows) and 100 groups (num_groups)
The following code will create 10 files on a directory call test
located at the same level from where the script is running. For a different location of the directory you can provide a full or relative path. Note that if the directory doesn't exists it will create it.
$ python create_groupby_data_dask.py --num-rows 1e6 --num-groups 1e2 --num-files 10 --output-dir test
To obtain a single file, you can join them using create_single_csv.py
, the commands below will take care of the headers, keeping only the first one.
$ python create_single_csv.py -of output_dir/N_1e6_K_1e2_single.csv -ifn output_dir/groupby-N_1000000_K_100_file_ --num-files 10
The H2O benchmark has three sizes of data they use, everyone on a single file. You can generate them in your machine, but note that you might need to increase the number of files depending on your RAM. (The following were tested on a M1 arm64 with a 16GB of RAM)
This results in 10 files
$ python create_groupby_data_dask.py --num-rows 1e7 --num-groups 1e2 --num-files 10 --output-dir output_dir
This results in 100 files
$ python create_groupby_data_dask.py --num-rows 1e8 --num-groups 1e2 --num-files 100 --output-dir output_dir
This results in 1000 files
$ python create_groupby_data.py --num-rows 1e9 --num-groups 1e2 --num-files 1000 --output-dir output_dir
How to get a single file:
In any of the cases above if you desired to join the files you can do it using the create_single_csv.py
script. There are details on the script on how to use it, but for example for the case of num_rows=1e7
, num_groups=1e2
and num_files=10
:
$ python create_single_csv.py \
--output-file output_dir/merged.csv \
--input-files-name output_dir/groupby-num_rows_10000000-num_groups_100-file_ \
--num-files 10
If you want to create the equivalent parquet datasets, use the script scripts/csv_to_parquet_data.py
which creates a parquet
dataset based on the csv
dataset provided.
For example, for the case of 1e7
rows, the following command will take the folder that contains the csv
files generated with create_groupby_data.py
that we are hosting in S3, and will create a local directory named test_parquet
that will have the corresponding parquet files for the case of 1e7
rows.
python csv_to_parquet_data.py --csv-dir s3://coiled-datasets/h2o-benchmark/N_1e7_K_1e2/ --output-dir test_parquet
The following S3 bucket contains the data to perform the h2o benchmark. We provide the single files for every case (num_rows=1e7, 1e8, and 1e9) as well as a folder for each case that contains num_files=10, 100, 1000
respectively. Folders with _parquet
contain the parquet files.
S3 URI: s3://coiled-datasets/h2o-benchmark/
S3 URI (id3 no categorical): s3://coiled-datasets/h2o-benchmark/id3_nocat/
S3 URI 50GB (N=1e9), 500GB (N=1e10) with id1, id2, id3 as pyarrow strings: s3://coiled-datasets/h2o-benchmark/pyarrow_strings/
This performance reports were obtained running the code in h2o_benchamarks.ipynb
. Using dask=2022.9.0