Ratchet is a prototype system for pipeline-based query suspending and resuming
Ratchet implementation is modified from DuckDB.
It is highly recommended to add third-party libs whose whole source code is in a single header file. Then, you can add them by,
- copying the header file of the third-party lib to
third_partyfolder - adding
include_directories(third_party/xxx)afterinclude_directories(src/include)in theCMakeLists.txtat the root directory. You may have to recompile the source code if needed, usingmakein root folder. - if you are working on Python client, you also need to update
third_party_includes()inscripts/package_build.py. You may have to reinstall python client to reflect the change, usingpip3 install .in/tools/pythonpkg.
We import the nlohmann/json to serialize and deserialize JSON. Github: https://github.com/nlohmann/json
When you want to add a new Python API or modify an existing one for DuckDB especially for virtual environments, you need to:
- Install
mypypython library in the virtual environment - Modify the source code in
tools/pythonpkg/srcto reflect to API change - Run
scripts/regenerate_python_stubs.shat the root directory of DuckDB, making sure<Ratchet-DuckDB>/tools/pythonpkg/duckdb-stubs/__init__.pyialready reflect the API change - Install the modified DuckDB again using
python setup.py installin<Ratchet-DuckDB>/tools/pythonpkg - If you still cannot apply the change you made for Python Client APIs, please repeat 3,4 for mutiple times, you should be fine.
The main codebase is written in C++, so it is common to use cmake to compile the source code. Namely, using make command in root folder.
Install pybind11 using pip3 install pybind11 (system-wide or virtual environment)
pip3 show pybind11 will tell you where is the pybind11, for example, /home/{user_path}/{venv}/lib/python3.7/site-packages
Then, </path/to/pybind11> is, for example, /home/{user_path}/{venv}/lib/python3.7/site-packages/pybind11
If you are using CLion IDE for development, and make sure CLion can link all the source code, you may need to add -DBUILD_PYTHON_PKG=TRUE -DCMAKE_PREFIX_PATH=</path/to/pybind11> in Settings | Build, Execution, Deployment | CMake | CMake Options. This will tell CLion where to find pybind11.
Ratchet-DuckDB can be used and tested by a python client. It is recommended to install the python client in a python virtual environment.
source <path/to/python-virtual-environment/bin/activate>
cd <Ratchet>/tools/pythonpkg
pip3 install .
# or python setup.py installSink(), Finalize(), and GetData() are the functions for query suspension and resumption. Usually, query suspension should happen in Finalize(), while query resumption should happen in the Sink(). However, it is still case-by-case due to implementation or performance reason, for example, resumption for aggregation may happen in GetData().
- Adding suspension and resumption APIs in
pyconnection.cppandpyconnection.hp - Checking finished pipelines when resumption in
pipeline.cpp - Suspending and resuming ungrouped aggregation in
physical_ungrouped_aggregate.cpp - Suspending and resuming in-memory hash join in
physical_hash_join.cppandperfect_hash_join_executor.cpp - Suspending and resuming external hash join in
physical_hash_join.cpp - Suspending and resuming grouped aggregation in
physical_hash_aggregate.cpp
- tools/pythonpkg/src/pyconnection.cpp
- tools/pythonpkg/include/duckdb_python/pyconnection/pyconnection.hpp
- src/include/duckdb/common/constants.hpp
- src/include/duckdb/common/types/data_chunk.hpp
- src/include/duckdb/common/vector_operations/aggregate_executor.hpp
- src/include/duckdb/execution/operator/join/perfect_hash_join_executor.hpp
- src/include/duckdb/execution/executor.hpp
- src/include/duckdb/main/client_config.hpp
- src/include/duckdb/parallel/pipeline.hpp
- src/common/constants.cpp
- src/main/settings/settings.cpp
- src/execution/operator/aggregate/physical_hash_aggregate.cpp
- src/execution/operator/aggregate/physical_ungrouped_aggregate.cpp
- src/execution/operator/join/perfect_hash_join_executor.cpp
- src/execution/operator/join/physical_hash_join.cpp
- src/execution/operator/join/physical_range_join.cpp
- src/execution/operator/order/physical_order.cpp
- src/execution/operator/scan/physical_table_scan.cpp
- src/execution/join_hashtable.cpp
- src/parallel/executor.cpp
- src/parallel/pipeline.cpp
- src/parallel/pipeline_executor.cpp
DuckDB is a high-performance analytical database system. It is designed to be fast, reliable and easy to use. DuckDB provides a rich SQL dialect, with support far beyond basic SQL. DuckDB supports arbitrary and nested correlated subqueries, window functions, collations, complex types (arrays, structs), and more. For more information on the goals of DuckDB, please refer to the Why DuckDB page on our website.
If you want to install and use DuckDB, please see our website for installation and usage instructions.
For CSV files and Parquet files, data import is as simple as referencing the file in the FROM clause:
SELECT * FROM 'myfile.csv';
SELECT * FROM 'myfile.parquet';Refer to our Data Import section for more information.
The website contains a reference of functions and SQL constructs available in DuckDB.
For development, DuckDB requires CMake, Python3 and a C++11 compliant compiler. Run make in the root directory to compile the sources. For development, use make debug to build a non-optimized debug version. You should run make unit and make allunit to verify that your version works properly after making changes. To test performance, you can run BUILD_BENCHMARK=1 BUILD_TPCH=1 make and then perform several standard benchmarks from the root directory by executing ./build/release/benchmark/benchmark_runner. The detail of benchmarks is in our Benchmark Guide.
Please also refer to our Contribution Guide.