-
Notifications
You must be signed in to change notification settings - Fork 190
Add PreparedInsert flow #443
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
| if (chtype->GetCode() == Type::LowCardinality) { | ||
| chtype = col->As<ColumnLowCardinality>()->GetNestedType(); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm honestly not sure this is the right thing to do. Might one need Type::LowCardonality?
|
|
||
| void FinishInsert(); | ||
|
|
||
| void SendData(const Block& block); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had to move this to public so that PreparedInsert can call it. Not in the header file, though, so shouldn't matter.
| public: | ||
| Block * GetBlock(); | ||
| void Execute(); | ||
| // XXX This shouldn't be public. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I couldn't figure out how to make this private. Suggestions appreciated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be nice if it worked declared public in the .cpp file, but I think I could also use an Impl class like Client does to hide such things.
51d8216 to
c93c844
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR introduces a PreparedInsert pattern for more memory-efficient bulk data insertion. Instead of accumulating all data before sending, users can now prepare an INSERT statement once and execute multiple smaller batches within a single ClickHouse operation.
Key Changes:
- Added
PreparedInsertclass withGetBlock(),Execute(), andFinish()methods for iterative data insertion - Implemented
PrepareInsert()methods in Client for initiating prepared inserts - Added comprehensive unit test demonstrating the prepared insert workflow
Reviewed Changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 6 comments.
| File | Description |
|---|---|
| clickhouse/client.h | Declared PreparedInsert nested class and PrepareInsert() methods with detailed documentation |
| clickhouse/client.cpp | Implemented PreparedInsert class methods, ReceivePreparePackets(), and refactored insert finalization logic |
| clickhouse/block.h | Fixed spelling in comments ("Convinience" → "Convenience") |
| ut/client_ut.cpp | Added PrepareInsert test case and fixed spelling in existing comment ("Spontaneosly" → "Spontaneously") |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Add a new pattern for "prepared inserts". It works like this:
* Call `PrepareInsert` with an `INSERT` query with optional columns
and ending in `VALUES`. No values should be included in the string.
* It returns a `PreparedInsert` object that has two methods:
* `Block()` returns a `Block` pre-configured with columns as
declared in the `INSERT` statement
* `Execute()` inserts data from the block then clears it.
* Call `Finish()` or just let the `PreparedInsert` object go out of
scope to send any remaining rows and to signal the server that it's
done.
This allows one to send smaller batches of blocks, thereby using less
memory, but still in a single ClickHouse `INSERT` operation.
Expected to be useful in the Postgres foreign data wrapper insert API,
where multiple rows can be inserted at once but its API handles
one-at-a-time insertion. It will also support the FDW COPY API, which
can submit huge batches of data to insert, as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you very much for contributing this feature. It has been on the list for quite some time, and I’m glad someone has started looking into it.
However, I have a few remarks.
In general, if you look at the codebase, there is no manual memory management, that is, instead of using new and delete, we rely on std::unique_ptr and std::shared_ptr to manage heap-allocated resources. In fact, the delete keyword is never used anywhere in the project. Using manual memory management of the PreparedInsert class introduces a very bad situation where PreparedInsert can be inadvertently copied.The compiler will automatically generate the copy assignment and the copy constructor operators, which could lead to shallow copies of pointers and ultimately a double-free error, if users are not careful. This can easily happen by accident.
My second remark is a bit tougher. I know you’ve put thought and care into this design, but I’ll have to ask for large changes. The PreparedInsert is not needed here, and the API is simpler without it. The insert operation should be simple and not require many visible moving parts. Ideally, I would approach it like this:
Block block = client.BeginInsert("INSERT INTO test_clickhouse_cpp_insert VALUES");
for (const auto& td : TEST_DATA) {
id->Append(td.id);
name->Append(td.name);
f->Append(td.f);
}
client.SendData(block);
...
client.SendData(block);
...
client.SendData(block);
client.EndInsert();The main points here are:
BeginInsertandEndInsertclearly form a pair and serve one another.- It’s unambiguous that no other insert or select statements should occur between them. The current
PreparedInsertdesign creates room for sharing thePreparedInsertaround, which risks losing the connection state and start using the client object for something else in the meantime. The proposed pattern enforces a clear principle: one operation → one connection → one client object. Need another parallel operation - create another client. - Here the
Blockobject is detached, and ownership is passed to the user code. The user knows it’s not an internal part ofPreparedInsertand can freely modify it if needed. - You can still preserve automatic
EndInsertbehavior when the client goes out of scope by tracking its state - if it’s in insert mode, callEndInsertin the destructor. - I would avoid using the word
Prepare...here, because it seem to have a bit different idea than what we are trying achiave here.
Thank you again for your work. Please let me know if you’d like any help, I’d be happy to assist.
Add a new pattern for "prepared inserts". It works like this:
PrepareInsertwith anINSERTquery with optional columns and ending inVALUES. No values should be included in the string.PreparedInsertobject that has two methods:Block()returns aBlockpre-configured with columns as declared in theINSERTstatementExecute()inserts data from the block then clears it.PreparedInsertobject goes out of scope it first signals the server that it's done sending data.This allows one to send smaller batches of blocks, thereby using less memory, but still in a single ClickHouse
INSERToperation.Expected to be useful in the Postgres foreign data wrapper insert API, where multiple rows can be inserted at once but its API handles one-at-a-time insertion. It will also support the FDW COPY API, which can submit huge batches of data to insert, as well.