Skip to content

Conversation

@theory
Copy link
Contributor

@theory theory commented Oct 30, 2025

Add a new pattern for "prepared inserts". It works like this:

  • Call PrepareInsert with an INSERT query with optional columns and ending in VALUES. No values should be included in the string.
  • It returns a PreparedInsert object that has two methods:
    • Block() returns a Block pre-configured with columns as declared in the INSERT statement
    • Execute() inserts data from the block then clears it.
  • When the PreparedInsert object goes out of scope it first signals the server that it's done sending data.

This allows one to send smaller batches of blocks, thereby using less memory, but still in a single ClickHouse INSERT operation.

Expected to be useful in the Postgres foreign data wrapper insert API, where multiple rows can be inserted at once but its API handles one-at-a-time insertion. It will also support the FDW COPY API, which can submit huge batches of data to insert, as well.

Comment on lines +1191 to +1178
if (chtype->GetCode() == Type::LowCardinality) {
chtype = col->As<ColumnLowCardinality>()->GetNestedType();
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm honestly not sure this is the right thing to do. Might one need Type::LowCardonality?


void FinishInsert();

void SendData(const Block& block);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had to move this to public so that PreparedInsert can call it. Not in the header file, though, so shouldn't matter.

public:
Block * GetBlock();
void Execute();
// XXX This shouldn't be public.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't figure out how to make this private. Suggestions appreciated.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be nice if it worked declared public in the .cpp file, but I think I could also use an Impl class like Client does to hide such things.

@theory theory force-pushed the insert-block branch 5 times, most recently from 51d8216 to c93c844 Compare October 31, 2025 20:50
@mshustov mshustov requested review from Copilot and slabko November 4, 2025 08:25
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces a PreparedInsert pattern for more memory-efficient bulk data insertion. Instead of accumulating all data before sending, users can now prepare an INSERT statement once and execute multiple smaller batches within a single ClickHouse operation.

Key Changes:

  • Added PreparedInsert class with GetBlock(), Execute(), and Finish() methods for iterative data insertion
  • Implemented PrepareInsert() methods in Client for initiating prepared inserts
  • Added comprehensive unit test demonstrating the prepared insert workflow

Reviewed Changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 6 comments.

File Description
clickhouse/client.h Declared PreparedInsert nested class and PrepareInsert() methods with detailed documentation
clickhouse/client.cpp Implemented PreparedInsert class methods, ReceivePreparePackets(), and refactored insert finalization logic
clickhouse/block.h Fixed spelling in comments ("Convinience" → "Convenience")
ut/client_ut.cpp Added PrepareInsert test case and fixed spelling in existing comment ("Spontaneosly" → "Spontaneously")

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Add a new pattern for "prepared inserts". It works like this:

*   Call `PrepareInsert` with an `INSERT` query with optional columns
    and ending in `VALUES`. No values should be included in the string.
*   It returns a `PreparedInsert` object that has two methods:
    *   `Block()` returns a `Block` pre-configured with columns as
        declared in the `INSERT` statement
    *   `Execute()` inserts data from the block then clears it.
*   Call `Finish()` or just let the `PreparedInsert` object go out of
    scope to send any remaining rows and to signal the server that it's
    done.

This allows one to send smaller batches of blocks, thereby using less
memory, but still in a single ClickHouse `INSERT` operation.

Expected to be useful in the Postgres foreign data wrapper insert API,
where multiple rows can be inserted at once but its API handles
one-at-a-time insertion. It will also support the FDW COPY API, which
can submit huge batches of data to insert, as well.
Copy link
Contributor

@slabko slabko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much for contributing this feature. It has been on the list for quite some time, and I’m glad someone has started looking into it.

However, I have a few remarks.

In general, if you look at the codebase, there is no manual memory management, that is, instead of using new and delete, we rely on std::unique_ptr and std::shared_ptr to manage heap-allocated resources. In fact, the delete keyword is never used anywhere in the project. Using manual memory management of the PreparedInsert class introduces a very bad situation where PreparedInsert can be inadvertently copied.The compiler will automatically generate the copy assignment and the copy constructor operators, which could lead to shallow copies of pointers and ultimately a double-free error, if users are not careful. This can easily happen by accident.

My second remark is a bit tougher. I know you’ve put thought and care into this design, but I’ll have to ask for large changes. The PreparedInsert is not needed here, and the API is simpler without it. The insert operation should be simple and not require many visible moving parts. Ideally, I would approach it like this:

Block block = client.BeginInsert("INSERT INTO test_clickhouse_cpp_insert VALUES");
for (const auto& td : TEST_DATA) {
    id->Append(td.id);
    name->Append(td.name);
    f->Append(td.f);
}
client.SendData(block);
...
client.SendData(block);
...
client.SendData(block);
client.EndInsert();

The main points here are:

  1. BeginInsert and EndInsert clearly form a pair and serve one another.
  2. It’s unambiguous that no other insert or select statements should occur between them. The current PreparedInsert design creates room for sharing the PreparedInsert around, which risks losing the connection state and start using the client object for something else in the meantime. The proposed pattern enforces a clear principle: one operation → one connection → one client object. Need another parallel operation - create another client.
  3. Here the Block object is detached, and ownership is passed to the user code. The user knows it’s not an internal part of PreparedInsert and can freely modify it if needed.
  4. You can still preserve automatic EndInsert behavior when the client goes out of scope by tracking its state - if it’s in insert mode, call EndInsert in the destructor.
  5. I would avoid using the word Prepare... here, because it seem to have a bit different idea than what we are trying achiave here.

Thank you again for your work. Please let me know if you’d like any help, I’d be happy to assist.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants