Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++] Data race in SimpleRecordBatch::columns #45371

Open
colin-r-schultz opened this issue Jan 28, 2025 · 1 comment
Open

[C++] Data race in SimpleRecordBatch::columns #45371

colin-r-schultz opened this issue Jan 28, 2025 · 1 comment

Comments

@colin-r-schultz
Copy link

Describe the bug, including details regarding any error messages, version, and platform.

The following example test case, when run under TSAN, reports a data race.

TEST_F(TestRecordBatch, ColumnsThreadSafety) {
  const int length = 10;

  random::RandomArrayGenerator gen(42);
  std::shared_ptr<ArrayData> array_data = gen.ArrayOf(utf8(), length)->data();
  auto schema = ::arrow::schema({field("f1", utf8())});
  auto record_batch = RecordBatch::Make(schema, length, {array_data});
  std::atomic_bool start_flag{false};
  std::thread t([record_batch, &start_flag]() {
    start_flag.store(true);
    auto columns = record_batch->columns();
    ASSERT_EQ(columns.size(), 1);
  });
  // Wait for thread startup
  while (!start_flag.load()) {
  };
  auto columns = record_batch->columns();
  ASSERT_EQ(columns.size(), 1);
  t.join();
}

The relevant definitions in record_batch.cc are below

  const std::vector<std::shared_ptr<Array>>& columns() const override {
    for (int i = 0; i < num_columns(); ++i) {
      // Force all columns to be boxed
      column(i);
    }
    return boxed_columns_;
  }

  std::shared_ptr<Array> column(int i) const override {
    std::shared_ptr<Array> result = std::atomic_load(&boxed_columns_[i]);
    if (!result) {
      result = MakeArray(columns_[i]);
      std::atomic_store(&boxed_columns_[i], result);
    }
    return result;
  }

The columns() method returns a reference to mutable boxed_columns_, assuming that it is fully initialized and will not be written to again. However, multiple threads can race to initialize boxed_columns_[i], leading to additional atomic writes after column(i) has been called for the first time. These atomic writes can race against non atomic reads of the boxed_columns_ vector after it is returned by columns(). This is undefined behavior and can lead to a use-after-free of the contained Arrays.

Component(s)

C++

@colin-r-schultz
Copy link
Author

I'll take this since I have a patch ready

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant