Skip to content

fix: prevent crash on KEYLOCK_ACQUIRED check for NO_KEY_TRANSACTIONAL commands #5185

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
May 29, 2025

Conversation

vyavdoshenko
Copy link
Contributor

@vyavdoshenko vyavdoshenko commented May 27, 2025

Fixes: #5173

Problem:
Server crashes with Check failed: !lock_args.fps.empty() during concurrent search indexing operations (FT.CREATE, FT.DROPINDEX, FT.SEARCH, etc.).

Root Cause:
Commands with CO::NO_KEY_TRANSACTIONAL flag don't have keys to lock, resulting in empty fingerprints (kv_fp_). However, the transaction system:

  • Always sets KEYLOCK_ACQUIRED flag in ScheduleInShard()
  • Expects this flag when releasing locks in CancelShardCb()
  • Attempts to release locks on empty fingerprint arrays

Solution:
Add proper conditional checks in CancelShardCb()

finally:
client.close()

def writer2_thread():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you do not need to clone the same function twice, you can just launch a second thread with the same function.

stop_threads.set()

# Check if server crashed
if server_crashed.is_set():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the server crashed, the testing framework should identify it automatically.

@vyavdoshenko vyavdoshenko self-assigned this May 27, 2025
@vyavdoshenko vyavdoshenko changed the title fix: race condition in CancelShardCb when kv_fp_ is empty in multi-transactions fix: prevent crash on KEYLOCK_ACQUIRED check for NO_KEY_TRANSACTIONAL commands May 28, 2025
@vyavdoshenko vyavdoshenko requested review from romange and kostasrim May 28, 2025 14:21
std::string query_numeric = absl::StrCat("@numeric_field:[", random_val_numeric, " ",
(random_val_numeric + 100), "]");
Run({"ft.search", kIndexName, query_content});
Run({"ft.search", kIndexName, query_tags});
Copy link
Collaborator

@romange romange May 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am pretty sure the bug is unrelated to query contents.
the goal is to provide a minimal reproducible example - everything else is just noise.

  1. please make the query constant
  2. there is no need for 3 calls. you can increase kSearcherOps if needed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made the test smaller

Run({"ft.search", kIndexName, query_tags});
Run({"ft.search", kIndexName, query_numeric});
if (i % 50 == 0)
ThisFiber::SleepFor(std::chrono::microseconds(200 * (1 + i % 2)));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect that a constant will suffice here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed all sleeps. The bug is reproducible without them.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

even better.

"TEXT", "SORTABLE", "tags", "TAG", "SORTABLE", "numeric_field", "NUMERIC", "SORTABLE"});
ThisFiber::SleepFor(std::chrono::milliseconds(10 + (i % 5 * 5)));
Run({"ft.dropindex", kIndexName});
ThisFiber::SleepFor(std::chrono::microseconds(500 * (1 + i % 2)));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are you sure you need reindexer_fiber at all?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to remove every part. This test is as minimal as I could create to reproduce.

@romange
Copy link
Collaborator

romange commented May 28, 2025

pls rebase as well

Co-authored-by: Roman Gershman <[email protected]>
Signed-off-by: Volodymyr Yavdoshenko <[email protected]>
@@ -2768,6 +2768,47 @@ TEST_F(SearchFamilyTest, JsonSetIndexesBug) {
EXPECT_THAT(resp, IsUnordArrayWithSize(IsMap("text", "some text")));
}

TEST_F(SearchFamilyTest, SearchReindexWriteSearchRace) {
const std::string kIndexName = "myRaceIdx";
const int kWriterOps = 1000;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it reproduces for me with all constants be 100

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please check if also for you. I prefer not having long running unit tests

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated to 200 because 100 sometimes passes

@vyavdoshenko vyavdoshenko merged commit 9d7c713 into main May 29, 2025
10 checks passed
@vyavdoshenko vyavdoshenko deleted the bobik/multitransation_race_condition branch May 29, 2025 05:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Race Condition in Search Indexing Leading to Crash During Concurrent Operations
3 participants