-
Notifications
You must be signed in to change notification settings - Fork 73
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Counts incorrect when fluo restarted in middle of stress test #456
Comments
I noticed in the three cases above that the primary column has successfully committed. The locks were left on other columns in the transaction. For one of those I tried scanning the two locked columns.
Then I looked for notifications and scanning the columns had caused one to be created.
Eventually the node at level 6 its count corrected from 233 to 234. First wait was set and then seen was updated.
A bit of time between these.
Based on this I tried scanning the entire table to clean up all locks.
This caused a lot of notifications to be created. Eventually the counts came out correct!
|
I looked into how this currently works and found the following.
The trigger column is always chosen as the primary column. So basically when the primary column is committed the notification is deleted. If the worker dies after this then there will be no notification to recreate the transaction. |
I am thinking that regular and weak notifications should only be deleted after the transaction is completed. For weak notifications this may result in the transaction running multiple times, but thats ok. For regular notifications, if a process dies after committing but before deleting the notification then the ACK should prevent it from running again. In the case where there is an ACK, would need to delete the notification. |
fixes #456 moved notification deletion to end of transaction
On a 12 node ec2 cluster I ran stress test loading 1 million integers via M/R and then 3 million w/ load transactions. While the workers were processing the 3 million I restarted fluo twice using
fluo yarn stop
andfluo yarn start
. After the workers had processed all notification I compared the counts from fluo and M/R. The counts differed.Using the technique described in a comment on astralway/stresso#7 I compared the fluo generated tree with a M/R generated tree. Below are some of the problems I found, but not all of the problems. Seems locks were just left.
There was a problem at level 6 with the count for
0000000021aa0000
. Saw the following at levels 7 and 8.The count for
0000000011fe0000
at level 6 was 270. It was supposed to be 271, found the following problem.At level 6
00000000231f0000
was 253 instead of 254. Found the following problems at level 7 and 8.The table had no notifications. Maybe the notifications for these were prematurely deleted?
The text was updated successfully, but these errors were encountered: