Stress test counts wrong when killing workers #642

keith-turner · 2016-04-12T15:18:44Z

I have been running the stress test and stopping and starting Fluo while the test was running. This causes YARN to kill alll of the workers and the Oracle. After the test completes, sometimes the counts are wrong. However doing a diagnostic scan of the entire table to diff levels of the tree, will cause the counts to become correct. This full scan causes some partially completed transactions to roll forward which causes notifications to be written. These notifications trigger observers which complete the computation.

The problem is that notifications are written in the 2nd phase of commit. I think the solution to this problem is to write notifications in the 1st phase of commit. This way if a transaction is partially successful with the 2nd phase observers will still trigger and roll forward.

keith-turner · 2016-04-12T17:30:17Z

Seems I ran into this problem before #456 and fixed it incorrectly. The reason I did not properly fix it then is I was making the following incorrect assumption :

If a notification causes an observer to create other notifications, then in the case of failures rerunning the observer with the same notification will always recreate the same notifications.

However this is not true if the observer runs again and finds there is nothing to do. Mainly in the case where it failed previously, but did complete part of the 2nd phase of commit. In this case it may not recreate the notifications it created last time and those notifications will never be created until something tries to read the locked data (which may never happen).

Since the solution in #456 was based on an incorrect premise I may be able to delete the notifications earlier as was done before the "fix" for #456 in addition to setting notifications in the lock phase.

keith-turner · 2016-04-12T18:45:12Z

Thinking about this, if setting notifications during 1st phase of commit then would use start timestamp. Need to be careful that notifications are not deleted when they should not be. Notifications are currently deleted using the start timestamp of a transactions. When writing notification using startTs (instead of commitTs as is currently done) need to do something to avoid a situation like the following where notification created by TX3 is lost.

TX1 snotify C1 with startTs 10 commitTs 12
TX3 starts with startTs 13
TX2 triggered by C1 with startTs 14
TX2 ACK C1 with startTs 14 (deletes notifcation using startTs 14)
TX3 snotify C1 with startTs 13 commitTs 17

Another issue is that even if the notification was not deleted in the situation above, the current code would consider it already acknowledged because this based on ACK timestamp vs Notification timestamp. So TX2 would write an ACK with timestamp 14 which would supersede the notification from TX3.

keith-turner · 2016-04-12T22:31:43Z

Was thinking of writing notifications while writing locks, however there is a problem with this. If a transaction writes a notification w/o writing all of its lock, then an observer triggered by the transaction is not guaranteed to see everything the transaction wrote. Can only write a notification after all locks are written. Beginning to think that strong notifications should be written in the same way that weak notifications are currently written. Weak notifications are written after all locks are written, but before finishing the commit on the primary column. In the case of failures the transactions will be rolled back and all notifications rewitten. For strong notifications, need to ensure observers fail when triggered by a notification written for a transaction that fails.

fixes #642 write strong notifcations at same place as weak

keith-turner self-assigned this Apr 12, 2016

keith-turner added a commit to keith-turner/fluo that referenced this issue Apr 13, 2016

fixes apache#642 write strong notifcations at same place as weak

a5da8c2

mikewalch closed this as completed in 0edd91d Apr 14, 2016

mikewalch added a commit that referenced this issue Apr 14, 2016

Merge pull request #645 from keith-turner/fluo-642

c8edbe6

fixes #642 write strong notifcations at same place as weak

pono unassigned keith-turner Jun 16, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stress test counts wrong when killing workers #642

Stress test counts wrong when killing workers #642

keith-turner commented Apr 12, 2016

keith-turner commented Apr 12, 2016

keith-turner commented Apr 12, 2016

keith-turner commented Apr 12, 2016

Stress test counts wrong when killing workers #642

Stress test counts wrong when killing workers #642

Comments

keith-turner commented Apr 12, 2016

keith-turner commented Apr 12, 2016

keith-turner commented Apr 12, 2016

keith-turner commented Apr 12, 2016