Counts incorrect when fluo restarted in middle of stress test #456

keith-turner · 2015-03-26T23:48:11Z

On a 12 node ec2 cluster I ran stress test loading 1 million integers via M/R and then 3 million w/ load transactions. While the workers were processing the 3 million I restarted fluo twice using fluo yarn stop and fluo yarn start. After the workers had processed all notification I compared the counts from fluo and M/R. The counts differed.

[ec2-user@leader1 fluo-stress]$ ./bin/print.sh 2>/dev/null
Total at root : 3991977
Nodes Scanned : 15259
[ec2-user@leader1 fluo-stress]$
[ec2-user@leader1 fluo-stress]$
[ec2-user@leader1 fluo-stress]$ grep UNIQ logs/unique.err
                UNIQUE=3992004

Using the technique described in a comment on astralway/stresso#7 I compared the fluo generated tree with a M/R generated tree. Below are some of the problems I found, but not all of the problems. Seems locks were just left.

There was a problem at level 6 with the count for 0000000021aa0000. Saw the following at levels 7 and 8.

root@instance16 fluo> grep 0000000021aa4f
07:s1o9:08:0000000021aa4f00 count:wait [] 14099658-LOCK    08:0iv6:08:0000000021aa4f4d count wait  $
07:s1o9:08:0000000021aa4f00 count:wait [] 14099658-DATA    1
08:0iv6:08:0000000021aa4f4d count:seen [] 14099658-LOCK    08:0iv6:08:0000000021aa4f4d count wait  $
08:0iv6:08:0000000021aa4f4d count:seen [] 14099658-DATA    1
08:0iv6:08:0000000021aa4f4d count:wait [] 6874041-TX_DONE
08:0iv6:08:0000000021aa4f4d count:wait [] 14101136-WRITE    14099658 TRUNCATION PRIMARY
08:0iv6:08:0000000021aa4f4d count:wait [] 6874041-WRITE    6872050 TRUNCATION PRIMARY
08:0iv6:08:0000000021aa4f4d count:wait [] 6872050-DATA    1

The count for 0000000011fe0000 at level 6 was 270. It was supposed to be 271, found the following problem.

root@instance16 fluo> grep 0000000011fec3
07:0lan:08:0000000011fec300 count:wait [] 14099649-LOCK    08:0ilh:08:0000000011fec323 count wait  $
07:0lan:08:0000000011fec300 count:wait [] 14099649-DATA    1
08:0ilh:08:0000000011fec323 count:seen [] 14099649-LOCK    08:0ilh:08:0000000011fec323 count wait  $
08:0ilh:08:0000000011fec323 count:seen [] 14099649-DATA    1
08:0ilh:08:0000000011fec323 count:wait [] 7815895-TX_DONE
08:0ilh:08:0000000011fec323 count:wait [] 14101134-WRITE    14099649 TRUNCATION PRIMARY
08:0ilh:08:0000000011fec323 count:wait [] 7815895-WRITE    7813650 TRUNCATION PRIMARY
08:0ilh:08:0000000011fec323 count:wait [] 7813650-DATA    1

At level 6 00000000231f0000 was 253 instead of 254. Found the following problems at level 7 and 8.

root@instance16 fluo> grep -np 00000000231f32
08:have:08:00000000231f321a count:seen [] 9555339-WRITE    9554544 TRUNCATION 
08:have:08:00000000231f321a count:seen [] 9554544-DATA    1
08:have:08:00000000231f321a count:wait [] 9555339-TX_DONE    
08:have:08:00000000231f321a count:wait [] 3777912-TX_DONE    
08:have:08:00000000231f321a count:wait [] 9555339-WRITE    9554544 TRUNCATION PRIMARY
08:have:08:00000000231f321a count:wait [] 3777912-WRITE    3776950 TRUNCATION PRIMARY
08:have:08:00000000231f321a count:wait [] 3776950-DATA    1
07:5g59:08:00000000231f3200 count:seen [] 13339534-WRITE    13339363 TRUNCATION 
07:5g59:08:00000000231f3200 count:seen [] 13339363-DATA    2
07:5g59:08:00000000231f3200 count:wait [] 13339534-TX_DONE    
07:5g59:08:00000000231f3200 count:wait [] 13339534-WRITE    13339363 TRUNCATION PRIMARY
07:5g59:08:00000000231f3200 count:wait [] 14099648-LOCK    08:0ild:08:00000000231f32a7 count wait  WRITE NOT_DELETE NOT_TRIGGER l
07:5g59:08:00000000231f3200 count:wait [] 14099648-DATA    1
08:ogzn:08:00000000231f32c8 count:seen [] 9604411-WRITE    9603694 TRUNCATION 
08:ogzn:08:00000000231f32c8 count:seen [] 9603694-DATA    1
08:ogzn:08:00000000231f32c8 count:wait [] 9604411-TX_DONE    
08:ogzn:08:00000000231f32c8 count:wait [] 3581402-TX_DONE    
08:ogzn:08:00000000231f32c8 count:wait [] 9604411-WRITE    9603694 TRUNCATION PRIMARY
08:ogzn:08:00000000231f32c8 count:wait [] 3581402-WRITE    3580467 TRUNCATION PRIMARY
08:ogzn:08:00000000231f32c8 count:wait [] 3580467-DATA    1
08:0ild:08:00000000231f32a7 count:seen [] 14099648-LOCK    08:0ild:08:00000000231f32a7 count wait  WRITE NOT_DELETE TRIGGER l
08:0ild:08:00000000231f32a7 count:seen [] 14099648-DATA    1
08:0ild:08:00000000231f32a7 count:wait [] 7033621-TX_DONE    
08:0ild:08:00000000231f32a7 count:wait [] 14101124-WRITE    14099648 TRUNCATION PRIMARY
08:0ild:08:00000000231f32a7 count:wait [] 7033621-WRITE    7027699 TRUNCATION PRIMARY
08:0ild:08:00000000231f32a7 count:wait [] 7027699-DATA    1

The table had no notifications. Maybe the notifications for these were prematurely deleted?

The text was updated successfully, but these errors were encountered:

keith-turner · 2015-03-27T00:25:29Z

I noticed in the three cases above that the primary column has successfully committed. The locks were left on other columns in the transaction. For one of those I tried scanning the two locked columns.

fluo scan -r 08:0iv6:08:0000000021aa4f4d
08:0iv6:08:0000000021aa4f4d count seen  1
fluo scan -r 07:s1o9:08:0000000021aa4f00
07:s1o9:08:0000000021aa4f00 count wait  1

Then I looked for notifications and scanning the columns had caused one to be created.

root@instance16 fluo> scan -c ntfy
07:s1o9:08:0000000021aa4f00 ntfy:count�wait [] 14101136-

Eventually the node at level 6 its count corrected from 233 to 234. First wait was set and then seen was updated.

[ec2-user@leader1 ~]$ fluo scan -p 06 | grep 0000000021aa0000
06:0ajr:08:0000000021aa0000 count seen  233
06:0ajr:08:0000000021aa0000 count wait  1

A bit of time between these.

[ec2-user@leader1 ~]$ fluo scan -p 06 | grep 0000000021aa0000
06:0ajr:08:0000000021aa0000 count seen  234

Based on this I tried scanning the entire table to clean up all locks.

fluo scan > /dev/null

This caused a lot of notifications to be created. Eventually the counts came out correct!

[ec2-user@leader1 fluo-stress]$ ./bin/print.sh 2>/dev/null
Total at root : 3992004
Nodes Scanned : 15259

keith-turner · 2015-04-15T22:18:38Z

I looked into how this currently works and found the following.

regular notifications are deleted when the trigger column is committed using the startTs. In the same mutation the ACK is written.
regular notifications are written when an observed column is committed using the commitTs
weak notifications are written after all locks were written but before primary column is committed. Its dones using commitTs
weak notifications are deleted before any locks are written (this is really really bad, notification could be completely lost w/o doing any work)

The trigger column is always chosen as the primary column. So basically when the primary column is committed the notification is deleted. If the worker dies after this then there will be no notification to recreate the transaction.

keith-turner · 2015-04-15T22:48:21Z

I am thinking that regular and weak notifications should only be deleted after the transaction is completed. For weak notifications this may result in the transaction running multiple times, but thats ok. For regular notifications, if a process dies after committing but before deleting the notification then the ACK should prevent it from running again. In the case where there is an ACK, would need to delete the notification.

fixes #456 moved notification deletion to end of transaction

keith-turner self-assigned this Mar 27, 2015

mikewalch added this to the 1.0.0-beta-1 milestone Apr 15, 2015

keith-turner added a commit to keith-turner/fluo that referenced this issue Apr 22, 2015

fixes apache#456 moved notification deletion to end of transaction

160a40d

keith-turner closed this as completed in cc257fb Apr 23, 2015

keith-turner added a commit that referenced this issue Apr 23, 2015

Merge pull request #481 from keith-turner/fluo-456

e936ba5

fixes #456 moved notification deletion to end of transaction

keith-turner mentioned this issue Apr 12, 2016

Stress test counts wrong when killing workers #642

Closed

pono unassigned keith-turner Jun 16, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Counts incorrect when fluo restarted in middle of stress test #456

Counts incorrect when fluo restarted in middle of stress test #456

keith-turner commented Mar 26, 2015

keith-turner commented Mar 27, 2015

keith-turner commented Apr 15, 2015

keith-turner commented Apr 15, 2015

Counts incorrect when fluo restarted in middle of stress test #456

Counts incorrect when fluo restarted in middle of stress test #456

Comments

keith-turner commented Mar 26, 2015

keith-turner commented Mar 27, 2015

keith-turner commented Apr 15, 2015

keith-turner commented Apr 15, 2015