Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Counts incorrect when fluo restarted in middle of stress test #456

Closed
keith-turner opened this issue Mar 26, 2015 · 3 comments
Closed

Counts incorrect when fluo restarted in middle of stress test #456

keith-turner opened this issue Mar 26, 2015 · 3 comments
Milestone

Comments

@keith-turner
Copy link
Contributor

On a 12 node ec2 cluster I ran stress test loading 1 million integers via M/R and then 3 million w/ load transactions. While the workers were processing the 3 million I restarted fluo twice using fluo yarn stop and fluo yarn start. After the workers had processed all notification I compared the counts from fluo and M/R. The counts differed.

[ec2-user@leader1 fluo-stress]$ ./bin/print.sh 2>/dev/null
Total at root : 3991977
Nodes Scanned : 15259
[ec2-user@leader1 fluo-stress]$
[ec2-user@leader1 fluo-stress]$
[ec2-user@leader1 fluo-stress]$ grep UNIQ logs/unique.err
                UNIQUE=3992004

Using the technique described in a comment on astralway/stresso#7 I compared the fluo generated tree with a M/R generated tree. Below are some of the problems I found, but not all of the problems. Seems locks were just left.

There was a problem at level 6 with the count for 0000000021aa0000. Saw the following at levels 7 and 8.

root@instance16 fluo> grep 0000000021aa4f
07:s1o9:08:0000000021aa4f00 count:wait [] 14099658-LOCK    08:0iv6:08:0000000021aa4f4d count wait  $
07:s1o9:08:0000000021aa4f00 count:wait [] 14099658-DATA    1
08:0iv6:08:0000000021aa4f4d count:seen [] 14099658-LOCK    08:0iv6:08:0000000021aa4f4d count wait  $
08:0iv6:08:0000000021aa4f4d count:seen [] 14099658-DATA    1
08:0iv6:08:0000000021aa4f4d count:wait [] 6874041-TX_DONE
08:0iv6:08:0000000021aa4f4d count:wait [] 14101136-WRITE    14099658 TRUNCATION PRIMARY
08:0iv6:08:0000000021aa4f4d count:wait [] 6874041-WRITE    6872050 TRUNCATION PRIMARY
08:0iv6:08:0000000021aa4f4d count:wait [] 6872050-DATA    1

The count for 0000000011fe0000 at level 6 was 270. It was supposed to be 271, found the following problem.

root@instance16 fluo> grep 0000000011fec3
07:0lan:08:0000000011fec300 count:wait [] 14099649-LOCK    08:0ilh:08:0000000011fec323 count wait  $
07:0lan:08:0000000011fec300 count:wait [] 14099649-DATA    1
08:0ilh:08:0000000011fec323 count:seen [] 14099649-LOCK    08:0ilh:08:0000000011fec323 count wait  $
08:0ilh:08:0000000011fec323 count:seen [] 14099649-DATA    1
08:0ilh:08:0000000011fec323 count:wait [] 7815895-TX_DONE
08:0ilh:08:0000000011fec323 count:wait [] 14101134-WRITE    14099649 TRUNCATION PRIMARY
08:0ilh:08:0000000011fec323 count:wait [] 7815895-WRITE    7813650 TRUNCATION PRIMARY
08:0ilh:08:0000000011fec323 count:wait [] 7813650-DATA    1

At level 6 00000000231f0000 was 253 instead of 254. Found the following problems at level 7 and 8.

root@instance16 fluo> grep -np 00000000231f32
08:have:08:00000000231f321a count:seen [] 9555339-WRITE    9554544 TRUNCATION 
08:have:08:00000000231f321a count:seen [] 9554544-DATA    1
08:have:08:00000000231f321a count:wait [] 9555339-TX_DONE    
08:have:08:00000000231f321a count:wait [] 3777912-TX_DONE    
08:have:08:00000000231f321a count:wait [] 9555339-WRITE    9554544 TRUNCATION PRIMARY
08:have:08:00000000231f321a count:wait [] 3777912-WRITE    3776950 TRUNCATION PRIMARY
08:have:08:00000000231f321a count:wait [] 3776950-DATA    1
07:5g59:08:00000000231f3200 count:seen [] 13339534-WRITE    13339363 TRUNCATION 
07:5g59:08:00000000231f3200 count:seen [] 13339363-DATA    2
07:5g59:08:00000000231f3200 count:wait [] 13339534-TX_DONE    
07:5g59:08:00000000231f3200 count:wait [] 13339534-WRITE    13339363 TRUNCATION PRIMARY
07:5g59:08:00000000231f3200 count:wait [] 14099648-LOCK    08:0ild:08:00000000231f32a7 count wait  WRITE NOT_DELETE NOT_TRIGGER l
07:5g59:08:00000000231f3200 count:wait [] 14099648-DATA    1
08:ogzn:08:00000000231f32c8 count:seen [] 9604411-WRITE    9603694 TRUNCATION 
08:ogzn:08:00000000231f32c8 count:seen [] 9603694-DATA    1
08:ogzn:08:00000000231f32c8 count:wait [] 9604411-TX_DONE    
08:ogzn:08:00000000231f32c8 count:wait [] 3581402-TX_DONE    
08:ogzn:08:00000000231f32c8 count:wait [] 9604411-WRITE    9603694 TRUNCATION PRIMARY
08:ogzn:08:00000000231f32c8 count:wait [] 3581402-WRITE    3580467 TRUNCATION PRIMARY
08:ogzn:08:00000000231f32c8 count:wait [] 3580467-DATA    1
08:0ild:08:00000000231f32a7 count:seen [] 14099648-LOCK    08:0ild:08:00000000231f32a7 count wait  WRITE NOT_DELETE TRIGGER l
08:0ild:08:00000000231f32a7 count:seen [] 14099648-DATA    1
08:0ild:08:00000000231f32a7 count:wait [] 7033621-TX_DONE    
08:0ild:08:00000000231f32a7 count:wait [] 14101124-WRITE    14099648 TRUNCATION PRIMARY
08:0ild:08:00000000231f32a7 count:wait [] 7033621-WRITE    7027699 TRUNCATION PRIMARY
08:0ild:08:00000000231f32a7 count:wait [] 7027699-DATA    1

The table had no notifications. Maybe the notifications for these were prematurely deleted?

@keith-turner
Copy link
Contributor Author

I noticed in the three cases above that the primary column has successfully committed. The locks were left on other columns in the transaction. For one of those I tried scanning the two locked columns.

fluo scan -r 08:0iv6:08:0000000021aa4f4d
08:0iv6:08:0000000021aa4f4d count seen  1
fluo scan -r 07:s1o9:08:0000000021aa4f00
07:s1o9:08:0000000021aa4f00 count wait  1

Then I looked for notifications and scanning the columns had caused one to be created.

root@instance16 fluo> scan -c ntfy
07:s1o9:08:0000000021aa4f00 ntfy:count�wait [] 14101136-    

Eventually the node at level 6 its count corrected from 233 to 234. First wait was set and then seen was updated.

[ec2-user@leader1 ~]$ fluo scan -p 06 | grep 0000000021aa0000
06:0ajr:08:0000000021aa0000 count seen  233
06:0ajr:08:0000000021aa0000 count wait  1

A bit of time between these.

[ec2-user@leader1 ~]$ fluo scan -p 06 | grep 0000000021aa0000
06:0ajr:08:0000000021aa0000 count seen  234

Based on this I tried scanning the entire table to clean up all locks.

fluo scan > /dev/null

This caused a lot of notifications to be created. Eventually the counts came out correct!

[ec2-user@leader1 fluo-stress]$ ./bin/print.sh 2>/dev/null
Total at root : 3992004
Nodes Scanned : 15259

@keith-turner keith-turner self-assigned this Mar 27, 2015
@mikewalch mikewalch added this to the 1.0.0-beta-1 milestone Apr 15, 2015
@keith-turner
Copy link
Contributor Author

I looked into how this currently works and found the following.

  • regular notifications are deleted when the trigger column is committed using the startTs. In the same mutation the ACK is written.
  • regular notifications are written when an observed column is committed using the commitTs
  • weak notifications are written after all locks were written but before primary column is committed. Its dones using commitTs
  • weak notifications are deleted before any locks are written (this is really really bad, notification could be completely lost w/o doing any work)

The trigger column is always chosen as the primary column. So basically when the primary column is committed the notification is deleted. If the worker dies after this then there will be no notification to recreate the transaction.

@keith-turner
Copy link
Contributor Author

I am thinking that regular and weak notifications should only be deleted after the transaction is completed. For weak notifications this may result in the transaction running multiple times, but thats ok. For regular notifications, if a process dies after committing but before deleting the notification then the ACK should prevent it from running again. In the case where there is an ACK, would need to delete the notification.

keith-turner added a commit to keith-turner/fluo that referenced this issue Apr 22, 2015
keith-turner added a commit that referenced this issue Apr 23, 2015
fixes #456 moved notification deletion to end of transaction
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants