-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consensus stuck in votesync upon restart #850
Comments
Thanks @aditiharini for the issue. Looked at the logs and here is a preliminary analysis
Node 1:
Node 2:
Node 3:
So there are a couple of things that we need to fix:
|
There is another case where consensus halts even though the proposed values are consistent across restart. Here, it seems like there is some issue with lost/missing votes. I've attached the logs. |
thanks @aditiharini for the new logs!
there are two more problems (see notes for Node 3):
See more in the notes below and also the snapshot of votes and proposals at each round. Node 1:
Proposals and votes for round 0:
Proposals and votes for round 1:
Node 2:
Proposals and votes for round 0:
Proposals and votes for round 1:
Node 3:
Proposals and votes for round 0:
Proposals and votes for round 1:
|
@cason @josef-widder @romac - this is related to our discussion about votesync and |
Hey @ancazamfir, in the description of the first problem scenario, is this a typo in the Node 3 section:
Is it from |
Sorry about that, it's from Node 1. Good catch! |
Switch back to referencing malachite via github and upgrade to latest commit so we can pull in fixes for informalsystems/malachite#850
Thanks @ancazamfir ! And the scenario you described is round 0? Do they halt in this round? I understand that already from your description we can see these wrong behaviors you have highlighted but I would like to understand the whole execution so we are sure we are not missing some other faulty behaviors. |
The first scenario described (for height 3291) is for round 0. Consensus is stuck later (in round 2). Briefly what happens:
Due to this Node 2 and Node 3 stay suck in Propose step because (with the bug) the Propose timeout has been canceled. But note that even without the bug, Node 2 and Node 3 would have timed out the Proposal, prevote nil, then precommit nil, move to next round and the cycles repeats. So to summarize:
If things are not clear I can update first scenario (height 3291) with more details for rounds 1 and 2. |
Thanks @ancazamfir, you don't need to update it, it is super clear now! Just to summarize everything:
And lastly, although in this setup Tendermint does not guarantee to tolerate any Byzantine behavior since we have 3 nodes only, in general we should be able to tolerate equivocation. Namely we need to allow an honest node to learn equivocated votes and proposals to ensure liveness. Is this issue about this - #857? |
On restarts where the proposer has to restream a precommited value (i.e. hits
RestreamValue
) consensus halts in votesync.The non-proposing nodes don't have the precommit votes for the restreamed value though the value has a pol round on it. The proposing node, never gets votes from its peers and gets stuck in votesync indefinitely.
Logs attached:
node1-shard8.txt
node2-shard8.txt
node3-shard8.txt
The text was updated successfully, but these errors were encountered: