Replies: 4 comments 4 replies
-
@stevana Let me break down an answer into small points. 1. The workloadIdeally, we would use the Haskell node as a baseline for an answer here. "At least as good" is what comes to mind, though I am not sure it's been properly formulated for the Haskell node either. So that's perhaps something worth spending some time on, possibly using the simulation from the DeltaQ framework that was developed as part of the Haskell development? One particular constraint that seems to be recurring is that when a new block is produced, this block is able to propagate to 95% of the network in less than 5s. I recall that the Haskell team is even using a stricter bound (less than 2s I reckon, but don't quote me on that). Another way to look at it is to think that we want to be able to adopt a block before a next block comes in. While Praos distribution shows that the average block production time is around 20s; in practice, the only strong guarantee we have is that there cannot be two blocks less than 1s from one another (if we discard the case of slot-battle-induced forks here, where we could potentially receive two blocks in a single second span, but from different peers/chains). So, worse-case scenario, we must be able to adopt blocks at a rate of 1 block per second. Now, not all blocks are born equal, but blocks are bounded on all dimensions. More specifically, a block is sized around three axes:
Each dimension comes with a maximum value, which is defined by (updatable) protocol parameters. So, since we have to optimize for the worse case scenario (to ensure we can survive from adversarial behaviors), then, the workload has to be considered given those bounds. Although we have to be careful here because the execution steps and memories are abstract units, measured and costed according to a model. They do not necessarily reflect the actual performance cost on the system. More so, they are modeled after the Haskell node, which has a vastly different runtime execution model than Rust. One practical way to approach this workload question could be to sample the mainnet chain, and run a simulation where blocks from mainnet would be applied every second. This would not necessarily reflect a worse-case situation, but would at least model a first "realistic" workload. From there, we can search for scenarios that try to maximise if not all, at least one of the block's dimensions. 2. Liveness & Safety propertiesConsensusI can think of a few safety property in the system. The one you stated already is, I believe, one:
The reason I would argue it is a safety property is because
Any system that would break these properties at any time would be considered invalid. In fact, the only "liveness" property I can truly think of in terms of the protocol is that a valid transaction that is continuously re-submitted to at least one honest node with non-zero stake will eventually land in a block (unless it becomes invalid due to external factor, e.g. double-spend). LedgerSince the simulation is focusing on the whole system, I think it also make sense to look at properties of the ledger as well. There is, in fact, one particularly interesting one: the sum of all money pots (UTxO balances, reward accounts balances, deposits, reserves, treasury and fees) always, at any time, equals the max supply (45B Ada on Mainnet). So said differently: the ledger isn't loosing (or creating!) money; everything is always accounted for. Interestingly, there are liveness properties that the ledger does not have such as, the guarantee that it's always possible to spend Ada. It is very much possible to end up in a situation where every single Ada is either locked in smart contracts or stuck in the treasury. I don't think there's anything we can do about it, though. |
Beta Was this translation helpful? Give feedback.
-
I would frame the problem in terms of observable behaviour. In classical DB systems, whether key-value, relational, or whatever, the observable behaviour of the system is expressed in terms of some operations and queries one can enact on the system, eg. write some key/value, read value at some key. Then what we care about from a distributed systems perspective is how distribution, concurrency, failures, etc. impact (or not) this observable behaviour. How the system implements distribution, consensus, and replication is basically irrelevant, but typically is carried out through a Replicated State Machine based on some consensus algorithm. In Cardano, and even more so in Amaru at this stage, the main observable behaviour of the system is actually the command log that each node constructs out of all the logs it observes from other nodes, which is precisely the role of the Nakamoto consensus algorithm we are implementing. So in my opinion it makes sense to define workloads in term of this observable behaviour given our main purpose is to build a node that correctly implements Ouroboros Praos algorithm.
For those scenarios, we largely don't care about the content of the blocks and the details of the ledger behaviour, beyond its ability to provide a stake distribution. There are interesting things we might want to test that are somehow dependent on the ledger, like what happens if a node receives a block that's not valid, but this is easy to make happen. In a way, we could perfectly imagine to run those tests with a "mock ledger" which is something worth considering as it could make it easier to simulate things like lengthy block validation or issues with the stake distribution. The workloads that @KtorZ describe are interesting to assess the performance of the system but are not that much relevant to the correctness of the consensus per se. But of course, this is something we'll need to do. |
Beta Was this translation helpful? Give feedback.
-
@abailly Yes, I agree with framing the problem in terms of observable behaviour. In particular, if we consider the following picture:
Observable behaviour should be from the point of view of the client. First question to ask is: who's the "Client"? What's inside the "System" box and what are the requests and responses (where are these API calls documented)? How do you express the properties you described in terms of a concurrent trace of request-response-pairs? For example, if the system is a linearisable distributed key-value store, then the client is the user of the store, requests are "write key value" and "read key". The property that we can take the concurrent trace and find a sequential interleaving (i.e. an execution that can be performed by one CPU) that respects a sequential model, where the sequential model is defined by the following state machine:
@KtorZ Yes, I think we should focus on whole system properties. Because observable behaviour from clients point of view typically does care about whole system behaviour. This "[..] the ledger isn't loosing (or creating!) money; everything is always accounted for." sounds like a good safety property to me (as a client/user of the system, this is something I'd like to be reassured of). It would be good if we can make this more precise over time (doesn't have to be now). Where by precise I mean something at least as detailed as my key-value store example above. Regarding "All honest nodes in the network ultimately converge to the same chain in no more than k blocks given there’s less than 50% adversarial stake." (or variants of it): it's still not clear to me if this is a safety or liveness property, and you two don't seem to agree either. It would be good to come to a consensus on this ;-). But let me throw in question, pardon my ignorance: why should a client/user care about this property? If the property wouldn't hold: what would the consequences be? Would it be possible to break the "ledger isn't losing or creating money" property? Both of you mention things like "sample the mainnet chain" or "simulate block forgers", sure we can fake things until they are actually implemented and I think those things can be valuable tests to have, but we I think we should keep our sight at observable behaviour from a clients point of view (rather than a developer of a subsystem's point of view), because ultimately it's a client/user of the system that needs to be convinced of the system's usefulness (and the whole system from client's point of view tests should cover all properties that a developer of a subsystem cares about). |
Beta Was this translation helpful? Give feedback.
-
Work on amaru simulator has slowed down to a crawl on my side because of other duties, but here is a proposal of a format we could use as a starting point to test Amaru's consensus using a predefined block tree. This is a JSON format which contains:
I think the tester/simulator should be able to read such a file then serve the headers "following" the block tree, simulating how different upstream peers can serve different headers, possibly introducing delays and hopefully forks. The generated chain is very small and does not have forks longer than 1 block, but I think this is a good starting point. My plan is to reuse what Consensus folks have done for Genesis which is much more involved and thorough, but of course requires a bit more work to be used in Amaru context. |
Beta Was this translation helpful? Give feedback.
-
For more context see: https://github.com/pragma-org/amaru/wiki/log#2025-02-20
Beta Was this translation helpful? Give feedback.
All reactions