-
Notifications
You must be signed in to change notification settings - Fork 466
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Design document for the Zippy testing framework #11025
base: main
Are you sure you want to change the base?
Conversation
Just gave this a quick skim, and it all sounds reasonable... but also it sounds a lot like Jepsen? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The implementation details here sound great to me, e.g., re-using mzcompose
for this, outputting reproducible test sequences that can be plumbed back into mzcompose
scripts, etc. I don't have nearly the depth of expertise on the philosophical side of things, so gonna let you and @aphyr chat about that.
Some clarifications in response to @aphyr ' comments in chat. His comments start here https://materializeinc.slack.com/archives/C035TU9QF5W/p1646684216688189 but I am intentionally not pasting them in this public PR in case they contain proprietary information. ReproducilbilityThe framework will be using Mz and the external containerized services as they are, with all the non-determinism this entails. What I am aiming here for is not byte-by-byte, instruction-by-instruction reproducibility, like Antithesis, but more of a "human-scale" reproduciblility which is somewhat akin to general developer-friendliness. What I am after is not increasing the chance that the bug will be reproducible on every run to 100% but increasing the chance that developers will feel compelled to look at the bug, work on it and fix it . Something along the lines of:
When the FoundationDB presentation came out a long time ago, it appeared that the issues it would find were so convoluted that I did not see which developer would be willing to debug them (pre-processed C++, mocking of I/O, etc, etc.). I assumed that a classical debugger would be useless the way their testing framework takes over the code. Workload RealismTrying for realistic, or, at the very least, understandable workloads is also key to developer interest and bug fix productivity. Test cases where a lot of invalid/unrealistic things happen are great for finding edge cases, but then it becomes an uphill battle to convince people to fix those bugs . If the workload can be described in a single sentence and sounds unobjectionally possible to happen at a customer, this gives the bug reports much less resistence. By aiming for a degree of realism you also protect yourself against the situation where tables/ kafka topics / materialized views are dropped and recreated so frequently that they never have the chance to accumulate enough state that would cause them to fall over. So, while abandoning some of the stressfullness of the workload, this framework will bring in:
Sequential executionThis is a bit of an artifact from our current way of describing tests in python, which are essentially a series of steps that one executes against docker containers sequentially. This is indeed a major limitation that I would like to lift at some point, once I have milked the sequential execution to the max. |
No trade secrets in that discussion! Feel free to talk/repost whatever you like. :-) Reproducibility / Sequential executionOn the questions of determinism & concurrency--this is not to to argue one way or the other, but I'd like to echo some of what you've said, offer a little bit of my experience from testing, and touch on some things that might help guide your choices. I bet you've already thought about a lot of this in depth, but perhaps there's one or two things here that might be novel. :-) First, you've mentioned that the current approach is to execute steps against Docker containers sequentially. Sequential tests are nice--they're easy to read & understand, they simplify what I'll call a "generator"--the thing that constructs the operations to perform in the test, and they also simplify the "checker"--the thing that validates that your history of completed operations is correct. They also have a nice (or frustrating, depending on your perspective) side effect, which is that they tend to make concurrent systems (that would normally be highly nondeterministic) much more deterministic, by virtue of getting rid of most or all of the concurrency. My suspicion (and please correct me if I'm wrong) is that the thing you want to test--Materialize, and possibly the composition of Materialize with other systems, like Kafka--is highly concurrent. Each node likely has multiple threads, there are probably multiple nodes, and you've got multiple systems. You may also have internal work queues, periodic tasks, and so on, both inside Materialize and in, say, Kafka poller. But so long as you perform only one operation at a time against this concurrent system, much of that concurrency ought to disappear, right? To some extent, yes. But there are some ways in which even a sequential test winds up exhibiting nondeterminism thanks to internal concurrency. For instance, take the following history of operations which are executed in strictly sequential order: one waits for each operation to complete before beginning the next.
Should There are a couple of things you can do to mitigate this nondeterminism. One of them is to slow down the request rate so that you're very confident each operation has fully "percolated" through these asynchronous, concurrent processes. This makes testing slower, and may mask some classes of bugs which depend on high throughput. Another approach is to have some kind of causality token, so that This problem of getting stuck is much harder when you start introducing faults, because faults may render the system unable to process certain operations until the fault is resolved. This also points to another challenge: in distributed systems, even workloads executed by a single thread become logically concurrent as soon as any request returns an indeterminate result. For example, imagine this history:
What happens if the write of
The fundamental problem here is that once an operation crashes, we must assume it is effectively concurrent with every single later operation in the history. It could be in flight in the network, or replicated to some nodes but not others, or sitting in an in-memory queue somewhere, just waiting to take effect five minutes from now. This tells us that once we allow operations to fail, there is no such thing as a truly sequential distributed systems test! In my experience with Jepsen, this kind of indeterminate concurrency is the rule, rather than the exception. This has significant implications for test design. First, even if the generator doesn't produce concurrent operations, the way you record the outcomes of those operations needs to have an explicit way to record both indeterminate failures and concurrency. Second, your checker needs to take both of these into account. It has to understand that if the write of So... what I'd like to encourage you to do here is to plan for concurrency & nondeterminism from the start, rather than trying to retrofit it into the test design at a later time. It's OK to leave the generator sequential, but the thing that records the history of operations, and the thing that checks that history to see it its valid, need to be fundamentally concurrency & failure-aware. It's definitely more up-front work, but I think it'll save you headaches down the line. :-) More on reproducibilityIt sounds like you tried out Antithesis (which I didn't know existed until yesterday) and found it difficult for a few reasons. I might have this wrong, but let me try and repeat what I think you saw:
I think these are all valid concerns, and I think your intuition that a "somewhat, but not entirely" reproducible test suite might be a good happy medium. While knowing absolutely nothing about A purely opaque-box, end-to-end test like Jepsen (and perhaps mzcompose/Zippy?) avoids issues 1, 2, and 3 by running Real Binaries on Real Computers with Real Networks. That's really nice because when you find a bug, it's something that a real user could see in production. It also means that stacktraces and network traffic look exactly how you'd expect--there's no source code rewriting, injection of probes, etc. On the other hand, there are a couple big drawbacks with this approach. One of them is that it usually involves different processes running on 5-10 nodes. Bugs often manifest across multiple nodes, rather than on one--so where does one attach a debugger to see the whole bug? Even if a bug does occur on a single node, it's not clear which one you should attach to. Moreover, if you pause execution via e.g. a breakpoint, it tends to immediately throw the system into a new regime--the other nodes will time out requests to the one that paused, and that can mean the bug disappears when you try to look at it. If a debugger workflow is critical, one thing that you might want to consider is--and perhaps not using Antithesis, but just with some judicious choices about your regular code design--being able to run an entire cluster in a single binary. You can still use wall-clock time, regular network calls, the regular thread scheduler, etc., but the network calls will all go over loopback and back into the same process. That gives you the ability to attach a debugger to the whole cluster, with essentially minimal code changes. This will likely still have concurrency issues--if you throw up a breakpoint in one node, that doesn't necessarily pause the others. If you've got a handy-dandy concurrency-aware debugger that's great at pausing all threads and handing off data between them in the right way, that's fantastic. Even if you don't, you can work on those issues piecemeal, and maybe get to a point where it's still meaningful to attach a debugger. You may already have this, but if you don't--the first place to interpose might be the clock! I'd focus on current-time, timeouts, and scheduled tasks. What you want is a clock shim that in normal operation calls A (rigorously used) clock shim also helps you work around a problem with testing in containers--you can't test clock skew! The next place I'd interpose would be the network. One of the big problems with Jepsen (and that you might face injecting faults on top of Docker containers) is that its instruments for interrupting network traffic are "blunt"--it can cut off all packets using Two other things you can do with a network shim! One, it gets you closer to determinism--you can pick message delivery based on the same random seed you're using for the test as a whole. Second, it lets you get application-level traces of network messages, which are incredibly useful debugging tools. For instance, here's a trace recorded by Maelstrom's in-memory network, showing clients ( I've built several systems using these auto-generated Lamport diagrams as a debugging aid, and it's just... oh it's SO useful to be able to see the flow of messages that led to some weird outcome, instead of trying to piece it together from a dozen log files. Not that you have to do all this from the start, but if you're laying groundwork for a new codebase, this might be something to have in mind as a potential goal. :-) The last thing I think I'd tackle would be scheduler interposition--trying to get the thread scheduler itself to be deterministic and driven by the test. I know of systems that do this sort of thing--for instance, the Pulse scheduler for Erlang, which helps you Quickcheck concurrent code. I don't have any experience with this personally though, and I don't know what that would look like for something like Rust. Workload RealismThis might be a different story at Materialize, and you know your engineering culture best. That said, I don't think you have to give up hope here! I've reported bugs in roughly 30 DBs over the last 9 years, and I've found that dev teams are generally very willing to fix issues that arise from "unrealistic" workloads. And in particular, those engineers often prefer an unrealistic workload that reproduces a bug in, say, 30 seconds, to a realistic workload that takes 10 hours. This doesn't have to be an either-or situation--it's perfectly OK to write realistic workloads too, and also to mix-and-match within a single test. One thing I do in Jepsen is create new keys/tables/topics aggressively (e.g. every few seconds) for a handful of keys, but to let other keys accumulate writes for the entire history of the test. That way you get to explore more codepaths and parts of the state space--some bugs manifest only on first/last writes to some key, whereas others require sustained writes over a certain volume of data or span of time. OK, I think I've prattled on far too much! Is any of this novel? Helpful? If you're curious, happy to dive into any of this stuff in depth. |
Just to clarify a potential source of confusion:
|
@aphyr This has been very useful, thanks! I will make sure to follow all the pointers. My entire plane of being is way more pedestrian than the stuff you describe, so the way I see this is that the heavy-hitting regarding network, clock and scheduler manipulation is left for Jepsen and Antithesis whereas Zippy would be:
|
That sounds reasonable! I should mention that Jepsen has no way to do scheduler faults, so that's probably something best left to Antithesis. I like your idea of being able to take these Zippy workloads and run them under Antithesis to get more sophisticated faults too--that way you can write tests once and get some of the advantages of each approach. :-) |
this is a great overview of testing -- thanks for taking the time to write it all up kyle. My two biggest takeaways are:
|
Oh my gosh yes, YES, once you have this it is so hard to go back to Wireshark! |
|
This document describes the overall design of a new framework for testing Platform and all parts connected to it.
Rendered view: https://github.com/philip-stoev/materialize/blob/0119667cea45c9b6551e9c920e75982d21e23010/doc/developer/design/2022-02-28_zippy_testing_framework.md