Design document for the Zippy testing framework #11025

philip-stoev · 2022-03-04T11:50:39Z

This document describes the overall design of a new framework for testing Platform and all parts connected to it.

Rendered view: https://github.com/philip-stoev/materialize/blob/0119667cea45c9b6551e9c920e75982d21e23010/doc/developer/design/2022-02-28_zippy_testing_framework.md

benesch · 2022-03-07T06:44:12Z

Just gave this a quick skim, and it all sounds reasonable... but also it sounds a lot like Jepsen?

benesch

The implementation details here sound great to me, e.g., re-using mzcompose for this, outputting reproducible test sequences that can be plumbed back into mzcompose scripts, etc. I don't have nearly the depth of expertise on the philosophical side of things, so gonna let you and @aphyr chat about that.

philip-stoev · 2022-03-08T11:37:05Z

Some clarifications in response to @aphyr ' comments in chat. His comments start here https://materializeinc.slack.com/archives/C035TU9QF5W/p1646684216688189 but I am intentionally not pasting them in this public PR in case they contain proprietary information.

Reproducilbility

The framework will be using Mz and the external containerized services as they are, with all the non-determinism this entails. What I am aiming here for is not byte-by-byte, instruction-by-instruction reproducibility, like Antithesis, but more of a "human-scale" reproduciblility which is somewhat akin to general developer-friendliness. What I am after is not increasing the chance that the bug will be reproducible on every run to 100% but increasing the chance that developers will feel compelled to look at the bug, work on it and fix it . Something along the lines of:

developers should be able to run the test case on their laptop
it should run using tools they are familiar with, e.g. mzcompose, etc.
it should be possible to use the test case as a regression test with minimum extra developer work

When the FoundationDB presentation came out a long time ago, it appeared that the issues it would find were so convoluted that I did not see which developer would be willing to debug them (pre-processed C++, mocking of I/O, etc, etc.). I assumed that a classical debugger would be useless the way their testing framework takes over the code.

Workload Realism

Trying for realistic, or, at the very least, understandable workloads is also key to developer interest and bug fix productivity. Test cases where a lot of invalid/unrealistic things happen are great for finding edge cases, but then it becomes an uphill battle to convince people to fix those bugs . If the workload can be described in a single sentence and sounds unobjectionally possible to happen at a customer, this gives the bug reports much less resistence.

By aiming for a degree of realism you also protect yourself against the situation where tables/ kafka topics / materialized views are dropped and recreated so frequently that they never have the chance to accumulate enough state that would cause them to fall over. So, while abandoning some of the stressfullness of the workload, this framework will bring in:

the ability for a very long test run to be considered a valid "longevity test"
the ability for a successful run to be representative of the stability of the product when used "in the wild", so that it can be used as a qualification test for releases and such.

Sequential execution

This is a bit of an artifact from our current way of describing tests in python, which are essentially a series of steps that one executes against docker containers sequentially. This is indeed a major limitation that I would like to lift at some point, once I have milked the sequential execution to the max.

aphyr · 2022-03-08T16:50:19Z

No trade secrets in that discussion! Feel free to talk/repost whatever you like. :-)

Reproducibility / Sequential execution

On the questions of determinism & concurrency--this is not to to argue one way or the other, but I'd like to echo some of what you've said, offer a little bit of my experience from testing, and touch on some things that might help guide your choices. I bet you've already thought about a lot of this in depth, but perhaps there's one or two things here that might be novel. :-)

First, you've mentioned that the current approach is to execute steps against Docker containers sequentially. Sequential tests are nice--they're easy to read & understand, they simplify what I'll call a "generator"--the thing that constructs the operations to perform in the test, and they also simplify the "checker"--the thing that validates that your history of completed operations is correct. They also have a nice (or frustrating, depending on your perspective) side effect, which is that they tend to make concurrent systems (that would normally be highly nondeterministic) much more deterministic, by virtue of getting rid of most or all of the concurrency.

My suspicion (and please correct me if I'm wrong) is that the thing you want to test--Materialize, and possibly the composition of Materialize with other systems, like Kafka--is highly concurrent. Each node likely has multiple threads, there are probably multiple nodes, and you've got multiple systems. You may also have internal work queues, periodic tasks, and so on, both inside Materialize and in, say, Kafka poller.

But so long as you perform only one operation at a time against this concurrent system, much of that concurrency ought to disappear, right? To some extent, yes. But there are some ways in which even a sequential test winds up exhibiting nondeterminism thanks to internal concurrency. For instance, take the following history of operations which are executed in strictly sequential order: one waits for each operation to complete before beginning the next.

Send message x to Kafka
Query Materialize to see if x is present

Should x be present? That depends on whether the Kafka was ready to hand x out to Materialize, whether Materialize had polled Kafka in between that time and the query, and whether Materialize's internal processes--differential dataflow, various caches, etc--had processed x, and perhaps also whether that processing had propagated to the node being queried.

There are a couple of things you can do to mitigate this nondeterminism. One of them is to slow down the request rate so that you're very confident each operation has fully "percolated" through these asynchronous, concurrent processes. This makes testing slower, and may mask some classes of bugs which depend on high throughput. Another approach is to have some kind of causality token, so that 2 can be phrased as "repeatedly query until x appears. This restores some degree of determinism, but makes it easy for tests to get "stuck" indefinitely.

This problem of getting stuck is much harder when you start introducing faults, because faults may render the system unable to process certain operations until the fault is resolved. This also points to another challenge: in distributed systems, even workloads executed by a single thread become logically concurrent as soon as any request returns an indeterminate result. For example, imagine this history:

Send x to Kafka
Send y to Kafka
Query Materialize to verify that it saw x, then y, in order.

What happens if the write of x to Kafka throws a timeout, perhaps because of a GC collection or IO stall in Kafka, or because we injected a network or process fault? We can't assume x was written, and we also can't assume x was not written. What do we do? We could bail on the test entirely--but now our test has exploded for frustrating, likely nondeterministic reasons, and hasn't told us anything useful about Materialize. Or we could move on to write y--but if we do this, we run several risks. Legal outcomes in Kafka at this point could be:

[x, y]
[y]
[y, x]

The fundamental problem here is that once an operation crashes, we must assume it is effectively concurrent with every single later operation in the history. It could be in flight in the network, or replicated to some nodes but not others, or sitting in an in-memory queue somewhere, just waiting to take effect five minutes from now. This tells us that once we allow operations to fail, there is no such thing as a truly sequential distributed systems test! In my experience with Jepsen, this kind of indeterminate concurrency is the rule, rather than the exception.

This has significant implications for test design. First, even if the generator doesn't produce concurrent operations, the way you record the outcomes of those operations needs to have an explicit way to record both indeterminate failures and concurrency. Second, your checker needs to take both of these into account. It has to understand that if the write of x is an indeterminate failure, then legal outcomes could be either [x, y], or [y]. Second, it has to understand that concurrency structure, which allows us to see [y, x].

So... what I'd like to encourage you to do here is to plan for concurrency & nondeterminism from the start, rather than trying to retrofit it into the test design at a later time. It's OK to leave the generator sequential, but the thing that records the history of operations, and the thing that checks that history to see it its valid, need to be fundamentally concurrency & failure-aware. It's definitely more up-front work, but I think it'll save you headaches down the line. :-)

More on reproducibility

It sounds like you tried out Antithesis (which I didn't know existed until yesterday) and found it difficult for a few reasons. I might have this wrong, but let me try and repeat what I think you saw:

It did some kind of source code transformation (?) which made it harder to debug
The mock IO also complicated debugging
It generated issues which seemed unlikely to occur in real life--perhaps with pathological thread/network schedules?
It required the use of tools that Materialize developers weren't familiar with
It made classical debuggers useless by taking over the structure of the code somehow

I think these are all valid concerns, and I think your intuition that a "somewhat, but not entirely" reproducible test suite might be a good happy medium. While knowing absolutely nothing about mzcompose, Rust, or how Materialize is actually built, I've got a few very loose ideas that might help in figuring out... exactly where you want to draw that line. Somewhat reiterating points from our chat discussion...

A purely opaque-box, end-to-end test like Jepsen (and perhaps mzcompose/Zippy?) avoids issues 1, 2, and 3 by running Real Binaries on Real Computers with Real Networks. That's really nice because when you find a bug, it's something that a real user could see in production. It also means that stacktraces and network traffic look exactly how you'd expect--there's no source code rewriting, injection of probes, etc.

On the other hand, there are a couple big drawbacks with this approach. One of them is that it usually involves different processes running on 5-10 nodes. Bugs often manifest across multiple nodes, rather than on one--so where does one attach a debugger to see the whole bug? Even if a bug does occur on a single node, it's not clear which one you should attach to. Moreover, if you pause execution via e.g. a breakpoint, it tends to immediately throw the system into a new regime--the other nodes will time out requests to the one that paused, and that can mean the bug disappears when you try to look at it.

If a debugger workflow is critical, one thing that you might want to consider is--and perhaps not using Antithesis, but just with some judicious choices about your regular code design--being able to run an entire cluster in a single binary. You can still use wall-clock time, regular network calls, the regular thread scheduler, etc., but the network calls will all go over loopback and back into the same process. That gives you the ability to attach a debugger to the whole cluster, with essentially minimal code changes.

This will likely still have concurrency issues--if you throw up a breakpoint in one node, that doesn't necessarily pause the others. If you've got a handy-dandy concurrency-aware debugger that's great at pausing all threads and handing off data between them in the right way, that's fantastic. Even if you don't, you can work on those issues piecemeal, and maybe get to a point where it's still meaningful to attach a debugger.

You may already have this, but if you don't--the first place to interpose might be the clock! I'd focus on current-time, timeouts, and scheduled tasks. What you want is a clock shim that in normal operation calls gettimeofday etc, but in testing allows you to pause/advance time--from tests, from the debugger, etc. Having a clock shim doesn't just help with interactive debugging, but also lets you make time controllable in unit and even integration tests. Tests that used to take a long time can run nearly instantly, and with deterministic results. For an example of this sort of thing, see Tea-Time.

A (rigorously used) clock shim also helps you work around a problem with testing in containers--you can't test clock skew!

The next place I'd interpose would be the network. One of the big problems with Jepsen (and that you might face injecting faults on top of Docker containers) is that its instruments for interrupting network traffic are "blunt"--it can cut off all packets using iptables, but that's not particularly good at introducing subtle reorderings, and takes a good deal of time to take effect. You're effectively limited by TCP & application-level timeouts, which constrains how quickly you can inject faults and actually observe phase changes in the system. So if you're willing, consider writing your network layer between nodes so that it has an alternate mode of operation, where the network is just a bunch of in-memory prioqueues, or even unordered sets. This gives you fine-grained control over message omission & reordering, which helps you get into weird corners of the state space that are hard to reach with iptables. And it's a lot easier to implement than you might think! Here's an in-memory network implementation in just a few hundred lines.

Two other things you can do with a network shim! One, it gets you closer to determinism--you can pick message delivery based on the same random seed you're using for the test as a whole. Second, it lets you get application-level traces of network messages, which are incredibly useful debugging tools. For instance, here's a trace recorded by Maelstrom's in-memory network, showing clients (c4, c5) making requests of transaction coordinators (n0, n1), which in turn use a linearizable (lin-kv) and an eventually consistent (lww-kv) storage service for their data.

I've built several systems using these auto-generated Lamport diagrams as a debugging aid, and it's just... oh it's SO useful to be able to see the flow of messages that led to some weird outcome, instead of trying to piece it together from a dozen log files. Not that you have to do all this from the start, but if you're laying groundwork for a new codebase, this might be something to have in mind as a potential goal. :-)

The last thing I think I'd tackle would be scheduler interposition--trying to get the thread scheduler itself to be deterministic and driven by the test. I know of systems that do this sort of thing--for instance, the Pulse scheduler for Erlang, which helps you Quickcheck concurrent code. I don't have any experience with this personally though, and I don't know what that would look like for something like Rust.

Workload Realism

This might be a different story at Materialize, and you know your engineering culture best. That said, I don't think you have to give up hope here! I've reported bugs in roughly 30 DBs over the last 9 years, and I've found that dev teams are generally very willing to fix issues that arise from "unrealistic" workloads. And in particular, those engineers often prefer an unrealistic workload that reproduces a bug in, say, 30 seconds, to a realistic workload that takes 10 hours.

This doesn't have to be an either-or situation--it's perfectly OK to write realistic workloads too, and also to mix-and-match within a single test. One thing I do in Jepsen is create new keys/tables/topics aggressively (e.g. every few seconds) for a handful of keys, but to let other keys accumulate writes for the entire history of the test. That way you get to explore more codepaths and parts of the state space--some bugs manifest only on first/last writes to some key, whereas others require sustained writes over a certain volume of data or span of time.

OK, I think I've prattled on far too much! Is any of this novel? Helpful? If you're curious, happy to dive into any of this stuff in depth.

philip-stoev · 2022-03-09T12:57:40Z

Just to clarify a potential source of confusion:

FoundationDB used the approach with heavily instrumented code which I thought would be difficult to debug with standard tools. This approach is described in their original video presentation at https://www.youtube.com/watch?v=4fFDFbi3toc
The same people then founded Antithesis, but their new approach does not require instrumenting the code . Instead, they use what they call a "deterministic hypervisor" to run uninstrumented (or barely instrumented, a C++ library is linked in and that is it). This one is difficult to debug as their deterministic environment does not allow live debugging or replay. You only get logs and possibly core files out of it.

philip-stoev · 2022-03-09T14:10:25Z

@aphyr This has been very useful, thanks! I will make sure to follow all the pointers. My entire plane of being is way more pedestrian than the stuff you describe, so the way I see this is that the heavy-hitting regarding network, clock and scheduler manipulation is left for Jepsen and Antithesis whereas Zippy would be:

understandable for mere mortals across the entire QA+development process -- e.g. adding new Actions to perform; test case minimization; reproduction; triage; debugging and creation of a regression test;
used in the CI with the familiar containerized tooling and reporting;
used for longevity and release qualification tests;
provide workloads for Antithesis (as their service is bring-your-own-workload) on top of which they will apply their fault injection operations

aphyr · 2022-03-09T14:33:48Z

That sounds reasonable! I should mention that Jepsen has no way to do scheduler faults, so that's probably something best left to Antithesis. I like your idea of being able to take these Zippy workloads and run them under Antithesis to get more sophisticated faults too--that way you can write tests once and get some of the advantages of each approach. :-)

cjubb39 · 2022-03-17T21:33:36Z

this is a great overview of testing -- thanks for taking the time to write it all up kyle. My two biggest takeaways are:

this take on testing sequential execution that I think super super succinctly captures why it's quite a hard problem

once an operation crashes, we must assume it is effectively concurrent with every single later operation in the history.

wow! auto-generated Lamport diagrams?? This would have saved me so much time at previous companies / projects

aphyr · 2022-03-18T13:03:15Z

wow! auto-generated Lamport diagrams??

Oh my gosh yes, YES, once you have this it is so hard to go back to Wireshark!

CLAassistant · 2022-12-08T17:45:11Z

All committers have signed the CLA.

CLAassistant · 2022-12-08T17:45:25Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

Design document for the Zippy testing framework

0119667

philip-stoev requested review from benesch and uce March 4, 2022 11:50

benesch approved these changes Mar 7, 2022

View reviewed changes

uce removed their request for review January 25, 2024 15:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Design document for the Zippy testing framework #11025

Design document for the Zippy testing framework #11025

philip-stoev commented Mar 4, 2022 •

edited

Loading

benesch commented Mar 7, 2022

benesch left a comment

philip-stoev commented Mar 8, 2022

aphyr commented Mar 8, 2022 •

edited

Loading

philip-stoev commented Mar 9, 2022

philip-stoev commented Mar 9, 2022

aphyr commented Mar 9, 2022 •

edited

Loading

cjubb39 commented Mar 17, 2022

aphyr commented Mar 18, 2022

CLAassistant commented Dec 8, 2022 •

edited

Loading

CLAassistant commented Dec 8, 2022

Design document for the Zippy testing framework #11025

Are you sure you want to change the base?

Design document for the Zippy testing framework #11025

Conversation

philip-stoev commented Mar 4, 2022 • edited Loading

benesch commented Mar 7, 2022

benesch left a comment

Choose a reason for hiding this comment

philip-stoev commented Mar 8, 2022

Reproducilbility

Workload Realism

Sequential execution

aphyr commented Mar 8, 2022 • edited Loading

Reproducibility / Sequential execution

More on reproducibility

Workload Realism

philip-stoev commented Mar 9, 2022

philip-stoev commented Mar 9, 2022

aphyr commented Mar 9, 2022 • edited Loading

cjubb39 commented Mar 17, 2022

aphyr commented Mar 18, 2022

CLAassistant commented Dec 8, 2022 • edited Loading

CLAassistant commented Dec 8, 2022

philip-stoev commented Mar 4, 2022 •

edited

Loading

aphyr commented Mar 8, 2022 •

edited

Loading

aphyr commented Mar 9, 2022 •

edited

Loading

CLAassistant commented Dec 8, 2022 •

edited

Loading