disconnect()/connect() and consistency/correctness

Hi,

This is a breakout, decomposition of issue "db.connect()/disconnect() issues: db.prop and db.syncStatus.prop values, delayed replication, and read your own writes anomaly" #226 

It's also a way to trigger "Local changes reverted when a replication delay is present" powersync-ja/powersync-sqlite-core#54

The "Local changes reverted when a replication delay is present" in  `powersync-sqlite-core` was easy to reproduce in v1.10.0 and became harder in v1.11.0 using disconnecting/connecting to trigger the bug.

At the current time, v.12.1, if one adds some network instability, e.g. an occasional network partition, in addition to disconnecting/connecting, it is possible to reliably trigger the bug.

Under these conditions, the bug presents in two ways:
- a `{key: value}` is read that is older, was previously read in prior reads, in place of the current, last read
  - i.e. the `{key: value}` appears to be non monotonic to the client, go backwards in time
- the client reads an empty, `SELECT *` returns `[]`, table even though there have been successful prior reads

A fuzzing application has been created to reproduce this and other anomalies, and fuzz test PowerSync in general.

Please see [A Dart CLI fuzzer for PowerSync](https://github.com/nurturenature/jepsen-powersync/tree/main/powersync_endpoint#readme).

There's a GitHub Action matrix to [fuzz-non-monotonic-reads](https://github.com/nurturenature/jepsen-powersync/actions/workflows/fuzz-non-monotonic-reads.yml).

And a script for reproducing locally [Local Host - Helper Script to Loop Until Error](https://github.com/nurturenature/jepsen-powersync/tree/main/powersync_endpoint#local-host---helper-script-to-loop-until-error)

----

Here's a typical example of working through a failed `powersync_fuzz.log` output.

`./powersync-fuzz-loop.sh ./powersync_fuzz --table mww --clients 5 --rate 10 --time 100 --disconnect --partition`

We can see the disconnecting, connecting, and partitioning nemesis activity:

```log
$ grep 'nemesis' powersync_fuzz.log

[2025-03-11 02:18:57.576457] [main] [INFO] nemesis: disconnect/connect: start listening to stream of disconnected/connected messages
[2025-03-11 02:18:57.576504] [main] [INFO] nemesis: partition: start listening to stream of partition messages
[2025-03-11 02:19:01.847681] [main] [INFO] nemesis: disconnect/connect: disconnected: clients: (1, 3)
[2025-03-11 02:19:03.575833] [main] [INFO] nemesis: disconnect/connect: connected: clients: (1, 5, 3, 2, 4)
[2025-03-11 02:19:05.290393] [main] [INFO] nemesis: partition: bidirectional
[2025-03-11 02:19:09.693580] [main] [INFO] nemesis: partition: none
[2025-03-11 02:19:13.468192] [main] [INFO] nemesis: disconnect/connect: disconnected: clients: (2)
[2025-03-11 02:19:18.185368] [main] [INFO] nemesis: partition: outbound
[2025-03-11 02:19:21.954030] [main] [INFO] nemesis: disconnect/connect: connected: clients: (3, 4, 2, 1, 5)
[2025-03-11 02:19:25.980436] [main] [INFO] nemesis: partition: none
...
```

We can see the nemesis is triggering upload/download errors in the `SyncStatus` stream, the client treats them as ignorable:

```log
$ grep 'ignorable' powersync_fuzz.log

[2025-03-11 02:19:21.948436] [ps-5] [INFO] SyncStatus: ignorable upload error in statusStream: ClientException: Connection closed before full header was received, uri=http://powersync:8080/write-checkpoint2.json?client_id=adb498d8-3697-4459-b009-823ee5a1f83b
...
[2025-03-11 02:19:31.108449] [ps-1] [INFO] SyncStatus: ignorable download error in statusStream: ClosedException
...
```

And now for the non-monotonic read error.

At the end of `powersync_fuzz.log`, the `SEVERE` messages will show the cause.
The easiest cause to debug is "expected because myPreviousRead".

Here is `clientNum: 3` reading backwards in time:    

```log
[2025-03-11 02:20:19.507539] [main] [SEVERE] read of {0: 422} is less than expected read of {0: 703}, expected because myPreviousRead, in op: {clientNum: 3, clientType: ps, f: txn, id: 156, table: mww, type: ok, value: [{f: read-all, k: -1, v: {0: 422, ..., 99: 402}}, {f: write-some, k: -1, v: {85: 818, 14: 818, 64: 818, 27: 818}}]}
[2025-03-11 02:20:19.507554] [main] [SEVERE] CausalChecker client states with {0: [422, 703]}:
[2025-03-11 02:20:19.507561] [main] [SEVERE] 	{0: 703}: in client 1: reason: myPreviousRead
[2025-03-11 02:20:19.507565] [main] [SEVERE] 	{0: 703}: in client 3: reason: myPreviousRead
[2025-03-11 02:20:19.507569] [main] [SEVERE] CausalChecker Monotonic Read/Write Follows Reads states:
[2025-03-11 02:20:19.507583] [main] [SEVERE] 	{0: 422}: written when: [422, ..., 316]
[2025-03-11 02:20:19.507599] [main] [SEVERE] 	{0: 703}: written when: [703, ..., 650]
```

We can see that client 3 read the correct, current, value many times previously:

```log
$ grep 'ps-3' powersync_fuzz.log |  grep '0: 703'

[2025-03-11 02:20:09.679222] [ps-3] [FINE] SQL txn: response: {clientNum: 3, clientType: ps, f: txn, id: 140, table: mww, type: ok, value: [{f: read-all, k: -1, v: {0: 703, ..., 99: 665}}, {f: write-some, k: -1, v: {13: 720, 57: 720, 6: 720, 90: 720}}]}
...
read 15 additional times
...
```

Add observe the correct value originally being written, uploaded and read by client 1:

```log
$ grep 'ps-1' powersync_fuzz.log |  grep '0: 703'

# the write
[2025-03-11 02:20:07.979747] [ps-1] [FINE] SQL txn: response: {clientNum: 1, clientType: ps, f: txn, id: 141, table: mww, type: ok, value: [{f: read-all, k: -1, v: {...}}, {f: write-some, k: -1, v: {98: 703, 12: 703, 18: 703, 0: 703}}]}

# a read
[2025-03-11 02:20:08.579595] [ps-1] [FINE] SQL txn: response: {clientNum: 1, clientType: ps, f: txn, id: 142, table: mww, type: ok, value: [{f: read-all, k: -1, v: {0: 703, ..., 99: 650}}, {f: write-some, k: -1, v: {29: 709, 98: 709, 66: 709, 25: 709}}]}

# the BackendConnector.upload()
[2025-03-11 02:20:09.179271] [ps-1] [FINER] uploadData: txn: 142 patch: {0: 703}

... multiple additional reads
```

There may actually be multiple bugs exposed by disconnecting/connecting with a flakey network.

The GitHub Actions runs a matrix of 240 jobs and:

-  25 fail with non monotonic reads
-  16 fail with empty table reads
- 141 fail with divergent final reads

The test matrix has been tuned over time to elicit maximum failure.
This issue is designed for the non monotonic or empty read. There are cleaner ways to produce divergent reads, and they behave differently, so will be opening up an independent issue.

The first local Docker build takes a bit of time, mostly creating the Flutter/Dart build image.

The local script that loops until error will sometimes find an error on the first attempt, other times it will grind, fuzz, for a bit.
It's the nature of fuzzing.

I know this is a lot.
I hope that the fuzzing environment can be made easy to use, maybe even run as an action on commits, so please ask questions and for changes that would be helpful.

Thanks!


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

disconnect()/connect() and consistency/correctness #253

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

disconnect()/connect() and consistency/correctness #253

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions