Skip to content

disconnect()/connect() and consistency/correctness #253

Open
@nurturenature

Description

@nurturenature

Hi,

This is a breakout, decomposition of issue "db.connect()/disconnect() issues: db.prop and db.syncStatus.prop values, delayed replication, and read your own writes anomaly" #226

It's also a way to trigger "Local changes reverted when a replication delay is present" powersync-ja/powersync-sqlite-core#54

The "Local changes reverted when a replication delay is present" in powersync-sqlite-core was easy to reproduce in v1.10.0 and became harder in v1.11.0 using disconnecting/connecting to trigger the bug.

At the current time, v.12.1, if one adds some network instability, e.g. an occasional network partition, in addition to disconnecting/connecting, it is possible to reliably trigger the bug.

Under these conditions, the bug presents in two ways:

  • a {key: value} is read that is older, was previously read in prior reads, in place of the current, last read
    • i.e. the {key: value} appears to be non monotonic to the client, go backwards in time
  • the client reads an empty, SELECT * returns [], table even though there have been successful prior reads

A fuzzing application has been created to reproduce this and other anomalies, and fuzz test PowerSync in general.

Please see A Dart CLI fuzzer for PowerSync.

There's a GitHub Action matrix to fuzz-non-monotonic-reads.

And a script for reproducing locally Local Host - Helper Script to Loop Until Error


Here's a typical example of working through a failed powersync_fuzz.log output.

./powersync-fuzz-loop.sh ./powersync_fuzz --table mww --clients 5 --rate 10 --time 100 --disconnect --partition

We can see the disconnecting, connecting, and partitioning nemesis activity:

$ grep 'nemesis' powersync_fuzz.log

[2025-03-11 02:18:57.576457] [main] [INFO] nemesis: disconnect/connect: start listening to stream of disconnected/connected messages
[2025-03-11 02:18:57.576504] [main] [INFO] nemesis: partition: start listening to stream of partition messages
[2025-03-11 02:19:01.847681] [main] [INFO] nemesis: disconnect/connect: disconnected: clients: (1, 3)
[2025-03-11 02:19:03.575833] [main] [INFO] nemesis: disconnect/connect: connected: clients: (1, 5, 3, 2, 4)
[2025-03-11 02:19:05.290393] [main] [INFO] nemesis: partition: bidirectional
[2025-03-11 02:19:09.693580] [main] [INFO] nemesis: partition: none
[2025-03-11 02:19:13.468192] [main] [INFO] nemesis: disconnect/connect: disconnected: clients: (2)
[2025-03-11 02:19:18.185368] [main] [INFO] nemesis: partition: outbound
[2025-03-11 02:19:21.954030] [main] [INFO] nemesis: disconnect/connect: connected: clients: (3, 4, 2, 1, 5)
[2025-03-11 02:19:25.980436] [main] [INFO] nemesis: partition: none
...

We can see the nemesis is triggering upload/download errors in the SyncStatus stream, the client treats them as ignorable:

$ grep 'ignorable' powersync_fuzz.log

[2025-03-11 02:19:21.948436] [ps-5] [INFO] SyncStatus: ignorable upload error in statusStream: ClientException: Connection closed before full header was received, uri=http://powersync:8080/write-checkpoint2.json?client_id=adb498d8-3697-4459-b009-823ee5a1f83b
...
[2025-03-11 02:19:31.108449] [ps-1] [INFO] SyncStatus: ignorable download error in statusStream: ClosedException
...

And now for the non-monotonic read error.

At the end of powersync_fuzz.log, the SEVERE messages will show the cause.
The easiest cause to debug is "expected because myPreviousRead".

Here is clientNum: 3 reading backwards in time:

[2025-03-11 02:20:19.507539] [main] [SEVERE] read of {0: 422} is less than expected read of {0: 703}, expected because myPreviousRead, in op: {clientNum: 3, clientType: ps, f: txn, id: 156, table: mww, type: ok, value: [{f: read-all, k: -1, v: {0: 422, ..., 99: 402}}, {f: write-some, k: -1, v: {85: 818, 14: 818, 64: 818, 27: 818}}]}
[2025-03-11 02:20:19.507554] [main] [SEVERE] CausalChecker client states with {0: [422, 703]}:
[2025-03-11 02:20:19.507561] [main] [SEVERE] 	{0: 703}: in client 1: reason: myPreviousRead
[2025-03-11 02:20:19.507565] [main] [SEVERE] 	{0: 703}: in client 3: reason: myPreviousRead
[2025-03-11 02:20:19.507569] [main] [SEVERE] CausalChecker Monotonic Read/Write Follows Reads states:
[2025-03-11 02:20:19.507583] [main] [SEVERE] 	{0: 422}: written when: [422, ..., 316]
[2025-03-11 02:20:19.507599] [main] [SEVERE] 	{0: 703}: written when: [703, ..., 650]

We can see that client 3 read the correct, current, value many times previously:

$ grep 'ps-3' powersync_fuzz.log |  grep '0: 703'

[2025-03-11 02:20:09.679222] [ps-3] [FINE] SQL txn: response: {clientNum: 3, clientType: ps, f: txn, id: 140, table: mww, type: ok, value: [{f: read-all, k: -1, v: {0: 703, ..., 99: 665}}, {f: write-some, k: -1, v: {13: 720, 57: 720, 6: 720, 90: 720}}]}
...
read 15 additional times
...

Add observe the correct value originally being written, uploaded and read by client 1:

$ grep 'ps-1' powersync_fuzz.log |  grep '0: 703'

# the write
[2025-03-11 02:20:07.979747] [ps-1] [FINE] SQL txn: response: {clientNum: 1, clientType: ps, f: txn, id: 141, table: mww, type: ok, value: [{f: read-all, k: -1, v: {...}}, {f: write-some, k: -1, v: {98: 703, 12: 703, 18: 703, 0: 703}}]}

# a read
[2025-03-11 02:20:08.579595] [ps-1] [FINE] SQL txn: response: {clientNum: 1, clientType: ps, f: txn, id: 142, table: mww, type: ok, value: [{f: read-all, k: -1, v: {0: 703, ..., 99: 650}}, {f: write-some, k: -1, v: {29: 709, 98: 709, 66: 709, 25: 709}}]}

# the BackendConnector.upload()
[2025-03-11 02:20:09.179271] [ps-1] [FINER] uploadData: txn: 142 patch: {0: 703}

... multiple additional reads

There may actually be multiple bugs exposed by disconnecting/connecting with a flakey network.

The GitHub Actions runs a matrix of 240 jobs and:

  • 25 fail with non monotonic reads
  • 16 fail with empty table reads
  • 141 fail with divergent final reads

The test matrix has been tuned over time to elicit maximum failure.
This issue is designed for the non monotonic or empty read. There are cleaner ways to produce divergent reads, and they behave differently, so will be opening up an independent issue.

The first local Docker build takes a bit of time, mostly creating the Flutter/Dart build image.

The local script that loops until error will sometimes find an error on the first attempt, other times it will grind, fuzz, for a bit.
It's the nature of fuzzing.

I know this is a lot.
I hope that the fuzzing environment can be made easy to use, maybe even run as an action on commits, so please ask questions and for changes that would be helpful.

Thanks!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions