Ztunnel does not scale up with number of worker threads in expected way due to h2 locking #1174

howardjohn · 2024-06-24T20:31:51Z

Some testinging:

Running locally (i.e. ONE ztunnel connecting to itself)
2 workers
1:  11.51 Gbits/sec
2:  11.92 Gbits/sec
3:  11.33 Gbits/sec
4:  11.30 Gbits/sec
6:  12.15 Gbits/sec
8:  12.77 Gbits/sec
16: 14.95 Gbits/sec

4 workers
1:  9.52 Gbits/sec
2:  15.89 Gbits/sec
3:  16.49 Gbits/sec
4:  17.05 Gbits/sec
6:  16.28 Gbits/sec
8:  16.15 Gbits/sec
16: 15.74 Gbits/sec

8 workers
1:  7.88 Gbits/sec
2:  12.27 Gbits/sec
3:  15.73 Gbits/sec
4:  17.14 Gbits/sec
6:  17.95 Gbits/sec
8:  18.63 Gbits/sec
16: 19.03 Gbits/sec

On Kubernetes, cross node
2 workers
1:  9.27 Gbits/sec
2:  10.11 Gbits/sec
3:  10.60 Gbits/sec
4:  11.72 Gbits/sec
6:  13.55 Gbits/sec
8:  14.34 Gbits/sec
16: 16.09 Gbits/sec

4 workers
1:  8.09 Gbits/sec
2:  9.16 Gbits/sec
3:  9.72 Gbits/sec
4:  11.31 Gbits/sec
6:  11.64 Gbits/sec
8:  13.67 Gbits/sec
16: 15.08 Gbits/sec

8 workers
1:  8.09 Gbits/sec
2:  8.61 Gbits/sec
3:  9.27 Gbits/sec
4:  10.63 Gbits/sec
6:  11.70 Gbits/sec
8:  13.17 Gbits/sec
16: 15.57 Gbits/sec


On kubernetes, raw TCP (not HBONE/TLS)
2 workers
1:  22.18 Gbits/sec
2:  35.72 Gbits/sec
3:  35.36 Gbits/sec
4:  35.28 Gbits/sec
6:  37.48 Gbits/sec
8:  37.37 Gbits/sec
16: 38.62 Gbits/sec

8 workers
1:  23.96 Gbits/sec
2:  34.57 Gbits/sec
3:  43.82 Gbits/sec
4:  51.85 Gbits/sec
6:  50.36 Gbits/sec
8:  47.48 Gbits/sec
16: 43.99 Gbits/sec

Note: I have a 16c machine.

The weird thing is I do not see the same behavior in our benchmarks (with #1171)

                        thrpt:  [12.107 Gelem/s 13.392 Gelem/s 14.496 Gelem/s]
                 change:
                        time:   [-3.0115% +5.8051% +18.549%] (p = 0.40 > 0.05)
                        thrpt:  [-15.647% -5.4866% +3.1050%]
                        No change in performance detected.
Found 2 outliers among 10 measurements (20.00%)
  2 (20.00%) high severe
throughput/hbone2       time:   [376.68 ms 384.08 ms 392.21 ms]
                        thrpt:  [21.901 Gelem/s 22.365 Gelem/s 22.804 Gelem/s]
                 change:
                        time:   [-12.720% -9.1954% -5.5575%] (p = 0.00 < 0.05)
                        thrpt:  [+5.8845% +10.127% +14.574%]
                        Performance has improved.
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) high mild
throughput/hbone8       time:   [501.59 ms 507.84 ms 514.72 ms]
                        thrpt:  [16.689 Gelem/s 16.915 Gelem/s 17.125 Gelem/s]
                 change:
                        time:   [-2.8183% -0.8154% +1.2152%] (p = 0.46 > 0.05)
                        thrpt:  [-1.2006% +0.8221% +2.9000%]
                        No change in performance detected.

Things I have tried with success:

H2 Window sizes - no impact
Tracing to try to see where we are waiting - inconclusive
Disabling pooling, so we have 1:1 connections - no impact

I would expect HBONE to behave like TCP: with 2 workers, we see ~2x throughput with 2 connections as 1. 2x is the theoretical best, of course, so something more like 1.5x-1.8x may be more realistic. We don't see that. We see virtually zero increase. We do see improvements at very high connection counts, though.

We need to find out what is holding up the processing

The text was updated successfully, but these errors were encountered:

keithmattix · 2024-06-24T20:41:13Z

The weird thing is I do not see the same behavior.

You don't see the same behavior where?

howardjohn · 2024-06-24T20:42:30Z

The weird thing is I do not see the same behavior.

You don't see the same behavior where?

🤦 I updated the comment

keithmattix · 2024-06-24T21:34:10Z

I would totally suspect connection pooling if you didn't rule it out. We only use HBONE for TLS right? Maybe some weird default in the hyper client/server?

howardjohn · 2024-06-24T21:44:51Z

Ah.. I think I just disabled pooling improperly. The benchmark doesn't pool due to using different client network namespaces. I can reproduce full now and see good scaling with number of connections.

At 2 workers (default):

1:  11.05 Gbits/sec
2:  17.17 Gbits/sec
3:  17.20 Gbits/sec
4:  16.24 Gbits/sec
6:  14.84 Gbits/sec
8:  15.95 Gbits/sec
16: 14.71 Gbits/sec

So looks pretty good and align with TCP numbers

howardjohn · 2024-06-24T21:48:25Z

Its still pretty interesting that 1->2 yields no improvement, but we do still scale up to some extent. 16c gives 15Gb/s which is consistent with the non-pooling...

keithmattix · 2024-06-24T22:07:10Z

Should I be looking at the most recent numbers you posted? I see improvement from 1-2 but it looks about the same as the first set you posted in the description

howardjohn · 2024-06-24T22:12:08Z

C	Pool	No Pool
1	9.27	11.05
2	10.11	17.17
3	10.60	17.20
4	11.72	16.24
6	13.55	14.84
8	14.34	15.95
16	16.09	14.71

Actually kind of interesting that we see inverse behavior

keithmattix · 2024-06-24T22:39:58Z

That feels somewhat expected: with connection pooling you get a smaller but monotonic throughput improvement with increased concurrency. Without pooling, you're more susceptible to longer running streams.

I wonder if the delay is on client or server processing....I assume the concurrency increase affects both

bleggett · 2024-06-24T22:50:29Z

When you say "workers" you mean you're constraining the Tokio runtime to X threads? How are you doing that, exactly?

Also - if ztunnel is connecting to itself, I wouldn't expect pooling to help at all, really?

EDIT: Oh, I see, you did cross node too.

We do see improvements at very high connection counts, though.

Yeah I wouldn't expect pooling to by-itself offer much of a boost at either end of the graph, it's more that it (among other things) mitigates starving the local node as the conn count increases (or would, if we didn't pool).

howardjohn · 2024-06-24T22:52:39Z

ZTUNNEL_WORKER_THREADS, we default to 2.

Ztunnel is only connecting to itself in some of the tests, the ones that are meaningful (the ones running in k8s) is cross-node.

bleggett · 2024-06-24T22:56:15Z

Also note that the one place pooling does lock is when it hits the conn limit to Dest A and needs to open another conn to Dest A.

We sort of have to lock there, as the alternative is we let clients use us to eagerly race connections to Dest A, and then we prune them after-the-fact - which is worse.

It's an intentional ratelimit for those scenarios, where a client is spamming new conns to us.

We also don't consult worker thread count when deciding to schedule new pool conns at all (we could, but do not)

howardjohn · 2024-06-24T22:57:34Z

These tests are "open connection once" then "send a ton of information" on them. The connection establishment is out of the measurement path entirely basically.

bleggett · 2024-06-24T23:09:30Z

These tests are "open connection once" then "send a ton of information" on them. The connection establishment is out of the measurement path entirely basically.

Oh ok, if the clients are opening A Single connection to ZT then H2 multiplexing doesn't matter at all then.

If connection establishment is controlled for fully then it's got to be crypto overhead on the wire or h2 server overhead somehow, right? Can't think of anything else.

It's possible that the h2 server code just isn't very good at handling multiple connections with the same src/dest tuple, since the http/2 spec doesn't exactly encourage that.

howardjohn · 2024-08-23T15:14:06Z

Root caused to hyperium/h2#531 (comment) a while back, forgot to update this issue

howardjohn changed the title ~~Ztunnel does not scale up with number of worker threads in expected way~~ Ztunnel does not scale up with number of worker threads in expected way due to h2 locking Aug 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ztunnel does not scale up with number of worker threads in expected way due to h2 locking #1174

Ztunnel does not scale up with number of worker threads in expected way due to h2 locking #1174

howardjohn commented Jun 24, 2024 •

edited

Loading

keithmattix commented Jun 24, 2024

howardjohn commented Jun 24, 2024

keithmattix commented Jun 24, 2024

howardjohn commented Jun 24, 2024

howardjohn commented Jun 24, 2024

keithmattix commented Jun 24, 2024

howardjohn commented Jun 24, 2024

keithmattix commented Jun 24, 2024

bleggett commented Jun 24, 2024 •

edited

Loading

howardjohn commented Jun 24, 2024

bleggett commented Jun 24, 2024 •

edited

Loading

howardjohn commented Jun 24, 2024

bleggett commented Jun 24, 2024 •

edited

Loading

howardjohn commented Aug 23, 2024

Ztunnel does not scale up with number of worker threads in expected way due to h2 locking #1174

Ztunnel does not scale up with number of worker threads in expected way due to h2 locking #1174

Comments

howardjohn commented Jun 24, 2024 • edited Loading

keithmattix commented Jun 24, 2024

howardjohn commented Jun 24, 2024

keithmattix commented Jun 24, 2024

howardjohn commented Jun 24, 2024

howardjohn commented Jun 24, 2024

keithmattix commented Jun 24, 2024

howardjohn commented Jun 24, 2024

keithmattix commented Jun 24, 2024

bleggett commented Jun 24, 2024 • edited Loading

howardjohn commented Jun 24, 2024

bleggett commented Jun 24, 2024 • edited Loading

howardjohn commented Jun 24, 2024

bleggett commented Jun 24, 2024 • edited Loading

howardjohn commented Aug 23, 2024

howardjohn commented Jun 24, 2024 •

edited

Loading

bleggett commented Jun 24, 2024 •

edited

Loading

bleggett commented Jun 24, 2024 •

edited

Loading

bleggett commented Jun 24, 2024 •

edited

Loading