MONGOCRYPT-792 Avoid libcrypto lock contention and fetch overhead#995
MONGOCRYPT-792 Avoid libcrypto lock contention and fetch overhead#9953 commits merged intomasterfrom unknown repository
Conversation
kevinAlbs
left a comment
There was a problem hiding this comment.
LGTM. The results are impressive.
99% of its samples following the locked op.
Am I correct in assuming this is due to perf record does not measure the off-CPU lock instruction? The samples get marked for the subsequent instruction?
There can be a few reasons to see sampling profilers report locations that have a small offset from the actual bottleneck, and I'd have to dig through the x86 manuals to know for absolute sure which effects are due to what causes, but I think in this case it's just that the instruction pointer has already been advanced before the actual stall takes place. (And the profiler doesn't know microarchitecture details, it just logs instruction pointer plus call stack.) |
This is a proposed performance improvement for HMAC and AES operations when using libcrypto (OpenSSL) 3.0.0 or later. Earlier versions are still supported but don't benefit from the optimization.
As MONGOCRYPT-792 requests, this performs the algorithm "fetch" operation early, resolving a name like "SHA-256" into a ready-to-use internal object. Upon closer examination, the overhead from this fetch operation can be much higher than expected due to lock contention around access to the shared global OSSL_LIB_CTX.
In testing with the
benchmark-python.shmicrobenchmark, I see about 50% improvement on single-threaded results, but an even more dramatic increase in multi-threaded results. The before results don't get much better than single-threaded, whereas after we are getting a bit over 4x concurrency. See full results below.There's still a great deal of room for improvement. Profiling shows signs of contention in the atomic reference count operations when we create and destroy contexts. It would be ideal to modify the API such that the caller is responsible for allocating a non-shared temporary buffer that can be used for a batch of many crypto operations.
There are more alternatives yet. The underlying algorithms here don't require temporary space or early initialization, we're just bending to support APIs that aren't ideal. If we wanted, a self-contained SHA2 implementation without external dependencies may prove faster than using OpenSSL's HMAC on such a frequent basis. Early fetch is straightforward for ciphers, but fetching both the HMAC and its sub-algorithm early requires a workaround (ctx dup) with its own overhead. This may also be reason to prefer other HMAC/SHA2 implementations in the future.
Update: To support the claim of atomic refcount contention above, here's more data. I ran the same "after" benchmark in 64-thread mode only, with
perf record. Overall profile is similar to above, with EVP_MD_free on top. If I annotate EVP_MD_free itself, the refcounting has been inlined and you can see 96.8% samples on the instruction following the locked operation. The next highest symbol on the profile is EVP_MD_up_ref, which has 99% of its samples following the locked op. This all represents room for improvement subsequent to this PR.