MONGOCRYPT-792 Avoid libcrypto lock contention and fetch overhead #995
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This is a proposed performance improvement for HMAC and AES operations when using libcrypto (OpenSSL) 3.0.0 or later. Earlier versions are still supported but don't benefit from the optimization.
As MONGOCRYPT-792 requests, this performs the algorithm "fetch" operation early, resolving a name like "SHA-256" into a ready-to-use internal object. Upon closer examination, the overhead from this fetch operation can be much higher than expected due to lock contention around access to the shared global OSSL_LIB_CTX.
In testing with the
benchmark-python.sh
microbenchmark, I see about 50% improvement on single-threaded results, but an even more dramatic increase in multi-threaded results. The before results don't get much better than single-threaded, whereas after we are getting a bit over 4x concurrency. See full results below.There's still a great deal of room for improvement. Profiling shows signs of contention in the atomic reference count operations when we create and destroy contexts. It would be ideal to modify the API such that the caller is responsible for allocating a non-shared temporary buffer that can be used for a batch of many crypto operations.
There are more alternatives yet. The underlying algorithms here don't require temporary space or early initialization, we're just bending to support APIs that aren't ideal. If we wanted, a self-contained SHA2 implementation without external dependencies may prove faster than using OpenSSL's HMAC on such a frequent basis. Early fetch is straightforward for ciphers, but fetching both the HMAC and its sub-algorithm early requires a workaround (ctx dup) with its own overhead. This may also be reason to prefer other HMAC/SHA2 implementations in the future.
Update: To support the claim of atomic refcount contention above, here's more data. I ran the same "after" benchmark in 64-thread mode only, with
perf record
. Overall profile is similar to above, with EVP_MD_free on top. If I annotate EVP_MD_free itself, the refcounting has been inlined and you can see 96.8% samples on the instruction following the locked operation. The next highest symbol on the profile is EVP_MD_up_ref, which has 99% of its samples following the locked op. This all represents room for improvement subsequent to this PR.