MONGOCRYPT-792 Avoid libcrypto lock contention and fetch overhead #995

ghost · 2025-04-04T03:08:10Z

This is a proposed performance improvement for HMAC and AES operations when using libcrypto (OpenSSL) 3.0.0 or later. Earlier versions are still supported but don't benefit from the optimization.

As MONGOCRYPT-792 requests, this performs the algorithm "fetch" operation early, resolving a name like "SHA-256" into a ready-to-use internal object. Upon closer examination, the overhead from this fetch operation can be much higher than expected due to lock contention around access to the shared global OSSL_LIB_CTX.

In testing with the benchmark-python.sh microbenchmark, I see about 50% improvement on single-threaded results, but an even more dramatic increase in multi-threaded results. The before results don't get much better than single-threaded, whereas after we are getting a bit over 4x concurrency. See full results below.

There's still a great deal of room for improvement. Profiling shows signs of contention in the atomic reference count operations when we create and destroy contexts. It would be ideal to modify the API such that the caller is responsible for allocating a non-shared temporary buffer that can be used for a batch of many crypto operations.

There are more alternatives yet. The underlying algorithms here don't require temporary space or early initialization, we're just bending to support APIs that aren't ideal. If we wanted, a self-contained SHA2 implementation without external dependencies may prove faster than using OpenSSL's HMAC on such a frequent basis. Early fetch is straightforward for ciphers, but fetching both the HMAC and its sub-algorithm early requires a workaround (ctx dup) with its own overhead. This may also be reason to prefer other HMAC/SHA2 implementations in the future.

Measuring on rhel90-dbx-perf-large evergreen host, using ./.evergreen/benchmark-python.sh

======
BEFORE
======

Running benchmark with pymongocrypt: 1.13.0.dev0 libmongocrypt: 1.13.1
runTest (__main__.TestBulkDecryption) ... Finished TestBulkDecryption, threads=1, median ops_per_second=119.56
Finished TestBulkDecryption, threads=2, median ops_per_second=163.44
Finished TestBulkDecryption, threads=8, median ops_per_second=188.24
Finished TestBulkDecryption, threads=64, median ops_per_second=206.21
ok

Running benchmark with pymongocrypt: 1.13.0.dev0 libmongocrypt: 1.13.1
runTest (__main__.TestBulkDecryption) ... Finished TestBulkDecryption, threads=1, median ops_per_second=117.58
Finished TestBulkDecryption, threads=2, median ops_per_second=167.10
Finished TestBulkDecryption, threads=8, median ops_per_second=184.78
Finished TestBulkDecryption, threads=64, median ops_per_second=210.00
ok

Samples: 1M of event 'task-clock:upppH', Event count (approx.): 305122250000
Overhead  Command   Shared Object                                  Symbol
  37.50%  python    libc.so.6                                      [.] pthread_rwlock_unlock@@GLIBC_2.34
  33.27%  python    libc.so.6                                      [.] pthread_rwlock_rdlock@GLIBC_2.2.5
   1.11%  python    libcrypto.so.3.0.1                             [.] 0x00000000001a3f39
   1.04%  python    libcrypto.so.3.0.1                             [.] EVP_MD_up_ref
   0.95%  python    libcrypto.so.3.0.1                             [.] EVP_MD_free
   0.79%  python    libc.so.6                                      [.] __strcmp_avx2
   0.78%  python    libc.so.6                                      [.] _int_free
   0.76%  python    libcrypto.so.3.0.1                             [.] HMAC_Init_ex
   0.70%  python    libcrypto.so.3.0.1                             [.] OPENSSL_cleanse
   0.67%  python    libc.so.6                                      [.] malloc
   0.59%  python    libcrypto.so.3.0.1                             [.] 0x00000000001a3f9b
   0.52%  python    libcrypto.so.3.0.1                             [.] OPENSSL_LH_retrieve
   0.43%  python    libcrypto.so.3.0.1                             [.] EVP_MAC_free
   0.40%  python    libc.so.6                                      [.] __libc_calloc
   0.39%  python    libc.so.6                                      [.] cfree@GLIBC_2.2.5
   0.32%  python    libcrypto.so.3.0.1                             [.] SHA512_Final

=====
AFTER
=====

Running benchmark with pymongocrypt: 1.13.0.dev0 libmongocrypt: 1.13.1
runTest (__main__.TestBulkDecryption) ... Finished TestBulkDecryption, threads=1, median ops_per_second=181.09
Finished TestBulkDecryption, threads=2, median ops_per_second=310.09
Finished TestBulkDecryption, threads=8, median ops_per_second=692.31
Finished TestBulkDecryption, threads=64, median ops_per_second=823.21
ok

Running benchmark with pymongocrypt: 1.13.0.dev0 libmongocrypt: 1.13.1
runTest (__main__.TestBulkDecryption) ... Finished TestBulkDecryption, threads=1, median ops_per_second=181.51
Finished TestBulkDecryption, threads=2, median ops_per_second=314.28
Finished TestBulkDecryption, threads=8, median ops_per_second=774.59
Finished TestBulkDecryption, threads=64, median ops_per_second=902.83
ok

Samples: 1M of event 'task-clock:upppH', Event count (approx.): 275680250000
Overhead  Command   Shared Object                                  Symbol
   8.62%  python    libcrypto.so.3.0.1                             [.] EVP_MD_free
   7.74%  python    libcrypto.so.3.0.1                             [.] EVP_MD_up_ref
   4.03%  python    libcrypto.so.3.0.1                             [.] EVP_DigestUpdate
   3.96%  python    libcrypto.so.3.0.1                             [.] EVP_MD_CTX_copy_ex
   3.08%  python    libcrypto.so.3.0.1                             [.] HMAC_Init_ex
   2.76%  python    libcrypto.so.3.0.1                             [.] OPENSSL_cleanse
   2.41%  python    libc.so.6                                      [.] _int_free
   2.40%  python    libcrypto.so.3.0.1                             [.] 0x0000000000178ada
   2.38%  python    libcrypto.so.3.0.1                             [.] EVP_DigestFinal_ex
   1.96%  python    libc.so.6                                      [.] malloc
   1.81%  python    libc.so.6                                      [.] __strcmp_avx2
   1.56%  python    libcrypto.so.3.0.1                             [.] 0x000000000017857d
   1.33%  python    libc.so.6                                      [.] cfree@GLIBC_2.2.5
   1.17%  python    libcrypto.so.3.0.1                             [.] SHA512_Final
   1.03%  python    libmongocrypt.so.0.0.0                         [.] _bson_iter_next_internal
   0.97%  python    libcrypto.so.3.0.1                             [.] 0x000000000024eae0
   0.96%  python    libc.so.6                                      [.] __memmove_avx_unaligned_erms
   0.89%  python    libcrypto.so.3.0.1                             [.] 0x0000000000178b5c
   0.84%  python    libcrypto.so.3.0.1                             [.] EVP_MAC_free
   0.83%  python    libcrypto.so.3.0.1                             [.] EVP_MAC_up_ref
   0.82%  python    libcrypto.so.3.0.1                             [.] OSSL_PARAM_locate
   0.81%  python    libcrypto.so.3.0.1                             [.] 0x0000000000178e84
   0.80%  python    libcrypto.so.3.0.1                             [.] EVP_CIPHER_up_ref
   0.77%  python    libc.so.6                                      [.] __libc_calloc
   0.67%  python    libmongocrypt.so.0.0.0                         [.] bson_utf8_validate
   0.66%  python    libcrypto.so.3.0.1                             [.] 0x000000000018d113
   0.59%  python    libmongocrypt.so.0.0.0                         [.] _mongocrypt_do_decryption
   0.58%  python    libmongocrypt.so.0.0.0                         [.] _mongocrypt_buffer_init
   0.58%  python    libmongocrypt.so.0.0.0                         [.] _bson_append
   0.56%  python    libc.so.6                                      [.] __memset_avx2_unaligned_erms
   0.56%  python    libcrypto.so.3.0.1                             [.] 0x0000000000271920
   0.55%  python    libcrypto.so.3.0.1                             [.] EVP_CIPHER_free
   0.53%  python    libc.so.6                                      [.] _int_malloc
   0.51%  python    libcrypto.so.3.0.1                             [.] EVP_MD_CTX_reset
   0.48%  python    libcrypto.so.3.0.1                             [.] SHA512_Update
   0.48%  python    libmongocrypt.so.0.0.0                         [.] _mongocrypt_buffer_from_subrange
   0.40%  python    libmongocrypt.so.0.0.0                         [.] _mongocrypt_memequal
   0.39%  python    libcrypto.so.3.0.1                             [.] OPENSSL_init_crypto
   0.38%  python    libcrypto.so.3.0.1                             [.] ENGINE_finish
   0.33%  python    libcrypto.so.3.0.1                             [.] EVP_CIPHER_CTX_set_padding

Update: To support the claim of atomic refcount contention above, here's more data. I ran the same "after" benchmark in 64-thread mode only, with perf record. Overall profile is similar to above, with EVP_MD_free on top. If I annotate EVP_MD_free itself, the refcounting has been inlined and you can see 96.8% samples on the instruction following the locked operation. The next highest symbol on the profile is EVP_MD_up_ref, which has 99% of its samples following the locked op. This all represents room for improvement subsequent to this PR.

Samples: 668K of event 'task-clock:upppH', 4000 Hz, Event count (approx.): 167054750000
EVP_MD_free  /usr/lib64/libcrypto.so.3.0.1 [Percent: local period]
Percent│
       │
       │
       │    Disassembly of section .text:
       │
       │    00000000001784a0 <EVP_MD_free@@OPENSSL_3.0.0>:
  0.81 │      endbr64
  0.03 │      test    %rdi,%rdi
       │      je      98
  0.04 │      push    %r12
  0.44 │      push    %rbp
  0.45 │      mov     %rdi,%rbp
       │      sub     $0x8,%rsp
  0.06 │      mov     0x18(%rdi),%eax
  0.18 │      test    %eax,%eax
       │      je      30
  0.01 │1e:   add     $0x8,%rsp
  0.00 │      pop     %rbp
  0.03 │      pop     %r12
  0.23 │      retq
       │      nop
  0.04 │30:   mov     $0xffffffff,%eax
  0.03 │      lock    xadd    %eax,0x78(%rdi)
 96.88 │      sub     $0x1,%eax
  0.19 │      test    %eax,%eax
       │      je      a0
  0.17 │      jg      1e
       │      nop
       │48:   mov     0x60(%rbp),%rdi
       │      lea     RC4_options@@OPENSSL_3.0.0+0x4bf10,%r12
       │      mov     $0x341,%edx
       │      mov     %r12,%rsi
       │      callq   CRYPTO_free
       │      mov     0x70(%rbp),%rdi
       │      test    %rdi,%rdi
       │      je      6e
       │      callq   OSSL_PROVIDER_add_builtin@@OPENSSL_3.0.0+0x1610
       │6e:   mov     0x80(%rbp),%rdi
       │      callq   CRYPTO_THREAD_lock_free
       │      add     $0x8,%rsp
       │      mov     %r12,%rsi
       │      mov     %rbp,%rdi
       │      mov     $0x344,%edx
       │      pop     %rbp
       │      pop     %r12
       │      jmpq    CRYPTO_free
       │      nop
  0.40 │98:   retq
       │      nop
       │a0:   jmp     48

Samples: 668K of event 'task-clock:upppH', 4000 Hz, Event count (approx.): 167054750000
EVP_MD_up_ref  /usr/lib64/libcrypto.so.3.0.1 [Percent: local period]
Percent│
       │
       │
       │    Disassembly of section .text:
       │
       │    0000000000176930 <EVP_MD_up_ref@@OPENSSL_3.0.0>:
  0.20 │      endbr64
  0.09 │      mov     0x18(%rdi),%eax
  0.40 │      test    %eax,%eax
       │      je      18
       │      mov     $0x1,%eax
       │      retq
       │      nop
  0.17 │18:   lock    addl    $0x1,0x78(%rdi)
 99.00 │      mov     $0x1,%eax
  0.15 │      retq

kevinAlbs

LGTM. The results are impressive.

99% of its samples following the locked op.

Am I correct in assuming this is due to perf record does not measure the off-CPU lock instruction? The samples get marked for the subsequent instruction?

src/crypto/libcrypto.c

ghost · 2025-04-09T16:54:29Z

Am I correct in assuming this is due to perf record does not measure the off-CPU lock instruction? The samples get marked for the subsequent instruction?

There can be a few reasons to see sampling profilers report locations that have a small offset from the actual bottleneck, and I'd have to dig through the x86 manuals to know for absolute sure which effects are due to what causes, but I think in this case it's just that the instruction pointer has already been advanced before the actual stall takes place. (And the profiler doesn't know microarchitecture details, it just logs instruction pointer plus call stack.)

Avoid libcrypto lock contention and fetch overhead

64465ec

ghost requested review from kevinAlbs and mdb-ad April 4, 2025 03:09

Micah Scott added 2 commits April 3, 2025 20:19

Comment fix, don't mention explicit fetch in shared code

46e6801

Comment update for _hmac_with_ctx_prototype, BSON_ASSERT_PARAM

1b19754

kevinAlbs approved these changes Apr 8, 2025

View reviewed changes

src/crypto/libcrypto.c Show resolved Hide resolved

mdb-ad approved these changes Apr 15, 2025

View reviewed changes

ghost merged commit eacb5fa into mongodb:master Apr 15, 2025
52 of 53 checks passed

This pull request was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

MONGOCRYPT-792 Avoid libcrypto lock contention and fetch overhead #995

MONGOCRYPT-792 Avoid libcrypto lock contention and fetch overhead #995

Uh oh!

ghost commented Apr 4, 2025 •

edited by ghost

Loading

Uh oh!

kevinAlbs left a comment

Uh oh!

Uh oh!

ghost commented Apr 9, 2025

Uh oh!

Uh oh!

Uh oh!

MONGOCRYPT-792 Avoid libcrypto lock contention and fetch overhead #995

MONGOCRYPT-792 Avoid libcrypto lock contention and fetch overhead #995

Uh oh!

Conversation

ghost commented Apr 4, 2025 • edited by ghost Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kevinAlbs left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ghost commented Apr 9, 2025

Uh oh!

Uh oh!

Uh oh!

ghost commented Apr 4, 2025 •

edited by ghost

Loading