Skip to content

MONGOCRYPT-792 Avoid libcrypto lock contention and fetch overhead #995

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Apr 15, 2025

Conversation

mdbmes
Copy link
Contributor

@mdbmes mdbmes commented Apr 4, 2025

This is a proposed performance improvement for HMAC and AES operations when using libcrypto (OpenSSL) 3.0.0 or later. Earlier versions are still supported but don't benefit from the optimization.

As MONGOCRYPT-792 requests, this performs the algorithm "fetch" operation early, resolving a name like "SHA-256" into a ready-to-use internal object. Upon closer examination, the overhead from this fetch operation can be much higher than expected due to lock contention around access to the shared global OSSL_LIB_CTX.

In testing with the benchmark-python.sh microbenchmark, I see about 50% improvement on single-threaded results, but an even more dramatic increase in multi-threaded results. The before results don't get much better than single-threaded, whereas after we are getting a bit over 4x concurrency. See full results below.

There's still a great deal of room for improvement. Profiling shows signs of contention in the atomic reference count operations when we create and destroy contexts. It would be ideal to modify the API such that the caller is responsible for allocating a non-shared temporary buffer that can be used for a batch of many crypto operations.

There are more alternatives yet. The underlying algorithms here don't require temporary space or early initialization, we're just bending to support APIs that aren't ideal. If we wanted, a self-contained SHA2 implementation without external dependencies may prove faster than using OpenSSL's HMAC on such a frequent basis. Early fetch is straightforward for ciphers, but fetching both the HMAC and its sub-algorithm early requires a workaround (ctx dup) with its own overhead. This may also be reason to prefer other HMAC/SHA2 implementations in the future.

Measuring on rhel90-dbx-perf-large evergreen host, using ./.evergreen/benchmark-python.sh

======
BEFORE
======

Running benchmark with pymongocrypt: 1.13.0.dev0 libmongocrypt: 1.13.1
runTest (__main__.TestBulkDecryption) ... Finished TestBulkDecryption, threads=1, median ops_per_second=119.56
Finished TestBulkDecryption, threads=2, median ops_per_second=163.44
Finished TestBulkDecryption, threads=8, median ops_per_second=188.24
Finished TestBulkDecryption, threads=64, median ops_per_second=206.21
ok

Running benchmark with pymongocrypt: 1.13.0.dev0 libmongocrypt: 1.13.1
runTest (__main__.TestBulkDecryption) ... Finished TestBulkDecryption, threads=1, median ops_per_second=117.58
Finished TestBulkDecryption, threads=2, median ops_per_second=167.10
Finished TestBulkDecryption, threads=8, median ops_per_second=184.78
Finished TestBulkDecryption, threads=64, median ops_per_second=210.00
ok

Samples: 1M of event 'task-clock:upppH', Event count (approx.): 305122250000
Overhead  Command   Shared Object                                  Symbol
  37.50%  python    libc.so.6                                      [.] pthread_rwlock_unlock@@GLIBC_2.34
  33.27%  python    libc.so.6                                      [.] pthread_rwlock_rdlock@GLIBC_2.2.5
   1.11%  python    libcrypto.so.3.0.1                             [.] 0x00000000001a3f39
   1.04%  python    libcrypto.so.3.0.1                             [.] EVP_MD_up_ref
   0.95%  python    libcrypto.so.3.0.1                             [.] EVP_MD_free
   0.79%  python    libc.so.6                                      [.] __strcmp_avx2
   0.78%  python    libc.so.6                                      [.] _int_free
   0.76%  python    libcrypto.so.3.0.1                             [.] HMAC_Init_ex
   0.70%  python    libcrypto.so.3.0.1                             [.] OPENSSL_cleanse
   0.67%  python    libc.so.6                                      [.] malloc
   0.59%  python    libcrypto.so.3.0.1                             [.] 0x00000000001a3f9b
   0.52%  python    libcrypto.so.3.0.1                             [.] OPENSSL_LH_retrieve
   0.43%  python    libcrypto.so.3.0.1                             [.] EVP_MAC_free
   0.40%  python    libc.so.6                                      [.] __libc_calloc
   0.39%  python    libc.so.6                                      [.] cfree@GLIBC_2.2.5
   0.32%  python    libcrypto.so.3.0.1                             [.] SHA512_Final

=====
AFTER
=====

Running benchmark with pymongocrypt: 1.13.0.dev0 libmongocrypt: 1.13.1
runTest (__main__.TestBulkDecryption) ... Finished TestBulkDecryption, threads=1, median ops_per_second=181.09
Finished TestBulkDecryption, threads=2, median ops_per_second=310.09
Finished TestBulkDecryption, threads=8, median ops_per_second=692.31
Finished TestBulkDecryption, threads=64, median ops_per_second=823.21
ok

Running benchmark with pymongocrypt: 1.13.0.dev0 libmongocrypt: 1.13.1
runTest (__main__.TestBulkDecryption) ... Finished TestBulkDecryption, threads=1, median ops_per_second=181.51
Finished TestBulkDecryption, threads=2, median ops_per_second=314.28
Finished TestBulkDecryption, threads=8, median ops_per_second=774.59
Finished TestBulkDecryption, threads=64, median ops_per_second=902.83
ok

Samples: 1M of event 'task-clock:upppH', Event count (approx.): 275680250000
Overhead  Command   Shared Object                                  Symbol
   8.62%  python    libcrypto.so.3.0.1                             [.] EVP_MD_free
   7.74%  python    libcrypto.so.3.0.1                             [.] EVP_MD_up_ref
   4.03%  python    libcrypto.so.3.0.1                             [.] EVP_DigestUpdate
   3.96%  python    libcrypto.so.3.0.1                             [.] EVP_MD_CTX_copy_ex
   3.08%  python    libcrypto.so.3.0.1                             [.] HMAC_Init_ex
   2.76%  python    libcrypto.so.3.0.1                             [.] OPENSSL_cleanse
   2.41%  python    libc.so.6                                      [.] _int_free
   2.40%  python    libcrypto.so.3.0.1                             [.] 0x0000000000178ada
   2.38%  python    libcrypto.so.3.0.1                             [.] EVP_DigestFinal_ex
   1.96%  python    libc.so.6                                      [.] malloc
   1.81%  python    libc.so.6                                      [.] __strcmp_avx2
   1.56%  python    libcrypto.so.3.0.1                             [.] 0x000000000017857d
   1.33%  python    libc.so.6                                      [.] cfree@GLIBC_2.2.5
   1.17%  python    libcrypto.so.3.0.1                             [.] SHA512_Final
   1.03%  python    libmongocrypt.so.0.0.0                         [.] _bson_iter_next_internal
   0.97%  python    libcrypto.so.3.0.1                             [.] 0x000000000024eae0
   0.96%  python    libc.so.6                                      [.] __memmove_avx_unaligned_erms
   0.89%  python    libcrypto.so.3.0.1                             [.] 0x0000000000178b5c
   0.84%  python    libcrypto.so.3.0.1                             [.] EVP_MAC_free
   0.83%  python    libcrypto.so.3.0.1                             [.] EVP_MAC_up_ref
   0.82%  python    libcrypto.so.3.0.1                             [.] OSSL_PARAM_locate
   0.81%  python    libcrypto.so.3.0.1                             [.] 0x0000000000178e84
   0.80%  python    libcrypto.so.3.0.1                             [.] EVP_CIPHER_up_ref
   0.77%  python    libc.so.6                                      [.] __libc_calloc
   0.67%  python    libmongocrypt.so.0.0.0                         [.] bson_utf8_validate
   0.66%  python    libcrypto.so.3.0.1                             [.] 0x000000000018d113
   0.59%  python    libmongocrypt.so.0.0.0                         [.] _mongocrypt_do_decryption
   0.58%  python    libmongocrypt.so.0.0.0                         [.] _mongocrypt_buffer_init
   0.58%  python    libmongocrypt.so.0.0.0                         [.] _bson_append
   0.56%  python    libc.so.6                                      [.] __memset_avx2_unaligned_erms
   0.56%  python    libcrypto.so.3.0.1                             [.] 0x0000000000271920
   0.55%  python    libcrypto.so.3.0.1                             [.] EVP_CIPHER_free
   0.53%  python    libc.so.6                                      [.] _int_malloc
   0.51%  python    libcrypto.so.3.0.1                             [.] EVP_MD_CTX_reset
   0.48%  python    libcrypto.so.3.0.1                             [.] SHA512_Update
   0.48%  python    libmongocrypt.so.0.0.0                         [.] _mongocrypt_buffer_from_subrange
   0.40%  python    libmongocrypt.so.0.0.0                         [.] _mongocrypt_memequal
   0.39%  python    libcrypto.so.3.0.1                             [.] OPENSSL_init_crypto
   0.38%  python    libcrypto.so.3.0.1                             [.] ENGINE_finish
   0.33%  python    libcrypto.so.3.0.1                             [.] EVP_CIPHER_CTX_set_padding

Update: To support the claim of atomic refcount contention above, here's more data. I ran the same "after" benchmark in 64-thread mode only, with perf record. Overall profile is similar to above, with EVP_MD_free on top. If I annotate EVP_MD_free itself, the refcounting has been inlined and you can see 96.8% samples on the instruction following the locked operation. The next highest symbol on the profile is EVP_MD_up_ref, which has 99% of its samples following the locked op. This all represents room for improvement subsequent to this PR.

Samples: 668K of event 'task-clock:upppH', 4000 Hz, Event count (approx.): 167054750000
EVP_MD_free  /usr/lib64/libcrypto.so.3.0.1 [Percent: local period]
Percent│
       │
       │
       │    Disassembly of section .text:
       │
       │    00000000001784a0 <EVP_MD_free@@OPENSSL_3.0.0>:
  0.81 │      endbr64
  0.03 │      test    %rdi,%rdi
       │      je      98
  0.04 │      push    %r12
  0.44 │      push    %rbp
  0.45 │      mov     %rdi,%rbp
       │      sub     $0x8,%rsp
  0.06 │      mov     0x18(%rdi),%eax
  0.18 │      test    %eax,%eax
       │      je      30
  0.01 │1e:   add     $0x8,%rsp
  0.00 │      pop     %rbp
  0.03 │      pop     %r12
  0.23 │      retq
       │      nop
  0.04 │30:   mov     $0xffffffff,%eax
  0.03 │      lock    xadd    %eax,0x78(%rdi)
 96.88 │      sub     $0x1,%eax
  0.19 │      test    %eax,%eax
       │      je      a0
  0.17 │      jg      1e
       │      nop
       │48:   mov     0x60(%rbp),%rdi
       │      lea     RC4_options@@OPENSSL_3.0.0+0x4bf10,%r12
       │      mov     $0x341,%edx
       │      mov     %r12,%rsi
       │      callq   CRYPTO_free
       │      mov     0x70(%rbp),%rdi
       │      test    %rdi,%rdi
       │      je      6e
       │      callq   OSSL_PROVIDER_add_builtin@@OPENSSL_3.0.0+0x1610
       │6e:   mov     0x80(%rbp),%rdi
       │      callq   CRYPTO_THREAD_lock_free
       │      add     $0x8,%rsp
       │      mov     %r12,%rsi
       │      mov     %rbp,%rdi
       │      mov     $0x344,%edx
       │      pop     %rbp
       │      pop     %r12
       │      jmpq    CRYPTO_free
       │      nop
  0.40 │98:   retq
       │      nop
       │a0:   jmp     48

Samples: 668K of event 'task-clock:upppH', 4000 Hz, Event count (approx.): 167054750000
EVP_MD_up_ref  /usr/lib64/libcrypto.so.3.0.1 [Percent: local period]
Percent│
       │
       │
       │    Disassembly of section .text:
       │
       │    0000000000176930 <EVP_MD_up_ref@@OPENSSL_3.0.0>:
  0.20 │      endbr64
  0.09 │      mov     0x18(%rdi),%eax
  0.40 │      test    %eax,%eax
       │      je      18
       │      mov     $0x1,%eax
       │      retq
       │      nop
  0.17 │18:   lock    addl    $0x1,0x78(%rdi)
 99.00 │      mov     $0x1,%eax
  0.15 │      retq

@mdbmes mdbmes requested review from kevinAlbs and mdb-ad April 4, 2025 03:09
Copy link
Contributor

@kevinAlbs kevinAlbs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. The results are impressive.

99% of its samples following the locked op.

Am I correct in assuming this is due to perf record does not measure the off-CPU lock instruction? The samples get marked for the subsequent instruction?

@mdbmes
Copy link
Contributor Author

mdbmes commented Apr 9, 2025

Am I correct in assuming this is due to perf record does not measure the off-CPU lock instruction? The samples get marked for the subsequent instruction?

There can be a few reasons to see sampling profilers report locations that have a small offset from the actual bottleneck, and I'd have to dig through the x86 manuals to know for absolute sure which effects are due to what causes, but I think in this case it's just that the instruction pointer has already been advanced before the actual stall takes place. (And the profiler doesn't know microarchitecture details, it just logs instruction pointer plus call stack.)

@mdbmes mdbmes merged commit eacb5fa into mongodb:master Apr 15, 2025
52 of 53 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants