Skip to content

icp: Port AVX2 implementation of aes-gcm from BoringSSL #17058

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 8 commits into
base: master
Choose a base branch
from

Conversation

lowjoel
Copy link

@lowjoel lowjoel commented Feb 15, 2025

Motivation and Context

Zen 3 CPUs support the VAES and VPCLMULDQ instructions which extend the width of each instruction from 128-bits to 256-bits. BoringSSL has recently implemented this version for AES-GCM and it provides up to a 80% speedup. See google/boringssl@3b6e1be.

Description

I've backported the implementation from BoringSSL, adapting code from google/boringssl@3b6e1be (but picking the tip of master), as well as from google/boringssl@62f9751 which changed the primitive signature from 6 arguments to 7 (by not implicitly relying on the address offset of the ghash structure.)

Adaptations for icp (akin to #9749) as well as to use the RET macro for kernel code are in the second commit.

The third and fourth commits combine the use_avx/use_avx2 flags into an enum, allowing toggling of the different implementations that are available. Also, define different values of CAN_USE_GCM_ASM to indicate various levels of compiler support.

How Has This Been Tested?

Compile tested.

I'm now running it on my ZFS-on-root main machine.

@robn has a Wycheproof test set (#17089) that has been merged; these changes pass.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • Documentation (a change to man pages or other documentation)

Checklist:

@lowjoel lowjoel marked this pull request as draft February 15, 2025 07:34
@github-actions github-actions bot added the Status: Work in Progress Not yet ready for general review label Feb 15, 2025
@lowjoel lowjoel force-pushed the aes-gcm-avx2 branch 2 times, most recently from 940863a to fbfc371 Compare February 15, 2025 08:18
@lowjoel
Copy link
Author

lowjoel commented Feb 15, 2025

Hey @AttilaFueloep I tried to port your changes in #9749 and I can't figure out what you mean by the change with the comment // ICP does not zero key schedule. in the modified assembly sources. I've ported the other two changes, for the round offset in the AES_KEY struct, as well as the OpenSSL vs ICP representation of number of AES rounds. I've also fixed the gcm_init_vpclmulqdq_avx2 method which you mentioned ICP stores H in network order.

@lowjoel lowjoel force-pushed the aes-gcm-avx2 branch 3 times, most recently from 0ce5f7b to 1b8867f Compare February 15, 2025 09:00
@robn
Copy link
Member

robn commented Feb 15, 2025

Strong opening move! I've had a little play with it, here's what I got.

Before the final thing, you should run make checkstyle, and fix the errors it throws out.

I added this patch to let me see what it thinks is happening on my test machines:

diff --git module/zcommon/simd_stat.c module/zcommon/simd_stat.c
index d82a88ca9..da557bbb0 100644
--- module/zcommon/simd_stat.c
+++ module/zcommon/simd_stat.c
@@ -117,6 +117,10 @@ simd_stat_kstat_data(char *buf, size_t size, void *data)
 		    "pclmulqdq", zfs_pclmulqdq_available());
 		off += SIMD_STAT_PRINT(simd_stat_kstat_payload,
 		    "movbe", zfs_movbe_available());
+		off += SIMD_STAT_PRINT(simd_stat_kstat_payload,
+		    "vaes", zfs_vaes_available());
+		off += SIMD_STAT_PRINT(simd_stat_kstat_payload,
+		    "vpclmulqdq", zfs_vpclmulqdq_available());
 
 		off += SIMD_STAT_PRINT(simd_stat_kstat_payload,
 		    "osxsave", boot_cpu_has(X86_FEATURE_OSXSAVE));

With that, my old Intel 2019 junker laptop says:

# grep -E 'vaes|vpclmulqdq' /proc/spl/kstat/zfs/simd
vaes                    0
vpclmulqdq              0

My much nicer Ryzen 5 from last year says:

# grep -E 'vaes|vpclmulqdq' /proc/spl/kstat/zfs/simd
vaes                    1
vpclmulqdq              1

Unfortunately we don't have visibility on the ICP microbenchmarks like we do for checksums and raidz, but we can at least see the options available:

# cat /sys/module/zfs/parameters/icp_gcm_impl
cycle [fastest] avx avx2 generic pclmulqdq

So I'd say it's all wired up right, which is half the fun.

I set it to avx and created a dataset, then unmounted and exported. Then I set it to avx2 and tried to zfs load-key, but that didn't work:

# zfs load-key -a
Enter passphrase for 'tank/enc':
Key load error: Incorrect key provided for 'tank/enc'.

Trying to create the pool and dataset with avx2 selected gets a crash:

# zpool create tank -O encryption=aes-256-gcm -O keyformat=passphrase /home/robn/blk
Enter new passphrase:
Re-enter new passphrase:
Killed

And the kernel has a nice complaint:

[ 2769.438918] BUG: unable to handle page fault for address: ffff9c23c8307e10
[ 2769.439133] #PF: supervisor read access in kernel mode
[ 2769.439290] #PF: error_code(0x0000) - not-present page
[ 2769.439447] PGD 110e01067 P4D 110e01067 PUD 0
[ 2769.439588] Oops: 0000 [#1] PREEMPT SMP NOPTI
[ 2769.439733] CPU: 0 PID: 73114 Comm: zpool Tainted: P           OE      6.1.0-25-amd64 #1  Debian 6.1.106-3
[ 2769.440035] Hardware name: FreeBSD BHYVE/BHYVE, BIOS 14.0 10/17/2021
[ 2769.440236] RIP: 0010:aes_gcm_dec_update_vaes_avx2+0x3e/0x5e0 [zfs]
[ 2769.440596] Code: 0c 24 c4 e2 71 00 c8 c4 42 7d 5a 18 c4 62 25 00 d8 44 8b 91 78 01 00 00 46 8d 14 95 f0 ff ff ff 4e 8d 5c 91 60 c4 62 7d 5a 09 <c4> 42 7d 5a 13 c5 25 fe 1d d5 ed 0b 00 48 83 fa 7f 0f 8
6 31 03 00
[ 2769.441171] RSP: 0018:ffffb1b6c37df358 EFLAGS: 00010086
[ 2769.441339] RAX: ffffffffc089b140 RBX: ffffb1b6c37df43c RCX: ffff9c214b6eae00
[ 2769.441566] RDX: 0000000000000060 RSI: ffff9c208e918500 RDI: ffff9c208e918500
[ 2769.441793] RBP: 0000000000000006 R08: ffffb1b6c37df430 R09: ffff9c20e8eba3c0
[ 2769.442019] R10: 000000009f3073ec R11: ffff9c23c8307e10 R12: ffffb1b6c37df498
[ 2769.442102] R13: ffffffffc089b140 R14: ffffb1b6c37df420 R15: ffffb1b6c37df488
[ 2769.442102] FS:  00007f2d7ead5840(0000) GS:ffff9c23afc00000(0000) knlGS:0000000000000000
[ 2769.442102] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 2769.442102] CR2: ffff9c23c8307e10 CR3: 00000001cccaa000 CR4: 00000000003506f0
[ 2769.442102] Call Trace:
[ 2769.442102]  <TASK>
[ 2769.442102]  ? __die_body.cold+0x1a/0x1f
[ 2769.442102]  ? page_fault_oops+0xd2/0x2b0
[ 2769.442102]  ? srso_alias_return_thunk+0x5/0x7f
[ 2769.442102]  ? search_bpf_extables+0x5b/0x80
[ 2769.442102]  ? exc_page_fault+0xca/0x170
[ 2769.442102]  ? asm_exc_page_fault+0x22/0x30
[ 2769.442102]  ? aesni_gcm_decrypt_avx+0x10/0x10 [zfs]
[ 2769.442102]  ? aesni_gcm_decrypt_avx+0x10/0x10 [zfs]
[ 2769.442102]  ? aes_gcm_dec_update_vaes_avx2+0x3e/0x5e0 [zfs]
[ 2769.442102]  ? aesni_gcm_decrypt_avx2+0x2a/0x50 [zfs]
[ 2769.442102]  ? gcm_decrypt_final_avx+0x167/0x460 [zfs]
[ 2769.442102]  ? crypto_update_uio+0xc0/0x120 [zfs]
[ 2769.442102]  ? aesni_gcm_decrypt_avx+0x10/0x10 [zfs]
[ 2769.442102]  ? aes_decrypt_atomic+0x1ca/0x310 [zfs]
[ 2769.442102]  ? crypto_decrypt+0x78/0x1c0 [zfs]
[ 2769.442102]  ? zio_do_crypt_uio+0x2ce/0x400 [zfs]
[ 2769.442102]  ? zio_crypt_key_unwrap+0x254/0x480 [zfs]
[ 2769.442102]  ? dsl_crypto_key_open.constprop.0+0x2d1/0x350 [zfs]
[ 2769.442102]  ? dsl_crypto_key_open.constprop.0+0x2d1/0x350 [zfs]
[ 2769.442102]  ? spa_keystore_dsl_key_hold_dd+0x12e/0x280 [zfs]
[ 2769.442102]  ? __kmalloc_node+0x4c/0x150
[ 2769.442102]  ? spa_keystore_create_mapping+0x79/0x200 [zfs]
[ 2769.442102]  ? dsl_dataset_hold_obj_flags+0x48/0x90 [zfs]
[ 2769.442102]  ? dsl_pool_create+0x1bc/0x490 [zfs]
[ 2769.442102]  ? spa_create+0x83d/0xdb0 [zfs]
[ 2769.442102]  ? zfs_ioc_pool_create+0xab/0x310 [zfs]
[ 2769.442102]  ? zfsdev_ioctl_common+0x6a0/0x7c0 [zfs]
[ 2769.442102]  ? __kmalloc_node+0xbf/0x150
[ 2769.442102]  ? srso_alias_return_thunk+0x5/0x7f
[ 2769.442102]  ? zfsdev_ioctl+0x4f/0xd0 [zfs]
[ 2769.442102]  ? __x64_sys_ioctl+0x90/0xd0
[ 2769.442102]  ? do_syscall_64+0x55/0xb0

That's all I have time for tonight. This is a good start!

@robn
Copy link
Member

robn commented Feb 15, 2025

Oh the other thing I forgot to add, after loading the module it says fastest, and fails as above. So it is selecting avx2 as fastest, which is good! (we need a stat for it, for sure).

@lowjoel lowjoel force-pushed the aes-gcm-avx2 branch 14 times, most recently from ae64131 to c0a1e8e Compare February 16, 2025 07:12
} else {
/*
* Handle the "cycle" implementation by creating avx and
* non-avx contexts alternately.
*/
gcm_ctx->gcm_use_avx = gcm_toggle_avx();
gcm_ctx->gcm_use_avx2 = gcm_toggle_avx2();
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pretty sure the cycle behaviour here isn't correct.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be fixed in #17061.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So does this PR have a dependency on #17061?

Copy link
Author

@lowjoel lowjoel Feb 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think cycle will still work, it just won't toggle the same way (or the assumed way) - is that behaviour documented somewhere? Under what circumstances would people be using cycle?

When this code runs on an AVX2 processor, cycle will only toggle between AVX2 and generic, ignoring AVX - I don't know if that is a deal breaker enough to consider it a "dependency"

AVX only processors should still cycle the same way.

(at least that's the intention - might have to think a bit deeper about this)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tonyhutter I've had another go at the cycle code - since movbe only applies to AVX I've modified the toggling for that to not apply to AVX2. Furthermore, if bswap is needed neither accelerated method will be usable.

@lowjoel lowjoel marked this pull request as ready for review February 16, 2025 23:10
@github-actions github-actions bot added Status: Code Review Needed Ready for review and testing and removed Status: Work in Progress Not yet ready for general review labels Feb 16, 2025
Comment on lines 1184 to 1188
static inline size_t
gcm_simd_get_htab_size(void)
{
return (2 * 6 * 2 * sizeof (uint64_t));
}
Copy link
Contributor

@AttilaFueloep AttilaFueloep Mar 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change isn't accurate. Each implementation has it's own Htab size and we have to handle it.

On lines 678 - 682 above we are allocating the Htab using this function which will return 192. But we need 16*16=256 bytes (const u128 Htable[16]). The reason this works is by coincidence. In the BoringSSL code I see

We do not use Htable[12..15].

So 192 bytes are sufficient. Still it seems like bad style to declare u128 Htable[16] but allocate u128 Htable[12].

Not sure if the removed return 0 for the non-SIMD case is needed, will look tomorrow

Copy link
Author

@lowjoel lowjoel Mar 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if the removed return 0 for the non-SIMD case is needed, will look tomorrow

Line 675-676:

if (gcm_ctx->gcm_use_avx == B_TRUE) {
	size_t htab_len = gcm_simd_get_htab_size(gcm_ctx->gcm_use_avx);

I only ever see gcm_simd_get_htab_size called with TRUE.

I agree that relying on the fact that BoringSSL only uses the first 12 bytes is a happy coincidence; I'll look at how to rewire tomorrow.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if the removed return 0 for the non-SIMD case is needed, will look tomorrow
I only ever see gcm_simd_get_htab_size called with TRUE.

Agreed.

I agree that relying on the fact that BoringSSL only uses the first 12 bytes is a happy coincidence; I'll look at how to rewire tomorrow.

Not a big deal. You could either adjust the prototypes to use u128[12] or change gcm_simd_get_htab_size to return 256 for the AVX2 case. I'm fine either way. This is mostly about avoiding readers wondering about the size of Htab.

Copy link
Contributor

@AttilaFueloep AttilaFueloep left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the risk of nitpicking, I think we should reconsider the module parameter name. Currently it's called avx2 but the main feature it adds is support for the VAES/VCLMUL extension to AES-NI/CLMUL. So I'd suggest something like vaes-avx2. (Actually it uses AVX2 only due to the fact that Zen 3 has no AVX-512/AVX10.) Thoughts?

I've some more nits but since all of them are in areas I'm refactoring right now it wouldn't make much sense to bring them up here. There's one optimization possibility regarding GCM_AVX_MIN_{DE,EN}CRYPT_BYTES (should both be 16 for VAES). But that's hard to fix and I already solved it locally. So I think it's OK to leave it as is right know.

So modulo Htab size and module parameter name we are good to go I'd say.

@lowjoel
Copy link
Author

lowjoel commented Mar 8, 2025

At the risk of nitpicking, I think we should reconsider the module parameter name. Currently it's called avx2 but the main feature it adds is support for the VAES/VCLMUL extension to AES-NI/CLMUL. So I'd suggest something like vaes-avx2. (Actually it uses AVX2 only due to the fact that Zen 3 has no AVX-512/AVX10.) Thoughts?

Yeah. I think you're right. The commit calls it VAES+AVX2; I should have stuck with it. I'll push this later today

So modulo Htab size and module parameter name we are good to go I'd say.

Added htab refactor in c49a94c - the method seemed to not fit in the program flow any more hence the slightly more aggressive rewrite. Thanks @AttilaFueloep

@lowjoel lowjoel force-pushed the aes-gcm-avx2 branch 3 times, most recently from a6b8673 to 3fc6ac8 Compare March 8, 2025 10:14
@lowjoel
Copy link
Author

lowjoel commented Mar 8, 2025

@AttilaFueloep renamed to avx2-vaes in 59fb58a; identifiers in code still remains just avx2 for symmetry with the assembly.

@lowjoel lowjoel force-pushed the aes-gcm-avx2 branch 2 times, most recently from 7f1fb8c to 1f02def Compare March 9, 2025 03:28
@AttilaFueloep
Copy link
Contributor

AttilaFueloep commented Mar 10, 2025

@lowjoel Sorry for not being clear enough, was in a hurry. My main concern was the user visible module parameter that is shown in e.g. cat /sys/module/zfs/parameters/icp_gcm_impl. I think it would be good enough to just change the string, and keep the #define as is.

gcm.c L851

static const struct {
	const char *name;
	uint32_t sel;
} gcm_impl_opts[] = {
		{ "cycle",	IMPL_CYCLE },
		{ "fastest",	IMPL_FASTEST },
#ifdef CAN_USE_GCM_ASM
		{ "avx",	IMPL_AVX },
=>		{ "avx2",	IMPL_AVX2 },
#endif
};

@AttilaFueloep
Copy link
Contributor

AttilaFueloep commented Mar 10, 2025

@lowjoel Regarding the Htab size: I'd have passed the GCM context to gcm_simd_get_htab_size() and returned an appropriate value there, but your change works as well.

@tonyhutter
Copy link
Contributor

@AttilaFueloep @lowjoel just checking in - are you guys pretty happy with where this PR is right now? Are you just waiting on more approvals?

@AttilaFueloep
Copy link
Contributor

@tonyhutter Well, I'm waiting for feedback from @lowjoel regarding my suggestion to rename the module parameter to avx2-vaes (#17058 (comment)). Besides that, I've no objections.

@lowjoel
Copy link
Author

lowjoel commented Mar 25, 2025

@tonyhutter Well, I'm waiting for feedback from @lowjoel regarding my suggestion to rename the module parameter to avx2-vaes (#17058 (comment)). Besides that, I've no objections.

Crap. I think I made a change in my branch locally and the push got rejected and I got distracted with something else... umm. Let me go figure it out later today, or if you'd want to just push the change to my branch that works too. Or if you'd want to merge and squash the change up into also works for me.

@AttilaFueloep
Copy link
Contributor

No worries, take your time.

@lowjoel
Copy link
Author

lowjoel commented Mar 26, 2025

OK it seems like I did do the work but just not the string shown to the user. Just rebased and pushed. See 42dfe38

lowjoel and others added 8 commits March 26, 2025 08:30
This uses the AVX2 versions of the AESENC and PCLMULQDQ instructions; on
Zen 3 this provides an up to 80% performance improvement.

Original source:
https://github.com/google/boringssl/blob/13840dd094f9e9c1b00a7368aa25e656554221f1/gen/bcm/aes-gcm-avx2-x86_64-linux.S

See the original BoringSSL commit at
google/boringssl@3b6e1be.

Signed-off-by: Joel Low <[email protected]>
 - Accept GCM H variable in network endianness (ICP convention)
 - Fix round count offset in AES_KEY struct (ICP convention)
 - Use RET macro for kernel code
 - Explicitly use .balign directive
 - Use ENTRY_ALIGN and SET_SIZE macros
 - Use ENDBR macro

Signed-off-by: Joel Low <[email protected]>
The BoringSSL AVX2 implementation allocates uint128_t[16] but only uses
the first 12 elements. Be explicit in the code.

Signed-off-by: Joel Low <[email protected]>
@lowjoel
Copy link
Author

lowjoel commented Mar 26, 2025

@AttilaFueloep we've got a few new commits when I was checking the license - https://github.com/google/boringssl/commits/main/crypto/fipsmodule/aes/asm/aes-gcm-avx2-x86_64.pl - I will backport the vzeroupper change after this is merged, or if it's still open later today (about to head out for work)

@@ -767,6 +754,9 @@ gcm_impl_get_ops(void)
break;
#ifdef CAN_USE_GCM_ASM
case IMPL_AVX:
#if CAN_USE_GCM_ASM >= 2
case IMPL_AVX2:
#endif
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realise that by specifying icp_gcm_impl = avx2 I'm seeing 50% of my CPU time spent in gcm_generic_mul, I believe because of this branch here. If I specify fastest this doesn't happen. Why do we override the generic implementation rather than the fastest when we explicitly set AVX/AVX2?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so. If you end up in gcm_generic_mul you're in a non avx2 code path. I'd suspect either setting or using ctx->impl or activating the implementation fails. I'd run this through gdb but I've no VAES box currently. (That may change soon though.)

The code above just make sure that a valid gcm_impl_ops is returned for still open non-avx contexts , see the comment.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see what you mean now - if this path is called it's because it's an in flight operation while the global changed. Hmm, then I need to dig further.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly. I'll try to have a look over the weekend.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Code Review Needed Ready for review and testing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants