-
Notifications
You must be signed in to change notification settings - Fork 342
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce ARM Neon SIMD. #743
base: master
Are you sure you want to change the base?
Conversation
The gain seem to be 7% on real word benchmarks:
Also note that I did one more refactoring to make the introduction of SIMD easier, so you still have a conflict. |
ext/json/ext/generator/simd.h
Outdated
uint8x16x4_t load_uint8x16_4(const unsigned char *table, int offset) { | ||
uint8x16x4_t tab; | ||
for(int i=0; i<4; i++) { | ||
tab.val[i] = vld1q_u8(table+offset+(i*16)); | ||
} | ||
return tab; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't that just vld4q_u8
?
https://developer.arm.com/architectures/instruction-sets/intrinsics/vld4q_u8
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately it's not. vld4q_u8
interleaves the data among the 4 vector registers.
% cat load-test.c
#include <stdio.h>
#include <stdint.h>
#include <arm_neon.h>
void print_vec(char *msg, uint8x16_t vec) {
printf("%s\n[ ", msg);
uint8_t store[16] = {0};
vst1q_u8(store, vec);
for(int i=0; i<16; i++) {
printf("%3d ", store[i]);
}
printf("]\n");
}
uint8x16x4_t load_table(uint8_t *table, int offset) {
uint8x16x4_t tab;
for(int i=0; i<4; i++) {
tab.val[i] = vld1q_u8(table+offset+(i*16));
}
return tab;
}
int main(void) {
uint8_t table[256];
for(int i=0; i<256; i++) {
table[i] = i;
}
uint8x16x4_t tab1 = load_table(table, 0);
print_vec("tab1.val[0]", tab1.val[0]);
print_vec("tab1.val[1]", tab1.val[1]);
print_vec("tab1.val[2]", tab1.val[2]);
print_vec("tab1.val[3]", tab1.val[3]);
printf("\n");
uint8x16x4_t tab1_2 = vld4q_u8(table);
print_vec("tab1_2.val[0]", tab1_2.val[0]);
print_vec("tab1_2.val[1]", tab1_2.val[1]);
print_vec("tab1_2.val[2]", tab1_2.val[2]);
print_vec("tab1_2.val[3]", tab1_2.val[3]);
return 0;
}
% ./load-test
tab1.val[0]
[ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 ]
tab1.val[1]
[ 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 ]
tab1.val[2]
[ 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 ]
tab1.val[3]
[ 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 ]
tab1_2.val[0]
[ 0 4 8 12 16 20 24 28 32 36 40 44 48 52 56 60 ]
tab1_2.val[1]
[ 1 5 9 13 17 21 25 29 33 37 41 45 49 53 57 61 ]
tab1_2.val[2]
[ 2 6 10 14 18 22 26 30 34 38 42 46 50 54 58 62 ]
tab1_2.val[3]
[ 3 7 11 15 19 23 27 31 35 39 43 47 51 55 59 63 ]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow, that's so weird.
Well, maybe that loop should be unrolled then, I suspect the compiler does it, but might as well be explicit.
Can you just include the implementation for the regular escaping? I'm not sure the script safe version is quite worth it. |
…tion. Also store the potential matches directly rather than looking up values in the escape table.
ext/json/ext/generator/generator.c
Outdated
if ((ch_len = search_escape_basic_neon_advance_lut(search)) != 0) { | ||
return ch_len; | ||
} | ||
|
||
// if ((ch_len = search_escape_basic_neon_advance_rules(search)) != 0) { | ||
// return ch_len; | ||
// } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like it's a toss up which one is the best. It might be an artifact that my M1 Macbook Air is passively cooled and it gets warm after I run it over and over.
Comparison between
Running it a second time:
|
…e only need 128 bytes for the lookup table as the top 128 bytes are all zeros.
Not sure why but it's way more modest on my machine (Air M3):
|
Apologies for going dark for a while. I've been trying to make incremental improvements on a different branch (found here). My hope was using a move mask would be faster than Feel free to try it out though. |
That's no worries at all. I want to release a After that I think I can start merging some SIMD stuff. I'd like to go with the smaller possible useful SIMD acceleration to ensure it doesn't cause issues with people. If it works well, we can then go farther. So yeah, no rush. |
@byroot if you have a few minutes, would you be able to checkout this branch and benchmark it against master. You'll have to tweak your compare script a bit to compile this branch with This branch uses the bit twiddling sort of platform agnostic SIMD code if the SIMD code is disabled via a The results on my M1:
|
With that compilation flag and compared to
|
Version 2 of the introduction of ARM Neon SIMD.
There are currently two implementations:
Benchmarks (Lookup table)
Benchmarks (Rules based)
I am still working on this but I wanted to share progress.
Edit: Looks like I missed one commit so I'll have to resolve some merge conflicts.