-
Notifications
You must be signed in to change notification settings - Fork 584
Dedicated SV copying code in place of Perl_sv_setsv_flags #23202
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: blead
Are you sure you want to change the base?
Conversation
Cloning is rather unfortunate choice of words, given that it has a very specific meaning in our codebase that is quite different from what this PR is about. Renaming the PR may be helpful. |
c392526
to
91c2b99
Compare
I've made a lot of changes following earlier comments - thanks for those - and have finally force-pushed. These changes aren't complete. For example:
|
91c2b99
to
583bbac
Compare
583bbac
to
d1e53fd
Compare
Since around the 5.10 era, `Perl_sv_setsv_flags` has unconditionally set `SvPOK_only(dsv)` in the `SVp_POK` branch. The associated comment reads: /* Whichever path we take through the next code, we want this true, and doing it now facilitates the COW check. */ Things have changed since 5.10 though, in particular using the `SVf_POK` to distinguish between a value that started off as a string from one that was originally an integer/float and later stringified. This commit: * Removes the `SvPOK_only(dsv)` in favour of `SvOK_off(dsv)` and hoisting the copying of `sflags` over. * Transforms the subsequent now-redundant `SVf_POK` toggles into asserts (to help reduce) the chance of inadvertent behaviour changes.
`Perl_sv_setsv_flags` is a hot function that contains liberal sprinklings of `SvOK_off()`. This commit changes two instances, where the operand SV cannot possibly be using the OOK hack, to do direct flag twiddling instead. `SvOK_off()` does two things: 1. Toggles off some flags: SvFLAGS(sv) &= ~(SVf_OK|SVf_IVisUV|SVf_UTF8) 2. Checks for use of the OOK hack and undoes it: ((void)(SvOOK(sv) && (sv_backoff(sv),0))) At least some compilers seem to struggle to figure out when `SvOOK(sv)` cannot be true and to then elide the call to `sv_backoff()`. This is desirable when: 1. ssv & dsv are both lower types than SVt_PV and cannot support OOK 2. inside a block following a conditional check that OOK is not in use In the two cases identified, the flag toggling is now done explicitly.
d1e53fd
to
05d5efa
Compare
Perl_sv_freshcopy_flags creates a fresh SV that contains the values of it source SV argument. It's like calling `new_SV(dsv)` followed by `sv_setsv_flags(dsv, ssv, flags`, but is optimized for a brand new destination SV and the most common code paths. The intended initial users for this new function were: * Perl_sv_mortalcopy_flags (still in sv.c) * Perl_newSVsv_flags (now a simple function in sv_inline.h) Perl_sv_freshcopy_flags handles the following cases: * SVt_NULL * SVt_IV * SVt_NV * SVt_PV * SVt_LAST For everything else, it calls S_sv_freshcopy_PVxx which handles: * SVt_INVLIST * SVt_PVIV * SVt_PVNV * SVt_PVMG - with no GET magic For everything else, there's a fall back to sv_setsv_flags. S_sv_freshcopy_POK is a dedicated helper for string swipe/COW/copy logic and is called from both Perl_sv_freshcopy_flags and S_sv_freshcopy_PVxx. With these changes compared with the previous commit: * `perl -e 'for (1..100_000_0) { my $x = { (1) x 1000 }; }'` runs about 20% faster * `perl -e 'for (1..100_000_0) { my $x = { ("Perl") x 250 }' runs about 40% faster * `perl -e 'for (1..100_000_0) { my $x = { a => 1, b => 2, c => 3, d => 4, e => 5 }; }'` is a touch faster, but within the margin for error * `perl -e 'for (1..100_000_0) { my $x = { a => "Perl", b => "Perl", c => "Perl", d => "Perl", e => "Perl" } ; }'` runs about 17% faster
Besides using the just-introduced faster path for SV copying, this allows the check for SV_GMAGIC to be pushed into the called function without having to worry about SV leaks. Two additional micro-optimizations are also in this commit: * A pointer to xav_fill is cached for use in the loop. This can be used directly to update AvFILLp(av), rather than having to get there from av's SV* each time. * The value of the loop iterator, i, is directly written into xav_fill, rather than getting the value in that slot, incrementing it (to get the same value as i), and writing it back.
05d5efa
to
bf95ea0
Compare
sv.c
Outdated
/* Passes the swipe test. */ | ||
SvLEN_set(dsv, len); | ||
SvCUR_set(dsv, cur); | ||
SvPV_set(dsv, SvPVX_mutable(ssv)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
split mem read SvPVX_mutable(ssv)
from SvPV_set(dsv, )
, and move the mem read SvPVX_mutable(ssv)
and save it to a C auto, right before SvLEN_set(dsv, len);
. pointer aliasing rules in C. C abstract machine doesn't know if SvCUR_set(dsv, cur)
will write ontop of 8 bytes of memory backing SvPVX_mutable(ssv)
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't make any difference on gcc builds. What do you see on MSVC builds?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd rather be safe than sorry. If the change does nothing on 1 CC, and does not changes to final machine code, that is perfectly fine. The intent is to encourage C compilers to do the right thing, or make it as easy as possible, for the C compiler to do the right thing, but Perl/C devs can't order a C compiler to emit any deterministic byte sequence of machine code. C devs are writing in C. They are not writing in Asm lang, which is WYSIWYG regarding the final byte sequence of machine code emitted.
Here is what I am trying to stop at all costs.
4704: SvPV_set(dsv,
000000013F40639F 48 8B 47 10 mov rax,qword ptr [rdi+10h]
000000013F4063A3 48 FF 40 F0 inc qword ptr [rax-10h]
4705: HEK_KEY(share_hek_hek(SvSHARED_HEK_FROM_PV(SvPVX_const(ssv)))));
4706: }
4707: SvLEN_set(dsv, len);
000000013F4063A7 48 8B 47 10 mov rax,qword ptr [rdi+10h]
000000013F4063AB 48 89 43 10 mov qword ptr [rbx+10h],rax
000000013F4063AF 48 8B 03 mov rax,qword ptr [rbx]
000000013F4063B2 4C 89 68 18 mov qword ptr [rax+18h],r13
4708: SvCUR_set(dsv, cur);
000000013F4063B6 48 8B 03 mov rax,qword ptr [rbx]
000000013F4063B9 4C 89 60 10 mov qword ptr [rax+10h],r12
4709: SvIsCOW_on(dsv);
000000013F4063BD 0F BA 6B 0C 1C bts dword ptr [rbx+0Ch],1Ch
4710: } else {
000000013F4063C2 E9 6A FE FF FF jmp Perl_sv_setsv_flags+0A31h (013F406231h)
4743: {
4744: const char *vstr_pv;
4745: STRLEN vstr_len;
4746: if ((vstr_pv = SvVSTRING(ssv, vstr_len))) {
More specifically
000000013F4063AF 48 8B 03 mov rax,qword ptr [rbx]
000000013F4063B2 4C 89 68 18 mov qword ptr [rax+18h],r13
4708: SvCUR_set(dsv, cur);
000000013F4063B6 48 8B 03 mov rax,qword ptr [rbx]
You see that ssv
's SvANY()
member was read twice in a row for no good reason? That is what I am trying to stop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The swipe code in my WIP commit now looks like the following:
if (!(flags & SV_NOSTEAL) && S_SvPV_can_swipe_buf(ssv, sflags, cur, len) ) {
/* Passes the swipe test. */
char * buf = SvPVX_mutable(ssv);
SvLEN_set(dsv, len);
SvCUR_set(dsv, cur);
SvPV_set(dsv, buf);
assert(!SvOOK(ssv)); /* According to S_SvPV_can_swipe_buf() */
/* (void)SvOK_off(ssv); but without the superfluous SvOOK_off(ssv)) */
SvFLAGS(ssv) &= ~(SVf_OK|SVf_IVisUV|SVf_UTF8|SVs_TEMP);
SvPV_set(ssv, NULL);
SvLEN_set(ssv, 0);
SvCUR_set(ssv, 0);
return;
}
Is that what you were looking for?
(Again, no difference in the disassembly from a standard gcc build, so I can't tell from my current dev box.)
/* (void)SvOK_off(ssv); but without the superfluous SvOOK_off(ssv)) */ | ||
SvFLAGS(ssv) &= ~(SVf_OK|SVf_IVisUV|SVf_UTF8|SVs_TEMP); | ||
|
||
SvPV_set(ssv, NULL); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
move this line to below SvCUR_set(ssv, 0);
, pointer aliasing, ILP, pipeline stall reasons. The value of SvANY(ssv) is unknown until SvPV_set(ssv, NULL);
is 100% completed by the CPU pipeline/ CPU conveyor belt.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move which of these two lines? Neither of them need to know SvANY(ssv)
, so I'm confused.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SvLEN_set(ssv, 0);
and SvCUR_set(ssv, 0);
go through SvANY()
's ->
operator, they include 4 turing machine logical steps and 1 memory read each time you write them
phantom_register = RAX + U8_0xLIT;
phantom_register = *(void**)phantom_register;
phantom_register = phantom_register + U8_0xLIT;
*(void**)phantom_register = register_ssv_svcur;
the phantom_register = *(void**)phantom_register;
is probably the most damaging/delaying/stalling/highest latency logical sub-operation in the macros SvLEN_set(ssv, 0);
and SvCUR_set(ssv, 0);
.
The write operation *(void**)phantom_register = register_ssv_svcur;
is basically asynchronous, or zero latency because it thrown in a write back cache, along with a very short (read-blocked/anti-read locked for next 1-3 instructions) Exclusive Lock on the 4/8 bytes behind the mem addr (void**)phantom_register
.
So IMO, unless there is some very specific reason or rational to do it differently in 1 particular block of C code, its always best to do the writes and read to a SV body struct first, then do writes and reads to a SV head struct. The SV head opcodes will complete at the same time as the SV body opcodes, If the head opcodes are after the body opcodes.
If the head opcodes are done first, then the body struct opcodes are done, the CPU will stall or run out of parallel work to do, or the CPU will only be 50% utilized, until the long latency body struct opcodes finish, along with wasted CPU capacity, since the CPU had nothing to do in parallel while the body opcodes executed. Or only during the first 50% of the body opcode's wall time, was the CPU at 100% utilization. The later 50% of walltime of the body opcodes, the CPU was at 50% utilization, not 100%. It had no work to do in parallel, and didn't find anything else in the upcoming instruction stream it could do before the next condition jump opcode, or "heavy weight"/"weird"/"slow"/"complicated" RET
or CALL
instructions.
Also b/c the SV head ptr is already in a CPU register, I'm going to guess the CPU already marked that SV head addr as no-SEGV or added the "can't fault" flag to those opcodes.
While the SV body ptr, because its 1 memory read away from L0, L1 or L2 cache, it is fuzzy and hazy, if the CPU knows or doesn't know yet, if that addr is a bad addr (SEGV) or a good read/write-able addr. Remember the CPU must conceptually keep a small rollback "journal log" of what to reverse, if any individual opcode touches a bad addr. To me, the SV head read/writes opcodes look smell and sound, like they are much more "rollback journal log" friendly b/c the mem addrs that need to be reversed are already known, so they can execute in parallel with the SV body ops that may or may not SEGV after 2-3 ns/2-3 hz. The rollback log can't get an entry with {"address_to_reverse": "UNKNOWN". "old_value": "UNKNOWN" }
lol
So head after body takes maximum advantage of the inherent parallelism of modern production grade CPU. In-order execution Pentium Classic and Pentium MMX, Intel Atoms, and FOSS ARM/x86 CPUs found on github, aren't production CPUs, they are Comp Sci/Elec Eng college student homework assignments, nothing more.
Examples of Homework FOSS ARM CPUs: https://github.com/nxbyte/ARM-LEGv8/tree/master/Pipelined-Only
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SvLEN_set(ssv, 0);
andSvCUR_set(ssv, 0);
go throughSvANY()
's->
operator,
Yes, I'm aware of that. Neither of those was in the code snippet you created this comment on!
The value of SvANY(ssv) is unknown until SvPV_set(ssv, NULL); is 100% completed by the CPU pipeline/ CPU conveyor belt.
Why would that be the case? SvPV_set(ssv, NULL);
doesn't touch SvANY(ssv)
, so the compiler is free to layout the instructions within the /* Passes the swipe test. */
block as best as it is able, and an out-of-order CPU will also execute the ops as efficiently as it can.
4c4fde3
to
8a5da3f
Compare
Perl_sv_setsv_flags
is the heavyweight function for assigning the value(s) ofa source SV to a destination SV. It contains many branches for preparing the
destination SV prior to assignment. However:
This set of commits:
Perl_sv_setsv_flags
into a macro.Perl_sv_freshcopy_flags
and two static helper functions.Perl_newSVsv_flags
andPerl_sv_mortalcopy_flags
to use them.should use
Perl_newSVsv_flags
orPerl_sv_mortalcopy_flags
.Using perl's test harness as a guide:
Perl_newSVsv_flags
and57% of calls to
Perl_sv_mortalcopy_flags
.SVt_PV/SVp_POK
code handles 32% of calls toPerl_newSVsv_flags
and 36% of calls toPerl_sv_mortalcopy_flags
.S_sv_freshcopy_flags
code handles 95% of the remainder inPerl_newSVsv_flags
and 91% of the remainder in toPerl_sv_mortalcopy_flags
.With these changes compared with a build of blead:
perl -e 'for (1..100_000) { my $x = [ (1) x 1000 ]; }'
runs 10% fasterperl -e 'for (1..100_000) { my $x = [ ("Perl") x 250 ]; }'
runs 45% faster