-
-
Notifications
You must be signed in to change notification settings - Fork 31.7k
GH-131521: fix clangcl build on Windows for zlib-ng #131526
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
<ClCompile Include="$(zlibNgDir)\arch\x86\chunkset_ssse3.c"> | ||
<AdditionalOptions Condition="$(PlatformToolset) == 'ClangCL'">%(AdditionalOptions) -mssse3</AdditionalOptions> | ||
</ClCompile> | ||
<ClCompile Include="$(zlibNgDir)\arch\x86\adler32_sse42.c"> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is interesting: the filename suggests sse43 sse42, but -mssse3
deemed sufficient.
Ups: edited typo.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably worth matching the filename, so that if an updated zlib-ng uses SSE4.3 instructions we don't need to modify anything here.
The file relies on intrinsics, so it's not going to automatically generate newer instructions.
PCbuild/zlib-ng.vcxproj
Outdated
@@ -97,6 +97,7 @@ | |||
<PreprocessorDefinitions>%(PreprocessorDefinitions);ZLIB_COMPAT;WITH_GZFILEOP;NO_FSEEKO;HAVE_BUILTIN_ASSUME_ALIGNED;_CRT_SECURE_NO_DEPRECATE;_CRT_NONSTDC_NO_DEPRECATE;</PreprocessorDefinitions> | |||
<PreprocessorDefinitions Condition="$(Platform) == 'Win32' or $(Platform) == 'x64'">%(PreprocessorDefinitions);X86_FEATURES;X86_HAVE_XSAVE_INTRIN;X86_SSE2;X86_SSSE3;X86_SSE42;X86_PCLMULQDQ_CRC;X86_AVX2;X86_AVX512;X86_AVX512VNNI;X86_VPCLMULQDQ_CRC</PreprocessorDefinitions> | |||
<PreprocessorDefinitions Condition="$(Configuration) == 'Debug'">%(PreprocessorDefinitions);ZLIB_DEBUG</PreprocessorDefinitions> | |||
<PreprocessorDefinitions Condition="$(PlatformToolset) == 'ClangCL'">%(PreprocessorDefinitions);HAVE_BUILTIN_CTZ</PreprocessorDefinitions> | |||
<EnableEnhancedInstructionSet Condition="$(Platform) == 'Win32' or $(Platform) == 'x64'">AdvancedVectorExtensions2</EnableEnhancedInstructionSet> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd feel better we'd not enable that for all *.c
files (at lest not on Win32
). MSVC would be totally happy without it (and AFAIR defaults to SSE2 if not set). Maybe I'd then have to throw in more -m
for clang-cl, but IMHO we're safer without that even in the MSVC case, so that the resulting binary can run on ancient CPUs, too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I suspect you're right on this. I'm not sure whether the intrinsics code is skipped or still generates fine without this, but it wouldn't surprise me if it's being ignored (there are __AVX*__
preprocessor checks that will behave differently without this setting).
Putting it just on the .c files from arch\x86
should be enough. Those are already in a conditional ItemGroup, so the condition isn't needed each time (the ClangCL condition is, though).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I suspect you're right on this. I'm not sure whether the intrinsics code is skipped or still generates fine without this, but it wouldn't surprise me if it's being ignored (there are
__AVX*__
preprocessor checks that will behave differently without this setting).
Well spotted! Although MSVC allows you to use intrinsics even if the respective /arch
part isn't set, it won't set the macros __AVX__
, etc. Please note, that it never sets macros like __SSE2__
:
- Fix #4265: Add appropriate compiler defines and flags for SIMD with MSVC AcademySoftwareFoundation/OpenImageIO#4266
- https://godbolt.org/z/3McojGzvs
Doing a regex __[A-Z]+[0-9]+__
only reveals
# if (defined(X86_SSE2) && defined(__SSE2__)) || defined(__x86_64__) || defined(_M_X64) || defined(X86_NOCHECK_SSE2) E:\cpython_clang_pgo2\externals\zlib-ng-2.2.4\arch\x86\x86_functions.h 76 37
# if defined(X86_SSSE3) && defined(__SSSE3__) E:\cpython_clang_pgo2\externals\zlib-ng-2.2.4\arch\x86\x86_functions.h 95 37
#if defined(X86_PCLMULQDQ_CRC) && defined(__PCLMUL__) E:\cpython_clang_pgo2\externals\zlib-ng-2.2.4\arch\x86\x86_functions.h 110 43
# if defined(X86_AVX2) && defined(__AVX2__) E:\cpython_clang_pgo2\externals\zlib-ng-2.2.4\arch\x86\x86_functions.h 123 36
# if defined(X86_AVX512) && defined(__AVX512F__) && defined(__AVX512DQ__) && defined(__AVX512BW__) && defined(__AVX512VL__) E:\cpython_clang_pgo2\externals\zlib-ng-2.2.4\arch\x86\x86_functions.h 147 38
# if defined(X86_AVX512VNNI) && defined(__AVX512VNNI__) E:\cpython_clang_pgo2\externals\zlib-ng-2.2.4\arch\x86\x86_functions.h 159 44
# if defined(__PCLMUL__) && defined(__AVX512F__) && defined(__VPCLMULQDQ__) E:\cpython_clang_pgo2\externals\zlib-ng-2.2.4\arch\x86\x86_functions.h 166 17
which are within an #ifdef DISABLE_RUNTIME_CPU_DETECTION
, which isn't set.
The other hits are in x86_intrins.h
:
#ifdef __AVX2__ E:\cpython_clang_pgo2\externals\zlib-ng-2.2.4\arch\x86\x86_intrins.h 7 8
#if (!defined(__clang__) && !defined(__NVCOMPILER) && defined(__GNUC__) && __GNUC__ < 10) \ E:\cpython_clang_pgo2\externals\zlib-ng-2.2.4\arch\x86\x86_intrins.h 10 63
#ifdef __AVX512F__ E:\cpython_clang_pgo2\externals\zlib-ng-2.2.4\arch\x86\x86_intrins.h 18 8
#endif // __AVX512F__ E:\cpython_clang_pgo2\externals\zlib-ng-2.2.4\arch\x86\x86_intrins.h 24 11
#endif // __AVX2__ E:\cpython_clang_pgo2\externals\zlib-ng-2.2.4\arch\x86\x86_intrins.h 27 11
#ifdef __AVX512F__ E:\cpython_clang_pgo2\externals\zlib-ng-2.2.4\arch\x86\x86_intrins.h 31 8
#if (!defined(__clang__) && !defined(__NVCOMPILER) && defined(__GNUC__) && __GNUC__ < 9) E:\cpython_clang_pgo2\externals\zlib-ng-2.2.4\arch\x86\x86_intrins.h 32 63
#endif // __AVX512F__ E:\cpython_clang_pgo2\externals\zlib-ng-2.2.4\arch\x86\x86_intrins.h 67 11
#ifdef __AVX2__ E:\cpython_clang_pgo2\externals\zlib-ng-2.2.4\arch\x86\x86_intrins.h 74 8
#endif // __AVX2__ E:\cpython_clang_pgo2\externals\zlib-ng-2.2.4\arch\x86\x86_intrins.h 78 11
#ifdef __AVX512F__ E:\cpython_clang_pgo2\externals\zlib-ng-2.2.4\arch\x86\x86_intrins.h 80 8
#endif // __AVX512F__ E:\cpython_clang_pgo2\externals\zlib-ng-2.2.4\arch\x86\x86_intrins.h 84 11
#if defined(_MSC_VER) && defined(__AVX512F__) && !defined(_MM_K0_REG8) E:\cpython_clang_pgo2\externals\zlib-ng-2.2.4\arch\x86\x86_intrins.h 88 34
where intrinsic quirks of some compiler versions are taken care of. There are some for MSVC that even include __AVX512F__
, which wouldn't be covered with the current AdvancedVectorExtensions2
. My latest commits should fix this.
Putting it just on the .c files from
arch\x86
should be enough. Those are already in a conditional ItemGroup, so the condition isn't needed each time (the ClangCL condition is, though).
Yeah, I've only set it to those files that really need them, rendering some clang specific stuff obsolete :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
which are within an
#ifdef DISABLE_RUNTIME_CPU_DETECTION
, which isn't set.
This makes me quite confident, that you're spot on #131438 (comment)
I believe it's detected and called dynamically based on availability, but it's all compiled at compile time.
I.e., we are in the "dynamic dispatch" code path
https://github.com/zlib-ng/zlib-ng/blob/860e4cff7917d93f54f5d7f0bc1d0e8b1a3cb988/functable.h#L12-L26
#ifdef DISABLE_RUNTIME_CPU_DETECTION
# include "arch_functions.h"
/* When compiling with native instructions it is not necessary to use functable.
* Instead we use native_ macro indicating the best available variant of arch-specific
* functions for the current platform.
*/
# define FUNCTABLE_INIT ((void)0)
# define FUNCTABLE_CALL(name) native_ ## name
# define FUNCTABLE_FPTR(name) &native_ ## name
#else
struct functable_s {
where functable_s
will be populated at runtime according to the capabilities of the CPU in:
https://github.com/zlib-ng/zlib-ng/blob/860e4cff7917d93f54f5d7f0bc1d0e8b1a3cb988/functable.c#L45-L49
static void init_functable(void) {
struct functable_s ft;
struct cpu_features cf;
cpu_check_features(&cf);
Using the /arch
option only for the needed files will now guarantee that the binary can run even on ancient CPUs and will use optimized code paths according to the CPU capabilites.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using the
/arch
option only for the needed files will now guarantee that the binary can run even on ancient CPUs and will use optimized code paths according to the CPU capabilites.
This is similar to #130213 / #130447. Thus, I've checked using clang-cl 18.1.8: it is picky about just "seeing" intrinsics without the appropriate -m
architecture - and it happily compiled :)
Maybe someone can trigger Windows tailcall CI - it's the only one that really is of interest here, but is only triggered for ceval and the like cpython/.github/workflows/tail-call.yml Lines 4 to 9 in ce79274
|
Just touch one of those files - a comment will do. And undo it before it gets merged (make the comment something like |
so that the resulting binary can be executed on older CPUs, too. Also use AdvancedVectorExtensions512 where necessary.
some of the files tail-call.yml is watching were touched on main, and thus I think it will fire :)
I think this is a skip news? |
Windows tail-call CI is now green: https://github.com/python/cpython/actions/runs/14006628024/job/39221297660 Having a detailed look, ISTM that the tests are not run, though. This is not related to this PR, the last green build some days ago did not run them either: https://github.com/python/cpython/actions/runs/13954576580/job/39062383429#step:4:326 After building, no more output can be seen. This is interesting, because cpython/.github/workflows/tail-call.yml Lines 82 to 91 in ce79274
clearly ./PCbuild/rt.bat is invoked. Maybe the reason is, because it is not used with backslashes like in
The question then is, why ./PCbuild/build.bat is working ...
@Fidget-Spinner: any idea? |
CI uses cmd instead if powershell, could that make a difference? |
Maybe? |
@@ -90,6 +90,7 @@ | |||
<ItemDefinitionGroup> | |||
<ClCompile> | |||
<AdditionalOptions>%(AdditionalOptions) /utf-8 /w34242</AdditionalOptions> | |||
<AdditionalOptions Condition="$(SupportPGO) and $(Configuration) == 'PGUpdate' and $(PlatformToolset) == 'ClangCL'">%(AdditionalOptions) -fno-profile-instr-use</AdditionalOptions> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since zlib-ng is compiled into a static lib, it won't produce any profile data. It is the first target that is built (after the _freeze_module
)
Lines 49 to 51 in 9962469
<ItemGroup> | |
<!-- Static libraries for use later in the build --> | |
<Projects Include="zlib-ng.vcxproj" Condition="$(zlibNgDir) != '' and Exists('$(zlibNgDir)\zlib-ng.h.in')" /> |
But since there is no
\$(TargetName)_*.profraw
for itcpython/PCbuild/pyproject-clangcl.props
Lines 16 to 18 in 9962469
<ItemGroup> | |
<_profrawFiles Include="$(OutDir)instrumented\$(TargetName)_*.profraw" /> | |
</ItemGroup> |
MergeClangProfileData
won't create profdata.profdata
:cpython/PCbuild/pyproject-clangcl.props
Lines 26 to 28 in 9962469
<Target Name="MergeClangProfileData" BeforeTargets="PrepareForBuild" | |
Condition="'$(SupportPGO)' and $(Configuration) == 'PGUpdate'" | |
Inputs="@(_profrawFiles)" |
IMHO, this is the best fix (instead of just setting SupportPGO=false
for it):
- it still will get compiled with
-fprofile-instr-generate
- it will be linked with
python314.dll
and (hopefully?) happily contribute its part topython314.profraw
- MSVC doesn't create
zlib-ng!1.pgc
andzlib-ng.pgd
either. No idea whether it gets PGOed there, but the same idea applies? The difference is just, that MSVC does not produce a hard error if the profile data isn't there, whereas clang does (TRACKEDEXEC : error : Error in reading profile profdata.profdata: no such file or directory
), and I still have found no way to make that a warning :(
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since zlib-ng is compiled into a static lib, it won't produce any profile data.
Thinking about it again, instead of doing the -fno-profile-instr-use
trick in the PGUpdate
phase, it would be better, to just create the lib once instead of twice (like I've recently done for the _freeze_module
).
There is no need to build the lib twice (same holds true for the MSVC case)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I gave it a quick try:
- introducing a
StaticLibProjects
likeFreezeProjects
inpcbuild.proj
is simple and straight forward. - Then I'd have to make the
ProjectReference Include="zlib-ng.vcxproj
conditional (otherwise it would still be built in thePGUpdate
phase) - ok, doable. - but then
zlib.h
is no longer found, because it is generated into$(IntDir)zlib.h
- likewise$(IntDir)zlib-ng.h
. So stopping here :)
So turns out to be not so easy and I'd like to keep that for a follow-up PR (if at all).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assumed the call counts for PGO are going straight into each project that references it, since a static lib is essentially just a bundle of .obj files and so the link-time optimisation happens in the referencing project, not the static lib project.
Which means yes, it's participating in PGO, and it doesn't require rebuilding. Though it's not going to hurt that badly, so I'd tend towards decoupled build configuration (it would be easy to fix by moving the header file regeneration into pythoncore.vcxproj, but that's bad coupling).
Possibly we could make SupportsPGO
choose the definition of IntDir
and then it would usually be a quick rebuild?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, but let's do that in a follow-up PR, since it is not really related to clang-cl.
Technically, the /arch
changes so that the resulting binary can run on legacy CPUs aren't either, but IMHO they are a nice fit here.
It would just put some rebase burden onto this PR, if we'd do it in a separate one.
It probably requires |
Python/ceval_macros.h
Outdated
@@ -1,3 +1,4 @@ | |||
// DO NOT MERGE - just triggering Windows tail-call CI) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remember to remove
I've created #131678 to continue the discussion there. |
…honGH-131526) Do not enable AdvancedVectorExtensions2 for all *.c files, so that the resulting binary can be executed on older CPUs, too. Also enable AdvancedVectorExtensions512 where necessary, and add the ClangCL flags required to enable vector extensions.
…honGH-131526) Do not enable AdvancedVectorExtensions2 for all *.c files, so that the resulting binary can be executed on older CPUs, too. Also enable AdvancedVectorExtensions512 where necessary, and add the ClangCL flags required to enable vector extensions.
I gave it a try. For me it compiles and
test_zlib
is green.I've only thrown in as many
-m
as needed.But only for those files that really need them!
I started to set that for all files (the diff would have been much smaller), but
*.c
files given, and the resulting binary would then not run on all hoststest_zlib
failed withinvalid instruction
if I compiled all*.c
files with the "all the-m
. This convinced me even more to be "selective" here: my ancient Haswell i5 4570 only supports up to AVX2 and it makes sense, that it would not happily accept all that instructions (up to AVX512).