Skip to content

GH-131521: fix clangcl build on Windows for zlib-ng #131526

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Mar 24, 2025

Conversation

chris-eibl
Copy link
Member

@chris-eibl chris-eibl commented Mar 20, 2025

I gave it a try. For me it compiles and test_zlib is green.

I've only thrown in as many -m as needed.

But only for those files that really need them!

I started to set that for all files (the diff would have been much smaller), but

  • this does not seem right to me: the compiler would be eligible to generate such code for all *.c files given, and the resulting binary would then not run on all hosts
  • test_zlib failed with invalid instruction if I compiled all *.c files with the "all the -m. This convinced me even more to be "selective" here: my ancient Haswell i5 4570 only supports up to AVX2 and it makes sense, that it would not happily accept all that instructions (up to AVX512).

<ClCompile Include="$(zlibNgDir)\arch\x86\chunkset_ssse3.c">
<AdditionalOptions Condition="$(PlatformToolset) == 'ClangCL'">%(AdditionalOptions) -mssse3</AdditionalOptions>
</ClCompile>
<ClCompile Include="$(zlibNgDir)\arch\x86\adler32_sse42.c">
Copy link
Member Author

@chris-eibl chris-eibl Mar 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is interesting: the filename suggests sse43 sse42, but -mssse3 deemed sufficient.
Ups: edited typo.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably worth matching the filename, so that if an updated zlib-ng uses SSE4.3 instructions we don't need to modify anything here.

The file relies on intrinsics, so it's not going to automatically generate newer instructions.

@@ -97,6 +97,7 @@
<PreprocessorDefinitions>%(PreprocessorDefinitions);ZLIB_COMPAT;WITH_GZFILEOP;NO_FSEEKO;HAVE_BUILTIN_ASSUME_ALIGNED;_CRT_SECURE_NO_DEPRECATE;_CRT_NONSTDC_NO_DEPRECATE;</PreprocessorDefinitions>
<PreprocessorDefinitions Condition="$(Platform) == 'Win32' or $(Platform) == 'x64'">%(PreprocessorDefinitions);X86_FEATURES;X86_HAVE_XSAVE_INTRIN;X86_SSE2;X86_SSSE3;X86_SSE42;X86_PCLMULQDQ_CRC;X86_AVX2;X86_AVX512;X86_AVX512VNNI;X86_VPCLMULQDQ_CRC</PreprocessorDefinitions>
<PreprocessorDefinitions Condition="$(Configuration) == 'Debug'">%(PreprocessorDefinitions);ZLIB_DEBUG</PreprocessorDefinitions>
<PreprocessorDefinitions Condition="$(PlatformToolset) == 'ClangCL'">%(PreprocessorDefinitions);HAVE_BUILTIN_CTZ</PreprocessorDefinitions>
<EnableEnhancedInstructionSet Condition="$(Platform) == 'Win32' or $(Platform) == 'x64'">AdvancedVectorExtensions2</EnableEnhancedInstructionSet>
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd feel better we'd not enable that for all *.c files (at lest not on Win32). MSVC would be totally happy without it (and AFAIR defaults to SSE2 if not set). Maybe I'd then have to throw in more -m for clang-cl, but IMHO we're safer without that even in the MSVC case, so that the resulting binary can run on ancient CPUs, too.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I suspect you're right on this. I'm not sure whether the intrinsics code is skipped or still generates fine without this, but it wouldn't surprise me if it's being ignored (there are __AVX*__ preprocessor checks that will behave differently without this setting).

Putting it just on the .c files from arch\x86 should be enough. Those are already in a conditional ItemGroup, so the condition isn't needed each time (the ClangCL condition is, though).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I suspect you're right on this. I'm not sure whether the intrinsics code is skipped or still generates fine without this, but it wouldn't surprise me if it's being ignored (there are __AVX*__ preprocessor checks that will behave differently without this setting).

Well spotted! Although MSVC allows you to use intrinsics even if the respective /arch part isn't set, it won't set the macros __AVX__, etc. Please note, that it never sets macros like __SSE2__:

Doing a regex __[A-Z]+[0-9]+__ only reveals

#  if (defined(X86_SSE2) && defined(__SSE2__)) || defined(__x86_64__) || defined(_M_X64) || defined(X86_NOCHECK_SSE2)	E:\cpython_clang_pgo2\externals\zlib-ng-2.2.4\arch\x86\x86_functions.h	76	37
#  if defined(X86_SSSE3) && defined(__SSSE3__)	E:\cpython_clang_pgo2\externals\zlib-ng-2.2.4\arch\x86\x86_functions.h	95	37
#if defined(X86_PCLMULQDQ_CRC) && defined(__PCLMUL__)	E:\cpython_clang_pgo2\externals\zlib-ng-2.2.4\arch\x86\x86_functions.h	110	43
#  if defined(X86_AVX2) && defined(__AVX2__)	E:\cpython_clang_pgo2\externals\zlib-ng-2.2.4\arch\x86\x86_functions.h	123	36
#  if defined(X86_AVX512) && defined(__AVX512F__) && defined(__AVX512DQ__) && defined(__AVX512BW__) && defined(__AVX512VL__)	E:\cpython_clang_pgo2\externals\zlib-ng-2.2.4\arch\x86\x86_functions.h	147	38
#    if defined(X86_AVX512VNNI) && defined(__AVX512VNNI__)	E:\cpython_clang_pgo2\externals\zlib-ng-2.2.4\arch\x86\x86_functions.h	159	44
#    if defined(__PCLMUL__) && defined(__AVX512F__) && defined(__VPCLMULQDQ__)	E:\cpython_clang_pgo2\externals\zlib-ng-2.2.4\arch\x86\x86_functions.h	166	17

which are within an #ifdef DISABLE_RUNTIME_CPU_DETECTION, which isn't set.

The other hits are in x86_intrins.h:

#ifdef __AVX2__	E:\cpython_clang_pgo2\externals\zlib-ng-2.2.4\arch\x86\x86_intrins.h	7	8
#if (!defined(__clang__) && !defined(__NVCOMPILER) && defined(__GNUC__) && __GNUC__ < 10) \	E:\cpython_clang_pgo2\externals\zlib-ng-2.2.4\arch\x86\x86_intrins.h	10	63
#ifdef __AVX512F__	E:\cpython_clang_pgo2\externals\zlib-ng-2.2.4\arch\x86\x86_intrins.h	18	8
#endif // __AVX512F__	E:\cpython_clang_pgo2\externals\zlib-ng-2.2.4\arch\x86\x86_intrins.h	24	11
#endif // __AVX2__	E:\cpython_clang_pgo2\externals\zlib-ng-2.2.4\arch\x86\x86_intrins.h	27	11
#ifdef __AVX512F__	E:\cpython_clang_pgo2\externals\zlib-ng-2.2.4\arch\x86\x86_intrins.h	31	8
#if (!defined(__clang__) && !defined(__NVCOMPILER) && defined(__GNUC__) && __GNUC__ < 9)	E:\cpython_clang_pgo2\externals\zlib-ng-2.2.4\arch\x86\x86_intrins.h	32	63
#endif // __AVX512F__	E:\cpython_clang_pgo2\externals\zlib-ng-2.2.4\arch\x86\x86_intrins.h	67	11
#ifdef __AVX2__	E:\cpython_clang_pgo2\externals\zlib-ng-2.2.4\arch\x86\x86_intrins.h	74	8
#endif // __AVX2__	E:\cpython_clang_pgo2\externals\zlib-ng-2.2.4\arch\x86\x86_intrins.h	78	11
#ifdef __AVX512F__	E:\cpython_clang_pgo2\externals\zlib-ng-2.2.4\arch\x86\x86_intrins.h	80	8
#endif // __AVX512F__	E:\cpython_clang_pgo2\externals\zlib-ng-2.2.4\arch\x86\x86_intrins.h	84	11
#if defined(_MSC_VER) && defined(__AVX512F__) && !defined(_MM_K0_REG8)	E:\cpython_clang_pgo2\externals\zlib-ng-2.2.4\arch\x86\x86_intrins.h	88	34

where intrinsic quirks of some compiler versions are taken care of. There are some for MSVC that even include __AVX512F__, which wouldn't be covered with the current AdvancedVectorExtensions2. My latest commits should fix this.

Putting it just on the .c files from arch\x86 should be enough. Those are already in a conditional ItemGroup, so the condition isn't needed each time (the ClangCL condition is, though).

Yeah, I've only set it to those files that really need them, rendering some clang specific stuff obsolete :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

which are within an #ifdef DISABLE_RUNTIME_CPU_DETECTION, which isn't set.

This makes me quite confident, that you're spot on #131438 (comment)

I believe it's detected and called dynamically based on availability, but it's all compiled at compile time.

I.e., we are in the "dynamic dispatch" code path
https://github.com/zlib-ng/zlib-ng/blob/860e4cff7917d93f54f5d7f0bc1d0e8b1a3cb988/functable.h#L12-L26

#ifdef DISABLE_RUNTIME_CPU_DETECTION

#  include "arch_functions.h"

/* When compiling with native instructions it is not necessary to use functable.
 * Instead we use native_ macro indicating the best available variant of arch-specific
 * functions for the current platform.
 */
#  define FUNCTABLE_INIT ((void)0)
#  define FUNCTABLE_CALL(name) native_ ## name
#  define FUNCTABLE_FPTR(name) &native_ ## name

#else

struct functable_s {

where functable_s will be populated at runtime according to the capabilities of the CPU in:
https://github.com/zlib-ng/zlib-ng/blob/860e4cff7917d93f54f5d7f0bc1d0e8b1a3cb988/functable.c#L45-L49

static void init_functable(void) {
    struct functable_s ft;
    struct cpu_features cf;

    cpu_check_features(&cf);

Using the /arch option only for the needed files will now guarantee that the binary can run even on ancient CPUs and will use optimized code paths according to the CPU capabilites.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using the /arch option only for the needed files will now guarantee that the binary can run even on ancient CPUs and will use optimized code paths according to the CPU capabilites.

This is similar to #130213 / #130447. Thus, I've checked using clang-cl 18.1.8: it is picky about just "seeing" intrinsics without the appropriate -m architecture - and it happily compiled :)

@chris-eibl
Copy link
Member Author

Maybe someone can trigger Windows tailcall CI - it's the only one that really is of interest here, but is only triggered for ceval and the like

paths:
- '.github/workflows/tail-call.yml'
- 'Python/bytecodes.c'
- 'Python/ceval.c'
- 'Python/ceval_macros.h'
- 'Python/generated_cases.c.h'

@zooba
Copy link
Member

zooba commented Mar 20, 2025

Maybe someone can trigger Windows tailcall CI

Just touch one of those files - a comment will do. And undo it before it gets merged (make the comment something like // DO NOT MERGE - just triggering CI)

so that the resulting binary can be executed on older CPUs, too.
Also use AdvancedVectorExtensions512 where necessary.
some of the files tail-call.yml is watching were touched
on main, and thus I think it will fire :)
@chris-eibl chris-eibl requested a review from markshannon as a code owner March 22, 2025 07:52
@chris-eibl
Copy link
Member Author

I think this is a skip news?

@chris-eibl
Copy link
Member Author

Windows tail-call CI is now green: https://github.com/python/cpython/actions/runs/14006628024/job/39221297660

Having a detailed look, ISTM that the tests are not run, though. This is not related to this PR, the last green build some days ago did not run them either: https://github.com/python/cpython/actions/runs/13954576580/job/39062383429#step:4:326

After building, no more output can be seen.

This is interesting, because

- name: Native Windows (debug)
if: runner.os == 'Windows' && matrix.architecture != 'ARM64'
shell: cmd
run: |
choco install llvm --allow-downgrade --no-progress --version ${{ matrix.llvm }}.1.5
set PlatformToolset=clangcl
set LLVMToolsVersion=${{ matrix.llvm }}.1.5
set LLVMInstallDir=C:\Program Files\LLVM
./PCbuild/build.bat --tail-call-interp -d -p ${{ matrix.architecture }}
./PCbuild/rt.bat -d -p ${{ matrix.architecture }} -q --multiprocess 0 --timeout 4500 --verbose2 --verbose3

clearly ./PCbuild/rt.bat is invoked. Maybe the reason is, because it is not used with backslashes like in

The question then is, why ./PCbuild/build.bat is working ...

@Fidget-Spinner: any idea?

@Fidget-Spinner
Copy link
Member

CI uses cmd instead if powershell, could that make a difference?

@chris-eibl
Copy link
Member Author

Maybe?
IMHO, we shouldn't do that here in this PR but create a separate issue and PR?

@@ -90,6 +90,7 @@
<ItemDefinitionGroup>
<ClCompile>
<AdditionalOptions>%(AdditionalOptions) /utf-8 /w34242</AdditionalOptions>
<AdditionalOptions Condition="$(SupportPGO) and $(Configuration) == 'PGUpdate' and $(PlatformToolset) == 'ClangCL'">%(AdditionalOptions) -fno-profile-instr-use</AdditionalOptions>
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since zlib-ng is compiled into a static lib, it won't produce any profile data. It is the first target that is built (after the _freeze_module)

<ItemGroup>
<!-- Static libraries for use later in the build -->
<Projects Include="zlib-ng.vcxproj" Condition="$(zlibNgDir) != '' and Exists('$(zlibNgDir)\zlib-ng.h.in')" />

But since there is no \$(TargetName)_*.profraw for it
<ItemGroup>
<_profrawFiles Include="$(OutDir)instrumented\$(TargetName)_*.profraw" />
</ItemGroup>

MergeClangProfileData won't create profdata.profdata:
<Target Name="MergeClangProfileData" BeforeTargets="PrepareForBuild"
Condition="'$(SupportPGO)' and $(Configuration) == 'PGUpdate'"
Inputs="@(_profrawFiles)"

IMHO, this is the best fix (instead of just setting SupportPGO=false for it):

  • it still will get compiled with -fprofile-instr-generate
  • it will be linked with python314.dll and (hopefully?) happily contribute its part to python314.profraw
  • MSVC doesn't create zlib-ng!1.pgc and zlib-ng.pgd either. No idea whether it gets PGOed there, but the same idea applies? The difference is just, that MSVC does not produce a hard error if the profile data isn't there, whereas clang does (TRACKEDEXEC : error : Error in reading profile profdata.profdata: no such file or directory), and I still have found no way to make that a warning :(

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since zlib-ng is compiled into a static lib, it won't produce any profile data.

Thinking about it again, instead of doing the -fno-profile-instr-use trick in the PGUpdate phase, it would be better, to just create the lib once instead of twice (like I've recently done for the _freeze_module).
There is no need to build the lib twice (same holds true for the MSVC case)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I gave it a quick try:

  • introducing a StaticLibProjects like FreezeProjects in pcbuild.proj is simple and straight forward.
  • Then I'd have to make the ProjectReference Include="zlib-ng.vcxproj conditional (otherwise it would still be built in the PGUpdate phase) - ok, doable.
  • but then zlib.h is no longer found, because it is generated into $(IntDir)zlib.h - likewise $(IntDir)zlib-ng.h. So stopping here :)

So turns out to be not so easy and I'd like to keep that for a follow-up PR (if at all).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assumed the call counts for PGO are going straight into each project that references it, since a static lib is essentially just a bundle of .obj files and so the link-time optimisation happens in the referencing project, not the static lib project.

Which means yes, it's participating in PGO, and it doesn't require rebuilding. Though it's not going to hurt that badly, so I'd tend towards decoupled build configuration (it would be easy to fix by moving the header file regeneration into pythoncore.vcxproj, but that's bad coupling).

Possibly we could make SupportsPGO choose the definition of IntDir and then it would usually be a quick rebuild?

Copy link
Member Author

@chris-eibl chris-eibl Mar 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, but let's do that in a follow-up PR, since it is not really related to clang-cl.

Technically, the /arch changes so that the resulting binary can run on legacy CPUs aren't either, but IMHO they are a nice fit here.

It would just put some rebase burden onto this PR, if we'd do it in a separate one.

@zooba
Copy link
Member

zooba commented Mar 24, 2025

CI uses cmd instead if powershell, could that make a difference?

It probably requires call PCbuild\build.bat ... to handle someone calling exit in the batch file. In Cmd, if the entire block is a batch file, then exit will break out of the top-level one, not the nested one, unless you use call.

@@ -1,3 +1,4 @@
// DO NOT MERGE - just triggering Windows tail-call CI)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remember to remove

@chris-eibl
Copy link
Member Author

CI uses cmd instead if powershell, could that make a difference?

It probably requires call PCbuild\build.bat ... to handle someone calling exit in the batch file. In Cmd, if the entire block is a batch file, then exit will break out of the top-level one, not the nested one, unless you use call.

I've created #131678 to continue the discussion there.

@zooba zooba merged commit d16f455 into python:main Mar 24, 2025
38 checks passed
@chris-eibl chris-eibl deleted the fix_clangcl_zlibng branch March 24, 2025 19:29
diegorusso pushed a commit to diegorusso/cpython that referenced this pull request Apr 1, 2025
…honGH-131526)

Do not enable AdvancedVectorExtensions2 for all *.c files, so that the resulting binary can be executed on older CPUs, too. Also enable AdvancedVectorExtensions512 where necessary, and add the ClangCL flags required to enable vector extensions.
seehwan pushed a commit to seehwan/cpython that referenced this pull request Apr 16, 2025
…honGH-131526)

Do not enable AdvancedVectorExtensions2 for all *.c files, so that the resulting binary can be executed on older CPUs, too. Also enable AdvancedVectorExtensions512 where necessary, and add the ClangCL flags required to enable vector extensions.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants