Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unwanted duplicate packages for Intel unified-dev environment on Orion #1477

Closed
srherbener opened this issue Jan 28, 2025 · 15 comments
Closed
Labels
bug Something is not working

Comments

@srherbener
Copy link
Collaborator

Describe the bug

The intel unified-dev environment for the Intel compiler set is producing unwanted duplicates in the concretize step. Here is the output of show_duplicate_packages.py

[ue-intel-test] orion-login-2[110] herbener$ ../../util/show_duplicate_packages.py -d log.concretize 
ph345on  [email protected]%[email protected]~ipo+python+shared+utils  build_system=cmake  build_type=Release  generator=make  test_files=none  arch=linux-rocky9-skylake_avx512
6f26fpg  [email protected]%[email protected]~ipo+python+shared+utils  build_system=cmake  build_type=Release  generator=make  test_files=none  arch=linux-rocky9-skylake
qe467cs  [email protected]%[email protected]~ipo+python  build_system=cmake  build_type=Release  generator=make  arch=linux-rocky9-skylake_avx512
rcuf3e7  [email protected]%[email protected]~ipo+python  build_system=cmake  build_type=Release  generator=make  arch=linux-rocky9-skylake_avx512
zohtrou  [email protected]%[email protected]+fix~ipo  build_system=cmake  build_type=Release  generator=make  arch=linux-rocky9-skylake_avx512
veonfpk  [email protected]%[email protected]+fix~ipo  build_system=cmake  build_type=Release  generator=make  arch=linux-rocky9-skylake_avx512
ujod65x  [email protected]%[email protected]+fix~ipo  build_system=cmake  build_type=Release  generator=make  arch=linux-rocky9-skylake
hhcb3ee  [email protected]%[email protected]  cxxflags='-fp-model  precise'  fflags='-fp-model  precise'  ~debug~external-lapack+external-parallelio+mpi+netcdf~pnetcdf+python+shared~xerces  build_system=makefile  esmf_comm=auto  esmf_os=auto  esmf_pio=auto  patches=f63d405  snapshot=none  arch=linux-rocky9-skylake_avx512
ahyggrp  [email protected]%[email protected]  cxxflags='-fp-model  precise'  fflags='-fp-model  precise'  ~debug~external-lapack+external-parallelio+mpi+netcdf~pnetcdf+python+shared~xerces  build_system=makefile  esmf_comm=auto  esmf_os=auto  esmf_pio=auto  patches=f63d405  snapshot=none  arch=linux-rocky9-skylake
i3mibbe  [email protected]%[email protected]  cxxflags='-fp-model  precise'  fflags='-fp-model  precise'  ~debug~external-lapack+external-parallelio+mpi+netcdf~pnetcdf+python+shared~xerces  build_system=makefile  esmf_comm=auto  esmf_os=auto  esmf_pio=auto  patches=f63d405  snapshot=none  arch=linux-rocky9-skylake_avx512
khhxhzv  [email protected]%[email protected]+bufrquery+fftw+hdf4  build_system=bundle  arch=linux-rocky9-skylake_avx512
jbruuvv  [email protected]%[email protected]+bufrquery+fftw+hdf4  build_system=bundle  arch=linux-rocky9-skylake_avx512
rpyt5dx  [email protected]%[email protected]~debug+extdata2g~f2py+fargparse~ipo+pflogger~pfunit~shared  build_system=cmake  build_type=Release  generator=make  arch=linux-rocky9-skylake_avx512
x2srcuy  [email protected]%[email protected]~debug+extdata2g~f2py+fargparse~ipo+pflogger~pfunit~shared  build_system=cmake  build_type=Release  generator=make  arch=linux-rocky9-skylake
5yxxf5n  [email protected]%[email protected]  build_system=python_pip  arch=linux-rocky9-skylake_avx512
cgi4wtw  [email protected]%[email protected]  build_system=python_pip  arch=linux-rocky9-skylake
7ttejsm  [email protected]%[email protected]  build_system=python_pip  arch=linux-rocky9-skylake_avx512
heksufs  [email protected]%[email protected]  build_system=python_pip  arch=linux-rocky9-skylake_avx512
mitt2pp  [email protected]%[email protected]~mpi  build_system=python_pip  arch=linux-rocky9-skylake_avx512
ws6vge6  [email protected]%[email protected]~mpi  build_system=python_pip  arch=linux-rocky9-skylake_avx512
4gm5a3p  [email protected]%[email protected]+mpi  build_system=python_pip  patches=255b5ae  arch=linux-rocky9-skylake
6nywkag  [email protected]%[email protected]+mpi  build_system=python_pip  patches=255b5ae  arch=linux-rocky9-skylake_avx512
65kaf2d  [email protected]%[email protected]  build_system=python_pip  patches=873745d  arch=linux-rocky9-skylake
hcroitu  [email protected]%[email protected]  build_system=python_pip  patches=873745d  arch=linux-rocky9-skylake_avx512
jtxqrzt  [email protected]%[email protected]~excel~performance  build_system=python_pip  arch=linux-rocky9-skylake_avx512
p2iufak  [email protected]%[email protected]~excel~performance  build_system=python_pip  arch=linux-rocky9-skylake
fjlbga4  [email protected]%[email protected]  build_system=python_pip  arch=linux-rocky9-skylake_avx512
7wjmfoe  [email protected]%[email protected]  build_system=python_pip  arch=linux-rocky9-skylake_avx512
ytctukt  [email protected]%[email protected]  build_system=python_pip  arch=linux-rocky9-skylake_avx512
a4cn5kv  [email protected]%[email protected]  build_system=python_pip  arch=linux-rocky9-skylake_avx512
ifw2owm  [email protected]%[email protected]  build_system=python_pip  patches=3720932  arch=linux-rocky9-skylake_avx512
pxikksb  [email protected]%[email protected]  build_system=python_pip  patches=3720932  arch=linux-rocky9-skylake_avx512
pr3dihn  [email protected]%[email protected]~io~parallel~viz  build_system=python_pip  arch=linux-rocky9-skylake_avx512
fblduwi  [email protected]%[email protected]~io~parallel~viz  build_system=python_pip  arch=linux-rocky9-skylake
===
Duplicates found!
[ue-intel-test] orion-login-2[111] herbener$ 

I'm getting this from the feature branches for the libirc PR #1435. I'm also seeing similar behavior from the Intel build the @RatkoVasic-NOAA recently did:

[ue-intel-test] orion-login-2[112] herbener$ ../../util/show_duplicate_packages.py -d /work/noaa/epic/role-epic/spack-stack/orion/spack-stack-1.9.0-dom2/envs/ue-intel/log.concretize 
ph345on  [email protected]%[email protected]~ipo+python+shared+utils  build_system=cmake  build_type=Release  generator=make  test_files=none  arch=linux-rocky9-skylake_avx512
6f26fpg  [email protected]%[email protected]~ipo+python+shared+utils  build_system=cmake  build_type=Release  generator=make  test_files=none  arch=linux-rocky9-skylake
qe467cs  [email protected]%[email protected]~ipo+python  build_system=cmake  build_type=Release  generator=make  arch=linux-rocky9-skylake_avx512
rcuf3e7  [email protected]%[email protected]~ipo+python  build_system=cmake  build_type=Release  generator=make  arch=linux-rocky9-skylake_avx512
veonfpk  [email protected]%[email protected]+fix~ipo  build_system=cmake  build_type=Release  generator=make  arch=linux-rocky9-skylake_avx512
ujod65x  [email protected]%[email protected]+fix~ipo  build_system=cmake  build_type=Release  generator=make  arch=linux-rocky9-skylake
zohtrou  [email protected]%[email protected]+fix~ipo  build_system=cmake  build_type=Release  generator=make  arch=linux-rocky9-skylake_avx512
i3mibbe  [email protected]%[email protected]  cxxflags='-fp-model  precise'  fflags='-fp-model  precise'  ~debug~external-lapack+external-parallelio+mpi+netcdf~pnetcdf+python+shared~xerces  build_system=makefile  esmf_comm=auto  esmf_os=auto  esmf_pio=auto  patches=f63d405  snapshot=none  arch=linux-rocky9-skylake_avx512
ahyggrp  [email protected]%[email protected]  cxxflags='-fp-model  precise'  fflags='-fp-model  precise'  ~debug~external-lapack+external-parallelio+mpi+netcdf~pnetcdf+python+shared~xerces  build_system=makefile  esmf_comm=auto  esmf_os=auto  esmf_pio=auto  patches=f63d405  snapshot=none  arch=linux-rocky9-skylake
hhcb3ee  [email protected]%[email protected]  cxxflags='-fp-model  precise'  fflags='-fp-model  precise'  ~debug~external-lapack+external-parallelio+mpi+netcdf~pnetcdf+python+shared~xerces  build_system=makefile  esmf_comm=auto  esmf_os=auto  esmf_pio=auto  patches=f63d405  snapshot=none  arch=linux-rocky9-skylake_avx512
khhxhzv  [email protected]%[email protected]+bufrquery+fftw+hdf4  build_system=bundle  arch=linux-rocky9-skylake_avx512
jbruuvv  [email protected]%[email protected]+bufrquery+fftw+hdf4  build_system=bundle  arch=linux-rocky9-skylake_avx512
4j634ov  [email protected]%[email protected]~debug+extdata2g~f2py+fargparse~ipo+pflogger~pfunit~shared  build_system=cmake  build_type=Release  generator=make  arch=linux-rocky9-skylake
w4o4wmz  [email protected]%[email protected]~debug+extdata2g~f2py+fargparse~ipo+pflogger~pfunit~shared  build_system=cmake  build_type=Release  generator=make  arch=linux-rocky9-skylake_avx512
cgi4wtw  [email protected]%[email protected]  build_system=python_pip  arch=linux-rocky9-skylake
5yxxf5n  [email protected]%[email protected]  build_system=python_pip  arch=linux-rocky9-skylake_avx512
7ttejsm  [email protected]%[email protected]  build_system=python_pip  arch=linux-rocky9-skylake_avx512
heksufs  [email protected]%[email protected]  build_system=python_pip  arch=linux-rocky9-skylake_avx512
ws6vge6  [email protected]%[email protected]~mpi  build_system=python_pip  arch=linux-rocky9-skylake_avx512
mitt2pp  [email protected]%[email protected]~mpi  build_system=python_pip  arch=linux-rocky9-skylake_avx512
4gm5a3p  [email protected]%[email protected]+mpi  build_system=python_pip  patches=255b5ae  arch=linux-rocky9-skylake
6nywkag  [email protected]%[email protected]+mpi  build_system=python_pip  patches=255b5ae  arch=linux-rocky9-skylake_avx512
hcroitu  [email protected]%[email protected]  build_system=python_pip  patches=873745d  arch=linux-rocky9-skylake_avx512
65kaf2d  [email protected]%[email protected]  build_system=python_pip  patches=873745d  arch=linux-rocky9-skylake
jtxqrzt  [email protected]%[email protected]~excel~performance  build_system=python_pip  arch=linux-rocky9-skylake_avx512
p2iufak  [email protected]%[email protected]~excel~performance  build_system=python_pip  arch=linux-rocky9-skylake
7wjmfoe  [email protected]%[email protected]  build_system=python_pip  arch=linux-rocky9-skylake_avx512
fjlbga4  [email protected]%[email protected]  build_system=python_pip  arch=linux-rocky9-skylake_avx512
ytctukt  [email protected]%[email protected]  build_system=python_pip  arch=linux-rocky9-skylake_avx512
a4cn5kv  [email protected]%[email protected]  build_system=python_pip  arch=linux-rocky9-skylake_avx512
pxikksb  [email protected]%[email protected]  build_system=python_pip  patches=3720932  arch=linux-rocky9-skylake_avx512
ifw2owm  [email protected]%[email protected]  build_system=python_pip  patches=3720932  arch=linux-rocky9-skylake_avx512
pr3dihn  [email protected]%[email protected]~io~parallel~viz  build_system=python_pip  arch=linux-rocky9-skylake_avx512
fblduwi  [email protected]%[email protected]~io~parallel~viz  build_system=python_pip  arch=linux-rocky9-skylake
===
Duplicates found!
[ue-intel-test] orion-login-2[113] herbener$ 

There seems to be an issue with skylake vs skylake-avx512 architecture.

It's not clear if the PR #1435 introduced this behavior. I suspect not and this behavior has likely been around for a while. I don't think this issue should hold up PR #1435.

I think this is a specific Orion/Intel configuration issue meaning that we should not hold up the 1.9.0 installation and testing on other platforms.

To Reproduce

Follow the instructions for building a local environment here: https://spack-stack.readthedocs.io/en/latest/PreConfiguredSites.html#create-local-environment, and select orion, unified-dev, and intel compiler.

Use the feature branches from #1435, but I suspect you can use the 1.9.0 release branches or the develop braches as well.

Then run spack concretize or spack concretize --fresh.

Expected behavior

The concretize step should only produce the expected duplicates (esmf, crtm)

System:
What system(s) are you running the code on?
Orion, Intel unified-dev environment

Additional context
Add any other context about the problem here.

@rickgrubin-noaa
Copy link
Collaborator

I was hitting this a lot when working on orion (and hercules), with @develop with some to-be-merged PR changes (including the one you mention (#1435), so prior the release just cut, and close to it in terms of content.

spack diff for the hashes corresponding to jedi-base-env may provide good clues -- that helped me get past duplicates.

@srherbener
Copy link
Collaborator Author

Thanks for the tip @rickgrubin-noaa! I poked around a bit and at least part of the issue is that the common/packages.yaml requires version 1.26.x for py-numpy but [email protected] with its +python variant ends up building [email protected] (latest version). These two versions become duplicate packages, and they trigger a number (perhaps all) of the other duplicate packages.

Here's the common/packages.yaml entry:

py-numpy:
require:
- '@1.26'

And here's the excerpt from the esmf packages.py script:

120     # python library
121     with when("+python"):
122         extends("python")
123         depends_on("py-pip")
124         depends_on("py-setuptools", type="build")
125         depends_on("py-wheel", type="build")
126         depends_on("py-mpi4py", when="+mpi")
127         depends_on("py-numpy")

There doesn't appear to be any restriction on the py-numpy version in the esmf packages.py script so why doesn't esmf select [email protected] instead?

Any thoughts?

@srherbener
Copy link
Collaborator Author

@mathomp4, @climbfuji, @rickgrubin-noaa, @alexrichert any thoughts about the behavior I mentioned above? Thanks!

@srherbener
Copy link
Collaborator Author

Also noticed that intel MKL is not used in the Orion intel configuration. Is this correct? Not sure MKL would be related to this issue, but thought I should double check on this. Thanks!

@climbfuji
Copy link
Collaborator

Also noticed that intel MKL is not used in the Orion intel configuration. Is this correct? Not sure MKL would be related to this issue, but thought I should double check on this. Thanks!

That's because NOAA doesn't want to move (back) to MKL from openblas. it used to be MKL with hpc-stack, but when we moved to spack-stack we had openblas first (due to issues with mkl in the stack in the early days of spack/spack-stack). And now it seems we are stuck with openblas on the RDHCPS systems (until someone actually tests MKL vs openblas and finds out that it is safe to switch back).

@climbfuji
Copy link
Collaborator

Thanks for the tip @rickgrubin-noaa! I poked around a bit and at least part of the issue is that the common/packages.yaml requires version 1.26.x for py-numpy but [email protected] with its +python variant ends up building [email protected] (latest version). These two versions become duplicate packages, and they trigger a number (perhaps all) of the other duplicate packages.

Here's the common/packages.yaml entry:

spack-stack/configs/common/packages.yaml

Lines 251 to 253 in ece9d20

py-numpy:
require:

  • '@1.26'
    And here's the excerpt from the esmf packages.py script:
120     # python library
121     with when("+python"):
122         extends("python")
123         depends_on("py-pip")
124         depends_on("py-setuptools", type="build")
125         depends_on("py-wheel", type="build")
126         depends_on("py-mpi4py", when="+mpi")
127         depends_on("py-numpy")

There doesn't appear to be any restriction on the py-numpy version in the esmf packages.py script so why doesn't esmf select [email protected] instead?

Any thoughts?

I don't know how this can happen, unless a require for Orion overwrites the common require.

@rickgrubin-noaa
Copy link
Collaborator

Also noticed that intel MKL is not used in the Orion intel configuration. Is this correct? Not sure MKL would be related to this issue, but thought I should double check on this. Thanks!

That's because NOAA doesn't want to move (back) to MKL from openblas. it used to be MKL with hpc-stack, but when we moved to spack-stack we had openblas first (due to issues with mkl in the stack in the early days of spack/spack-stack). And now it seems we are stuck with openblas on the RDHCPS systems (until someone actually tests MKL vs openblas and finds out that it is safe to switch back).

Here's what I learned this week when asking the same question:

It boils down to what EMC has historically used. MKL support was added prior for what may have been nonspecific reasons, but we should keep using those other packages unless/until EMC asks otherwise. For a bit of added context, running the UWM RTs based on building with MKL resulted in a ton of numerical results changing, so likely a lot of work involved in reconciling/validating outputs. That's why we don't want to go down that road unless we're specifically asked.

@climbfuji
Copy link
Collaborator

Thanks for the info. Let's keep openblas on the RDHPCS platforms (and continue to carry around all the special handling in the site configs).

@rickgrubin-noaa
Copy link
Collaborator

Thanks for the tip @rickgrubin-noaa! I poked around a bit and at least part of the issue is that the common/packages.yaml requires version 1.26.x for py-numpy but [email protected] with its +python variant ends up building [email protected] (latest version). These two versions become duplicate packages, and they trigger a number (perhaps all) of the other duplicate packages.
Here's the common/packages.yaml entry:
spack-stack/configs/common/packages.yaml
Lines 251 to 253 in ece9d20
py-numpy:
require:

  • '@1.26'
    And here's the excerpt from the esmf packages.py script:
120     # python library
121     with when("+python"):
122         extends("python")
123         depends_on("py-pip")
124         depends_on("py-setuptools", type="build")
125         depends_on("py-wheel", type="build")
126         depends_on("py-mpi4py", when="+mpi")
127         depends_on("py-numpy")

There doesn't appear to be any restriction on the py-numpy version in the esmf packages.py script so why doesn't esmf select [email protected] instead?
Any thoughts?

I don't know how this can happen, unless a require for Orion overwrites the common require.

Building release/1.9.0 on orion, for configs/sites/tier1/orion/packages_oneapi.yaml (analogue to packages_intel.yaml):

  py-numpy:
    require::
    - '@1.26'
    - '^openblas'

results in no duplicates.

@climbfuji
Copy link
Collaborator

Ahh of course, this makes sense. We had to use :: to overwrite the default mkl with openblas. Thanks @rickgrubin-noaa

@srherbener
Copy link
Collaborator Author

This works for my case too! Thanks @rickgrubin-noaa! I'm seeing only the expected duplicates now:

[ue-intel-test] orion-login-2[55] herbener$ ../../util/show_duplicate_packages.py -d log.concretize
zohtrou  [email protected]%[email protected]+fix~ipo  build_system=cmake  build_type=Release  generator=make  arch=linux-rocky9-skylake_avx512
ujod65x  [email protected]%[email protected]+fix~ipo  build_system=cmake  build_type=Release  generator=make  arch=linux-rocky9-skylake
veonfpk  [email protected]%[email protected]+fix~ipo  build_system=cmake  build_type=Release  generator=make  arch=linux-rocky9-skylake_avx512
545czl5  [email protected]%[email protected]  cxxflags='-fp-model  precise'  fflags='-fp-model  precise'  ~debug~external-lapack+external-parallelio+mpi+netcdf~pnetcdf+python+shared~xerces  build_system=makefile  esmf_comm=auto  esmf_os=auto  esmf_pio=auto  patches=f63d405  snapshot=none  arch=linux-rocky9-skylake_avx512
6vblbfy  [email protected]%[email protected]  cxxflags='-fp-model  precise'  fflags='-fp-model  precise'  ~debug~external-lapack+external-parallelio+mpi+netcdf~pnetcdf+python+shared~xerces  build_system=makefile  esmf_comm=auto  esmf_os=auto  esmf_pio=auto  patches=f63d405  snapshot=none  arch=linux-rocky9-skylake
===
Duplicates found!
[ue-intel-test] orion-login-2[56] herbener$ 

I'll create a PR for the Orion Intel config. Should that be based on the release/1.9.0 branch?

@climbfuji
Copy link
Collaborator

Yes please. I guess we need the same change for each and every RDHPCS system that has an override of the py-numpy requirements in the site config (packages_{intel,oneapi}.yaml).

@rickgrubin-noaa @AlexanderRichert-NOAA Since we expect b4b differences for the UFS when switching to the "full" OneAPI compilers (icx, icpx, ifx) in spack-stack-1.10.0 anyway, do we want to make the move back to MKL at the same time?

@AlexanderRichert-NOAA
Copy link
Collaborator

AlexanderRichert-NOAA commented Jan 29, 2025

It's a question for UWM devs, I have little preference (it's one more source of possible numerical differences that will get investigated by devs from time to time, but I don't mind if the devs don't). How was UWM getting linked to MKL previously?

@climbfuji
Copy link
Collaborator

It's a question for UWM devs, I have little preference (it's one more source of possible numerical differences that will get investigated by devs from time to time, but I don't mind if the devs don't). How was UWM getting linked to MKL previously?

In hpc-stack, I don't recall the details. These were all manually configured build scripts.

@srherbener
Copy link
Collaborator Author

PR #1482 resolved this issue, so I will close as completed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something is not working
Projects
None yet
Development

No branches or pull requests

4 participants