Change default IO type from NETCDF4C to PNETCDF#325
Conversation
|
Testing:
|
|
Thanks very much, @amametjanov! This looks promising. |
|
@amametjanov, I've got 3 tests for this in the queue, one on Chrysalis and 2 on Frontier. But wait times seem to be a bit long both places. I'll keep you posted. |
|
To fix |
|
@amametjanov, maybe this will be fixed by E3SM-Project/scorpio#670 but what I'm seeing on Chrysalis with this branch is: The polaris output is available at: It seems like this won't be a short-term fix for Omega if a scoprio fix is needed, because that would mean:
It feels like we should look into whether there's some alternative way to address #323 in the next week or two. |
Update scorpio from v1.8.2 2025-Jul-14 to v1.9.0 2025-Nov-21. Also add fix for PnetCDF CDF5 types.
4d44424 to
e68388f
Compare
|
Xylar, please check with updated head of this branch to see if it fixes This branch updates the scorpio submodule ahead of E3SM/master's version, which is still on v1.8.2 2025-Jul-14. When scorpio gets update in E3SM/master (with v1.9.0 or later), E3SM/master merge to Omega/develop will subsume this branch's updates. |
|
Thanks, @amametjanov! I'll retest as soon as I can. |
xylar
left a comment
There was a problem hiding this comment.
I tested the omega_pr suite using the fix in E3SM-Project/polaris#442, pointing to this branch for the Omega build.
I was able to run successfully with both Intel and Gnu on Chrysalis. I discovered that I can't log in to either Aurora or Frontier at the moment. I'm trying on Perlmutter (CPU and GPU) next.
In the mean time, two small questions/comments.
|
On Perlmutter-CPU (Both Intel and Gnu) and -GPU (Gnu-GPU), I'm seeing the same hanging behavior reported in E3SM-Project/polaris#396 as we had seen previously. It seems like maybe that behavior is unfortunately independent of this PIO problem. |
|
@amametjanov , I ran the tests for this PR on Frontier, but I got the same PIO error: Please see Frontier test results at Omega CDash dashboard at https://my.cdash.org/index.php?project=omega |
|
@amametjanov, could you please let me know when this is ready for me to re-test in Polaris? |
|
Yes, this is ready, please re-run Polaris tests. 🙏 |
TestingI successfully ran the
(Unchecked items are still in the queue -- update soon...) I also verified that one @amametjanov and @grnydawn, thank you so much for figuring out these issues and fixing them! |
|
My Frontier tests are now running. All the CPU tests did okay. The GPU tests are very slow by comparison, and they are taking more than the 1 hour I had allocated. I don't know for sure but I presume the slowness is not from this PR. I also saw the I/O failure in one I will rerun both the tests that timed out and the one that failed. We will see what happens. Update: It seems like the file system on Frontier might be a problem. My resubmitted jobs are hanging just trying to load the environment. |
|
In the This is for the |
|
When I try to rerun the failed test ( The original error was the same as we have seen before: I presume these errors might indicate that Omega isn't overwriting the |
|
For Frontier, craygnu and craygnu-mphipcc is the only compiler E3SM cares about. Don't spend more then a token amount of E3SM time looking that the others. |
|
Okay, thanks @rljacob. That wasn't clear to me. |
|
I set up |
|
Nope, now I'm seeing the usual error message: but this time in the So I think this PR should go in but we can't consider this problem to be solved. |
|
Thank you for re-running Polaris tests (and merging).
I heard that frontier scratch filesystem was hanging and slow this week: maybe that's the culprit. |
This merge updates the e3sm_submodules/Omega submodule from [f2e951a](https://github.com/E3SM-Project/Omega/tree/f2e951a) to [fc53608](https://github.com/E3SM-Project/Omega/tree/fc53608). This update includes the following MPAS-Ocean and MPAS-Frameworks PRs (check mark indicates bit-for-bit with previous PR in the list): - [ ] (ocn) E3SM-Project/Omega#325 - [ ] (ocn) E3SM-Project/Omega#329
Change default IO type from NETCDF4C to PNETCDF
Checklist
Testingwith the following:have been run on and indicate that are all passing.
has passed, using the Polaris
e3sm_submodules/Omegabaseline-pfor both the baseline (Polarise3sm_submodules/Omega) and the PR buildFixes #333
Closes #334