You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
<td>- <a href="../Getting_Started/Getting_Help/System_status.md">Sign up for system status updates</a> for advance warning of any system updates or unplanned outages.<br>- <a href="https://www.docs.nesi.org.nz">Consult our User Documentation</a> pages for instructions and guidelines for using the systems.<br>- <a href="https://www.youtube.com/playlist?list=PLvbRzoDQPkuGMWazx5LPA6y8Ji6tyl0Sp">Visit our YouTube channel</a> for introductory training webinars.</td>
33
+
<td>- <a href="../Getting_Started/System_status.md">Sign up for system status updates</a> for advance warning of any system updates or unplanned outages.<br>- <a href="https://www.docs.nesi.org.nz">Consult our User Documentation</a> pages for instructions and guidelines for using the systems.<br>- <a href="https://www.youtube.com/playlist?list=PLvbRzoDQPkuGMWazx5LPA6y8Ji6tyl0Sp">Visit our YouTube channel</a> for introductory training webinars.</td>
Copy file name to clipboardExpand all lines: docs/Announcements/Known_Issues_HPC3.md
+14-13Lines changed: 14 additions & 13 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -11,30 +11,31 @@ Below is a list issues that we're actively working on. We hope to have these res
11
11
12
12
For differences between the new platforms and Mahuika, see the more permanent [differences from Mahuika](../Getting_Started/FAQs/Mahuika_HPC3_Differences.md).
13
13
14
-
## Access
15
14
16
-
###OnDemand Apps
17
-
*The resources dedicated to interactive work via a web browser are smaller, and so computations requiring large amounts of memory or many CPU cores are not yet supported.
15
+
## OnDemand Apps
16
+
*Firefox Browser will fail to render the _HPC Shell Access_ app correctly. Please switch to a Chrome or Safari browser until the vendor provides a fix.
18
17
19
-
*Slurm `sbatch` jobs can be submitted directly from your apps, such as the terminal in Jupyterlab, RSudio or code-server. However, interactive jobs (`srun` or `salloc`) can only run from the `Clusters > NeSI HPC SHell Access` dropdown menu which opens a standard terminal window in the browser. [Watch a demo here](https://youtu.be/bkq6tpRrAwc?si=kS2KBifnCf4d6tWz).
18
+
*The resources dedicated to interactive work via a web browser are smaller, and so computations requiring large amounts of memory or many CPU cores are not yet supported.
20
19
21
20
* Missing user Namespaces in Kubernetes pods will interfere with most Apptainer operations. One can run `apptainer pull` command, `apptainer exec,run,shell` commands can not be executed.
22
21
22
+
## UCX ERROR
23
+
Multi-node MPI jobs may fail on the four nodes mg[13-16] with errors like `UCX ERROR: no active messages transport`. If you encounter this, add the sbatch option `-x mg[13-16]` to avoid those nodes. Single-task jobs are not affected.
23
24
24
-
## Software
25
-
As was already the case on the Milan nodes in Mahuika (where they had a Rocky 8 OS), some of our environment modules cause system software to stop working, e.g: load `module load Perl` and `svn` stops working. This is usually the case if they load `LegacySystem/7` as a dependency. The solutions are to ask us to re-build the problem environment module, or just don't have it loaded while doing other things.
26
-
27
-
**Delft3D_FM** wasn't working in Mahuika's Milan partition so probably needs rebuilding.
25
+
## Core dump files
26
+
Contrary to what is stated in [our documentation on core files](../Getting_Started/FAQs/What_is_a_core_file.md), these are not currently available, even if `ulimit -c unlimited` is set.
28
27
29
-
**MPI** software using 2020 or earlier toolchains eg intel-2020a, may not work correctly across nodes. Trying with more recent toolchains is recommended eg intel-2022a.
28
+
## Software
29
+
* FileSender - If you modify the `default_transfer_days_valid` parameter in your `~/.filesender/filesender.py.ini` to > 20 it will cause the transfer to fail with a 500 error code. Please do not modify this parameter.
30
30
31
-
Please let us know if you find any additional problems.
31
+
* Legacy Code - Some of our environment modules cause system software to stop working, e.g: do `module load Perl/5.38.2-GCC-12.3.0` and find that `svn` stops working. This is usually the case if they load `LegacySystem/7` as a dependency. The solutions are to ask us to re-build the problem environment module, or just don't have it loaded while doing other things.
32
+
33
+
* MPI software using 2020 or earlier toolchains eg intel/2020a, may not work correctly across nodes. Trying with more recent toolchains is recommended.
32
34
33
35
## Slurm
34
36
35
-
### GPUs
37
+
### Requesting GPUs
36
38
If you request a GPU without specifying which *type* of GPU, you will get a random one. So please always specify a GPU type.
37
39
38
40
### BadConstraints
39
-
This uninformative message can appear as a reason for a job pending in the `squeue` output when the job is submitted to both `milan` and `genoa` partitions (which is the default behaviour). It does not appear to reflect a real problem though, just a side-effect of the mechanism we are using to target jobs to the right-sized node(s).
40
-
41
+
This uninformative message can appear in the `squeue` output as the reason for a job pending. It does not always reflect a real problem though, just a side-effect of the mechanism we are using to target jobs to the right-sized node(s) together with a small bug in Slurm. If it causes your job to be put on hold (ie: its priority gets appears as zero in output from `squeue --me -S -p --Format=jobid:10,partition:13,reason:22,numnodes:.6,prioritylong:.6`) then please try `scontrol release <jobid>` or {% include "partials/support_request.html" %} if the issue persists.
0 commit comments