Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

libjob: rabbit jobspec tests #6727

Merged
merged 4 commits into from
Mar 28, 2025

Conversation

jameshcorbett
Copy link
Member

@jameshcorbett jameshcorbett commented Mar 22, 2025

WIP PR to address #6710. Several of the tests in t/t0022-jj-reader.t fail. Some of them because I didn't bother to spend the time to get them right, but others because now jj-reader can incorrectly count the nodes. A jobspec like this:

version: 1
resources:
  - type: slot
    count: 2
    label: rabbit
    with:
    - type: ssd
      count: 1
    - type: node
      exclusive: true
      count: 1
      with:
      - type: slot
        label: task
        count: 1
        with:
        - type: core
          count: 1

should give a nodecount of 2, but it currently gives a nodecount of 1 because the reader doesn't take into account the count of the higher-level slot entry. That could be easily fixed. However, it will probably get a wrong slot count but I don't know how the slot count is supposed to be used so I'm not as sure how to fix it.

@grondo @cmoussa1 I have to step away from development on this issue for a while to get back to SCR work, but I wanted to put it up so you could see what I was thinking.

I am wondering if flux-coral2 should drop the use of jobspec_update and just update the job-manager's version of the jobspec in-place like it used to, so that only the scheduler sees the updated version (I think that's how it works). That would let us punt on the issue until we have a better way to handle it. But if I'm wrong about only the scheduler seeing the updated jobspec, and in particular if the mf_priority plugin sees the in-place-updated jobspec, we'd still hit the same problem and that wouldn't save us anything.

@grondo
Copy link
Contributor

grondo commented Mar 24, 2025

But if I'm wrong about only the scheduler seeing the updated jobspec, and in particular if the mf_priority plugin sees the in-place-updated jobspec, we'd still hit the same problem and that wouldn't save us anything.

Yeah, I don't think it would help. jobspec-update events modify the job manager's copy of the jobspec in place, then future callbacks that unpack from jobspec key of plugins args reference the updated copy.

We'll need to think of some simple update to correctly count resources in jobspec V1+ or whatever this is. I actually didn't realize that jobspec could have more than one slot. We may need to sit down and devise some rules for a v2 jobspec that scheduling rabbits requires, encode that in an RFC, then update libjj using those rules. Maybe we can discuss this at the next meeting.

Problem: some flux-coral2 rabbit jobs insert an 'ssd' vertex type
which is not accepted by the jj jobspec reader.

Ignore unknown types.
Problem: the jj-reader tests need to be updated for jj's new handling
of unknown resource types.

Change the tests, removing two tests that were previously expected
to fail but now succeed.
Problem: flux-coral2 sometimes generates jobspecs with entries above
'node' (usually 'slot' entries) where the intended meaning is that
whatever is below should be multiplied by that count.

Add logic so that if there are entries in jobspec above 'node', the
nodecount read in by `jj.c` is multiplied.
Problem: there are no tests for the multiplication of the nodecount
in jobspec if there are entries above "node".

Add tests.
@jameshcorbett jameshcorbett marked this pull request as ready for review March 28, 2025 00:56
@jameshcorbett jameshcorbett requested a review from grondo March 28, 2025 00:57
@jameshcorbett
Copy link
Member Author

After discussion in the meeting yesterday, I think this is ready for review @grondo .

Copy link
Contributor

@grondo grondo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks!

@mergify mergify bot merged commit 0365918 into flux-framework:master Mar 28, 2025
35 checks passed
Copy link

codecov bot commented Mar 28, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 83.80%. Comparing base (52f9060) to head (d159c71).
Report is 5 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #6727      +/-   ##
==========================================
- Coverage   83.82%   83.80%   -0.02%     
==========================================
  Files         534      534              
  Lines       89281    89279       -2     
==========================================
- Hits        74838    74823      -15     
- Misses      14443    14456      +13     
Files with missing lines Coverage Δ
src/common/libjob/jj.c 93.24% <100.00%> (-0.18%) ⬇️

... and 11 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants