Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MJF in TimeLeft #7073

Open
iueda opened this issue Jun 22, 2023 · 5 comments
Open

MJF in TimeLeft #7073

iueda opened this issue Jun 22, 2023 · 5 comments

Comments

@iueda
Copy link
Contributor

iueda commented Jun 22, 2023

See https://ggus.eu/index.php?mode=ticket_info&ticket_id=162431
DESY claims that our DIRAC pilots do not respect the MACHINEFEATURES/shutdowntime they set,
referring to the documents dated in early 2016:
https://hepsoftwarefoundation.org/notes/HSF-TN-2016-02.pdf
https://twiki.cern.ch/twiki/bin/view/LCG/WMTEGEnvironmentVariables

Looking into the code:

if name is None and "MACHINEFEATURES" in os.environ and "JOBFEATURES" in os.environ:

MJFResourceUsage seems to be used only when the batch system is unknown, is that correct?
The pilots running at DESY finds the batch system is HTCondor, and the log reads

MaxRuntime attribute is not supported
Could not determine timeleft for batch system at site LCG.DESY.de
CPUTime for /Resources/Sites/LCG/LCG.DESY.de/CEs/grid-htcondorce0.desy.de/Queues/htcondorce-condor: 216000.000000

There have been some discussions in the past
#4544 JobAgent TimeLeft computation: definitions, multi-core environments, batch system based on wallclock time
#4788 HTCondor TimeLeft module

If MaxRuntime is not available (most HTCondor queues are concerned), setting MaxCPUTime (not too high) should be sufficient.

MJF is not used by the pilot jobs on HTCondor by intention?
Not only for getting wallclock time limit, but even for downtime?

@fstagni
Copy link
Contributor

fstagni commented Jun 22, 2023

Before investigating the code, I am surprised that:

  • MJF is still deployed somewhere, while it is completely unsupported, and the last time I check that didn't even had a python3 version working;
  • they ask you to comply with that!
  • LHCb (and other DIRAC users) also run at DESY and we didn't get such complains.

@chrisburr
Copy link
Member

LHCb was also ticketed: https://ggus.eu/index.php?mode=ticket_info&ticket_id=162429

@fstagni
Copy link
Contributor

fstagni commented Jun 22, 2023

MJF is used when nothing else is found (for what regards TimeLeft). So, it basically won't be used when there's a known batch system. When this was initially coded we thought of switching the priority once MJF would have been deployed ~everywhere, but this never happened and the MJF project reached a slow death. I will reply in LHCb's ticket.

@marianne013
Copy link
Contributor

Just to say that in the UK it's not used, apparently not even by Manchester who invented it.

@iueda
Copy link
Contributor Author

iueda commented Jun 26, 2023

I have read https://ggus.eu/index.php?mode=ticket_info&ticket_id=162429, and https://twiki.cern.ch/twiki/bin/view/LCG/WLCGOpsMinutes230608#HEPScore_status_update

I understand MJF was abandoned because the "numbers being published on the WNs are too unreliable in practice",
but I suppose they were the "benchmarking and the CPU and WallClock time available to the job" (#4544)

Maybe it is worthwhile to respect "downtime", for it would not be filled usually?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants