Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

crab report wrongly reports lumis as both failed and successful for Autom.Split #5328

Open
belforte opened this issue Jul 15, 2024 · 9 comments
Assignees

Comments

@belforte
Copy link
Member

belforte commented Jul 15, 2024

ref: https://cms-talk.web.cern.ch/t/command-crab-report-not-generate-filed-files/43487

Report from @mapellidario :

User submitted task 240704_211221:matheus_crab_EGammaEXO_Run2018A_20240704_231147 with Automatic splitting, which means that if a job at step one fails,
the lumisections that it did not process will be processed by a later stage.
This means that it is possible for a lumisection to have been assigned to a job that failed,
and then re-assigned to a job at a later stage that was successful.
In particular, the following lumisections [2] of run 316457 are both in
processedLumis.json and failedLumis.json, while they are not in notFinishedLumis.json.

[2]

{516, 12, 14, 30, 32, 66, 86, 601, 114, 116, 117, 1146, 1147, 129, 1155, 1156, 1159, 136, 1161, 1167, 656, 1169, 661, 662, 663, 664, 1177, 1180, 1181, 1185, 1190, 1192, 1194, 1195, 1196, 1197, 1198, 1199, 1207, 185, 1209, 723, 724, 225, 226, 228, 229, 232, 233, 234, 242, 271, 276, 280, 281, 283, 288, 294, 299, 307, 309, 823, 312, 824, 316, 326, 329, 330, 331, 332, 337, 341, 342, 351, 355, 356, 359, 365, 366, 367, 368, 369, 370, 371, 372, 373, 374, 380, 394, 395, 396, 397, 398, 400, 402, 403, 407, 409, 410, 412, 416, 425, 429, 440, 448, 489, 492, 497, 499, 503, 504}

@belforte
Copy link
Member Author

@mapellidario can you tell me how to reproduce the list above ? I can not find a failedLumis.json file what crab report writed

@mapellidario
Copy link
Member

you need to use the --recovery=failed option, which will not produce any output unless you apply the PR that I submitted on Friday

@belforte
Copy link
Member Author

but I can't find the code which writes that file !

@belforte
Copy link
Member Author

anyhow the output is there

@belforte
Copy link
Member Author

ok. found

@belforte
Copy link
Member Author

this loop should be modified to only look at tail jobs when automatic splitting is used. Even then, when some tail fail, the input will be reprocessed in additional tail subdags, We need to somehow backtrace the job number hierarchy...

elif self.options.recovery == 'failed':
for jobid, status in reportData['jobList']:
if status in ['failed']:
for run, lumiRanges in lumisToProcessPerJob[jobid].items():
if run not in notProcessedLumis:
notProcessedLumis[run] = []
for lumiRange in lumiRanges:
notProcessedLumis[run].extend(range(lumiRange[0], lumiRange[1]+1))
notProcessedLumis = LumiList(runsAndLumis=notProcessedLumis).getCompactList()
notProcLumisCalcMethMsg += " the lumis to process by jobs in status 'failed'."

I start to wonder if we should rather remove --failed option !

@belforte
Copy link
Member Author

is there any bad side effect from using simply toProcess - Processed ?

@mapellidario
Copy link
Member

I start to wonder if we should rather remove --failed option !

After spending a good 15min thinking about how to make this option work for automatic splitting, I came to the very same conclusion. If 2 people out of the 2 who thought about this came to the same conclusion i guess we have enough quorum to proceed! :)

And if we wonder about how impact on our users: well, it was broken, so nobody really could have ever used this feature!

is there any bad side effect from using simply toProcess - Processed ?

no, there is not. And if I understand correctly, this is exactly what the default crab report option does

if self.options.recovery == 'notFinished':
notProcessedLumis = BasicJobType.subtractLumis(lumisToProcess, processedLumis)

@belforte
Copy link
Member Author

At the very least we need to deprecate --failed in case of automatic splitting. In a way, when there's a reliable list of failed jobs, the ability to directly show what lumis were to be processed in those jobs may be good.
POssibly this is also less work.

@belforte belforte self-assigned this Aug 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants