Skip to content

Commit

Permalink
Merge pull request #616 from JaimeFrey/docs-HTCONDOR-1323-job-removal…
Browse files Browse the repository at this point in the history
…-debug

HTCONDOR-1323 job removal debug
  • Loading branch information
GregThain authored Oct 18, 2024
2 parents 2198d64 + 83518f5 commit d226af7
Showing 1 changed file with 49 additions and 1 deletion.
50 changes: 49 additions & 1 deletion docs/v23/troubleshooting/common-issues.md
Original file line number Diff line number Diff line change
Expand Up @@ -422,14 +422,28 @@ Notice the failures in the above message: `Remote Mapping: gsi@unmapped` and `Au

### Jobs go on hold

Jobs will be put on held with a `HoldReason` attribute that can be inspected with
Jobs can be put on hold with a `HoldReason` attribute that can be inspected with
[condor\_ce\_q](debugging-tools.md#condor_ce_q):

``` console
user@host $ condor_ce_q -l <JOB-ID> -attr HoldReason
HoldReason = "CE job in status 5 put on hold by SYSTEM_PERIODIC_HOLD due to no matching routes, route job limit, or route failure threshold."
```

The CE (and CE client) will put a job on hold when it encounters a problem
with the job that it doesn't know how to resolve.

If the HTCondor schedd believes that the existing job it has submitted
to a remote queue may be recoverable, then it will leave the remote job
queued and keep the `GridJobId` attribute defined in the local job ad.
If you release the local job (with `condor_ce_release`), then the schedd
will attempt to re-establish contact with the remote scheduler.

If the schedd believes the existing remote job is not recoverable, then it
willremove the job from the remote queue and set `GridJobId` to `Undefined`
in the local job ad. If you release the local job, then a new job instance
will be submitted to the remote scheduler.

#### Held jobs: no matching routes, route job limit, or route failure threshold

Jobs on the CE will be put on hold if they are not claimed by the job router within 30 minutes.
Expand Down Expand Up @@ -550,6 +564,40 @@ This means that the `condor_job_router_info` (note this is not the CE version),
2. You have installed HTCondor in a non-standard location that is not in your `PATH`.
3. The `condor_job_router_info` tool itself wasn't available until Condor-8.2.3-1.1 (available in osg-upcoming).

### Jobs removed from the local batch system

When the CE removes a job from the local batch system, it may be due to
a problem the CE encountered with managing the job or it may be at the
behest of the submitter to the CE (which may be a remote HTCondor
Access Point).

Given a specific job ID in the CE logs, first find the job ad in CE
queue with the `condor_ce_q` tool and check the value of the `GridJobID`
attribute:

``` console
user@host $ condor_ce_q <JOB_ID> -af GridJobId
```

If the job is no longer in the queue, you will have to check the history
using the `condor_ce_history` tool:

``` console
user@host $ condor_ce_history <JOB_ID> -af GridJobId
```

If the `GridJobId` is *undefined*, then the CE did the removal due to a
problem interacting with the local batch system.
Check the `HoldReason` and `LastHoldReason` attributes for why the CE
removed the job.

If `GridJobID` is not *undefined*, and is set to some value, then the
submitter to the CE removed the job.
If the submitter is a remote HTCondor Access Point, its daemons may have
done the removal as part of putting its local job on hold.
In that case, the `HoldReason` attribute in the remote job queue should
indicate the source of the problem.

Getting Help
------------

Expand Down

0 comments on commit d226af7

Please sign in to comment.