SGK `qstat+` PERL wrapper around GridEngine `qstat`

Description

qstat+ is a PERL scripts to improve the output of qstat on a system running the Grid Engine as the job scheduler/load manager.

Release 1.01 Jul 1 2024, version 5.5/2

Conceptual Description

Runs a set of qstat commands and parse their output to produce a better output than qstat

It will run one or more of the following commands:

 qstat -xml -r -ext -u USER
 qstat -xml -f -u USER
 qstat -j JID
 qstat -F ssd_res,ssd_free,ssd_total -q all.q@@ssd-hosts -u nobody
 ruptime or ssh HOST uptime

qstat+ will also compute some derived quantities, like fraction of CPU usage and memory usage. On the HPC cluster qstat+ was developed for, we have implemented a memory reservation and a local SSD usage request, hence qstat+ can return information related to both.

Practical Considerations

We run Altair's Grid Engine version 8.8.1 with a customized configuration, hence qstat+ is tailored to that configuration. It is my intention (in my spare time) to clean qstat+ to clearly identify what is specific to our configuration. Among other things:

we do not use the all.q queue, but use other queues (with short names),
our compute nodes are all called compute-NN-MM, hence qstat+ truncates it to NN-MM,
we implemented a memory reservation consumable, and
we have local SSD that can only be used by request.

qstat+ can be run in different (exclusive) modes:

show summary (simple, extended or compact)
show all jobs
show queued jobs
show running jobs
show extra jobs (dr, Eqw, t)
show info on a specific job
show node list for a (set of) job(s)
show node(s) with high load
show global cluster status
show empty slots, including expanded info
show node(s) that are down
show over-reserved jobs, i.e.: resMem/maxVMem > 2.5, & age > 1 hr
show over-subscribed jobs, i.e.: cpu% > 133%, & age > 1 hr
show inefficient jobs, i.e.: cpu% < 33%, & age > 1 hr
show SSD usage
show help
show examples

and can be run in debug mode, for debugging & testing.

Usage info

usage: qstat+ [mode] [options]
  where modes are (exclusive list):
    -s|-sx|-sc      show summary (simple, extended or compact)
    -a              show all jobs
    -q              show queued jobs
    -r              show running jobs
    -X              show extra jobs (dr, Eqw, t)
    -j      JID     show info on a specific job
    -nlist  JID     show node list for a (set of) job(s)
    -hiload N       show node(s) with high load
    -gc             show global cluster status
    -es[x]          show empty slots, -esx: expanded info
    -down           show node(s) that are down
    -ores           show overreserved jobs, i.e.: resMem/maxVMem > 2.5, & age > 1 hr
    -osub           show oversubscribed jobs, i.e.: cpu% > 133%, & age > 1 hr
    -ineff          show inefficient jobs, i.e.: cpu% < 33%, & age > 1 hr
    -ssd            show SSD usage
    -h|-help--help  show help
    -examples       show examples

where options are:
    -u USER       limit to the specified user, you can use "*" or "all" to list everybody's jobs
                  or a coma-separated list
    -njobs        show counts in no. of jobs
    -npes         show counts in no. of PEs (slots)
    -nqpe         show jobs/PEs not jobs/tasks for queued jobs
    -nqxx         show jobs/tasks/PEs for queued jobs
    -load         show the nodes' load
    -sua          show the nodes' used/avail no of slots
    -age          show the age of the jobs, i.e., elapsed time vs starting/submit time
    -cpu          show the amount of CPU used by running job(s)
    -cpu%         show the amount of CPU/age/#PE in % (job efficiency)
    -cpur         show the ratio CPU/AGE, not scaled by #PE
    -mem          show the mean memory usage for running jobs, the total requested memory for queued jobs, in GB
    -memx         show more memory info for running jobs: reserved, mean, vmem and maxvmem (slow)
    -memr         show more memory info for running jobs: reserved, mean, maxvmem and res/mxvmem (slow)
    -io           show the I/O usage
    -iow          show the I/O wait usage (slow)
    -ioops        show the IOPs usage (slow)
    +lic          show license info (default), although not in detailed outputs
    -lic          do not show license info

    -noheader     do not show the header
    -nofooter     do not show the footer
    -raw          print raw values (for easier parsing)
    -queue QSPEC  limit to jobs in queue QSPEC (RE ok)
    -oresF  VAL   change the overreserved factor to VAL
    -osubT  VAL   change the oversubscribed threshold to VAL, in %
    -osubE  VAL   set the oversubscribtion excess threshold to VAL, in hr x slots
    -ineffT VAL   change the inefficient threshold to VAL, in %
    -ageT   VAL   change the minimum age threshold to VAL, in hour

    -v|-verbose   set verbose mode
    -warn         show warnings
    -wide         wide output (132 cols, implies -notty)
    -notty        do not use the width of the terminal, as returned by stty
    -check-load   check the instantanous load (-hiload only)

shorthands:
    +a   is expanded to -a -u $USER -nofooter             show all your jobs
    +a%                 -a -u $USER -nofooter -age -cpu%  ibidem, with cpu on % of age
    +ax%                -a -u $USER -nofooter -age -cpu% -load -sua -mem -io
    +ar%                -a -u $USER -nofooter -age -cpu% -memr -io
    +r                  -r -u $USER -nofooter             show all your running jobs
    +r%                 -r -u $USER -nofooter -age -cpu%  ibidem, with cpu on % of age
    +rr%                -r -u $USER -nofooter -age -cpu% -memr -io -iow -ioops
    +rx%                -r -u $USER -nofooter -age -cpu% -load -sua -memx -io -iow -ioops
    +q                  -q -u $USER -nofooter             show all your queued jobs
    +q%                 -q -u $USER -nofooter -age        ibidem but show age
    +qx%                -q -u $USER -nofooter -age -mem   ibidem plus mem info
    +X                  -X -u $USER -nofooter             show all your extra jobs
    +n                  -notty -wide -noheader -nofooter

qstat+: Ver 5.5/2 - Jun 2024

Usage Examples

% qstat+ -examples

examples:
 qstat+ -a -u hpc                   show all of hpc's jobs
 qstat+ -r -cpu% -u hpc             show all of hpc's running jobs, cpu in %
 qstat+ -r -cpu% -load -sua -u hpc  show all of hpc's running jobs, cpu in %, and
                                the nodes' load and slot usage/availability
 qstat+ -q                          show all the queued jobs
 qstat+ -j 8683280,8683285          show info on specific job IDs
 qstat+ -nlist 8683280,8683285      show nodes list for specific job IDs
 qstat+ -hiload 1.5                 show nodes whose load is 1.5 greater than the number of slots used
 qstat+ -ineffT 50 -ineff           show jobs that are below a 50% efficiency threshold
 qstat+ -osubT 200 -ageT 48 -osub   show jobs that are above a 200% usage threshold and are older than 48 hours
 qstat+ -gc                         show the global cluster status
 qstat+ -es                         show the empty slots
 qstat+ -down                       show which nodes are down

        +ax% -u all                 is equiv to -a -u all -cpu% -load -sua -mem
                             etc... for the +XXX shorthands

qstat+: Ver 5.5/2 - Jun 2024

Customization

To run this on a different GE cluster, one needs to redefine some of the variables near the top of the PERL script:

#
# list of nodes: head node, login nodes, NSDs
#
my @logNames = qw/hydra-7 login01 login02 login04 login07/;
my @nsdNames = qw/nsd01 nsd02/;
my $nodePrfx = 'compute-';
#
# list of queues to show - all.q is ignored
#
my @queueSet = ('sThC.q', 'mThC.q', 'lThC.q', 'uThC.q', '', 
                'sThM.q', 'mThM.q', 'lThM.q', 'uThM.q', '',
                'uTxlM.rq', '',
                'lTIO.sq', '',
                'sTgpu.q','mTgpu.q', 'lTgpu.q', '',
                'qgpu.iq', 'qrsh.iq', 
    );
#
# queue to ignore, this can be a RE since it is used as if ($queue !~ /$ignQueue/)
my $ignQueue = 'all.q'; 
#
# use ruptime or ssh $host uptime?
#
my $uptimeCommand = '/usr/bin/ruptime -n=25';
# my $uptimeCommand = 'uptime';
#

Debugging & Testing

Explanation

The flag -debug DIR directs qstat+ to read from files, located in the directory DIR, instead of running qstat and parsing its output, hence these files must contain the output of the commands qstat+ would have run.

Examples of such files are provided in the debug.tgz archive, where 6 examples are available, including one when using the original SGE, not AGE 8.8.1.

One creates these files as follows:

  qstat -xml -r -ext -u \* > dbg/qstat-r.xml
  qstat -xml -f -u \*      > dbg/qstat-f.xml

or

  qstat -j JID             > dbg/qstat-j-JID.txt

with a specific job id;

or to show SSD usage and reservation:

  qstat -F ssd_res,ssd_free,ssd_total -q all.q@@ssd-hosts -u nobody > /tmp/qstat-ssd.txt

Examples

Using the provided files, you can do the following:

First untar the archive

tar xf debug.tgz

Run qstat+ in debug mode

Examples 0

qstat+ -debug ex.0
qstat+ -debug ex.0 -j 2343013
qstat+ -debug ex.0 -j 2343013.1

Examples 1: various modes

qstat+ -debug ex.1
qstat+ -debug ex.1 -gc
qstat+ -debug ex.1 -es
qstat+ -debug ex.1 -esx
qstat+ -debug ex.1 -down
qstat+ -debug ex.1 -ssd
qstat+ -debug ex.1 -sx
qstat+ -debug ex.1 -q
qstat+ -debug ex.1 -r
qstat+ -debug ex.1 -X

Examples 2: specific large MPI job

qstat+ -debug ex.2
qstat+ -debug ex.2 -j 2437501
qstat+ -debug ex.2 -nlist 2437501

Examples 3: specific user, job array

qstat+ -debug ex.3
qstat+ -debug ex.3 +a% -u erickoch
qstat+ -debug ex.3 -j 2436756
qstat+ -debug ex.3 -j 2436756.1
qstat+ -debug ex.3 -j 2436756.3

Examples 4: different user, lots of jobs

qstat+ -debug ex.4 +a% -u ingushin
qstat+ -debug ex.4 -j 2435983
qstat+ -debug ex.4 -nlist 2435983 -u ingushin

SGE example

qstat+ -debug sge -geType sge +a% -u all

Comparison file:

The file test-debug.log illustrates the output from these commands.

Click on the file to view it and see examples of qstat+ output.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
man/man1l		man/man1l
LICENSE		LICENSE
README.md		README.md
debug.tgz		debug.tgz
qstat+		qstat+
test-debug.log		test-debug.log

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SGK `qstat+` PERL wrapper around GridEngine `qstat`

Description

Conceptual Description

Practical Considerations

Usage info

Usage Examples

Customization

Debugging & Testing

Explanation

Examples

About

Releases

Packages

Languages

License

skorzennik/ge-qstat-plus

Folders and files

Latest commit

History

Repository files navigation

SGK qstat+ PERL wrapper around GridEngine qstat

Description

Conceptual Description

Practical Considerations

Usage info

Usage Examples

Customization

Debugging & Testing

Explanation

Examples

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

SGK `qstat+` PERL wrapper around GridEngine `qstat`

Packages