HPCC Guides

Training documentation for Linux-based high performance compute clusters

View the Project on GitHub wsuops/hpcc-training

Job Information

There are several tools that are available for getting information about the jobs that run on our cluster. The most common tools you will use are qstat and showq.

qstat

The qstat command (introduced in an earlier tutorial) will show the status of your current jobs on the system or can give you detailed information about a single job. When used without any options the output will look similar to this:

Job id                    Name             User            Time Use S Queue
------------------------- ---------------- --------------- -------- - -----
4929338.mgt1               ..._hard_problem go.cougs        00:00:00 R batch
4929339.mgt1               and_done         go.cougs        00:00:00 C batch
4929340.mgt1               common_cold_cure go.cougs               0 Q batch
4929341.mgt1               due_tommorow     go.cougs               0 H batch

The columns that are displayed include the job identifier, the job name, the amount of time a job has been running, the state of the job, and the queue that the job is currently running in or waiting for.

The S column stands for State. This column will tell you where your job is in it's lifecycle. Some of the most common states that you will see represented in the column are:

To get detailed information on an existing job, use qstat -f <jobid>. There is alot of information that is output to the screen, but the items that you will probably use the most include:

showq

The showq command will give you information regarding all jobs that are currently running or waiting to be run on the cluster. Unlike the qstat command, showq displays all jobs regardless of the user and it breaks up the output into 3 categories: active, eligible and blocked.

Active jobs are currently being run on the execution nodes and elegible jobs are jobs that meet all the requirements to be run, but are waiting on resources. A job can be blocked for many reasons, but the most common is that a user has exceeded the quotas enforced by the scheduler and the jobs are blocked from running.

The output will look like this:

active jobs------------------------
JOBID              USERNAME      STATE PROCS   REMAINING            STARTTIME

4917639            xxxxxxxx    Running     8    22:26:15  Sat Aug  9 11:25:20
4917934            xxxxxxxx    Running     4  1:12:21:19  Mon Aug 11 13:20:24
...
4917948            xxxxxxxx    Running    40 19:19:24:58  Tue Aug 12 08:24:03
4917712            xxxxxxxx    Running   420 28:08:43:03  Sun Aug 10 21:42:08

24 active jobs        1261 of 2064 processors in use by local jobs (61.09%)
                        103 of 167 nodes active      (61.68%)

eligible jobs----------------------
JOBID              USERNAME      STATE PROCS     WCLIMIT            QUEUETIME

4917687            xxxxxxxx       Idle    16  4:00:00:00  Sun Aug 10 14:20:16
4917724            xxxxxxxx       Idle    16  4:00:00:00  Mon Aug 11 10:02:35
...
4917955            xxxxxxxx       Idle    16  4:00:00:00  Tue Aug 12 09:01:38
4917956            xxxxxxxx       Idle    16  4:00:00:00  Tue Aug 12 09:02:17

15 eligible jobs   

blocked jobs-----------------------
JOBID              USERNAME      STATE PROCS     WCLIMIT            QUEUETIME

4917949            xxxxxxxx       Idle    24 20:00:00:00  Tue Aug 12 08:24:24
4917950            xxxxxxxx       Idle    20 10:00:00:00  Tue Aug 12 08:24:29

2 blocked jobs   

Total jobs:  41

All of the categories show the job ID, username, state, and the number of processors allocated to the job. The eligible and blocked jobs show the wallclock limit and when the job was queued. Unlike the last two sections, the active section reports the when the job started and how much time ti has left before the scheduler will terminate it.

During busy times there may be thousands of lines that are output depending on the number of jobs on the system. The showq command also takes a number of options that will filter the output and tell you what you need to know.

You can even mix and match these options. For instance, to view all the running jobs in the gp queue for the user go.cougs you would send the following command:

showq -r -u go.cougs -w class=gp