Monitoring Jobs#


ROCKS#

A fantastic resource for monitoring cluster via web interface. If you can't connect to the above link then contact request or T.Clark with your machine name or IP address.

What ROCKS does:

  1. Provides us with the BIO Roll out and allows us to package our own cluster software into a site wide Roll.
  2. Monitors job i/o, network and host behaviour
  3. Has a v useful mailing list that people actually reply to

qstat#

Command line job monitoring.

Try

 morecambe1> qstat 

Here's a subset of the most useful commands:

Command Flag Description
on its own lists your jobs
-u username list of all this user's jobs
-s p -u username list of all user's jobs that are in pending i.e. queued state
-s r -u username list of user's jobs in a running state
-j job_id v detailed info on a job including submission parameters and host info
-j job_id | grep err show any sge errors from job submission time
-j job_id | grep args show me the command line arguments for this job

Also useful:

 morecambe> qstat -f -u "*"

This will show all jobs (currently running and queued) for all users. Useful for observing the load the cluster is under.

qalter#

Command line tool for altering job priority or submission parameters

Command Flag Description
on its own a list of everyone's jobs
jobid -P bioinf alter project name for this job to bioinf
-u alobley -P bioinf alter all user's jobs projects to bioinf
jobid -l h_rt=10:00:00 alter hard run time limit of jobid to 10 hours

qmod#

Command line tool for modifying job submission parameters and clearing errors

Command Flag Description
-c job_id clear error status and resubmit this job

qresub#

Command Flag Description
-h job_id hold job with this id
-hu user_id hold jobs from this user
-h jobid.taskid hold task
-h jobid.taskid-taskid hold multiple tasks

qdel#

Command Flag Description
jobID (from qstat) delete single job (or array of jobs)
-u username delete all users jobs
-f -u username force a delete

Sometimes qdel doesn't seem to work as the sge scheduler is no longer in contact with the job. i.e the job isn't running anymore under a particular process id but sge still thinks it is.

This is a particular pain and the only way to clear the queue is to get your friendly cluster admin to do it for you.

Add new attachment

Only authorized users are allowed to upload new attachments.
« This page (revision-1) was last changed on 06-Mar-2013 18:19 by UnknownAuthor