!Monitoring Jobs



A fantastic resource for monitoring cluster via [web|http://buffy1.cs.ucl.ac.uk/ganglia] interface.
If you can't connect to the above link then contact request or T.Clark with your machine name or IP address.

What ROCKS does:

#Provides us with the BIO Roll out and allows us to package our own cluster software into a site wide Roll.
#Monitors job i/o, network and host behaviour
#Has a v useful mailing list that people actually reply to


Command line job monitoring.


 morecambe1> qstat 

Here's a subset of the most useful commands:

||Command Flag || Description
| | on its own lists your jobs
| -u username| list of all this user's jobs
| -s p -u username| list of all user's jobs that are in pending i.e. queued state
| -s r -u username| list of user's jobs in a running state
| -j job_id| v detailed info on a job including submission parameters and host info
| -j job_id ~| grep err | show any sge errors from job submission time
| -j job_id ~| grep args | show me the command line arguments for this job  

Also useful:
 morecambe> qstat -f -u "*"

This will show all jobs (currently running and queued) for all users. Useful for observing the load the cluster is under.


Command line tool for altering job priority or submission parameters

||Command Flag || Description
| | on its own a list of everyone's jobs 
| jobid -P bioinf| alter project name for this job to bioinf
| -u alobley -P bioinf | alter all user's jobs projects to bioinf
| jobid -l h_rt=10:00:00 | alter hard run time limit of jobid to 10 hours


Command line tool for modifying job submission parameters and clearing errors

||Command Flag || Description
|-c job_id| clear error status and resubmit this job


||Command Flag || Description
|-h job_id | hold job with this id
|-hu user_id | hold jobs from this user
| -h jobid.taskid | hold task
| -h jobid.taskid-taskid | hold multiple tasks


||Command Flag || Description
| jobID (from qstat) | delete single job (or array of jobs)
| -u username | delete all users jobs
| -f -u username | force a delete

Sometimes qdel doesn't seem to work as the sge scheduler
is no longer in contact with the job. i.e the job isn't running
anymore under a particular process id but sge still thinks it is.

This is a particular pain and the only way to clear the queue is 
to get your friendly cluster admin to do it for you.