using the buffy cluster (V0.9, 25/1/2006)

This document is written in reStructuredText so it may easily be converted to HTML or LaTeX. See http://docutils.sourceforge.net/rst.html Step 0 - Get an Account

Fill in this short questionnaire and email it to the bioinf-cluster mailing list. You should also subscribe to this list: http://oakham.cs.ucl.ac.uk/mailman/listinfo/bioinf-cluster

  1. Summary & motivation (Short (300 words max) description of your computing jobs)
  2. Current methods (Short description of how you currently run your jobs (if applicable))
  3. Estimated run time per job (typical and maximum)
  4. Could you divide the work into smaller units and run a greater number of shorter jobs, if we asked you to?
  5. Estimated frequency & total number of jobs:
  6. Estimated file I/O per job:
  7. Estimated RAM profile of jobs (max usage, typical usage) ((does it allocate all RAM at the start or does it grow slowly, must it be physical RAM or will swap space be used?)):
  8. Code description (if running own code) (Brief description of the code, its compilation details and platform dependencies.)
  9. Hardware requirements: (Minimum RAM, CPU speed, disk space)
  10. Software requirements: (What 3rd party tools you will run, and whether you require us to install these tools for you.)
  11. Disk space required on file server:

We are assuming all jobs are serial. If you would like to run parallel jobs, e.g. MPI, let us know. Step 1 - Upload

Upload your data and job scripts to your home directory on the buffy fileserver. you can do this using SSH or SCP. Note that you can only connect to buffy from CS machines.

For example:

scp data/my_datafile buffy:

(don't forget the colon at the end of the server name)

or with sftp:

sftp buffy
put data/*

or using the Konqueror file manager, go to this URL and then drag your files over:

sftp://buffy

That is a kioslave. Another one that may work is:

fish://buffy

The nice thing about kioslaves is that KDE applications use them as virtual filesystems, so if you use KDE exclusively then you can just treat your remote files as if they were local. And KDE is cross platform, so this will work on Linux, Solaris, Mac OS X, maybe even Windows.

If you don't use KDE, there is a Linux-specific virtual SSH filesystem you can try: http://shfs.sourceforge.net/ Step 2 - Submit Jobs

Connect to buffy using SSH and then run the qsub command to submit your jobs. note that you can only connect to buffy from CS machines:

ssh buffy
qsub <options> <jobscript>

We may supply default values for some of these options if you leave them out, but it is good practise to always specify them all, so you don't wonder afterwards why your job ran out of time or memory. If your job gets stuck in the qw state it probably means you have not specified a required option. You can put these options into a file and save it as ~/.sge_request to avoid specifying them on the command line each time.

Required options:

-l mf=memoryM
Memory free required for your job ('M' suffix signifies megabytes). this isn't enforced - you are free to lie and say your job only requires 48 kilobytes if you want - but then you risk it being scheduled on a machine that does only have 48k free! You can also specify vf for virtual memory free but using virtual memory could result in your job using swap space and slowing down. If you want to see how much free memory is availalbe on the cluster, run 'qhost'.
-l h_rt=x:y:z
Hard run time limit in hours, minutes and seconds. Your job will be killed if it exceeds this limit so do not set it too low. But jobs with shorter run time limits may get priority so do not set it too high. Currently the cut off point where you will get sent to the slow queue is 11 hours, so there is no advantage to asking for only 30 minutes - you may as well ask for 10 hours 59 minutes - but this could change in the future, so it's better to only ask for what you need.

These are not required but are very useful: -p priority Default is zero. If you don't care how long your job takes (maybe it is a long job and you dont want to hog the cluster) you can reduce it to -1023. Managers may increase priority up to 1024.
-l immediate=TRUE
Sends your job to the 'highpri' queue which means other jobs will be suspended to make room for it. This option may only be used by managers. (Note that you can also use the -p option to ensure you get highest priority, although i'm not sure if there is any situation where it would be useful, it certainly doesn't hurt to specify both.)
-e errorfile.$JOB_NAME.$JOB_ID
Where the error logs go.
-o outputfile.$JOB_NAME.$JOB_ID
Very useful to put these in .sge_request
-M emailaddress
Where to send notifications. Usually you use your CS email address, e.g. j.user@cs.ucl.ac.uk.
-m a Email notification of any aborted jobs. if you don't specify the -M option then the emails will go to the file /var/spool/mail/username on buffy. you can read them from there by running 'mail' or 'pine' on buffy. '-m e' will send mail when job ends, '-m b' when it begins, etc.

See man qsub for more options. Example

Submitting a test job with a low priority that requires 500 megaytes of RAM to run and will finish in less than 30 minutes and direct stderror and stdout to files:

qsub -p -1023 -l mf=500M -l h_rt=0:30:0 -e /home/rsmith/errors/$JOB_NAME.$JOB_ID -o /home/rsmith/output/$JOB_NAME.$JOB_ID test.sh

Alternatively, you can also specify these options in your job script. Put comments at the top of the script. Here is an example - save this file as test.sh:

  1. !/bin/sh
  2. Interpret this script with the Bourne shell
  3. $ -S /bin/sh
  4. $ -p -1024
  5. $ -l mf=500M h_rt=0:30:0
sleep 100 echo test done

and then put these options in your .sge_request file to apply to all scripts:

-e /home/rsmith/errors/$JOB_NAME.$JOB_ID
-o /home/rsmith/output/$JOB_NAME.$JOB_ID
-M j.user@cs.ucl.ac.uk

You can then submit the job without command line options:

qsub test.sh

Or you can ask for an email when your job ends:

qsub -m ea test.sh

Step 3 - Monitor

You can monitor your jobs using the qstat command. qhost, qdel and qresub may also be useful. see the man pages. if you want to know why a job does not run, try:

qstat -j <jobid>

To monitor the cluster itself, visit this webpage (accessible to CS machines only): http://buffy.cs.ucl.ac.uk/ganglia/ Step 4 - Results

Using SFTP to retrieve your results from buffy.

Data on buffy is not backed up so do not leave important files there. Hardware

There are two sets of nodes:

  • 90 'fast' nodes which have dual 1.3 Ghz Pentium III CPUs and 1 GB RAM.
  • 85 'slow' nodes which have (mainly) dual 0.8 Ghz Pentium III CPUs and 0.5 GB RAM.

Nodes are named 'compute-x-y' where x is the number of the rack, from 0 to 5, and y is the number of the row within the rack, from 0 to 29. Note that some nodes that require hardware maintainance may have been removed from the database entirely, so you may find some numbers simply do not exist. You can use the name 'cx-y' as a shorthand, e.g:

ssh c5-0

would connect you to the first machine in the 6th rack. How Buffy Schedules Jobs

At any one time a proportion of these nodes will be down for maintainance.

If you request a run time of more than or equal to 11 hours then your job will only be scheduled on a subset of the slower nodes. Currently there are 21 slow nodes that are allowed to run long jobs. Note that these 21 nodes will also be used for 'normal', i.e. short jobs. (This may be overridden by admins using the '-l immediate=TRUE' option.)

This time limit of 11 hours may be adjusted either up or down in future based on user feedback.

Jobs of less than 11 hours will be scheduled on to any free node. Hopefully the faster nodes should be used first but I haven't tested this. Don't be surprised if you see your job short running on a 'slowhost' in the 'longjobs' queue - if the main queue is full then this one will be used also. If you want to ensure your job runs on a fast node, then you should request more than 500M of RAM because the slow nodes only have 512M, so this will ensure only a fast node will be used.

(You can also manually specify the name of the queue you want to use, but the queue names may change in future and in most situations you care more about minimum RAM than minimum CPU speed, so I would suggest specifing RAM.)

We are no longer using round-robin sheduler - we use functional sharing, which should be fairer. If there is 1 user in the queue, he will get 100% of the nodes. If another user submits some jobs, SGE tries to schedule them 50% of the nodes each. 3 users, 33%, etc.

There are some problems with this functional sharing:

  1. It ignores past usage. There is another method called 'share tree' which takes past usage into account. That would mean if you had been using the cluster 24 hours a day for the past month you would have a much lower priority than someone who only submits jobs occasionally. This is technically fairer, but it tends to annoy users, so I'm not using it. Thus if you want to get your 'rightful share' you should keep the queue fully loaded at all times.
  2. There can be a long 'lag' before it becomes fair. Assume User A currently has 100% of the nodes running his jobs. User B submits some jobs. SGE will attempt to give 50% of the nodes to both users. However, it will not kill any jobs - it will wait for User A's jobs to finish before giving their nodes to User B. So if User A's jobs have a run time of 2 days then User B will have to wait on average 24 hours before he gets his share of 50%. The solution to this is to disallow very long jobs, similiar to what I proposed before. Currently jobs must be less than 11 hours, or else they get relagated to the old slow nodes. So if you want your 'rightful share' your jobs should last as long as possible, but less than 11 hours.

One advantage is this gives us an easy way to boost priority temporarily. If User A is doing an important task that needs to finish within 1 week,then we could give him a weight multiplyer of 3, so with 2 users, User A would get 75% and User B 25%.

An admin may use the -p option to override the functional share system and ensure all his jobs are scheduled before anyone else's. However, he will still have to wait for nodes to become available. If he specifies the '-l immediate=TRUE' option then his jobs will run immediately and cause the existing jobs to be suspended.

Add new attachment

Only authorized users are allowed to upload new attachments.
« This page (revision-1) was last changed on 19-May-2009 13:12 by UnknownAuthor