Scripting Examples#


  • Before you start
    • Create a directory for logging standard out from cluster scripts i.e. /home/myusername/output
    • Create a directory for logging standard error from cluster scripts i.e. /home/myusername/error

  • How to submit jobs
    • Write a simple shell script (tcsh, sh and bash are all acceptable) or copy and save the example below inserting your username where appropriate
    • From the directory containing the test script type:
   buffy1> qsub test.sh 
You should see the message job (job number) submitted at

Check the error and output directories. If the test script executed correctly you should see an empty file in the error directory and a file in the output directory with the words "Hello World"

test.sh shell script


#!/bin/sh

#$ -S /bin/sh
#$ -o /home/myusername/output/$JOB_NAME.$JOB_ID.out
#$ -e /home/myusername/error/$JOB_NAME.$JOB_ID.err
#$ -l h_rt=0:0:20

/bin/echo Hello World

Job Submission Parameters#

    • In the test scripts there are a number of job submission parameters. These can specified in 3 ways:
      • In the script itself as shown above using #$ at the beginning of the line to mark a submission parameter
      • On the command line
      • In a default job submission control file called .sge_request placed in your home directory on buffy

command line

    buffy1> qsub test.sh -l h_rt=0:0:20 -o mypath/$JOB_ID.out -e mypath/$JOB_ID.err

.sge_request

    -o /home/myusername/output/$JOB_NAME.$JOB_ID.out
    -e /home/myusername/error/$JOB_NAME.$JOB_ID.err
    -l h_rt=0:0:20

More involved job submission parameters#

Flag Definition
-l h_rt=10:00:00 Hard run time - the maximum time a job should take
-S /bin/sh The type of shell script you are running - can be csh or tcsh also but MUST be set
-l mf=100M Only submit the job if the host has >=100M free
-l vf=100M Only allow the job 100M throughout its runtime
-t 1-50 For array jobs run the script as a for loop executing 1-50 times using $SGE_TASK_ID as the index
-o mydir/output.log Path to stdout log
-e mydir/error.log Path to stderr log
-m alobley@cs.ucl.ac.uk Email me when job is finished
-q fast.q Only submit to a particular queue i.e. the fast nodes
-pe mpich Use parallel job environment (mpi in this case)
-s 2 Allocate two slots for this jobs (for parallel jobs)
-h buffy-2-7 Only submit my jobs to this node

Special variables #

Here's a list of a few environment variables available at runtime and set up by the sge environment - its worth checking that you don't over-write any of these in your own progs or that 3rd party software doesn't...

JAVA_HOME=/usr/java/jdk1.5.0_07
HMMER_DB=/share/bio/hmmer/db
HOME=/home/alobley
HOSTNAME=buffy-1-30.local
HOSTTYPE=i686
LD_LIBRARY_PATH=/opt/gridengine/lib/lx26-x86:/opt/gridengine/lib/lx26-x86:/opt/globus/lib:/opt/lam/gnu/lib
LOGNAME=alobley
MACHTYPE=i686-redhat-linux-gnu
MAIL=/var/spool/mail/alobley
NHOSTS=1
NQUEUES=1
NSLOTS=1
OSTYPE=linux-gnu
QUEUE=slow.q
REQNAME=featurama.sh
REQUEST=featurama.sh
SGE_STDERR_PATH=/home/alobley/error/8472.e
SGE_STDIN_PATH=/dev/null
SGE_STDOUT_PATH=/home/alobley/output/8472.o
SGE_TASK_ID=161
TMP=/tmp/8472.161.slow.q
TMPDIR=/tmp/8472.161.slow.q
UID=16426
USER=alobley
JOB_ID=8472
JOB_NAME=featurama.sh
JOB_SCRIPT=/opt/gridengine/default/spool/buffy-1-43/job_scripts/8472
BLASTDB=/share/bio/ncbi/db
BLASTMAT=/opt/Bio/ncbi/data

Why are these useful?

They can be used to debug... for example in your job submission script you can set the error log name to include the HOST and QUEUE names for the job.. or print these to standard out using echo.

Simple BLAST job#


#!/bin/sh

#$ -o /home/alobley/output/$JOB_NAME.$JOB_ID.out
#$ -e /home/alobley/error/$JOB_NAME.$JOB_ID.err
#$ -l h_rt=5:00:00
#$ -S /bin/sh

INFILE=$1

blastall -p blastp -i $INFILE -o $INFILE.blast -e 0.001 -d $BLAST_DB

echo "Finished blasting $INFILE"

to run

 buffy1> qsub blast.sh myfile.fasta

Array jobs#

Say you want to run 100 similar blast jobs. You have 2 options:

  1. Submit the first script 100 times using a wrapper script or for/each loop to specify which fasta file
  2. Use an array style job which you submit only once but runs 100 tasks farmed out to different nodes by the sge scheduler

4 good reasons to use array jobs

  1. Lighter queue load... sge has less work to do on the head node
  2. Sge handles i/o more efficiently for array jobs i.e. one output, error and script file created per job rather than per task
  3. Easy tracking of progress for you .. how many jobs left / how far through you are
  4. Kinder to your colleagues and to buffy :)

#!/bin/sh

#$ -o /home/alobley/output/$JOB_NAME.$JOB_ID.out
#$ -e /home/alobley/error/$JOB_NAME.$JOB_ID.err
#$ -l h_rt=5:00:00
#$ -S /bin/sh
#$ -t 1-100

INFILE=$SGE_TASK_ID.fasta

blastall -p blastp -i $INFILE -o $INFILE.blast -e 0.001 -d $BLAST_DB

echo "Finished blasting $INFILE"

In essence the script runs as for loop initialised at 1 and ending at 100. The environment variable SGE_TASK_ID indexes the loop. The script relies upon the fasta input files being numbered 1.fasta through to 100.fasta. The output file for this job should contain the echo line for each task in the task array.


Add new attachment

Only authorized users are allowed to upload new attachments.
« This page (revision-1) was last changed on 08-Feb-2007 09:40 by UnknownAuthor