Admin stuff (for Liam)

This document is written in reStructuredText so it may easily be converted to HTML or LaTeX. See http://docutils.sourceforge.net/rst.html

This document is quite short because we are using an off-the-shelf solution - Rocks. Rocks even installs and configures SGE for us. Thus you should read the Rocks and SGE docs. This document will only explain where we differ from the default config and will also highlight a few of the more useful commands that are buried deep in the manuals. TODO

  • A few nodes need labelling
  • Trouble installing the final 15 nodes
  • Freke NFS working but not tested
  • Get NFS working on Titin
  • Install gmond on freke and titin
  • Enable multicast on switches

Mailing list

We have a mailing list at http://oakham.cs.ucl.ac.uk/mailman/listinfo/bioinf-cluster. Use it to keep cluster users informed. SGE users

You don't need to create SGE user objects - this is done automatically the first time a user submits a job ('enforce_user auto' in sge_conf). You just need to set up accounts on buffy for users:

useradd pdiddy
passwd pdiddy

Don't ever attempt to use any graphical Redhat tools for user management or for any other configuration - they will mess up the Rocks configuration. The command line tools have all been modified by Rocks.

See the section on NFS for details on giving this user access to the NFS shares.

See section on passwords for advice on choosing good passwords. It will take a few minutes for the account to be added to the nodes. To speed this up, try:

make -C /var/411
cluster-fork 411get --all

If that doesn't work:

cluster-fork 'service autofs restart'

However, for important people, there are some things to configure in SGE. First, make to make him a manager so he can use priority option:

qconf -am pdiddy

Then add him to the 'elites' usergroup. These are the only people allowed to use the 'highpri' queue:

qconf -au pdiddy elites

To boost the amount of share a user gets, modify his user object:

qconf -muser pdiddy

fshare defaults to 100. Set it to 200 to double his share, etc.

To clear all jobs:

qdel -u "*"

You can do everything graphically using Sun's qmon program (use X11 forwarding when you ssh into buffy, then run qmon). Some configuration stuff I changed from the defaults

If you ever need to, you should be able to reinstall Rocks from scratch. Here I will try to list everything I have changed from the defaults.

Global queue config:

qconf -mconf global

enforce_user auto - automatically create a new user object the first time a user submits a job
auto_user_fshare 100 - give 100 functional 'tickets' to each newly created user object
auto_user_delete_time 0 - never delete automatically created user objects

Scheduler config:

qconf -msconf

weight_user 0.250000 - use user weighting
weight_project 0.250000 - doesnt matter we arent using projects
weight_department 0.250000 - doesnt matter we arent using departments
weight_job 0.250000
weight_tickets_functional 10000 - use functional weighting
weight_tickets_share 0 - dont use share tree weighting
weight_ticket 0.010000
weight_priority 1.000000 - priority more important than functional tickets

The default queue made by rocks is 'all.q'. It contains the hostlist @allhosts. To modify it:

qconf -mq all.q

h_rt 11:00:00 - only jobs with hard run time less than 11 hours may enter this queue

I made another queue called 'highpri'. This is exactly the same as 'all.q' except it has infinite run time and:

subordinate_list all.q=1 - when this queue has 1 job in it, a job will be suspended from all.q
user_lists elites - only elites usergroup may use this queue
complex_values immediate=TRUE - only use this queue if user specifies this option

I had to edit the complexes to create an 'immediate' complex for this to work:

qconf -mc

immediate im BOOL == FORCED NO FALSE 1000

I made another queue 'longjobs'. This allows infinite lengh jobs and has:

hostlist @slowhosts - it only contains hosts i have put into hostgroup '@slowhosts'

To add hosts to this group:

qconf -mhgrp @slowhosts

Note that @allhosts doesn't actually contain all the hosts now! It should probably be renamed @fasthosts. Rocks by default puts every host into @allhosts, but I have removed all the slow hosts from @allhosts and put them into @slowhosts. Passwords

If you want to set up access using SSH keys without a password (as we already do between the frontend the nodes) it may be a bit tricky because CS uses the commercial version of SSH. See here for instructions: http://www.cs.berkeley.edu/~dtliu/sshinterop.html

the best way of choosing a password is to think of a phrase (with numbers and uppercase letters in it) and then take the first digit or letter of each word in the phrase.

I've installed a program called 'john the ripper' to ensure that users' passwords on buffy are not crackable. Here are some of the ways of using it. The more thourough tests take longer.

First, you should always run this test, because it only takes a second and finds lots of obvious passwords based on the username:

/root/john-1.6.40/run/john --single /etc/shadow

Then you should run a dictionary attack using one of the files of words. The '--rules' option tries lots of permutations of each word (e.g. it will try 'h3ll0' for 'hello') but takes longer. It just depends how much time you have as to which one you run. In practice, an attacker isn't going to be able to try that many words. The attacks on my systems only try a few hundred usually. The first option below only takes john 20 seconds because john has direct access to the shadow password file but it would probably take an attacker a few days to try that many passwords because an attack has to do an SSH connection for each attempt.

20 seconds:

/root/john-1.6.40/run/john --wordlist=/root/cracklib-small /etc/shadow

8 minutes:

/root/john-1.6.40/run/john --wordlist=/root/cracklib-words /etc/shadow

12 minutes:

/root/john-1.6.40/run/john --wordlist=/root/cracklib-small --rules /etc/shadow

20 minutes:

/root/john-1.6.40/run/john --wordlist=/root/all.dict /etc/shadow

6 hours:

/root/john-1.6.40/run/john --wordlist=/root/cracklib-words --rules /etc/shadow

Finally this one will take a few days but is the most comprehensive:

/root/john-1.6.40/run/john --wordlist=/root/all.dict --rules /etc/shadow

john saves cracked passwords in a file called john.pot. to view this file, do:

/root/john-1.6.40/run/jown --show /etc/shadow

After that, you could use john to try every possible combination of numbers and letters:

/root/john-1.6.40/run/john --incremental /etc/shadow

There is not actually much point doing that though - it may take a few weeks, but eventually it will find every password. If an attacker gets hold of your shadow password file then there is nothing you can do to stop them getting the passwords. (Which is why the NIS based security isn't secure - it doesnt use a shadow file - so everyone has access to the password file and could easily crack many passwords.)

It would be quite interesting to write a version of john that ran on a cluster. I wonder what the world record time is for cracking every 10 characters-or-less password? Adding packages to nodes

If you can't find a RHEL4 RPM (http://rpmfind.net) then compile your own and use check 'checkinstall' to build the RPM. Assuming the program has a makefile, you would do:

make
/usr/local/sbin/checkinstall -R --nodoc make install

If it doesn't, put the binaries in a tmp directory and then install them like this:

cd /state/partition1/tmp_bin
/usr/local/sbin/checkinstall -R --nodoc cp -r * /bin

You should install the RPM on buffy to test it (rpm -Uvh). Note that if you didn't compile the binaries yourself, the RPM may end up with some wierd dependencies and refuse to install.

mv the RPM to /state/partition1/home/install/contrib/4.1/i386/RPMS/

There may be an older RPM there already with a slightly different name which you should probably delete. (the name includes the date it was created you see)

Then edit /home/install/site-profiles/4.1/nodes/extend-compute.xml to make sure it lists all the RPMs you are using. (you dont include version numbers, so you probably wont need to update this file when you update RPMs)

Here is a copy of the current /home/install/site-profiles/4.1/nodes/extend-compute.xml. It does three things:

  1. install ncbi RPM
  2. setup manual mounting of /home and disable automount (this file will soon be changed to also mount freke and titin nfs servers)
  3. prepare to use custom kernel

file:

<?xml version="1.0" standalone="no"?>
<kickstart>
<description>
</description>
<changelog>
</changelog>
<package>ncbi_gcc4_optimized</package>
<post>
mkdir /home
echo "buffy.local:/export/home /home nfs defaults 0 0" >>/etc/fstab
chkconfig autofs off
rm -f /etc/rc.d/rocksconfig.d/pre-09-prep-kernel-source
</post>
</kickstart>

Then rebuild the distribution:

cd /home/install
rocks-dist dist

reinstall a test node:

ssh-agent $SHELL
shoot-node compute-0-4

If you want to watch it install, do:

ssh -p 2200 compute-0-4

If compute-0-4 reinstalls sucessfully, do the whole lot:

cluster-fork /boot/kickstart/cluster-kickstart

Cron

I have created /etc/cron.allow to prevent users from using cron. I noticed some users were using cron to run huge bzip2 backup tasks every single day on freke and I don't want them doing the same on buffy without permission. Firewall

Firewall is setup by stopping iptables (/etc/init.d/iptables stop), editing /etc/sysconfig/iptables, then restarting iptables.

The rules in this file allow packets from established TCP connections, packets from the private cluster NIC, packets on ssh, www and https ports from CS LAN and from my home IP addresses. They disallow everything else, so no-one outside CS can initiate TCP connections, or send UDP, or anything.

Interestingly, they also set up NAT so the nodes may connect to the outside world. I did not notice this before - I think it is new for Rocks 4.1. It doesn't allow incoming connections so it's not much of a security risk. Could be useful - would have allowed us to connect direct from nodes to freke. And Soren may find it useful. Installation

If you need to reinstall buffy from scratch, you will need these settings:

  • IP address: 128.16.12.25
  • Gateway: 128.16.6.150
  • DNS: 128.16.6.8
  • and 128.16.5.31
  • NTP (Time): 128.16.64.10

You must specify this time server - no other will work due to UCL firewall. If the installer doesn't give you the option then do it by editing files manually afterwards. Backup

To backup the important files that you will need to restore after a reinstall/upgrade:

insert-ethers --dump > /export/insert-ethers.sh
add-extra-nic --dump > /export/add-extra-nic.sh
tar -cf /backup/ssh.tar /root/.ssh
cp /etc/sysconfig/iptables /backup/
tar -cf backup`date --iso-8601`.tar /backup

Make a copy of this backup file. See the Rocks documentation for details on dumping and restoring the ethers database. NFS

Buffy runs an NFS server which exports home directories to all the nodes. The default Rocks set-up is to use automount to mount the home directories on demand. We had some problems with this (and so have other clusters) so we have disabled it - the nodes now mount /home from the fstab at boot time.

Buffy seems to perform well as an NFS server providing it is only being used by a single user at a time, i.e. it is only serving one bioinf database. We think this is because Buffy only has 2 GB of RAM and one database will usually fit in 2 GB, so Linux is able to cache the entire database. When two users run jobs at the same time, they use two separate databases, and Buffy cannot cache 4 GB of data, so it begins to read from the disk and go into IO wait state a lot. Performance is very poor.

For this reason we are only running 3 nfsd threads on Buffy - so even if it overloads, all 4 CPUs do not get stuck in IO wait, and it remains responsive to commands. (You can change this in /etc/rc.d/init.d/nfs ) I have recommended upgrading the RAM to 6 GB.

Freke and Titin have 32 GB of RAM and so make better fileservers. In addition, they have direct fiber connections to the Summit 7i, which buffy doesn't have (because it only has copper) which may mean they share their bandwidth much more fairly between the nodes.

We did not use them initially for several reasons:

  • until recently we only had 1 user and so buffy was performing fine as NFS server
  • we didnt have fiber links to them from the new nodes
  • they are managed by TSG and the idea with the new cluster was to keep a clear boundary between our administrative domain and the TSG domain
  • Rocks is Linux-only and so cannot manage Solaris machines
  • write performance is pretty poor and I think they would be better used as compute nodes for MPI jobs

Freke and Titin are now being configured for the cluster to use. Because we are not part of the TSG NIS domain, our user names and UID do not correspond to those used by TSG. Solutions to this:

  1. Edit /etc/passwd after you add a new user to set his UID to be the same as it is on TSG systems. This is probably the easiest, considering we only have a few users.
  2. Tell the users to make their files on freke and titin publicly readable.
  3. Tell the users to only copy files to freke and titin via SFTP on buffy, therefore giving the files their Buffy UIDs.
  4. Set up NIS on Buffy. (This may be complicated because I don't know how it would interact with the Rocks 411 system.)

PVM

I have installed PVM on the nodes (see instructions above on installing an RPM).

I have also configured SGE for 'loose' PVM integration. There are some instructions on the SGE website. This configuration was very tricky to get working. It was necessary to symlink rsh to ssh on all the nodes! Some files also had to be compiled and installed, you can find them in /export/home/install/contrib/4.1/i386/RPMS/pvm-extra-20060111-1.i386.rpm

Sorry this is not fully documented, but the students who were supposed to be using PVM+SGE actually didn't use it, so I don't think this configuration will ever need to be repeated and so I didn't record all the steps. The crashing issue

Shortly after we first tried multiple users on the NFS and discovered the performance problems, Buffy began kernel panicking regularly. We never found the cause. I suspect a combination of increased load and bugs in the current version of the BIOS supplied by Dell (which we had installed a few weeks previously). Redhat and Dell's latest patches did not help. We eventually 'solved' the problem by installing the latest version of the mainline Linux kernel. I made the following optimisations which did seem to improve performance:

  • changed the system clock interrupt to 100 hz,
  • turned off all pre-emption (including kernel lock)
  • changed the IO scheduler to 'deadline'

Further reading

Manuals:

Scheduling:

SSH:

Add new attachment

Only authorized users are allowed to upload new attachments.
« This page (revision-1) was last changed on 19-May-2009 13:08 by UnknownAuthor