Troubleshooting#

And now for the bit you came here for. The quickest way to discover if something is amiss is to check the job queue page. If none of the frontend html is being served to and is rendered by your browser then either you can't see the public interface for bioinf4 or Rails has crashed. If you can see the queue page then a quick glance of the status should start you off. If many, many jobs have a status of 0 (and they've been like that for hours and hours) it's likely that the runner has crashed. If many have the status of 3 then something serious has gone amiss with the backend. If everything has a status of 4 or 1 then everything is ticking along nicely, do not panic.

  • Ruby on Rails has crashed and isn't serving any html

Log in to bioinf4 as rails, switch to the frontend session and restart the server.

  • The server is down via http yet the runner and frontend appear to be up and mysql has either crashed or is not responding

There are a range of problems with the mysql desyncing (nfs hiccups, read-write lock failures), that can cause the tables or the mysql logs to become corrupt and prevent mysql from working. This in turn stops the the server from serving ANY pages. Sometimes mysql going down will take the frontend process with it too. This is by far the most common catastrophic error we currently experience and has a surprisingly simple remedy, rebuild the mysql logs. To accomplish this delete the old logs in /var/lib/mysql labelled ib_logfile0 and ib_logfile1 and then you can restart mysql and any other services you need to.

  • All the recent jobs have a status of 0 and there is now a huge backlog of 0 state jobs

Log into bioinf4 as rails. Switch to the runner session. Has it crashed? If so log the error message and restart the runner (see the runner_deamon.pl instructions above). Then send the error message to the server admin so the problem can be debugged

  • All the recent jobs have a status of 3

This could be any number of things log into bioinf4. Can you see the contents of the /webdata mount point. If not the mount point may have disappeared temporarily
Double check by logging into each of the bios machines. Can they see the contents of /webdata.
If not then the backend process can not access the binaries it needs to execute or the data it needs. This is a network or file system problem and you need to contact Tristan to sort it out.

  • A single job on the queue has a status of 3 or a user has emailed an error message for their job

Find the job in the queue, click on the job id link to the left of the page, this returns a report of all the process outputs from the backend, from here you can see where the job died. This may be enough to understand what happened.

If you can't work it out from there. Go back to the queue page and take note of the server number for the job. At the moment if you subtract 2 from the server number you get the id of the bioserv machine that tried to run the job. So if job 715 was on server 3 that means it was running on bioserv1. Next log in to bioinf4. Try and ping the relevant bioserv machine. If you can't ping it then that machine has disappeared from the network and you'll need to get Tristan to sort it out and reboot it. In the meantime you should remove that server as available from the configurations of all the jobs until it's back.

If you can ping it then ssh to the machine and you can then search through the backend server log. You can view this at /var/logs/bioinfd Once you've found the error copy it and send it to the server admin for debugging.

  • All jobs assigned to a given server keep returning a status of 3

It's likely that the backend has crashed or can't see /webdata. As per above try and log into the machine and find the error in the bioinfd log (/var/log/bioinfd). Then take that machine out of the configurations for all the jobs and get Tristan to reboot the machine. Once the machine is back up edit the various jobs' configurations to add the server back in..

  • The servers all went down, they are back up now what should I do?

The backend machines should come back up with no intervention on your part. If they don't harrangue Tristan to sort it out. The frontend machine will need you to restart all the frontend processes (see the Live Front end notes above). Make sure you spawn separate terminal sessions with screen for each of the frontend processes. You will also need to log in as root and restart mysql (/etc/rc.d/init.d/mysql start).

  • Someone just emailed to say they haven't been receiving emails from the service

This could be any number of things due to the vagueries of email. If you're absolutely sure that it's nothing between us and the user then it might be a problem with our services. When a job is complete the relevant service attempts to send a results email to a user. If the CS SMTP times out it will try to send the email again 4 more times (5 tries in total), if all those fail it will log it's failure (/var/log/bioinfd) and drop the email. If there is any other type of SMTP error the error is logged and the send attempt dropped. At the moment there is no way to initiate a resend so the user will need to submit their job again. A possible solution would be for bioinf4 to have an instance of sendmail running so outgoing mails could stack up there and in the CS SMTP service went away they would just queue until it came back (seems like a lot of work to me).

  • The runner stops processing jobs.

If there is too great a backlog of running/processing jobs at the back end, the runner stops being able to handle the flow of data and locks. Possibly some sort of race condition gets triggered. Jobs can still be posted to the queue but the runner stops reading data from the backend nodes and locks. The only solution I've found for this is to manually log into the rails database as admin and set all the processing (state=0 and state=1) jobs to failed (state=3) and then restart the runner. Only seems to happen if there are >200 concurrent jobs at the backend. So it's pretty rare.
>UPDATE jobs SET state=3 WHERE state=0 OR state=1; Typically you'll only find out about this when someone emails to say that they "can't submit more jobs and why is the IP address wrong?" (see below). When you look on the queue or restart the runner you'll find that there are 200+ state=0 jobs.

  • Jobs on the queue fail to be assigned a state and/or server

This is a rare but, approx 3,300 jobs in a year of running. Not currently sure what the source of this bug is, something to do with the runner. Essentially the runner fails to send new jobs to a backend nodes, so can't pick up these details and update the job table accordingly, such jobs then hang on the queue, with nothing happening. Sometimes such jobs cause the runner to hang entirely (genTHREADER jobs especially). Once the runner hangs no new jobs can be assigned and no jobs can return results. At this point it's worth logging in to the frontend mysql instance and setting all such jobs to state 3, then restarting the runner.
UPDATE jobs SET state=3 WHERE server_id IS NULL;
UPDATE jobs SET state=3 WHERE created_at < DATE_SUB(NOW(), INTERVAL 2 DAY) AND server_id IS NULL;
UPDATE jobs SET state=3 WHERE created_at < DATE_SUB(NOW(), INTERVAL 2 DAY) AND state IS NULL;

  • A single server stops returning jobs

Could be any number of things, check the spare disk capacity on the bios machines. If /var is full ffpred, mempack and other jobs can fail. There may be a jammed job; often you can set the oldest jobs on that server to state 3 and restart the runner.

  • You are the server admin and you keep receiving 'Server X has gone away' emails

One of the backend nodes has crashed, log in to that node and restart it

  • You are the server admin and you keep receiving "Runner failed to insert results for job id:X emails

I have no idea, for now this appears to be occasional DomPred jobs failing to insert the blast results in to the database. I believe because they are bigger than the max packet size. We may wish to change max_allowed_packet=20M in the my.cnf or just stop keep those blast results that no one ever downloads.

  • mySQL database Corruption

If/when this happens there may or may not be a lot you can do about it. mysqld will probably refuse to start and spew out a whole load of stuff into the mysqld.log (/var/log/), you may wish to read that. You may have to force startup on the database.

1 - Go to /var/lib/mysql and delete the inno_log files and start mysqld, if these are corrupt that db won't come up and this is the simplest fix
2 - Failing that try http://dev.mysql.com/doc/refman/5.0/en/forcing-recovery.html
3 - Failing that you may wish to delete the innodb file and the logs, delete the directories for the database and restart mysqld afresh.

If you end up doing 3 you will need to add a RoR and root users back into the mysql database instance. You will have to rake the rails database to regenerate it. Luckily you won't have to manually redo the application or NewPred user settings. On bioinf4 /backup/disaster_recovery you will find dumps of all the settings and user information you need to get the database back to a working state. You will just have to live with the loss of all the job data.

Add new attachment

Only authorized users are allowed to upload new attachments.
« This page (revision-1) was last changed on 11-Feb-2013 19:22 by UnknownAuthor