Bugs#

1) Somewhere below 0.05% of user submitted jobs enter the job queue get assigned a backend server and sit on status 0 permenantly. They do not appear to run a job on the backend, no file ever get sent to the runner or added to request_results. This is currently fewer than 1 job in 4000. One user told me that they did receive a completion email, this seems unlikely as the "Job Complete" message is not in the request_results table for any of these jobs. A fix might be to have the runner set all jobs older than 24 hours to state 3 (failed), maybe have the runner email the server admin when it does that.

PARTIAL SOLUTION: For now, the when the job queue is polled any job with a created date older than 24 hours gets set to state 3 and labelled as timed out.

TODO: We should log these timeouts somewhere so we have reliable stats on the scale of this issue just in case it's much more frequent than we think

2) VERY large pages can take so long to build and server the the dept. proxy server (and even the bioinf4 apache) can timeout before they complete. 2his is almost entirely down to the jalview initialisation for the genthreader tables when that start to have VERY large number of rows. Future solution would be to use ajax to load the tab data only when you need it, i.e. when someone clicks the relevant tab.

PARTIAL SOLUTION: Currently increase the timeout the bioinf4 and CS dept timeouts are set to 240seconds and any genthreader table or alignment diagram is truncated to 30 rows.

3) The appearance of the toggled tabs on the structure and psipred forms is dependant on the toggle action not on the actual status of the tick box. Tabs should be displayed due to the ticked status of the options not due to the user toggle action.

4) The backend passes back the psiblast matrices it calculates with a view to chaching these at the front end so users do not have to wait for a psiblast run for sequences the server has seen before. This is currently totally bugged and invalid matrices somehow make their way in to the db cache table in rails. I have no idea how/why and for now this functionality is disabled in the frontend as the cached table insert is commented out in the Rails code (job_status_handler)

5) All structure jobs are not sending completion emails. A quick debugging check shows that the structJob portion in email_handler.rb gets called. The exception handling which wraps the JobMailer.struct(object, email_body) call reports no errors. And the code in job_mailer.struct() gets called with no issue.

6) Simple modeller jobs don't always get assigned to a backend node. Probably some config option.

TODOs#

Search both the NewPred and NewPredServer code bases for a large number of non-critical TODOs

Move to cloudflare. Then we have caching, better stats etc... Will required things like the submit and ongoing pages set a short "page expires" time (60seconds?).

Fault tolerance: Redundant Rails instances on separate physical machines behind an HAProxy balancer, Add redundant MYSQL instances on separate machines and look up how to get Rails to handle multiple synced mySQL instances.

Replace jmol with Jsmol http://chemapps.stolaf.edu/jmol/jsmol/test2.htm or GLmol (which looks nicer) http://webglmol.sourceforge.jp/index-en.html, would be nice to find an all javascript alignment viewer plugin too. Could look in to this: https://github.com/mathew-fleisch/Protein-Alignment-Viewer

Remove all of the graph drawing that happens on the backend with gnuplot (disopred for instance) and move that to the frontend with d3.js. Then the frontend could parse the raw datafiles and supply d3.js with the data and we could draw much nicer looking and more interactive graphs.

Would be nice for the job polling to check that the database is actually live and if not or if requests time out to email. It tried using ACtiveRecord::Base.connected? but this remains true if it makes a first connection and does not update if the db crashes or is stopped. Not sure what else, could set up a cron job on the db machine to watch the process list or something. Best idea: Have a cronjob on bioinf5 that checks that mysqld is in the process list if not email, also on each of the bios machines (1-10) have a cronjob that watches to see if NewPredServer is running email if not.

Add the queue time estimates to the submit/ongoing pages for the structure jobs

Get rid of the ongoing controller method, just need the results one really which renders the ongoing page

Rework the serverpages use Bootstrap.js, the layout shouldn't take long but integration with JQueryUI and all the other JQuery stuff might take quite a while. A little bit of a look and feel refreshment might be nice, we've had the same styling for 5 years now and we're no longer in the same style as the main UCL site. Might want to go with something cleaner like CATH of genome3d.

Rework some of the psipred results views. some of the layouts could/should have some better view helpers written for them. At the moment things like the genthreader_table are quite unreadable where as individual rows could in fact be rendered/handled by a table row helper which would greatly increase the readability and logical separation of the code. This could be applied to most of the core view elements; other table rows, the Jmol panels, the genthreader cartoon in the summary and so forth...

On the sequence resubmission widget if you slide the sliders to too short a length a pop up should appear and tell you that you can't submit a length that short.

Go through all NewPred and NewPredServer code and remove and hardcoded references to the bioinf /webdata/ paths, move those to config.

Go through all our code in /webdata/binaries/current/ and remove hardcoded references to the file system (grep for /webdata should do it)

Revisit the org.ucl.servertasks the way all the runX() methods are written is exceedingly redundant, just look at those methods in runDomSerf. This could/should be abstracted a way to a runExternal process class. Generalisation issues arise when you look things like the methods in the runMemsatSVM class, lost of process specific stuff going on there.

Change the output of tmjury3d_mq_modeller_threadsafe.c in the bioserf job such that it outputs the pdb IDs for the 10 templates it has chosen. So that we can list those on the bioserf output page.

When links to the psipred server preset a method (i.e. /psipred/?domserf=1), the tabs that should open should be opened by default

GenThreader results should not offer a Model button on the summary tab if the user didn't provide a valid MODELLER key on sequence submission. Should accept a genTHREADER job if there is no key provided (assuming no BioSerf selection) and show no Model button. If a valid key is provided then accept the job and show the model button. If an invalid key is provided then bounce the job until the user corrects the key or leaves it blank.

Change the /structure/ index page livery so it's differentiated from the /psipred/ index. Also the coming cDNA handling page could have it's own colour too.

Make the mysql server come up when bioinf5 boots (change the run level script from K to S). Also make the rails instance come up when bioinf4 boots and make it spawn a screen session for the runner. And make the run_bioinf.sh run on the bios machines when they boot

Rails 4 and Ruby 2.0 are coming!

ONGOING#

Have added rspec to the NewPred Code base, currently the Jobs, psipredcontroller, structurecontroller and psipred_api_controller have tests rolled in. That covers the core of the pieces that users will interact with. The current plan is all new feature add and any bug fixes require that a test be added to cover .

In the spec_helper.rb we should really set up a working version of the backend with the job_config and servers and everything configured to a "working" test state. Then we could dry out the code in the structure and psipred rspec files. This would also allow some more thorough testing.

Add new attachment

Only authorized users are allowed to upload new attachments.
« This page (revision-1) was last changed on 27-Sep-2013 16:30 by UnknownAuthor