I’d like to say a bit about how cool ActiveRecord Observers are.
I am writing a highly event-driven app in Rails (e.g. create a BuildJob, and kick off a SLURM job for said job). Now, the simplistic way to handle this might be to write a daemon that sits in a loop and watches for new BuildJob records. In fact, that’s essentially how the old implementation worked (I didn’t write it!).
As I was writing the application (and learning Rails at the same time), I discovered ActiveRecord::Observer. I was thunderstruck. Folks who have been programming “for real” for a while may laugh here, but this was the first time I’d seen a good use of the Observer pattern that I’d learned about way back when. I’ve been a sysadmin for most of my career, though, so cut me some slack.
Anyhow, instead of a clunky poll-based design, I now have a nice, clean, and logical design that tries to create a SLURM job whenever a new BuildJob is created. If the SLURM job creation fails, so does the BuildJob creation, things are rolled back in the DB, and all is safe.
That’s the kind of thing that makes programming really fun and gratifying. I love finding elegant solutions that solve multiple problems at the same time.
Oh, you want to see some code? Ok.
class BuildJobObserver < ActiveRecord::Observer
include Slurm
include LTS
# We want to try to create a SLURM job before saving the BuildJob
# record. This way, we know that a BuildJob always has an associated
# SLURM job. The problem is that we don't yet have an id for the BuildJob
# when we try to create the SLURM job, and so we can't tell the SLURM
# job which BuildJob it's associated with. The solution is to create
# the SLURM job, but don't let it run until it's updated with the
# BuildJob's id.
#
# before_create
# * create slurm job with status "held" (use SchedulerType=sched/wiki)
# * save slurm job id in lbats job
# after_create
# * update slurm job name with lbats id for future linkage
# * update slurm job with priority > 0 to enable it
def before_create(job)
batch_options = {
:command => lbats_job2lts_build_cmd(job),
:name => "lbats_need_id",
:features => job.architecture.name,
:partition => "#{job.architecture.name}_build",
}
job.logger.debug "command: #{batch_options[:command]}"
if slurm_id = slurm_submit_batch_job(batch_options)
job.resource_manager_id = slurm_id
job.logger.info "Created new SLURM job: #{slurm_id}"
else
# So here's a problem, maybe... If we submit 5/6 build jobs, and
# the 6th fails, what about the 5 SLURM jobs that were submitted OK?
# If we bomb out here, the BuildJobs won't be created in the DB, which
# is good, but we've still got 5 SLURM jobs that are unassociated.
# Not sure there's much we can do here but cross our fingers. :/
raise "Unable to create SLURM job"
end
end
def after_create(job)
slurm_update_job_attributes(job.resource_manager_id,
{:priority => job.build_request.priority.value,
:name => "lbats_build_#{job.id}"})
end
def before_update(new_job)
old_job = BuildJob.find(new_job)
# check for a state change
if old_job.job_state != new_job.job_state
# tell the request about the state change
new_job.build_request.register_job_state_change(new_job)
# TODO: sanity-checking of all state transitions?
case new_job.job_state
when JobState.Canceled
# NB: If the job is canceled outside of LBATS (i.e. someone
# runs scancel), we'll be doing a double-cancel because
# the jobcomp script will post an update to the LBATS job
# and bring us here. I think this is OK, because trying
# to scancel an already-canceled job is not fatal.
slurm_cancel_job(new_job.resource_manager_id)
end
end
end
def after_save(job)
# tickle the build request's updated_at
job.build_request.save
end
def before_destroy(job)
# make sure we nuke the job from SLURM, too!
slurm_cancel_job(job.resource_manager_id)
end
end