Picard Sidekiq Tip: Split Up Independant Operations

December 20, 2023 📬 Get My Weekly Newsletter

A while back, Joe Sondow posted a great thread on Mastodon that is a perfect example of why you should split a complex job up into several individual ones. Basically, Worf was experiencing an error that caused Riker to post more than he should have.

A big theme of my Sidekiq book is to handle failure by making jobs idempotent—allowing them to be safely retried on failure, but only having a single effect.

While Joe is not using Sidekiq, the same theories apply. His job’s logic is basically like so:

  1. Picard Tips executes
  2. Then, Riker Googling runs
  3. After that, Worf Email does its thing
  4. Last but not least, Locutus Tips posts.
Four fake Mastodon posts with arrows from one to the next in order. First is from user @PicardTips that says 'Picard background job tip: Allow your jobs to fail and retry without firing photon torpedos more than once'. Second is user @RikerGoogling that says 'redis port main computer core moriarty'. Third is user @WorfEmail that says 'Lt Barclay, You have failed jobs that must be fixed. If not addressed, they will escalate to Capt Picard's combadge Worf'. Fourth is user @LocutusTips that says 'Locutus scaleability tip: Just add more database replicas instead of clearing failed jobs.'
The happy path of the bots (Open bigger version in new window).

Ideally, if Worf’s email fails, it should get retried until Worf succeeds. It should not cause Riker to google more or for Picard to present additional tips. And it shouldn’t prevent Locutus from sharing his wisdom, either.

For Joe’s Lambda function, this isn’t how it worked, unfortunately. Worf had an issue and while Picard was able to avoid posting more than once, Riker was not.

The same four fake Mastodon posts from the previous image, but in this case @PicardTips leads to @RikerGoogling, which leads to @WorfEmail, which leads to an error state.  The error state leads to a second post from @RikerGoogling with the same content, which then leads to another @WorfEmail post with the same content as well.  The @LocutusTips post is shown, but no path leads there. A note indicates that this post 'Sadly, never ran'
When worf fails, the entire operation is started over (Open bigger version in new window).

Joe’s solution—which he admits isn’t great—is to catch all errors and exit the entire process when one is caught.

The same four Mastodon posts from before. @PicardTips leads to @RikerGoogling, which leads to @WorfEmail, which leads to an error.  From the error, flow proceeds to the end state. @LocutusTips post is shown with the note 'Sadly, never ran'. The content of the posts is the same as the first image
Bail out on any error to avoid re-posting (Open bigger version in new window)....

This is actually not that bad of a strategy! In Joe’s case, the bots will run the next day and if the underlying problem was transient, everyone will be fine. They’ll miss one day hearing about how Locutus thinks you should run your life, but it’s fine.

If these jobs were more important, the way to make the entire operation idempotent is to create five jobs:

You’d have one top-level job that queues the others:

class BotsJob
  include Sidekiq::Job

  def perform
    PicardJob.perform_async
    RikerJob.perform_async
    WorfJob.perform_async
    LocutusJob.perform_async
  end
end

Each of those jobs would then have logic that it sounds like Picard Tips already has: don’t post if you’ve already posted. But, this time, if any of the jobs fail, it won’t affect the other jobs.

A graph with the top showing a picture of the Enterprise D's computer system, LCARS. It leads to the four Mastodon posts from before, each separate on the same line: @PicardTips, @RikerGoogling, @WorfEmail, andn @LocutusTips.  @WorfEmail leads to an error, which leads to a second @WorfEmail post. The other posts all leads to successes: @PicardTips's leads to 'Made it So', @RikerGoogling's to 'Jazzed', and @LocutusTips to 'Assimilated'.  Text of posts is the same as the first image.
Each bot in its own job can succeed or fail without affecting the others (Open bigger version in new window).

The only problem with the Ruby code for this is that we can’t call PicardJob.make_it_so!