January 3, 2017

Last time, I told you about a cool Bazaarvoice project that you can’t get your hands on (yet). This time, I thought you’d like to hear about a project that we have already open sourced!

Introducing Super-Simple Workflow (SSWF)

Super-simple workflow is a library for programming against Amazon Simple Workflow (SWF) without losing your mind.

It does not cover all possible workflows (as SWF does), but it makes probable workflows much easier to work with.

I’ll explain more about SWF for the uninitiated in a moment, but first, please join me in a brief rant.

The Rant

SWF is powerful and flexible, but not simple. I feel like, if you’re going to name your thing X Thing, your thing should be x. And not in a specialized or meta or even ironic way. It’s fine if you thought your thing was going to be x but somehow went astray without realizing it. But I really don’t think there’s any way that simplicity was a core design goal of SWF. Why not Amazon Workflow Service”? Or Universal Workflow Service” if you’re feeling grandiose? Ultimately, it’s fine, I’ll just call it SWF (and try to ignore the corner of my brain that shouts out Shockwave Flash” every time). I just felt a little betrayed when I thought, ooh, I need to write a workflow, and I like it when things are simple…”, right before burning weeks building enough specialized knowledge to actually use the darn thing.

But maybe you’ll benefit from my irritation, as I’ve taken the time to materialize the way I thought SWF should work. Here’s a quick scorecard:

category AWS SWF Super-Simple Workflow
custom code for the steps in your workflow
define the shape” of your workflow good luck writing a robust decision worker” just return a list of steps to execute
register your workflow many components to register. many of the options are quite confusing.
  • define the steps in an enum
  • define the workflow itself with constructor arguments
  • register everything with a convenience method: registerWorkflow().
  • hides the most confusing/least useful options and clearly documents the rest: here.

The Introduction

SWF is a framework (including both library and service components) for defining a workflow”. The concept of a workflow” is defined to be fluid enough that it can admit not just code components, but human ones as well. Think of a flowchart!

the best flowchartthe best flowchart

If you can describe your algorithm, business process, or whatever, as a flowchart, you can implement it in SWF.

But wait! Doesn’t this just mean that the workflow is yet another universal model of computation? Well, yes, but there are some things that SWF is particularly good at. I think it’s a huge bummer that Amazon made SWF so complicated, since the cognitive overhead makes using it a losing proposition unless your use case is very fancy indeed. Leaving aside the overhead for a second, here’s a handy magazine quiz to let you know if SWF is right for you:

  • Is your process is long running?
  • Does your process have external systems (such as humans) in the loop?
  • Does your process make resumable progress? (or the harder version: it can’t just start over when it fails and must pick up where it left off)

What this all boils down to is state. SWF is basically a safe place to stash your program’s state. Now, you might be thinking, State? Sounds like a job for a database!” and Maybe I’ll just roll my own!” and How hard can it be?”. The truth is it actually wouldn’t be that hard, as long as your state-tracking needs are very, very simple. But here are some things to consider:

  • Does your workflow run just once at a time, or are there multiple concurrent executions? How will you model this?
  • How will you ensure that all concurrent executions make progress?
  • Will you ever need to look back at the execution history of a particular run? (Spoiler alert: yes, you will.) How will you model this?
  • How will you deal with versioning your workflow, either in total or just versioning individual chunks? (Especially important if you need to roll out updates in place)
  • How will you detect and recover if a particular workflow step has crashed or stalled?

There are more pitfalls, but this should be enough to give you second thoughts.

A Super-Simple Use Case

To illustrate how to use SSWF, I’ll share a simplified version of one of my real use cases. At Bazaarvoice, I wrote a service for managing Elasticsearch clusters. Similarly to Elastic Cloud: you can create and destroy clusters, and also dynamically change the number/type of machines, as well as the cluster layout (client/data/master nodes, etc.).

This is a great example of what I see as the main selling point of SWF: this is a process with some very long-running steps, and quite a few steps, and I’m really glad I don’t have to track the state of my jobs (because the ES management is complicated enough).

Here’s a basic overview of the job:

  1. create the cluster via Cloudformation (idempotent if it already exists)
  2. silence our alerts (the resize job will cause metrics to exceed their nominal values)
  3. update ES settings to optimize for shard reallocation
  4. re-shuffle the shards
  5. reset ES settings to optimize for query performance
  6. un-silence our alerts

This example is in Scala, because it makes the example shorter, but the library is perfectly usable from Java. To prove it, I wrote the example module in pure Java.

Definition of input

The first things I’ll do is define an Input class and parser.

// This will be the input we pass to executions of the workflow.
case class MyInput(clusterName: String, 
                   esVersion: String, 
                   // for the sake of the example, let's say this is the number of nodes and the ec2 instance type.
                   clientNodes: (Int, String),
                   dataNodes: (Int, String),
                   masterNodes: (Int, String))

class MyInputParser(jacksonMapper: ObjectMapper with ScalaObjectMapper) extends com.bazaarvoice.sswf.InputParser[MyInput] {
  def serialize(input: MyInput): String = jacksonMapper.writeValueAsString(input)
  def deserialize(inputString: String): MyInput = jacksonMapper.readValue[MyInput](inputString)
}

Definition of the workflow steps

Next, I’ll make an Enum that captures all the steps of my workflow.

Note that the order doesn’t matter, this bit is just configuration.

public enum MySteps implements WorkflowStep {
    // These are generous timeouts, 5 or 10 minutes
    CREATE(60 * 10),
    TURN_ALERTS_OFF(60 * 5),
    SET_ALLOCATION_SETTINGS(60 * 5),
    RESHUFFLE_SHARDS(60 * 10), 
    SET_QUERY_SETTINGS(60 * 5),
    TURN_ALERTS_ON(60 * 5);

    private final int startToFinishTimeoutSeconds;

    MySteps(final int startToFinishTimeoutSeconds) {
        this.startToFinishTimeoutSeconds = startToFinishTimeoutSeconds;
    }

    @Override public int startToFinishTimeoutSeconds() {
        // this is how long to let the task run to completion
        // before considering the step hung or failed and re-scheduling it.
        return startToFinishTimeoutSeconds;
    }

    @Override public int startToHeartbeatTimeoutSeconds() {
        // this is how long to let the task run between heartbeat checkins
        // before considering the step hung or failed and re-scheduling it.
        // We don't have any steps that need to heartbeat, so just supply the main timeout.
        return startToFinishTimeoutSeconds;
    }

    @Override public InProgressTimerFunction inProgressTimerSecondsFn() {
        // this is how long to sleep between runs of a step when it returns "InProgress"
        return new ConstantInProgressTimerFunction(60 * 10);
    }
}

Now that we have defined the steps, we can both define and implement the workflow by implementing Workflow Definition.

class MyWorkflow extends WorkflowDefinition[MyInput, MySteps] {

  // this is basically our flowchart drawing.
  // it describes what order our steps will execute in
  def workflow(input: MyInput): java.util.List[ScheduledStep[MySteps]] = 
    List[ScheduledStep[MySteps]](
      DefinedStep(CREATE),
      DefinedStep(TURN_ALERTS_OFF),
      DefinedStep(SET_ALLOCATION_SETTINGS),
      DefinedStep(RESHUFFLE_SHARDS),
      DefinedStep(SET_QUERY_SETTINGS),
      DefinedStep(TURN_ALERTS_ON)
    )

  // Here is where we define the logic for what happens in each step.
  // I have a tendency just to dispatch to helper methods to keep things tidy.
  def act(step: MySteps, input: MyInput, stepInput: StepInput, heartbeatCallback: HeartbeatCallback, execution: WorkflowExecution): StepResult = {
    step match {
      case CREATE => doCreate(stepInput)
      case TURN_ALERTS_OFF => switchAlerts(on = false)
      case SET_ALLOCATION_SETTINGS => switchToAllocationSettings(input.clusterName)
      case RESHUFFLE_SHARDS => allocateShards(input)
      case SET_QUERY_SETTINGS => switchToQuerySettings(input.clusterName)
      case TURN_ALERTS_ON => switchAlerts(on = true)
    }
  }

  def onFail(workflowId: String, runId: String, input: MyInput, history: StepsHistory[MyInput, MySteps], message: String): Unit = {
    // notify whoever cares about failure
  }

  def onFinish(workflowId: String, runId: String, input: MyInput, history: StepsHistory[MyInput, MySteps], message: String): Unit = {
    // notify whoever cares about successful completion
  }

  def onCancel(workflowId: String, runId: String, input: MyInput, history: StepsHistory[MyInput, MySteps], message: String): Unit = {
    // notify whoever cares about cancellation
  }
}

And that’s it! We are completely liberated from having to write the decision logic. Instead, we just list the steps we want to execute in the order we want them. Obviously, we still have to write the logic for the actions, but that’s exactly where we want to spend our effort. The point of SSWF is to let you focus on the action logic and not worry about the decision stuff.

The astute observer will wonder why the only shape” of workflow we permit is a list of steps. More on this in the next section.

Bonus advice: writing the logic for a workflow step

Here’s a bonus stylistic tip when it comes to the action logic. This actually applies in many similar situations whether you use SSWF or not. You might consider replicating it even if you’re using raw SWF:

Note that in SSWF, actions must return one of Success”, Failed”, or InProgress” to report the status of the action.

When I implement the logic for a workflow step, I use this pattern:

  • check if this step needs to do anything at all, if not, return Success”.
  • check for failure conditions. If they aren’t recoverable, return Failed”.
  • check if whatever this step does (like Cloudformation stack creation) in currently in progress, if so, return InProgress”.
  • So the action for this step is not done, failed, or in progress. That means we get to do something! Run the action for this step and return InProgress”.

This may seem a little inside-out, but it lets us state clearly what invariants hold when the step finally returns success. It also doesn’t require us to occupy a thread just to monitor the progress of the step. When we return InProgress”, SSWF will schedule a sleep according to the InProgressTimerFunction defined in the MySteps enum, and the action thread becomes free to work on the next thing.

Here’s how it looks:

def doCreate(stepInput): StepResult = {
  val status = getStackStatus(stepInput.clusterName)
  status match {
    case Some("CREATE_COMPLETE") || Some("UPDATE_COMPLETE") =>
      Success("Stack exists")
    case Some("CREATE_FAILED") =>
      Failed("Stack creation failed")
    case Some("CREATE_IN_PROGRESS") || Some("UPDATE_IN_PROGRESS") =>
      InProgress("Waiting for completion...")
    case None => 
      createCfnStack(stepInput)
      InProgress("Created Cloudformation stack. Waiting for completion...")
  }
}

Registering, polling, starting, etc.

SSWF is also a convenience wrapper around many of the arcane incantations SWF requires of you around registering your domain, workflow, steps, yadda, yadda, yadda. Also it hides or explains the super-confusing configuration options it provides/requires on… everything.

I won’t melt your brain with an example of how to accomplish all this stuff in raw SWF, but here’s the configuration”, as it were, with documentation, for SSWF: https://github.com/bazaarvoice/super-simple-workflow/blob/master/sswf-core/src/main/scala/com/bazaarvoice/sswf/service/WorkflowManagement.scala#L25.

And here’s all you have to do to register everything: registerWorkflow(). That method is idemotent, so you can call it every time your service starts up.

Since SSWF is just a library, not a service, we can’t save you from having to schedule and run decision” and action” workers, but at least you won’t have to write them. You just new up one each of StepDecisionWorker and StepActionWorker.

I also provided (in a separate, sswf-guava-20 module) some service wrappers to the workers. You pass the workers to the services as constructor args, and then call startAsync() on them, and, you’re up and running.

So at a high level, your workflow app will start-up like this:

public void start() {
    workflowManagement.registerWorkflow();

    decisionService.startAsync().awaitRunning();

    actionService.startAsync().awaitRunning();
}

Take a look at the example for more detail.

Super-Simple Workflow: Simplicity via constraints

Now, I’ll justify the decision to stick to a list-of-steps style of workflow.

SSWF offers a couple of small conveniences over raw SWF, but the main one is the elimination of the decision worker” logic.

Instead, you just provide a list of steps. Why a list? I have a theory that most of the workflows that you’re actually going to create fit more or less naturally into a linear sequence of steps. In some cases, you’ll have one or two branches at the most, and there are two simple transformations you can do to arrive at a linear workflow. If you need to make arbitrary complex workflows, there’s always the full power of raw SWF, but SSWF is all about making easy stuff easy to implement.

Transformation #1: Fork early

All the workflows I’ve dealt with have at most a small number of forks, and the path has always been determined by the input to the workflow.

a very tasty workflowa very tasty workflow

One thing you can do is just cheat and programmatically generate two different workflows depending on the input (this is why SSWF passes the input to the workflow method).

forking before the wf rather than in the wfforking before the wf rather than in the wf

This idea, which says you can parse the input and then generate a linear workflow, also allows you to do loop unrolling. For example, in case you need to do a buy” step repeatedly, one for each thing on your grocery list.

Transformation #2: Noop steps

If you’re not forking to choose between different flavors of workflow, chances are you’re forking to avoid steps that aren’t applicable.

skipping a stepskipping a step

You can avoid this fork by simply having the step detect for itself if it is unnecessary and return immediately.

noop rather than skippingnoop rather than skipping

These two transformations have been enough for several diverse projects at BV, but I greatly look forward to hearing more use cases. I’m open to expanding the model, but on the other hand, I’m a huge fan of the simplicity afforded by the linear workflow model.

Final thought: AWS Step Functions

If you’re just getting started with SWF, another projet you may want to take a look at is AWS Step Functions.

Step Functions, like SSWF, offers a dramatically simplified interface to SWF. Also like SSWF, it does this by letting you declare the shape of the workflow, rather than making you implement the decision worker.

The key difference is that Step Functions moves your execution into AWS Lambda, whereas SSWF leaves it on your machine. There are going to be cases where either one of those is what you want, so I think it still makes sense to support both.

Let me know what you think, either in the comments or on Hacker News!


bazaarvoice aws


Previous post
How Bazaarvoice solved data denormalization [edit] I’ve linked to the HN thread (at the bottom) and also mentioned that we’re hiring (also at the bottom). I recently read Liron Shapira’s