<![CDATA[ John Roesler's blog]]> http://blog.vvcephei.org <![CDATA[ Introducing Super-Simple Workflow (SSWF) ]]> http://blog.vvcephei.org/super-simple-workflow http://blog.vvcephei.org/super-simple-workflow Tue, 03 Jan 2017 00:00:00 -0500 Last time, I told you about a cool Bazaarvoice project that you can’t get your hands on (yet). This time, I thought you’d like to hear about a project that we have already open sourced!

Super-simple workflow is a library for programming against Amazon Simple Workflow (SWF) without losing your mind.

It does not cover all possible workflows (as SWF does), but it makes probable workflows much easier to work with.

I’ll explain more about SWF for the uninitiated in a moment, but first, please join me in a brief rant.

The Rant

SWF is powerful and flexible, but not simple. I feel like, if you’re going to name your thing X Thing, your thing should be x. And not in a specialized or meta or even ironic way. It’s fine if you thought your thing was going to be x but somehow went astray without realizing it. But I really don’t think there’s any way that simplicity was a core design goal of SWF. Why not Amazon Workflow Service”? Or Universal Workflow Service” if you’re feeling grandiose? Ultimately, it’s fine, I’ll just call it SWF (and try to ignore the corner of my brain that shouts out Shockwave Flash” every time). I just felt a little betrayed when I thought, ooh, I need to write a workflow, and I like it when things are simple…”, right before burning weeks building enough specialized knowledge to actually use the darn thing.

But maybe you’ll benefit from my irritation, as I’ve taken the time to materialize the way I thought SWF should work. Here’s a quick scorecard:

category AWS SWF Super-Simple Workflow
custom code for the steps in your workflow
define the shape” of your workflow good luck writing a robust decision worker” just return a list of steps to execute
register your workflow many components to register. many of the options are quite confusing.
  • define the steps in an enum
  • define the workflow itself with constructor arguments
  • register everything with a convenience method: registerWorkflow().
  • hides the most confusing/least useful options and clearly documents the rest: here.

The Introduction

SWF is a framework (including both library and service components) for defining a workflow”. The concept of a workflow” is defined to be fluid enough that it can admit not just code components, but human ones as well. Think of a flowchart!

the best flowchartthe best flowchart

If you can describe your algorithm, business process, or whatever, as a flowchart, you can implement it in SWF.

But wait! Doesn’t this just mean that the workflow is yet another universal model of computation? Well, yes, but there are some things that SWF is particularly good at. I think it’s a huge bummer that Amazon made SWF so complicated, since the cognitive overhead makes using it a losing proposition unless your use case is very fancy indeed. Leaving aside the overhead for a second, here’s a handy magazine quiz to let you know if SWF is right for you:

  • Is your process is long running?
  • Does your process have external systems (such as humans) in the loop?
  • Does your process make resumable progress? (or the harder version: it can’t just start over when it fails and must pick up where it left off)

What this all boils down to is state. SWF is basically a safe place to stash your program’s state. Now, you might be thinking, State? Sounds like a job for a database!” and Maybe I’ll just roll my own!” and How hard can it be?”. The truth is it actually wouldn’t be that hard, as long as your state-tracking needs are very, very simple. But here are some things to consider:

  • Does your workflow run just once at a time, or are there multiple concurrent executions? How will you model this?
  • How will you ensure that all concurrent executions make progress?
  • Will you ever need to look back at the execution history of a particular run? (Spoiler alert: yes, you will.) How will you model this?
  • How will you deal with versioning your workflow, either in total or just versioning individual chunks? (Especially important if you need to roll out updates in place)
  • How will you detect and recover if a particular workflow step has crashed or stalled?

There are more pitfalls, but this should be enough to give you second thoughts.

A Super-Simple Use Case

To illustrate how to use SSWF, I’ll share a simplified version of one of my real use cases. At Bazaarvoice, I wrote a service for managing Elasticsearch clusters. Similarly to Elastic Cloud: you can create and destroy clusters, and also dynamically change the number/type of machines, as well as the cluster layout (client/data/master nodes, etc.).

This is a great example of what I see as the main selling point of SWF: this is a process with some very long-running steps, and quite a few steps, and I’m really glad I don’t have to track the state of my jobs (because the ES management is complicated enough).

Here’s a basic overview of the job:

  1. create the cluster via Cloudformation (idempotent if it already exists)
  2. silence our alerts (the resize job will cause metrics to exceed their nominal values)
  3. update ES settings to optimize for shard reallocation
  4. re-shuffle the shards
  5. reset ES settings to optimize for query performance
  6. un-silence our alerts

This example is in Scala, because it makes the example shorter, but the library is perfectly usable from Java. To prove it, I wrote the example module in pure Java.

Definition of input

The first things I’ll do is define an Input class and parser.

// This will be the input we pass to executions of the workflow.
case class MyInput(clusterName: String, 
                   esVersion: String, 
                   // for the sake of the example, let's say this is the number of nodes and the ec2 instance type.
                   clientNodes: (Int, String),
                   dataNodes: (Int, String),
                   masterNodes: (Int, String))

class MyInputParser(jacksonMapper: ObjectMapper with ScalaObjectMapper) extends com.bazaarvoice.sswf.InputParser[MyInput] {
  def serialize(input: MyInput): String = jacksonMapper.writeValueAsString(input)
  def deserialize(inputString: String): MyInput = jacksonMapper.readValue[MyInput](inputString)
}

Definition of the workflow steps

Next, I’ll make an Enum that captures all the steps of my workflow.

Note that the order doesn’t matter, this bit is just configuration.

public enum MySteps implements WorkflowStep {
    // These are generous timeouts, 5 or 10 minutes
    CREATE(60 * 10),
    TURN_ALERTS_OFF(60 * 5),
    SET_ALLOCATION_SETTINGS(60 * 5),
    RESHUFFLE_SHARDS(60 * 10), 
    SET_QUERY_SETTINGS(60 * 5),
    TURN_ALERTS_ON(60 * 5);

    private final int startToFinishTimeoutSeconds;

    MySteps(final int startToFinishTimeoutSeconds) {
        this.startToFinishTimeoutSeconds = startToFinishTimeoutSeconds;
    }

    @Override public int startToFinishTimeoutSeconds() {
        // this is how long to let the task run to completion
        // before considering the step hung or failed and re-scheduling it.
        return startToFinishTimeoutSeconds;
    }

    @Override public int startToHeartbeatTimeoutSeconds() {
        // this is how long to let the task run between heartbeat checkins
        // before considering the step hung or failed and re-scheduling it.
        // We don't have any steps that need to heartbeat, so just supply the main timeout.
        return startToFinishTimeoutSeconds;
    }

    @Override public InProgressTimerFunction inProgressTimerSecondsFn() {
        // this is how long to sleep between runs of a step when it returns "InProgress"
        return new ConstantInProgressTimerFunction(60 * 10);
    }
}

Now that we have defined the steps, we can both define and implement the workflow by implementing Workflow Definition.

class MyWorkflow extends WorkflowDefinition[MyInput, MySteps] {

  // this is basically our flowchart drawing.
  // it describes what order our steps will execute in
  def workflow(input: MyInput): java.util.List[ScheduledStep[MySteps]] = 
    List[ScheduledStep[MySteps]](
      DefinedStep(CREATE),
      DefinedStep(TURN_ALERTS_OFF),
      DefinedStep(SET_ALLOCATION_SETTINGS),
      DefinedStep(RESHUFFLE_SHARDS),
      DefinedStep(SET_QUERY_SETTINGS),
      DefinedStep(TURN_ALERTS_ON)
    )

  // Here is where we define the logic for what happens in each step.
  // I have a tendency just to dispatch to helper methods to keep things tidy.
  def act(step: MySteps, input: MyInput, stepInput: StepInput, heartbeatCallback: HeartbeatCallback, execution: WorkflowExecution): StepResult = {
    step match {
      case CREATE => doCreate(stepInput)
      case TURN_ALERTS_OFF => switchAlerts(on = false)
      case SET_ALLOCATION_SETTINGS => switchToAllocationSettings(input.clusterName)
      case RESHUFFLE_SHARDS => allocateShards(input)
      case SET_QUERY_SETTINGS => switchToQuerySettings(input.clusterName)
      case TURN_ALERTS_ON => switchAlerts(on = true)
    }
  }

  def onFail(workflowId: String, runId: String, input: MyInput, history: StepsHistory[MyInput, MySteps], message: String): Unit = {
    // notify whoever cares about failure
  }

  def onFinish(workflowId: String, runId: String, input: MyInput, history: StepsHistory[MyInput, MySteps], message: String): Unit = {
    // notify whoever cares about successful completion
  }

  def onCancel(workflowId: String, runId: String, input: MyInput, history: StepsHistory[MyInput, MySteps], message: String): Unit = {
    // notify whoever cares about cancellation
  }
}

And that’s it! We are completely liberated from having to write the decision logic. Instead, we just list the steps we want to execute in the order we want them. Obviously, we still have to write the logic for the actions, but that’s exactly where we want to spend our effort. The point of SSWF is to let you focus on the action logic and not worry about the decision stuff.

The astute observer will wonder why the only shape” of workflow we permit is a list of steps. More on this in the next section.

Bonus advice: writing the logic for a workflow step

Here’s a bonus stylistic tip when it comes to the action logic. This actually applies in many similar situations whether you use SSWF or not. You might consider replicating it even if you’re using raw SWF:

Note that in SSWF, actions must return one of Success”, Failed”, or InProgress” to report the status of the action.

When I implement the logic for a workflow step, I use this pattern:

  • check if this step needs to do anything at all, if not, return Success”.
  • check for failure conditions. If they aren’t recoverable, return Failed”.
  • check if whatever this step does (like Cloudformation stack creation) in currently in progress, if so, return InProgress”.
  • So the action for this step is not done, failed, or in progress. That means we get to do something! Run the action for this step and return InProgress”.

This may seem a little inside-out, but it lets us state clearly what invariants hold when the step finally returns success. It also doesn’t require us to occupy a thread just to monitor the progress of the step. When we return InProgress”, SSWF will schedule a sleep according to the InProgressTimerFunction defined in the MySteps enum, and the action thread becomes free to work on the next thing.

Here’s how it looks:

def doCreate(stepInput): StepResult = {
  val status = getStackStatus(stepInput.clusterName)
  status match {
    case Some("CREATE_COMPLETE") || Some("UPDATE_COMPLETE") =>
      Success("Stack exists")
    case Some("CREATE_FAILED") =>
      Failed("Stack creation failed")
    case Some("CREATE_IN_PROGRESS") || Some("UPDATE_IN_PROGRESS") =>
      InProgress("Waiting for completion...")
    case None => 
      createCfnStack(stepInput)
      InProgress("Created Cloudformation stack. Waiting for completion...")
  }
}

Registering, polling, starting, etc.

SSWF is also a convenience wrapper around many of the arcane incantations SWF requires of you around registering your domain, workflow, steps, yadda, yadda, yadda. Also it hides or explains the super-confusing configuration options it provides/requires on… everything.

I won’t melt your brain with an example of how to accomplish all this stuff in raw SWF, but here’s the configuration”, as it were, with documentation, for SSWF: https://github.com/bazaarvoice/super-simple-workflow/blob/master/sswf-core/src/main/scala/com/bazaarvoice/sswf/service/WorkflowManagement.scala#L25.

And here’s all you have to do to register everything: registerWorkflow(). That method is idemotent, so you can call it every time your service starts up.

Since SSWF is just a library, not a service, we can’t save you from having to schedule and run decision” and action” workers, but at least you won’t have to write them. You just new up one each of StepDecisionWorker and StepActionWorker.

I also provided (in a separate, sswf-guava-20 module) some service wrappers to the workers. You pass the workers to the services as constructor args, and then call startAsync() on them, and, you’re up and running.

So at a high level, your workflow app will start-up like this:

public void start() {
    workflowManagement.registerWorkflow();

    decisionService.startAsync().awaitRunning();

    actionService.startAsync().awaitRunning();
}

Take a look at the example for more detail.

Super-Simple Workflow: Simplicity via constraints

Now, I’ll justify the decision to stick to a list-of-steps style of workflow.

SSWF offers a couple of small conveniences over raw SWF, but the main one is the elimination of the decision worker” logic.

Instead, you just provide a list of steps. Why a list? I have a theory that most of the workflows that you’re actually going to create fit more or less naturally into a linear sequence of steps. In some cases, you’ll have one or two branches at the most, and there are two simple transformations you can do to arrive at a linear workflow. If you need to make arbitrary complex workflows, there’s always the full power of raw SWF, but SSWF is all about making easy stuff easy to implement.

Transformation #1: Fork early

All the workflows I’ve dealt with have at most a small number of forks, and the path has always been determined by the input to the workflow.

a very tasty workflowa very tasty workflow

One thing you can do is just cheat and programmatically generate two different workflows depending on the input (this is why SSWF passes the input to the workflow method).

forking before the wf rather than in the wfforking before the wf rather than in the wf

This idea, which says you can parse the input and then generate a linear workflow, also allows you to do loop unrolling. For example, in case you need to do a buy” step repeatedly, one for each thing on your grocery list.

Transformation #2: Noop steps

If you’re not forking to choose between different flavors of workflow, chances are you’re forking to avoid steps that aren’t applicable.

skipping a stepskipping a step

You can avoid this fork by simply having the step detect for itself if it is unnecessary and return immediately.

noop rather than skippingnoop rather than skipping

These two transformations have been enough for several diverse projects at BV, but I greatly look forward to hearing more use cases. I’m open to expanding the model, but on the other hand, I’m a huge fan of the simplicity afforded by the linear workflow model.

Final thought: AWS Step Functions

If you’re just getting started with SWF, another projet you may want to take a look at is AWS Step Functions.

Step Functions, like SSWF, offers a dramatically simplified interface to SWF. Also like SSWF, it does this by letting you declare the shape of the workflow, rather than making you implement the decision worker.

The key difference is that Step Functions moves your execution into AWS Lambda, whereas SSWF leaves it on your machine. There are going to be cases where either one of those is what you want, so I think it still makes sense to support both.

Let me know what you think, either in the comments or on Hacker News!

]]>
<![CDATA[ How Bazaarvoice solved data denormalization ]]> http://blog.vvcephei.org/polloi-denorm http://blog.vvcephei.org/polloi-denorm Wed, 14 Dec 2016 00:00:00 -0500 [edit] I’ve linked to the HN thread (at the bottom) and also mentioned that we’re hiring (also at the bottom).

I recently read Liron Shapira’s really insightful blog article called Data denormalization is broken” (https://hackernoon.com/data-denormalization-is-broken-7b697352f405#.24urwx4e2).

He starts:

Backend engineers can’t write elegant and performant application-layer code for many types of everyday business logic. That’s because no one has yet invented a denormalization engine”, a database with a more general kind of indexer.

I couldn’t agree more, and his article is extremely well reasoned and clearly presented. I highly recommend reading it.

At Bazaarvoice, we have a decentralized team structure, in which separate product lines, services, etc. are maintained by small, focused teams.

Many of these teams are operating at a scale that requires them to:

  1. use at least one search index to answer production queries
  2. include some up-front denormalizations to optimize the queries

We started a replatforming effort 5 or 6 years ago, and that gave us the opportunity to revisit many data-related problems, among them the denormalization one. When I say that the system we made solves” denormalization, I mean that, yes, it correctly and efficiently computes them, but more importantly that you configure it with a simple spec.

A little background

polloi

Just for context, the denormalization system I’m describing is called Polloi. This is derived from the phrase hoi polloi”, or the populace”.

The idea is to democratize the construction of denormalized indices.

I.e., we wanted to make denormalization easy enough that teams within Bazaarvoice will be empowered to customize and optimize their own indices. Otherwise, if it’s too much effort, this responsilibility will naturally trend back toward a central team, which will hurt the business’s agility in the long run.

streaming data

Bazaarvoice has rebuilt itself on a streaming data backend. We use a distributed database that provides a data bus”, which lets users create subscriptions to data changes. (FYI: the Cassandra Summit talk)

But Polloi will work with any data stream.

Motivating example

Bazaarvoice’s primary domain is user reviews and Q&A. We collect this content on products, and the products are organized into categories.

Here’s an amusingly simple version of our data model:

a crude data model diagrama crude data model diagram

Review

[{
  "type": "review",
  "id":   "r1",
  "product": {"id":"p3"},
  "status": "APPROVED",
  "rating": 1,
  "text":  "I hate this product. >:("
},
{
  "type": "review",
  "id":   "r9",
  "product": {"id":"p3"},
  "status": "APPROVED",
  "rating": 5,
  "text":  "This thing is the best thing ever. You changed my life!!!"
},
{
  "type": "review",
  "id":   "r13",
  "product": {"id":"p3"},
  "status": "REJECTED",
  "rating": 5,
  "text":  "Here, follow this spam link: http://****.  Also, check out this profanity: **** ** *** *****"
}]

Product

[{
  "type": "product",
  "id":   "p3",
  "name": "Some Kind of Thing (tm)",
  "description": "This is a product of some kind"
}]

For the sake of discussion, let’s say we want to support these use cases:

  • (uc1) query for reviews by product name
  • (uc2) show the average rating of a product over its approved reviews (alongside its other details)

In reality, our application has many more use cases than these to support. We also have an API, which supports some very different access patterns, and additionally, we have other products with totally different data and use cases, such as curated content from social media streams.

Also, this is all running at a pretty decent scale, so we are way beyond running these use cases against our main database.

Just to make the example concrete, we are generally indexing into Elasticsearch to support our queries: simple indexing

(uc1) requires this (pseudo)code:

products = search({ "query": { "match" : { "name" : "thing" } } })
PRODUCT_IDS = products.hits.map(product => product.id)
reviews = search({ "query": { "constant_score" : { "filter" : { "terms" : { "user" : PRODUCT_IDS } } } } })
return reviews

(uc2) requires this:

product = get("p3")
result = search({ "query": { "constant_score" : { "bool" : { "filter" : [ { "term" : { "product.id" : "p3" } },
                                                                          { "term" : { "status" : "APPROVED" } } ] } }
                  "aggregations": { "avgRating" : { "avg" : { "field" : "rating" } } }
                })
return (product, result.avgRating)

Both require two queries, and Elasticsearch makes them reasonably efficient. But at Bazaarvoice, it is common to be shooting for over 10K requests/sec and under 100ms/request, which would be hard to satisfy with this approach.

However, if we pull the product’s name onto the review and compute the average rating ahead of time, (uc1) can become a single query, and (uc2) can actually become a single super-fast get.

The performance requirements, in combination with the scale of the data and the volume of requests dictates that we will need to consider denormalization as a technique for speeding queries up.

Denormalize below the level of application

This leads us to the question of how we can compute the denormalizations.

Let’s start with the application logic that writes and modifies the data. Hypothetically, write-time is the perfect opportunity to also update denormalized values. But as Liron illustrates, this can become a nightmare for even a moderately complex application.

Beyond that, for Bazaarvoice, and for many other large companies, this is actually intractable, as data enters and leaves the company through a number of independent applications. It would be crazy for all data-writing applications to have to maintain denormalizations on behalf of data-querying apps.

materialized views

So we’re not going to denormalize in the application layer. What about in the database?

Many of the comments on Liron’s post point to materialized views in Oracle, DB2, SQL Server, and now, apparently, Postgres. Or, equivalently, they point out that you can implement materialized views / denormalization using triggers in MySQL. To this, I would add that a number of other databases provide equivalent functionality, for example, Cassandra 3+.

If you’re already using one of these, and you’re looking for denormalizations, you may as well try out the materialized view. However, I don’t think I agree that these are the solution” to denormalization. Primarily, there are many, many factors that play into the database you choose for your use case. You may be constrained by other factors and won’t be able to choose one that offers materialized views.

Even if your database offers a hook to compute denormalizations on the fly, you may find that it is too costly to use. For the Bazaarvoice applications I’m describing, they query out of a search index instead of a database. We figured to compute denormalized values, we could just hook into the change stream. Every time a document changed, a background process would re-compute the relevant denormalizations. But we quickly discovered that this caused our application’s queries to suffer (a lot). It depends on your read/write patterns how bad this is, but denormalization magnifies your write workload. In our case, we found that each write turned into ~60 denormalization queries. That’s not a nice thing to do either to your database or to your search index.

There is clearly a lot you can do to batch and throttle materialization workloads, but fundamentally we found that there was just too much competition between the materialization queries and the application ones.

Finally, returning to the separation of concerns, just as the application is most maintainable when it only has to worry about proper application concerns, the database or search index likewise will be most maintainable when it only needs to worry about its own concerns. And there are plenty: durability, availability, query performance, etc.

Our database team is focused on keeping a massive globally-distributed, multi-master, streaming data platform up and running and reasonably consistent, not to mention continuing development work on the database itself. They shouldn’t be distracted by ad-hoc requests as dozens of different teams constantly want to materialize different queries.

Conversely, those teams want to move fast to implement features and solve problems. They don’t want to coordinate with the database team to define denormalizations.

In other words, I’m convinced that at scale, you really want a separate system to run your denormalizations. And it needs to be under the control of the team who does the searches.

Therefore, it should be easy enough to create and operate denormalizers that you can create many, one for each use case.

Denormalize between your database and your application

Polloi basically consumes a stream of normalized data and produces a stream of denormalized data. As such, it will work with any source database and any destination whatsoever.

Sounds great! But what is Polloi like to use? What Liron wants (and what we all deserve) is a denormalizer that not just works, but that is easy to use.

For the record, I have some ideas for a revision of the Polloi rules language to make it even easier and more intuitive, but here’s a look at an example Polloi configuration from the Bazaarvoice domain.

Denormalizing with Polloi

Polloi is configured with two text files: a schema, and a rules file.

Schema

type        string
title       string
text        string
status      string
rating      integer
product     ref
name        string

The schema tells Polloi what attributes to include, either in the computations, or just in the output documents.

Rules

[type="review"]product = expand(product, name)
[type="product"]averageRating = average(_product[type="review",status="APPROVED"].rating)

The rules define the denormalizations, in this case, we add the product name to reviews and compute average ratings for products.

Without getting into too much detail,

  • expand lets you traverse a reference (in this case, product)
  • average … computes an average.
  • _product is an inverse reference. I.e., it matches all documents that have a product reference back to the document in question.
  • The expressions in square brackets are predicates. On the left side, they are filtering which rules apply to which types of documents, and on the right of _product, they filter which documents get included in the average computation.

At this point, we compile our code into a package and deploy it using an orchestration service we provide, which gives us a new denormalizing stream processor:

polloi as a denormalizing stream processor

That produces:

Review

[{
  "type": "review",
  "id":   "r1",
  "product": {"id":"p3", "name": "Some Kind of Thing (tm)"},
  "status": "APPROVED",
  "rating": 1,
  "text":  "I hate this product. >:("
},
{
  "type": "review",
  "id":   "r9",
  "product": {"id":"p3", "name": "Some Kind of Thing (tm)"},
  "status": "APPROVED",
  "rating": 5,
  "text":  "This thing is the best thing ever. You changed my life!!!"
},
{
  "type": "review",
  "id":   "r13",
  "product": {"id":"p3", "name": "Some Kind of Thing (tm)"},
  "status": "REJECTED",
  "rating": 5,
  "text":  "Here, follow this spam link: http://****.  Also, check out this profanity: **** ** *** *****"
}]

Product

[{
  "type": "product",
  "id":   "p3",
  "name": "Some Kind of Thing (tm)",
  "description": "This is a product of some kind",
  "averageRating": 3.0
}]

With this data in our index, here is the new code for our use cases:

(uc1)

return search({ "query": { "match" : { "product.name" : "thing" } } })

(uc2)

return get("p3")

To compare the performance more concretely, I ran some sample queries against a testing cluster that has all our data loaded. Just to simplify things, I’ll assume all use cases just fetch the first page of results. Here’s what I got:

use case normalized denormalized
(uc1) 1.3s + 0.5s for each returned product = 1.8s 1s
(uc2) 60ms for the get + 1s for the aggregation = 1.06s 60ms = 0.06s

It’s not scientific by any means, but it serves to illustrate the point.

Keep in mind that these results are against a cold, un-loaded test cluster. At this moment (10:30pm on a Friday), our production application is doing 11.2K queries per second with a 99th percentile response time of 120ms. There’s no way we could sustain that with just normalized data in our indices.

Denormalization democracy, realized

Bazaarvoice has been running its heaviest display use cases with Polloi for several years now. To sum up our experience, I would say that thinking of denormalization as an architectural problem is a small paradigm shift that really decouples projects and opens up a lot of scale and innovation possibilities.

Just to drive the point home, I’ll leave you with this diagram showing how multiple teams can define their own denormalizations, plug them into the company’s data stream, index their own optimized data, and power their applications:

streaming data architecture

This does not come without trade-offs, however. I’m planning to write more about the broader challenges and opportunities of operating a company on streaming data backend.

Let me know what you think! I’ve enabled comments on the blog, you can discuss on Hacker News (https://news.ycombinator.com/item?id=13176411), or you can shoot me an email (see $whoami).

Bazaarvoice is Hiring!

[edit] I initially forgot to mention this, but the Polloi team is looking for someone passionate about distributed systems, streaming data, and just generally solving hard problems with good engineering. Hit me up if you’re interested.

]]>
<![CDATA[ Efficient Command-Line Accounting with Ledger and Account-CLI ]]> http://blog.vvcephei.org/account-cli http://blog.vvcephei.org/account-cli Mon, 23 Dec 2013 00:00:00 -0500 Inspired by Andrew Cantino’s blog about accounting with Ledger and Reckon, I started this year out following his workflow. I’ve used MS MyMoney, GnuCash, KMyMoney, and Mint before. Ledger is my clear favorite out of these.

All of the software offerings are similar: they download transactions, some do a reasonable job of classifying them, but I universally spend longer than I want to wrangling transactions and generating reports. Mint is awesome at sucking in transactions and categorizing them. It is focused more on budgeting than the full compliment of accounting functions. This is good for me, since I am no accountant; I just want to track my spending and keep an eye on my budget. However, I was never comfortable with all the access that Mint had to my financial life. I also wanted more flexibility with report generation than Mint offers.

Switching to the command line makes a lot of sense, since you can easily script your workflow and spend approximately no time performing routine actions. Andrew wrote about using Reckon, a Ruby program for loading transactions from CSV and appending them to your Ledger journal. I started out using the same workflow, but I got tired fast of downloading CSV files from all my accounts. So, I put together account-cli, which is similar to Reckon, but adds the ability to download transactions straight from the bank using OFX (aka Direct Connect). As an aside, I wound up writing scala-ofx and scala-ledger to support this project.

Using account-cli, here is my workflow:

  1. Download recent transactions from the one account that doesn’t support OFX
  2. Issue this command: account-cli path/to/ledger/journal.dat path/to/your/config.yaml This downloads new transactions, adds them to my journal, and then prints out my balance and budget reports.
  3. Profit!

Having a simple 2-step process has been the major factor that keeps me on top of my finances.

How to set this up

The Ledger documentation has a really good introduction. It walks you through how to use the program, of course, but it also gives you an accessible intro to accounting.

Following Andrew’s advice and also the tutorial for ledger, I start a new Ledger file at the new year. The first block is the budget:

~ Monthly
  Expenses:Automotive:Gas   $300.00
  Expenses:Entertainment     $30.00
  Expenses:Food:Dining Out  $500.00
  Expenses:Food:Groceries   $250.00
  Assets

~ Yearly
  Expenses:Travel          $1500.00
  Assets

Then comes the opening balances:

2013/01/01 * Opening Balances
    Assets:Bank:Savings               $100.00
    Liabilities:Bank:Credit Card    $-100.00
    Liabilities:Student Loan    $900.00
    Equity: Opening Balances

Then, you just keep a running list of transactions:

2013/02/04  Example Transaction
    Expenses:Automotive:Gas                 $1000.00
    Assets:Bank:Checking

2013/03/04  Example Transaction 2
    Expenses:Food:Dining Out                $400.00
    Assets:Bank:Checking

2013/10/04  Example Transaction 5
    Income:Work:Reimbursement             $-100.00
    Assets:Bank:Checking

2013/11/04  Example Transaction 4
    Income:Work:Paycheck                $-3,000.00
    Assets:Bank:Checking

2013/12/04  Example Transaction 3
    Expenses:Food:Groceries             $600.00
    Assets:Bank:Checking

The next component is the configuration file for account-cli. Here is the example config I put together with comments explaining all the parts.

Let’s say that journal is stored in a file called 2013.dat, and the config is in config.yaml, and that you followed the couple of simple steps to install account-cli.

All you have to do to update your finances is run the command:

account-cli 2013.dat config.yaml

It will download new transactions, match them against the journal, and walk you through the categorization like this:

+----------------------+------------+-----------+------------------------+
| Assets:Bank:Savings  | 2013/12/13 | -$1000.00 | Withdrawal to Checking |
+----------------------+------------+-----------+------------------------+
To which account did this money go? ([account] / [q]uit/ [s]kip / [1]Bank:Transfer / [2]Expenses:Utilities / [3]Expenses:Entertainment) 1

+-----------------------+------------+---------+----------------+
| Assets:Bank:Checking  | 2013/12/14 | -$80.00 | ATM Withdrawal |
+-----------------------+------------+---------+----------------+
To which account did this money go? ([account] / [q]uit/ [s]kip / [1]Expenses:Entertainment / [2]Bank:Transfer / [3]Expenses:Automotive:Fees) Expenses:Misc

+-----------------------+------------+----------+----------------------+
| Assets:Bank:Checking  | 2013/12/13 | $1000.00 | Deposit from Savings |
+-----------------------+------------+----------+----------------------+
From which account did this money come? ([account] / [q]uit/ [s]kip / [1]Bank:Transfer / [2]Income:Work:Reimbursement / [3]Income:Work:Paycheck) 1

Then, it prints out the reports:

Balances: Assets and Liabilities
           $1,200.00  Assets:Bank
           $1,100.00    Checking
             $100.00    Savings
             $800.00  Liabilities
            $-100.00    Bank:Credit Card
             $900.00    Student Loan
--------------------
           $2,000.00

Balances: Expenses and Income
           $2,000.00  Expenses
           $1,000.00    Automotive:Gas
           $1,000.00    Food
             $400.00      Dining Out
             $600.00      Groceries
          $-3,100.00  Income:Work
          $-3,000.00    Paycheck
            $-100.00    Reimbursement
--------------------
          $-1,100.00

Budget: this year
      Actual       Budget         Diff  Burn  Account
   $2,000.00   $13,380.00  $-11,380.00   15%  Expenses
   $1,000.00    $3,300.00   $-2,300.00   30%    Automotive:Gas
           0      $330.00     $-330.00     0    Entertainment
   $1,000.00    $8,250.00   $-7,250.00   12%    Food
     $400.00    $5,500.00   $-5,100.00    7%      Dining Out
     $600.00    $2,750.00   $-2,150.00   22%      Groceries
           0    $1,500.00   $-1,500.00     0    Travel
------------ ------------ ------------ -----
   $2,000.00   $13,380.00  $-11,380.00   15%
Percent of Period complete:              95%

Budget: this quarter
      Actual       Budget         Diff  Burn  Account
     $600.00    $1,080.00     $-480.00   56%  Expenses
           0      $300.00     $-300.00     0    Automotive:Gas
           0       $30.00      $-30.00     0    Entertainment
     $600.00      $750.00     $-150.00   80%    Food
           0      $500.00     $-500.00     0      Dining Out
     $600.00      $250.00      $350.00  240%      Groceries
------------ ------------ ------------ -----
     $600.00    $1,080.00     $-480.00   56%
Percent of Period complete:              82%

Budget: this month
      Actual       Budget         Diff  Burn  Account
     $600.00    $1,080.00     $-480.00   56%  Expenses
           0      $300.00     $-300.00     0    Automotive:Gas
           0       $30.00      $-30.00     0    Entertainment
     $600.00      $750.00     $-150.00   80%    Food
           0      $500.00     $-500.00     0      Dining Out
     $600.00      $250.00      $350.00  240%      Groceries
------------ ------------ ------------ -----
     $600.00    $1,080.00     $-480.00   56%
Percent of Period complete:              46%

As a final note, Ledger is a powerful piece of accounting software, and you can run all kinds of reports. As I discover new, useful ones, I’ll update my scripts.

If you have any feedback or advice, I’d welcome comments and pull requests!

Cheers!

-John

]]>
<![CDATA[ Command-line Photo Processing ]]> http://blog.vvcephei.org/command-line-photo-processing http://blog.vvcephei.org/command-line-photo-processing Sun, 02 Aug 2009 00:00:00 -0400 My girlfriend, and I recently took a trip to California, which was fantastic. We went backpacking in Yosemite for a week, and then spent another week in San Francisco and Napa Valley. During this time, we took a little over 1400 photos on our two digital cameras. When we came back, I sat down to figure out how to consolidate our pictures so that they would display chronologically. Although it took me a while to research all the different programs for this, I ended up discovering just a handful of commands that reduced the task from a several-day monotonous chore to a several-minute fire-and-forget job.

The starting point:

  • She used a Canon PowerShot SD1100 IS, which saves files like this: IMG_1700.JPG.
  • I used an OLYMPUS E-500, which saves files like this: AB276025.ORF for raw and like this: AB276025.JPG.

Here is a list of the programs I used:

  • dcraw
  • ImageMagick
  • exiv2
  • jhead

Here is a summary of the commands I issued:

dcraw *.ORF
mogrify -format jpg *.ppm
exiv2 insert -S.ORF *.jpg
jhead -n%Y.%m.%d-%H.%M.%S-e *
jhead -model E-500 -n%Y.%m.%d-%H.%M.%S-j *

I did not include the command to adjust the timestamps of my photos here because I corrected my camera’s clock, and that step will no longer be necessary. Plus, it’s confusing for other people who just want to run these commands to see what happens.

And here is the process:

Since I took about 200 raw photos, my first task was to convert them into jpegs.

dcraw *.ORF
mogrify -format jpg *.ppm

Dcraw is used to develop” raw formatted files into an intermediate format, ppm. In my experience, dcraw does a fine job choosing the correct metrics to develop the pictures, but it also has a whole host of options for changing color, white balance, etc. Mogrify is part of the ImageMagick suite. Specifically, it is used to modify files in-place (the other programs in the ImageMagick suite create a new file each time you run them), but it has one handy feature that allows you to batch-convert files from one format to another. Ppm-to-jpeg is just one of these options. Now, because this conversion does not preserve the metadata, which I would need later, I used this command to insert the data from the raw files into the jpegs:

exiv2 insert -S.ORF *.jpg

One thing to know about this command is that exiv2 is designed to work with Canon raw format (CRW). When applied to others, it makes a guess about which raw metadata fields map to which jpeg Exiv fields without promising to get it absolutely right. That said, it worked just fine with Olympus raw (ORF) for me.

NOTE: this command inserts into every jpg the information from the ORF file with the corresponding name. Therefore, you must do this step before renaming the jpegs.

Once I had all my photos in jpeg format, I decided to rename them by date and time so that my girlfriend’s and my photos would both be part of a single timeline. I also put an e’ at the end of her pictures and a j’ at the end of mine. Jhead is the perfect program for this:

jhead -n%Y.%m.%d-%H.%M.%S-e *
jhead -model E-500 -n%Y.%m.%d-%H.%M.%S-j *

The first command renames each file based on the date taken” field. This command allows you to specify your own format for the date and time information, so I used my standard year.month.day-(24)hour.minute.second. The documentation for Jhead details how to use these tokens. The first command also renames all the files to end with e’. The second command uses the -model argument to rename only my pictures so that they end in j’. The -model argument can be used to limit most other commands to pictures from a specific camera. This feature was important for the last step.

When the pictures were all renamed according to when they were taken, I noticed that my pictures were slightly out of sync with my girlfriend’s. Since I did not have the cameras handy to compare their times, I just looked for a pair of pictures, one from each of our cameras, that captured a unique moment. I used the time difference in the photos’ timestamps to calculate that my camera was about 15 minutes and 20 seconds ahead of my girlfriend’s. Then I issued this command to correct it:

jhead -model E-500 -ta-0:15:20 *.jpg

The -model argument limits this command to my camera, and -ta can be used to adjust the time taken by a specified amount, positive or negative–negative, in this case. One more of these to rename my pictures based on their new timestamps, and I was done!

jhead -model E-500 -n%Y.%m.%d-%H.%M.%S-j *

Next time, I might roll all these commands into a script, but I also like seeing the changes to the files after each step. Each of the programs I discussed here have a large number of options, so you can use them to do much more than I accomplished here.

]]>