Query Batching ala Dataloader

There have been the question this week, asking for the code we are most proud of. After I was finished with my answer, I thought this could be its own article.

It was in a midsize application, about 60 000 lines of server code, when I was implementing the API endpoints and database logics. For new features, I initially processed one entity, such as user, a comment or file. In UI when there was a list of something, the users often can select multiple items and run an action on many together. Instead of calling the provided API endpoint multiple times, they asked me to implement a version that would accept many at once.

Now doing so there are also basically two different approaches, that at that time in 2016 have not been as obvious to me, because the backend code used node style callbacks. In the first approach you would today in server side accept many items and run the original logic just using ‘promise.all()’. This is kind of, how GraphQL is doing it in a resolver.

However, this is very inefficient for the server performance, as it is executing a lot of very small SQL statements. So, I was implementing a version of that function that would really take many Items and run as few database queries as needed.
This is also, how many people do it today in GraphQL. Using the dataloader module developed by Facebook.

The impact of this is, that the code you write get more complex. Handling a list is more complex then handling a single item. that get most obvious when you encounter a condition like this:.

async function processItem(item) {
  if (item.prop === SOME_CONSTANT) {
    await doThis(item) 
  } else {
    await doTheOther(item) 
  }
}

For this situation, you have to process both cases and the do-functions need to also accept lists. I was using the underscore library at that time:

async function processItems(items) {
  const itemByProp = _.groupBy(items, 'prop');
  Promise.all([
    doThis(itemsByProp[SOME_CONSTANT]), 
    doTheOther(itemsByProp[OTHER_CONSTANTS]),
  ]);
}

This example has the same number of lines, but the code get more bigger when there are more then two different possible values for the ‘prop’ or when you have mode than one condition. You are likely to split functions into multiple because it get to hard to read and. Slitting a function into multiple is good, to handle more complex logic, but maybe the code don’t need to be as complex in the first place. In some functions I ended up with multiple index objects or also used ‘array.filter()’. This approach can definitely change the coding style for the entire project.

But what was the goal of this complex functions. It was to avoid constant called to something like ‘getItemById’, with a single id, and execute to many SQL statements that each only contain one id and are very costly on networking and together put a huge burden on the db.

That is when I decided to do an other approach. The idea: to do caching, but do not cache the results, but the function-calls and the callbacks to functions that do the database access.

This is what I than wrapped into the module tcacher (today it is refactored for async functions not callbacks). Having the request caching not on the API side, but on the data layer, I was able to get the gains of running few SQL queries, while still keeping code, that looks like processing a single item. In fact, this way, even more queries have been avoided, because even queries from different API’s that use the same database method, are batched together. Also: tcacher is also used and proved successful with different dao libraries.

It was much later, at a new, my current company, that I learned about dataloader and that this functionality was not called request caching, but query batching.

Today, I think, it does not matter what package you use, dataloaderl or tcacher. The first is more object oriented, the other more functional in functional style.

While writing this article, I was checking the dataloader documentation again, now in version 2.0.0 it has a feature batchScheduleFn, a feature that has been available in tcacher all along, making me even more proud. Now I still wonder how long it will take until dataloader is not returning result copies to each caller, but always the same object. This can lead to avoid nasty side effects when one the caller mutate the result. (see the code snipped

However, what I think is important, is to do the query batching, at the side where you access other resources, not where other apps call your code. That is the same, if you use JS or Golang or any other language.

I, of course will always use tcacher. Because it does one thing and does it right. It does batching. And I have seen engineers struggling trying to figure out how to use dataloader right, together with its second feature, an actual in memory cache. Along the way loosing many of the benefits.

You see, I am not only proud of the fact that I had a the solution before I learned about a solution provided by Facebook, but also to find a way, to keep the code clean.

This article is not mean to be a bashing of dataloader, when you are aware of the behavior, it can serve you well.

Tobias Nickel

software developer from heart

Query Batching ala Dataloader