Long-running Node processes

At Spanning we often say that “easy things are hard at scale” (at least, our VP of Engineering has said it enough that it has actually caught on). Our first product, Spanning Backup for Google Apps, uses mostly Java code on the server side to communicate with Google, Amazon Web Services, our databases, and so on. For our second product, Spanning Backup for Salesforce, we chose to use Node.js. Node was the new hotness at the time we started developing the product, piqued our curiosity, and its asynchronous nature promised the ability to easily parallelize I/O and network-bound tasks. After several prototypes, we decided that we had everything we needed to create a fully-baked product with Node so we took the plunge. We decided to use Node for everything on the server-side, including some really long running processes.

This brings us back to all of those easy things that can be hard at scale. While we had many established patterns for dealing with large-scale data processing in Java, they didn’t always translate perfectly to Node, especially due to the asynchronous nature of Node. To further complicate things, we run Node processes as background jobs that may take a considerable amount of time to complete. What’s considerable? Anywhere from minutes to hours, or even weeks to finish processing. Yes, I said weeks!

What follows are some general tips and tricks we have incorporated in Node, as well as pitfalls that we encountered with long-running background processes. So what did we learn?

Separate your concerns.

Is your web server running Node? Rad. Do you use restify or express for your REST interface? Spiffy. Do you want some long running process invoked via a REST call to block your event loop for minutes, hours, or days? No. Not cool. Even if a task isn’t blocking the event loop permanently over its entire lifecycle, it’s still possible to block it for long stretches of time and/or overload the CPU such that new requests are not serviced quickly. It’s easier than you might think to introduce blocking code. Some simple examples include:

  • Accidentally invoking a synchronous function such as fs.readFileSync(), instead of it’s asynchronous, non-blocking counterpart.
  • Iterating over a large array with a blocking operation like a forEach loop.
  • Attempting to use JSON.stringify() on a large object.

We run a somewhat SOA-inspired architecture involving separate classes of machines with different roles. This separation prevents us from blocking the event loop on any of our web servers. Specifically, our infrastructure generally involves several web servers behind a load balancer. These servers generally do very little, and then immediately return to the client. An entirely separate set of machines is responsible for handling a long-running process (such as a backup, export, or restore operation). These “worker” machines constantly poll one or more SQS queues looking for work. Once they find a message they will work it until completion. Note that SQS isn’t a requirement, it’s just what we use. You could just as easily use another messaging queue, Redis, etc.

The architecture looks something like the following:

[Typical network server diagram]

For example, if a user wishes to spin up a long running process such as a backup, one of the servers in the web tier (a WWW server) will receive the request, queue up a message in SQS, and additionally create a row in a database table to track progress. At this point the WWW server can return a successful response to the client. Meanwhile, an available worker eventually polls the queue, pops the message off of the queue, and begins to update the database with progress information. As the worker machine churns through data, the WWW servers can periodically poll the database to see what progress the worker has made. In essence, the web tier can queue up work, and then communicate progress back to the client by simply reading the same database row that a worker box is updating.

Additionally, the WWW servers are capable of handling a wide range of requests from clients, not limited to displaying progress of a running process. They can also access well-indexed, small amounts of database information to support general application functionality such as displaying historical data, displaying error messages that may have occurred, administrative functions, etc. Eventually, some or all of these operations may begin to overwhelm your web server with large CPU loads, and/or spend too much time looping over large result sets. In such cases, you will likely need to offload those tasks to (micro)services that the web servers can call. The pattern is still valid however, with your web server making a call to another service, and then continuing to process other incoming requests. When the results from the service are available they can be relayed back to the web server, and finally back to the client (likely a browser). This loosely coupled architecture works well for several reasons:

  • It easily scales horizontally as load increases. There is generally not a need for more than a few web servers, but as our queue fills up, worker boxes can be added on-demand to handle the additional load. As load decreases, the worker boxes can be scaled down to provide cost savings. We are using EC2 to virtualize our workers, so simple auto-scale rules are put in place to help us balance performance needs and cost. You may also consider using Docker, Kubernetes, or some other container solution to scale your services as load increases; the basic concept remains the same.
  • Long-running processes are handled by a set of servers that do not handle web requests. Thus, the web servers can continue to service a large number of incoming requests quickly, and not be bogged down with a blocked event loop for extended periods of time.
  • Queues and background processes can be observed and monitored. When a long-running Node process fails (as some inevitably will), an error handler can pass the message back to the queue for another worker to finish processing. In the ideal scenario, a process can be resumed from the point that a previous worker left off, so that work does not have to be repeated.
  • Since messages can be requeued and picked up by other workers when a process fails, we can give the illusion of running for days or weeks without interruption. In reality, a robust failover mechanism is actually in place to keep the ship sailing smoothly.

Promise all the things!

Gordon has said this already in a previous blog post, but it’s important enough to reiterate. Decide on a Promise library and learn to love it. While we started our project with Q promises, we ultimately settled on Bluebird for a number of reasons.1

To the point however, any asynchronous function we write is a promise returning function. We embraced promises for several reasons:

  • They help us write clean, readable code that feels almost synchronous in nature.
  • They make error handling very pleasant, which has the added bonus of making retry logic simpler.
  • They make it dead simple to fire off multiple tasks in parallel. If you have ever dealt with thread synchronization in Java you’ll love Promise.all() or Promise.map() with a concurrency argument.
  • Libraries such as Bluebird can easily promisify a single callback function, or an entire library of functions, allowing you to instantly invoke third-party libraries in a promisified manner (even though the library itself relies on callbacks).
  • They are still super easy to mock for unit testing purposes, especially via Sinon.

Test all the things!

Speaking of tests…it’s always a great idea to test your code. Tests are especially important when working with dynamically typed languages such as JavaScript. At Spanning, we have multiple layers of tests and related defenses. It starts with a linting tool, such as JSHint. We run JSHint as part of our grunt builds, and it runs in our continuous integration environment every time new code is pushed to a branch. Using a tool like JSHint allows you to avoid many of the gotchas associated with JavaScript. Additionally, JSHint rules can be put in place to enforce a common coding standard across the team.

Next, we write a significant number of unit tests. So much, that writing and updating tests can often take more time than writing the actual code that we need to accomplish something new. Our tests are extremely behavioral driven and one of the main components of our unit tests is the ability to mock functions via Sinon. If you’re coming from Java, think of Sinon as the JavaScript equivalent of Mockito. You can create stubs and spies, and get them to do all sorts of things.

We also heavily rely on the Chai assertion library for our behavioral driven testing. You can easily compare values and make assertions about conditions that should, or not should be.

Finally, if you’re using Bluebird, you will want the sinon-as-promised library, which will allow you to stub tests to fail when the code is executed, rather than when the file is parsed. Otherwise, Bluebird will complain about a possibly unhandled exception when the tests are parsed. Here is an example of how we might write unit tests:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
// 3rd party dependencies
var chai = require('chai'),
sinon = require('sinon'),
Promise = require('bluebird'),
// our own dependencies
SomeModule = require('someDirectory/SomeModule'),
// NOTE: can only be defined after chai is present
assert = chai.assert,
should = chai.should();

// override sinon with sinon-as-promised and make it return Bluebird promises
require('sinon-as-promised')(Promise);

describe('fake unit tests for foo() function', function() {
// setup some data and some stubs
var someValue = 42,
someFunctionStub, someOtherFunctionStub;

beforeEach(function(testDone) {
someFunctionStub = sinon.stub(SomeModule, 'someFunction');
someOtherFunctionStub = sinon.stub(SomeModule, 'someOtherFunction');
testDone();
});

afterEach(function(testDone) {
someFunctionStub.restore();
someOtherFunctionStub.restore();
testDone();
});

// test some random error condition
it('should reject with an error if argument is less than 5', function(testDone) {
// force the stub to reject
someFunctionStub.rejects(new Error('arg is less than 5'));

// just assume this will invoke someFunction() with the given argument and an array of [1, 2, 3]
SomeModule.foo(4)
.then(function() {
assert(false, 'should not have made it here with arg < 5');
})
.catch(function(err) {
var callArgs;

should.exist(err);
err.message.should.equal('arg is less than 5');

someFunctionStub.callCount.should.equal(1);
callArgs = someFunctionStub.getCall(0).args;
callArgs[0].should.equal(4);
callArgs[1].should.be.an('array').and.have.members([1, 2, 3]);

// the other stub should not be called due to the error
someOtherFunctionStub.callCount.should.equal(0);
})
.done(testDone, testDone);
});

// test that it behaves normally
it('should write to the database using someOtherFunction() with arg >= 5', function(testDone) {
// these stubs should return without error
someFunctionStub.returns(Promise.resolve());
someOtherFunctionStub.returns(Promise.resolve(someValue));

// just assume this will invoke someFunction() with the given argument and an array of [1, 2, 3]
SomeModule.foo(8)
.then(function(result) {
var callArgs;

// assume the result should be 8 for whatever reason
result.should.equal(8);

someFunctionStub.callCount.should.equal(1);
callArgs = someFunctionStub.getCall(0).args;
callArgs[0].should.equal(8);
callArgs[1].should.be.an('array').and.have.members([1, 2, 3]);

// the other stub should now be called
someOtherFunctionStub.callCount.should.equal(1);
callArgs = someOtherFunctionStub.getCall(0).args;
// test the rest of the arguments here...
})
.done(testDone, testDone);
});
});

Additionally, we have written a large number of integration tests. These can be especially useful to test the functionality of REST endpoints. Our typical flow is rather similar to the unit tests although instead of stubbing functions to return values or reject with errors, the before and after functions are typically responsible for populating database rows and subsequently removing them. Such tests are left as an exercise for the reader. :)

Learn to love recursion to learn to love recursion.

Let’s face it – your asynchronous code is going to fire off some task and then return immediately to its caller. Sure the callback function will be invoked once the asynchronous response is back, but there are many cases where additional processing is required. If you have nice, DRY code you’ll probably want to call yourself recursively to process that next chunk of data. Assuming a large amount of data needs to be processed, you certainly cannot process it all once. Perhaps you cannot store all of the data in memory or the database query for everything at once is too onerous. In such cases it is generally a good idea to retrieve data in small, well-indexed chunks. Furthermore, interactions with third-party APIs will likely force you to page through the data to help lower the burden on their servers.

Whatever the case may be, dealing with large data sets generally involves walking through things in a paginated fashion. Recursion and recursive helper functions are your friend. Write a function that sets up an initial state and then invokes the recursive helper function to do all of the heavy lifting. With promisified functions, you can simply and easily inspect the return value of a function and then decide to return the final result, or to recurse. An example might be paging through a list of database records, or walking a large number of files in S3. If you have promisified functions, this can be an effortless process. For example, to walk all records in a particular database table with a page size of 1000, you might do something like the following:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
var walkAllRecords = function(someValue) {
// we might be resuming work so query the DB for where we left off
return getProgessInfo(someValue)
.then(function(partialResult) {
return walkAllRecordsHelper(partialResult.numRecords, 1000, partialResult.lastOffset);
});
};

var walkAllRecordsHelper = function(numRecords, limit, offset) {
// default the values if they are falsy
numRecords = numRecords || 0;
offset = offset || 0;

// some promisified function that can get records from a database table
return getNextPageOfRecords(limit, offset)
.then(function(records) {
if (records.length === 0) {
// return the total number of records processed
return numRecords;
} else {
// do something with the records
return doSomething(records)
.then(function() {
// increment our values and recurse
numRecords += records.length;
offset += limit;
return walkAllRecordsHelper(numRecords, limit, offset);
});
}
});
};

You could also embed the helper function as a variable inside of your recursive function, but then you unfortunately cannot stub/spy it for unit tests. You do write unit tests and verify call arguments and such, right? If not, you somehow made it here without reading my previous section about testing. :)

Conclusion:

To summarize, my main suggestions are:

  • Build an architecture where concerns are separated amongst different classes of servers (different services), where each class/service is independently scalable. This is particularly important to keep your web servers responsive and not block the event loop on each one.
  • Allow your long-running processes to leave breadcrumbs as they run so that progress can be inspected and processing can be resumed by any other worker should a failure occur.
  • Wrap all asynchronous code with a solid promise library. Additionally, wrap calls to the promisified functions with retry logic where applicable.
  • Explore good, reusable patterns and then reuse them as much as possible. Recursive helper functions to deal with massive amount of data in a paginated fashion is a great example.

There are so many more suggestions and examples that could be provided. The above information works well for our purposes, but may not suit everyone. Hopefully there is some useful nugget contained here for how to scale and process large amounts of data in Node. Happy coding!



1 The Bluebird library continues to be maintained on a frequent basis, and also avoids using process.nextTick() in versions 0.12 and up (actually 0.11 and up but you shouldn’t be running an unstable branch of Node in production). You want setImmediate() instead, because process.nextTick() can show up in odd places and force your Node process to crash beyond the reaches of your code, where you can do something to handle it in a meaningful way. If you’re still running Node 0.10 you will unfortunately be susceptible to this error when using either Q or Bluebird (ask me how I know). If you cannot upgrade, my suggestion is to find patterns in the problematic areas of code, then look for ways to identify that the pattern is occurring. Finally, define a safe but reasonable threshold to be considered a maximum amount of work to perform. When the code detects that the threshold has been reached, force the code to halt to allow the stack to unwind.