Diagnosing and Resolving MySQL deadlocks

Spanning uses MySQL for several of our products, and our Backup for Salesforce product is no exception. Over the 2015 holiday season we noticed that some of our backups were reporting MySQL deadlock errors. While this wasn’t the first time we had encountered deadlocks (more on that later), this was extremely surprising to us as we hadn’t deployed any code changes over the holidays.

Ultimately, we realized that while we had not deployed any code changes, we had modified our infrastructure. Namely, we upgraded our MySQL instances in Amazon RDS to use SSDs as opposed to HDDs (or magnetic disks, as they are called in the AWS console). The goal was to provide a performance improvement to our application without having to change any code.

[Go Speed Racer, Go!]

While the upgrade accomplished this goal, it also brought along a small handful of deadlocks. To be more precise, we had experienced deadlocks due to the same root cause several months prior, but some recently improved, self-imposed monitoring of production was robust enough to bring the error to our immediate attention.

Fortunately, MySQL (using the InnoDB engine) offers an extremely simple way to diagnose deadlocks, assuming you know where to look.

Locating deadlocked transactions

At a minimum, to locate the transactions (and more specifically, the deadlocked SQL statements), you will need the PROCESS privilege for your MySQL user. If you don’t have this privilege you will simply get an error when you attempt to run the query below. If your user isn’t privileged simply ask your administrator to execute the query for you and forward the output. You can run the following to access the stored information about the most recent deadlock:

SHOW ENGINE INNODB STATUS \G

I recommend using '\G' rather than ';' as a query terminator as the output is much more friendly to read. The output can be rather large, but there is a section specifically titled LATEST DETECTED DEADLOCK which is where you should focus your efforts. Here is an example with most of the irrelevant data removed from the output:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
------------------------
LATEST DETECTED DEADLOCK
------------------------
2016-01-17 13:22:20 2b80eeb5b700
*** (1) TRANSACTION:
TRANSACTION 1418022806, ACTIVE 0 sec starting index read
mysql tables in use 1, locked 1

UPDATE `record` SET `end_date`='2016-01-31 20:04:10', `deleted`=0 WHERE
`sobject_type`='Account' AND `end_date`='9999-12-31 23:59:59' AND
`salesforce_id`='ABC' AND `start_date` <= '2016-01-31 20:04:11'
*** (1) WAITING FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 23 page no 26303736 n bits 112 index
`sobject_start_end_deleted` of table `db`.`record` trx id 1418022806
lock_mode X waiting


*** (2) TRANSACTION:
TRANSACTION 1418022805, ACTIVE 0 sec updating or deleting

UPDATE `record` SET `end_date`='2016-01-31 20:04:10', `deleted`=0 WHERE
`sobject_type`='Account' AND `end_date`='9999-12-31 23:59:59' AND
`salesforce_id`='XYZ' AND `start_date` <= '2016-01-31 20:04:11'
*** (2) HOLDS THE LOCK(S):
RECORD LOCKS space id 23 page no 26303736 n bits 112 index
`sobject_start_end_deleted` of table `db`.`record` trx id 1418022805
lock_mode X


*** (2) WAITING FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 23 page no 26303736 n bits 112 index
`sobject_start_end_deleted` of table `db`.`record` trx id 1418022805
lock_mode X locks gap before rec insert intention waiting


*** WE ROLL BACK TRANSACTION (1)

Here you can see 2 UPDATE statements, one for the record with id of ‘ABC’ in transaction 1, and another with an id of ‘XYZ’ in transaction 2. Note that these statements were running concurrently. We’ll dive into the details of this output below after a slight overview on the table itself.

Diagnosing the source of the deadlock

To explain why the above deadlock happened (without going too far into our architecture and table structure), there are just a few things I should explain first. This is a gross simplification, but:

  • We have a table in MySQL that represents every version of every record we have ever backed up for a given organization.
  • Every row effectively stores a point in time when the record first became relevant, and a point in time when the version stopped being relevant. Meaning that if a version is backed up on January 1 at noon, you can think of it as being inserted with a start_date=’2016-01-01 12:00:00’ and an arbitrarily large end_date such as end_date=’9999-12-31 23:59:59’. We actually don’t use dates but it’s easier to demonstrate this way.

So let’s suppose that the table has the following structure to store records that we have retrieved from Salesforce. Salesforce calls them SObjects, and each one has a type (Account, Contact, Attachment, etc.) and a globally unique identifier that we will store in the salesforce_id column.

1
2
3
4
5
6
7
8
9
10
CREATE TABLE `record` (
`id` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
`salesforce_id` varchar(255) NOT NULL,
`sobject_type` varchar(255) NOT NULL,
`start_date` datetime NOT NULL,
`end_date` datetime NOT NULL DEFAULT '9999-12-31 23:59:59',
`deleted` tinyint(1) unsigned NOT NULL DEFAULT '0',
PRIMARY KEY (`id`),
KEY `sobject_start_end_deleted` (`sobject_type`,`start_date`,`end_date`,`deleted`)
) ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=utf8;

Records in the database might look something like the following:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
mysql> insert into record (salesforce_id, sobject_type, start_date, deleted)
values ('ABC', 'Account', NOW(), 0);
Query OK, 1 row affected (0.00 sec)

mysql> insert into record (salesforce_id, sobject_type, start_date, deleted)
values ('XYZ', 'Account', NOW(), 0);
Query OK, 1 row affected (0.00 sec)

mysql> select * from record;
+----+---------------+--------------+---------------------+---------------------+---------+
| id | salesforce_id | sobject_type | start_date | end_date | deleted |
+----+---------------+--------------+---------------------+---------------------+---------+
| 1 | ABC | Account | 2016-01-29 19:29:13 | 9999-12-31 23:59:59 | 0 |
| 2 | XYZ | Account | 2016-01-29 19:31:00 | 9999-12-31 23:59:59 | 0 |
+----+---------------+--------------+---------------------+---------------------+---------+
2 rows in set (0.00 sec)

Note that we have the key sobject_start_end_deleted which is a non-unique index. This plays an important role in the deadlock, which (SPOILER ALERT) is caused by a technique known as a gap lock. If you aren’t familiar with gap locks, you can read an introduction to them in the MySQL docs and their explanation why it is used. If things still aren’t clicking I would suggest you refer to the following blog post.

During the course of our backup, we will retrieve a page of records from the Salesforce API, and then insert the information for each record into our table. We might receive 100 rows in a page, some of which are records we have never seen before, while others might be an update to a record that we have backed up in the past. Suppose a new version of ABC and XYZ both come in 2 days later. In this case we would need to:

  • Insert the new version of each record.
  • Locate the version of each record prior to the most recent one.
  • Update the end_date of the previous versions to be less than the newest version.

The table might now look something like this:

1
2
3
4
5
6
7
8
9
10
mysql> select * from record;
+----+---------------+--------------+---------------------+---------------------+---------+
| id | salesforce_id | sobject_type | start_date | end_date | deleted |
+----+---------------+--------------+---------------------+---------------------+---------+
| 1 | ABC | Account | 2016-01-29 19:29:13 | 2016-01-31 20:04:10 | 0 |
| 2 | XYZ | Account | 2016-01-29 19:31:00 | 2016-01-31 20:04:10 | 0 |
| 3 | ABC | Account | 2016-01-31 20:04:11 | 9999-12-31 23:59:59 | 0 |
| 4 | XYZ | Account | 2016-01-31 20:04:11 | 9999-12-31 23:59:59 | 0 |
+----+---------------+--------------+---------------------+---------------------+---------+
4 rows in set (0.00 sec)

Now imagine this running in our production code, which is having to insert, query, and potentially update millions of records on a daily basis. We have a function in our NodeJS code to update the end_date column for the previous version of a record, and if necessary, mark it as deleted. We run this update function with a small amount of concurrency using the Bluebird.map() function similar to the following:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
// number of milliseconds for every second
VAR ONE_SECOND = 1000;

var updateEndDate = function(sobjectType, newVersionDate, salesforceIds, deleted) {
var endDate = new Date(newVersionDate - ONE_SECOND),
setValues = {
end_date: endDate,
deleted: deleted
};

return Promise.map(salesforceIds, function(sfId) {
return db.update(
// the SET portion of the statement including columns and values
setValues,
// the WHERE clause
{
sobject_type: sobjectType,
end_date: '9999-12-31 23:59:59',
salesforce_id: sfId,
start_date: {
lessThan: newVersionDate
}
});
}, {
concurrency: 5
});
};

The specified concurrency argument offers a small performance gain since multiple connections can be made to the database to update several rows in parallel. Bluebird’s map function will simply iterate over the array we pass it in a non-deterministic order, and fire off 5 promises in parallel. As one resolves, another promise is created to update another row to maintain a concurrency of 5 until the array has been completely processed. However, each one of those UPDATE statements creates a transaction that tries to update rows using the above WHERE clause, which targets the non-unique index on the table. This opens the opportunity for a deadlock due to MySQL’s gap locking mechanism.

We noticed that gap lock was occurring due to the output of SHOW ENGINE INNODB STATUS which contained two transactions in the deadlock section (you can refer back to the sample output above). The first has a transaction id of 1418022806 and the second is 1418022805. From the output we can see that the second transaction acquired a row-level lock but that the first transaction acquired a gap lock (trx id 1418022805 lock_mode X locks gap before rec). This confirms that we were indeed experiencing a gap lock.

Taking both of those UPDATE statements and converting them to SELECT statements that used the same WHERE clause, we found the primary key (id) of these rows were exactly 1 apart, meaning the rows were right next to each other in the table. Thus, not only was MySQL telling us that a gap lock was in play, it was pretty easy to verify just by running queries and inspecting the output.

It is possible to avoid gap locks by ordering the statements according to the index, but unfortunately the non-deterministic order of Bluebird’s Promise.map() is out of our control, and is actually at the heart of what allows them to always maintain the specified level of concurrency since there is no guarantee how long an asynchronous request will take to complete.

Determining deadlock resolution

Resolving a deadlock obviously depends on what actions are causing the deadlock. Our first encounter with deadlocks was fairly straightforward. In anticipation of our deleted item restore (which we have subsequently improved and replaced with multi-record restore to handle both updated and deleted items), we were executing a long-running process to change the way we represented deleted records in the database. This utility would find specific rows in a table, locate a previous version of each row, then copy over some of the data and delete the row that was no longer relevant. It just so happened that the utility was attempting to delete rows for a Salesforce organization that was in the process of being backed up. Subsequently, our backup code was in the process of modifying the very same records that the fix-up utility was trying to delete, resulting in a deadlock. The path forward in this case was quite simple – ignore it and let it run its course. The fix-up utility was a temporary process that had a very slim chance of ever conflicting again with another backup, and in fact never did. It finished successfully after a few weeks of execution without another hiccup.

Our more recent deadlocks definitely required intervention as our backup code will continue to run as long as this product is still available. We decided to fix them in a two-step process, though depending on the severity and persistence of the problem, you may choose to only implement one. The process was:

  1. Remove the concurrency argument entirely so that only one row is updated at a time.
  2. Revisit the code at a later time and add retry logic around the database update call.

Removing concurrency was the simplest code change to make. This allowed us to push out an immediate fix without any real engineering effort, and continue to fully backup data for all of our customers as quickly as possible. Unfortunately it does incur a slight performance penalty, and we don’t like things taking longer than they have to. So we came along later and put the concurrency back into the code. However, the update code was wrapped in retry logic to make sure that any updates which failed due to a deadlock were retried (up to a maximum number of times). This allows us to run at maximum performance when no deadlocks occur, yet gracefully handle the problem if/when it occurs again.

Life after deadlocks

Without any monitoring in place (or even insufficient monitoring), it’s possible that many months could pass before we encounter a problem caused by deadlocks. At that point, the transaction information may have already been purged from the InnoDB engine status log. Living with problems like these can be tricky. If something appears to be working, then we have to assume it is as correctness can be a fuzzy thing to define. Fortunately for us, our self-imposed monitoring of our production environment was able to point out the deadlocks to us, and allow us to investigate them quickly.

In the past, the error itself would be found in a log file as well as a production database table that most employees don’t have access to anyway. The ones that do would have to remember to check this table on a regular basis. This made production monitoring difficult, so we worked on an internal web application that uses some internal REST APIs to help us monitor the health of every backup, export, and restore operation performed on a daily basis. Using a simple dashboard there, we can aggregate the results and see which operations are successful, partial, or outright failures.

The internal-only dashboard allowed us to pinpoint the deadlocks and immediately begin investigation into the problem. This helped us out immensely in this particular case, but such a tool is probably the basis of an entirely separate blog post itself.

Long-running Node processes

At Spanning we often say that “easy things are hard at scale” (at least, our VP of Engineering has said it enough that it has actually caught on). Our first product, Spanning Backup for Google Apps, uses mostly Java code on the server side to communicate with Google, Amazon Web Services, our databases, and so on. For our second product, Spanning Backup for Salesforce, we chose to use Node.js. Node was the new hotness at the time we started developing the product, piqued our curiosity, and its asynchronous nature promised the ability to easily parallelize I/O and network-bound tasks. After several prototypes, we decided that we had everything we needed to create a fully-baked product with Node so we took the plunge. We decided to use Node for everything on the server-side, including some really long running processes.

This brings us back to all of those easy things that can be hard at scale. While we had many established patterns for dealing with large-scale data processing in Java, they didn’t always translate perfectly to Node, especially due to the asynchronous nature of Node. To further complicate things, we run Node processes as background jobs that may take a considerable amount of time to complete. What’s considerable? Anywhere from minutes to hours, or even weeks to finish processing. Yes, I said weeks!

What follows are some general tips and tricks we have incorporated in Node, as well as pitfalls that we encountered with long-running background processes. So what did we learn?

Separate your concerns.

Is your web server running Node? Rad. Do you use restify or express for your REST interface? Spiffy. Do you want some long running process invoked via a REST call to block your event loop for minutes, hours, or days? No. Not cool. Even if a task isn’t blocking the event loop permanently over its entire lifecycle, it’s still possible to block it for long stretches of time and/or overload the CPU such that new requests are not serviced quickly. It’s easier than you might think to introduce blocking code. Some simple examples include:

  • Accidentally invoking a synchronous function such as fs.readFileSync(), instead of it’s asynchronous, non-blocking counterpart.
  • Iterating over a large array with a blocking operation like a forEach loop.
  • Attempting to use JSON.stringify() on a large object.

We run a somewhat SOA-inspired architecture involving separate classes of machines with different roles. This separation prevents us from blocking the event loop on any of our web servers. Specifically, our infrastructure generally involves several web servers behind a load balancer. These servers generally do very little, and then immediately return to the client. An entirely separate set of machines is responsible for handling a long-running process (such as a backup, export, or restore operation). These “worker” machines constantly poll one or more SQS queues looking for work. Once they find a message they will work it until completion. Note that SQS isn’t a requirement, it’s just what we use. You could just as easily use another messaging queue, Redis, etc.

The architecture looks something like the following:

[Typical network server diagram]

For example, if a user wishes to spin up a long running process such as a backup, one of the servers in the web tier (a WWW server) will receive the request, queue up a message in SQS, and additionally create a row in a database table to track progress. At this point the WWW server can return a successful response to the client. Meanwhile, an available worker eventually polls the queue, pops the message off of the queue, and begins to update the database with progress information. As the worker machine churns through data, the WWW servers can periodically poll the database to see what progress the worker has made. In essence, the web tier can queue up work, and then communicate progress back to the client by simply reading the same database row that a worker box is updating.

Additionally, the WWW servers are capable of handling a wide range of requests from clients, not limited to displaying progress of a running process. They can also access well-indexed, small amounts of database information to support general application functionality such as displaying historical data, displaying error messages that may have occurred, administrative functions, etc. Eventually, some or all of these operations may begin to overwhelm your web server with large CPU loads, and/or spend too much time looping over large result sets. In such cases, you will likely need to offload those tasks to (micro)services that the web servers can call. The pattern is still valid however, with your web server making a call to another service, and then continuing to process other incoming requests. When the results from the service are available they can be relayed back to the web server, and finally back to the client (likely a browser). This loosely coupled architecture works well for several reasons:

  • It easily scales horizontally as load increases. There is generally not a need for more than a few web servers, but as our queue fills up, worker boxes can be added on-demand to handle the additional load. As load decreases, the worker boxes can be scaled down to provide cost savings. We are using EC2 to virtualize our workers, so simple auto-scale rules are put in place to help us balance performance needs and cost. You may also consider using Docker, Kubernetes, or some other container solution to scale your services as load increases; the basic concept remains the same.
  • Long-running processes are handled by a set of servers that do not handle web requests. Thus, the web servers can continue to service a large number of incoming requests quickly, and not be bogged down with a blocked event loop for extended periods of time.
  • Queues and background processes can be observed and monitored. When a long-running Node process fails (as some inevitably will), an error handler can pass the message back to the queue for another worker to finish processing. In the ideal scenario, a process can be resumed from the point that a previous worker left off, so that work does not have to be repeated.
  • Since messages can be requeued and picked up by other workers when a process fails, we can give the illusion of running for days or weeks without interruption. In reality, a robust failover mechanism is actually in place to keep the ship sailing smoothly.

Promise all the things!

Gordon has said this already in a previous blog post, but it’s important enough to reiterate. Decide on a Promise library and learn to love it. While we started our project with Q promises, we ultimately settled on Bluebird for a number of reasons.1

To the point however, any asynchronous function we write is a promise returning function. We embraced promises for several reasons:

  • They help us write clean, readable code that feels almost synchronous in nature.
  • They make error handling very pleasant, which has the added bonus of making retry logic simpler.
  • They make it dead simple to fire off multiple tasks in parallel. If you have ever dealt with thread synchronization in Java you’ll love Promise.all() or Promise.map() with a concurrency argument.
  • Libraries such as Bluebird can easily promisify a single callback function, or an entire library of functions, allowing you to instantly invoke third-party libraries in a promisified manner (even though the library itself relies on callbacks).
  • They are still super easy to mock for unit testing purposes, especially via Sinon.

Test all the things!

Speaking of tests…it’s always a great idea to test your code. Tests are especially important when working with dynamically typed languages such as JavaScript. At Spanning, we have multiple layers of tests and related defenses. It starts with a linting tool, such as JSHint. We run JSHint as part of our grunt builds, and it runs in our continuous integration environment every time new code is pushed to a branch. Using a tool like JSHint allows you to avoid many of the gotchas associated with JavaScript. Additionally, JSHint rules can be put in place to enforce a common coding standard across the team.

Next, we write a significant number of unit tests. So much, that writing and updating tests can often take more time than writing the actual code that we need to accomplish something new. Our tests are extremely behavioral driven and one of the main components of our unit tests is the ability to mock functions via Sinon. If you’re coming from Java, think of Sinon as the JavaScript equivalent of Mockito. You can create stubs and spies, and get them to do all sorts of things.

We also heavily rely on the Chai assertion library for our behavioral driven testing. You can easily compare values and make assertions about conditions that should, or not should be.

Finally, if you’re using Bluebird, you will want the sinon-as-promised library, which will allow you to stub tests to fail when the code is executed, rather than when the file is parsed. Otherwise, Bluebird will complain about a possibly unhandled exception when the tests are parsed. Here is an example of how we might write unit tests:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
// 3rd party dependencies
var chai = require('chai'),
sinon = require('sinon'),
Promise = require('bluebird'),
// our own dependencies
SomeModule = require('someDirectory/SomeModule'),
// NOTE: can only be defined after chai is present
assert = chai.assert,
should = chai.should();

// override sinon with sinon-as-promised and make it return Bluebird promises
require('sinon-as-promised')(Promise);

describe('fake unit tests for foo() function', function() {
// setup some data and some stubs
var someValue = 42,
someFunctionStub, someOtherFunctionStub;

beforeEach(function(testDone) {
someFunctionStub = sinon.stub(SomeModule, 'someFunction');
someOtherFunctionStub = sinon.stub(SomeModule, 'someOtherFunction');
testDone();
});

afterEach(function(testDone) {
someFunctionStub.restore();
someOtherFunctionStub.restore();
testDone();
});

// test some random error condition
it('should reject with an error if argument is less than 5', function(testDone) {
// force the stub to reject
someFunctionStub.rejects(new Error('arg is less than 5'));

// just assume this will invoke someFunction() with the given argument and an array of [1, 2, 3]
SomeModule.foo(4)
.then(function() {
assert(false, 'should not have made it here with arg < 5');
})
.catch(function(err) {
var callArgs;

should.exist(err);
err.message.should.equal('arg is less than 5');

someFunctionStub.callCount.should.equal(1);
callArgs = someFunctionStub.getCall(0).args;
callArgs[0].should.equal(4);
callArgs[1].should.be.an('array').and.have.members([1, 2, 3]);

// the other stub should not be called due to the error
someOtherFunctionStub.callCount.should.equal(0);
})
.done(testDone, testDone);
});

// test that it behaves normally
it('should write to the database using someOtherFunction() with arg >= 5', function(testDone) {
// these stubs should return without error
someFunctionStub.returns(Promise.resolve());
someOtherFunctionStub.returns(Promise.resolve(someValue));

// just assume this will invoke someFunction() with the given argument and an array of [1, 2, 3]
SomeModule.foo(8)
.then(function(result) {
var callArgs;

// assume the result should be 8 for whatever reason
result.should.equal(8);

someFunctionStub.callCount.should.equal(1);
callArgs = someFunctionStub.getCall(0).args;
callArgs[0].should.equal(8);
callArgs[1].should.be.an('array').and.have.members([1, 2, 3]);

// the other stub should now be called
someOtherFunctionStub.callCount.should.equal(1);
callArgs = someOtherFunctionStub.getCall(0).args;
// test the rest of the arguments here...
})
.done(testDone, testDone);
});
});

Additionally, we have written a large number of integration tests. These can be especially useful to test the functionality of REST endpoints. Our typical flow is rather similar to the unit tests although instead of stubbing functions to return values or reject with errors, the before and after functions are typically responsible for populating database rows and subsequently removing them. Such tests are left as an exercise for the reader. :)

Learn to love recursion to learn to love recursion.

Let’s face it – your asynchronous code is going to fire off some task and then return immediately to its caller. Sure the callback function will be invoked once the asynchronous response is back, but there are many cases where additional processing is required. If you have nice, DRY code you’ll probably want to call yourself recursively to process that next chunk of data. Assuming a large amount of data needs to be processed, you certainly cannot process it all once. Perhaps you cannot store all of the data in memory or the database query for everything at once is too onerous. In such cases it is generally a good idea to retrieve data in small, well-indexed chunks. Furthermore, interactions with third-party APIs will likely force you to page through the data to help lower the burden on their servers.

Whatever the case may be, dealing with large data sets generally involves walking through things in a paginated fashion. Recursion and recursive helper functions are your friend. Write a function that sets up an initial state and then invokes the recursive helper function to do all of the heavy lifting. With promisified functions, you can simply and easily inspect the return value of a function and then decide to return the final result, or to recurse. An example might be paging through a list of database records, or walking a large number of files in S3. If you have promisified functions, this can be an effortless process. For example, to walk all records in a particular database table with a page size of 1000, you might do something like the following:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
var walkAllRecords = function(someValue) {
// we might be resuming work so query the DB for where we left off
return getProgessInfo(someValue)
.then(function(partialResult) {
return walkAllRecordsHelper(partialResult.numRecords, 1000, partialResult.lastOffset);
});
};

var walkAllRecordsHelper = function(numRecords, limit, offset) {
// default the values if they are falsy
numRecords = numRecords || 0;
offset = offset || 0;

// some promisified function that can get records from a database table
return getNextPageOfRecords(limit, offset)
.then(function(records) {
if (records.length === 0) {
// return the total number of records processed
return numRecords;
} else {
// do something with the records
return doSomething(records)
.then(function() {
// increment our values and recurse
numRecords += records.length;
offset += limit;
return walkAllRecordsHelper(numRecords, limit, offset);
});
}
});
};

You could also embed the helper function as a variable inside of your recursive function, but then you unfortunately cannot stub/spy it for unit tests. You do write unit tests and verify call arguments and such, right? If not, you somehow made it here without reading my previous section about testing. :)

Conclusion:

To summarize, my main suggestions are:

  • Build an architecture where concerns are separated amongst different classes of servers (different services), where each class/service is independently scalable. This is particularly important to keep your web servers responsive and not block the event loop on each one.
  • Allow your long-running processes to leave breadcrumbs as they run so that progress can be inspected and processing can be resumed by any other worker should a failure occur.
  • Wrap all asynchronous code with a solid promise library. Additionally, wrap calls to the promisified functions with retry logic where applicable.
  • Explore good, reusable patterns and then reuse them as much as possible. Recursive helper functions to deal with massive amount of data in a paginated fashion is a great example.

There are so many more suggestions and examples that could be provided. The above information works well for our purposes, but may not suit everyone. Hopefully there is some useful nugget contained here for how to scale and process large amounts of data in Node. Happy coding!



1 The Bluebird library continues to be maintained on a frequent basis, and also avoids using process.nextTick() in versions 0.12 and up (actually 0.11 and up but you shouldn’t be running an unstable branch of Node in production). You want setImmediate() instead, because process.nextTick() can show up in odd places and force your Node process to crash beyond the reaches of your code, where you can do something to handle it in a meaningful way. If you’re still running Node 0.10 you will unfortunately be susceptible to this error when using either Q or Bluebird (ask me how I know). If you cannot upgrade, my suggestion is to find patterns in the problematic areas of code, then look for ways to identify that the pattern is occurring. Finally, define a safe but reasonable threshold to be considered a maximum amount of work to perform. When the code detects that the threshold has been reached, force the code to halt to allow the stack to unwind.