JSFeeds: blog.npmjs.org - npm private modules outage on December 12th

Saturday, 30 December, 2017 UTC

npm private modules outage on December 12th

Summary

For a period of about 100 minutes on 12 Dec 2017, all read and write access to private packages was interrupted. For some days after that, our customer data was not in perfect sync with the source of truth—Stripe, our payments provider—and some customers renewed their subscriptions when they did not need to.

We believe we have now fully reconciled our customer database with Stripe’s data, and have refunded all renewals that were made during the incident. If we are in error about your account, please contact our Support team who will sort it out for you.

The underlying cause of this outage was a software bug that deleted all records from our customer database. Read on for more details about the root cause and a timeline of the incident.

Root cause

The root cause was a bug in the query builder software we use to interface with postgres, ormnomnom. Due to a bug in ormnomnom’s “delete” method, it omitted the WHERE clause for the query built by the list of filters. This would result in removing all records from the database table where a single record was intended to be deleted. The fix was a one-liner.

In this case, one endpoint in the microservice fronting our payments database used a DELETE query. This endpoint is exercised only by an internal tool that our support team uses to resolve customer payments problems and requests to terminate service. After the bug was deployed to production, it was triggered the first time the tool was used in this way. A DELETE with no WHERE clause was executed and all records were deleted from the targeted table.

Code audits revealed that this one API endpoint was the only place in our system where we use a straightforward DELETE, and the internal support tool is the only consumer of that endpoint. In all of our other databases, we do not ever delete rows from tables. Instead, we use “tombstoning”: setting a deleted column to a timestamp that indicates when a record was marked as deleted. In this one case, removing the data entirely is more appropriate because it is reflecting a ground truth stored in an external database. The support tool use was the only thing that could have triggered this bug, and only for that one category of data.

Timeline

All times given here are in PST, aka GMT–8.

11 Dec 2017 5:00 PM: The microservice that fronts our payments database is deployed with an upgrade to the backing framework, including an upgrade to our SQL query builder.

12 Dec 2017 1:34 PM: All customer records are deleted from our payments database. Uncached access to private modules begins failing.

2:16 PM: Our support team escalates a Twitter complaint & npm engineering is alerted.

2:32 PM: We roll back the deploy to the relevant microservice on the theory that recent changes are the most likely culprit. We observe that this does not restore service.

2:44 PM: We discover that our payments database has far fewer records in it than it should. We activate our serious incident response protocol, which formalizes communication and roles to aid internal coordination during outages.

3:07 PM: We deploy a mitigation to our payments service that treats all customer accounts as being in good standing, without checking the database. We observe that this does not restore access.

3:19 PM: We clear the access cache entirely and access to private packages is restored for all customers.

3:32 PM: We restore the database from a backup and note that all customer events from the time of the backup snapshot on will need to be replayed. We decide to leave the mitigation in place overnight to give us time to do that work in a methodical way. Meanwhile, root cause exploration continues.

3:36 PM: We find the culprit DELETE statement in the database logs. We isolate where in service code this could have been executed and confirm that a support tool invocation was the trigger, given the timestamp.

4:48 PM: The root cause is found and fixed in the query builder code.

13 Dec, through the day: We reconcile data in our behind-reality customer database with data from Stripe, our payments provider. When this is complete, we undo the mitigation that short-circuited payment checks.

27 Dec: Some customers are locked out of their organizations or private modules accounts because the 15-day grace period has expired. This happens to a handful of customers we missed in our original cleanup sweep. We identify those customers and replay the Stripe events to restore their accounts to working order.

Steps we’re taking in response

The root cause bug did not turn up in development because a new feature in ormnomnom was not completely tested. The developer wrote only cursory tests because the test suite for the library is awkward to use.

We kicked off a project to rewrite the tests for this library so that new tests are very easy to write, and will therefore be written. In our opinion, the time invested in making testing easy is time well spent. Software bugs are inevitable and testing is how we catch them early. This is the most important step we can take in response to the incident. We can measure our progress in this work by measuring test coverage of our query builder library.

None of our monitoring alerted us that anything was wrong, because our systems and software were operating correctly. Our access cache did alert briefly on CPU use, but this was not enough to make us aware that something was wrong. We were alerted by a sudden and unusually high volume of support requests and Twitter complaints. This is probably an inappropriate target for monitoring. What we can monitor, however, are unusually high rates of certain status codes being reported by our internal services. In this case, the rate of 402 Payment Required responses was an unusually high percentage of all auth checks (indeed, 100%), and could have triggered an automated alert. We will be adding such an alert to our monitoring system.

Our database logging was set to log slow queries only. Fortunately the delete query in question was quite slow, but it was only luck that we had a log line to pinpoint the time the error occurred. We will be turning on more complete logging for all data-mutating queries on databases like this one.

We were pleased with the behavior of postgres’s logical replication and will be migrating the systems we have that still use repmgr to the newer system.

An apology

We apologize to all of npm’s customers for this outage. This downtime affected all of our paying customers during peak work hours in the United States. We prefer to be an invisible and reliable part of your infrastructure, but on 12 Dec we were a source of frustration instead. We’re sorry.

... more @ blog.npmjs.org

blog.npmjs.org