Post mortem: A bug from the past
All developers make mistakes from time to time. Finding them and debugging them can be a real challenge, especially when “it works on my machine”. I am sure that most developers can relate to this!
A while back, a valued Unleash open source user reported issues with their test environments after upgrading from Unleash v3.17.6 to v4.4.4. This was not actually a planned upgrade, but a small accident from one of their engineers. After the upgrade, Unleash simply would not start. Instead, it would produce mysterious errors in the logs at startup before immediately crashing. Bummer!
(Even though you should plan an upgrade of Unleash from v3 to v4 because of a few breaking changes highlighted in our upgrade guide, there should not be anything technically blocking this upgrade.)
The debugging
The Unleash team immediately started debugging the scenario.
Step 1: We started by installing Unleash v3.17.6 and then upgrading it to Unleash v4.4.4. Unleash upgrades successfully, and starts successfully in our sandbox environment.
We reached out to the user and asked for some more information. What is Unleash complaining about? What does the error message say?
The user was kind enough to share the error message and share some database table structures with us. It was clear that the database was not in the state we expected it to be. Some things we observed was that one of the global roles was missing, and the remaining roles had incomplete role descriptions. It almost looked like the user had manually changed the data in the database. Of course they had not done that.
Step 2: Validate the releases on GitHub. All versions of Unleash get a unique git tag, and we did validate all the Unleash version tags, but none of them included the incomplete database migration that would result in the state above.
Time for a coffee break!
Step 3: The source code is not the actual release!
Can we be sure that the version tag told us the whole story on what was released? Let’s see what the tarball on npm actually contains, shall we? We created a script that downloaded all versions of Unleash and extracted the contents into separate folders. Having all the releases at hand, we started searching for the incomplete “role descriptions”.
And there they were! Version 3.13.0 of Unleash did indeed have an incomplete database migration file that should not have been there. It was a first draft of a new role concept, not even close to being ready to be shipped. It was part of the new RBAC concept we had been playing with for a while, scheduled to be part of Unleash v4.
Unleash uses a tool called db-migrate for database migrations. It’s a simple tool that uses a file-based approach. You must give each migration a unique name with a timestamp prefix for sortability (just check out the unleash source code). Db-migrate makes sure that migrations are executed in order and that a single migration is only executed once.
So why was this incomplete database migration causing problems? The migration was included as part of Unleash v3.13.0 by mistake. As long as the user stayed on Unleash v3.x, everything would continue to work just fine, because that data wasn’t used for anything. When the user upgraded to Unleash v4, however, the incomplete database migration file started causing issues. Db-migrate will not re-apply a migration that has already been executed. It uses a special migrations table to keep track of this. Because we used the same migration file (with the same name!) for the complete migration needed to upgrade Unleash v4, db-migrate would never run the complete migration if you’d previously run the incomplete one. This would only ever happen if you ran Unleash v3.13.0.
How many were impacted?
It’s a bit hard to say, but it took almost a year from releasing Unleash v3.13.0 until someone reported this issue. This version was also the latest version for only 3 days before it was replaced by v3.14.0. So it’s safe to assume that not many of the long term Unleash users got the chance to upgrade to this version before it was superseded.
Why did this happen?
How could we release something to npm that was not exactly what we had in our source code on github for the same release tag? To understand that, we need to understand what actually happens when we create a new version of Unleash: Because we use TypeScript, we have to compile our source code into JavaScript files, which then get bundled together as the Unleash distribution, and uploaded to npm. During this process, the TypeScript compiler places the resulting files in a folder called `dist`.
When creating the v3.13.0 release of Unleash, we had only just adopted TypeScript. One of the caveats of TypeScript is that it will only move (and overwrite) the files that it actually processes. As such, if you happen to have other files in your `dist` folder, TypeScript will leave them alone. This means that if you compile your application from one branch and then switch to a different branch and compile again, you might end up with more files in your dist folder than you’d expect.
We were only three developers back then, so we built that release by running the release process on our local machines. This made it very easy to accidentally include code from a different local branch by mistake.
How did we fix the broken migration?
We created a new migration (commit) that would detect the inconsistency and correct the database if it was in this incorrect state. We have included it in the latest v4 version of Unleash and also created patch releases for all the previous minor versions. This means that it should be safe to upgrade to the latest Unleash v4 version, even if you were so unlucky as to encounter the v3.13 bug.
How do we prevent this from happening again?
One of the things we care about in the Unleash team is to use any incident as an opportunity to learn and improve. How can we use this incident to improve our process? Starting from v3.14, the release script would empty the `dist` folder as part of the build process. This at least guarantees that we are releasing only what we just built from the current version, and nothing else.
In addition to that, we have recently made an effort to no longer publish the npm packages from our developer machines. Instead, we now offload that responsibility to a GitHub action (source code). The only thing we have to do to trigger a new release is to tag our source code with a new version tag. The action will pick up that a new tag was created and publish a new version for us. It can even tell whether it’s a pre-release version or not, and tags the npm artifact correctly. This both reduces the chances that some local inconsistencies on our developer machines end up in the final release, and makes it easy for any new developer on the team to create a new release. A win for everyone!