Hashing it Right: Solving a Gradual Rollout Puzzle
The Role of Hashing in Unleash
At Unleash, we’re passionate about empowering developers with the tools they need to confidently and intelligently release features through feature flags. These flags aren’t just on-off switches; they’re nuanced controls that enable our customers to orchestrate who sees what and when—a strategy known as gradual rollout and variants. For instance, imagine you can introduce a new feature but limit its visibility to just 10% of your users, gauging reactions and performance before a full-scale launch.
But how do you ensure that out of the thousands, or even millions, of users, a specific 10% consistently sees your new feature across different devices and sessions? That’s where hashing—specifically, Murmur hashing—becomes the hero of our story. It’s an algorithm that takes a user identifier and converts it into a deterministic hash value. This value then decides which variant bucket the user falls into, ensuring they get the same experience on a mobile app or a web browser. Murmur hashing was our algorithm of choice because it’s renowned for its speed and consistency, the twin pillars required for delivering seamless and stable user experiences.
For Unleash, Murmur hashing’s performance was a game-changer, offering the robustness our feature flagging service required. It enabled us to promise and deliver consistency to our customers, who, in turn, could provide reliable experiences to their users. However, as with any technology, real-world applications can reveal quirks that lead to unexpected adventures, and our journey with hashing was no exception.
Hash Function Discrepency
Our hashing implementation was working smoothly until we encountered an unexpected discrepancy: the user distribution across feature flag variants did not align with our expected percentages. This discrepancy prompted a thorough investigation to pinpoint the cause.
Upon initiating a gradual rollout at 10%, a strange distribution of users was observed, as much as 10-30% from the expected figures, despite conducting tests with a sample size as large as one billion. This irregularity was noted only during partial rollouts, that is when the rollout percentage was less than 100%.
For gradual rollouts, it is possible to configure rollouts with 1% accuracy, meaning users are essentially distributed into 100 buckets, with hashing determining their correct placement. Those within the first 10 buckets were selected for the initial rollout, excluding the remaining 90%. The same hashing function was then reapplied to this 10% subset to assign them to specific variants, where significant disparities in distribution were detected.
We spent a long time debugging this, suspecting that the issue lay with the Murmur implementation, or with the 32-bit algorithm, or perhaps that we were using inputs that were too long for the hash function, but none of these were the case. The real reason was that we used the same seed for hashing on both occasions. This meant that the initial run of the function, which filtered the data, introduced a bias. When we ran the same function on this pre-selected set, the cryptographic algorithms compounded the bias, resulting in a skewed variant distribution that favored certain buckets and led to an uneven spread.
Implementing the Fix: A New Approach to Seeding
The solution involved using distinct seeds for two different hashing operations. After employing different seeds, all our buckets became evenly distributed and consistent.
However, introducing new seeds brought about its own set of challenges. The hash values generated by the new seed did not align with those from the previous one, which meant that end customers might be reassigned to new buckets, disrupting their existing experience. This necessitated a major update across all SDKs.
Despite the inconvenience, this solution was vital for the integrity of our system. We resolved the issue within a week. While users may encounter some changes, we are confident that our system is now more robust and will provide a better overall experience moving forward.