From the community: How my team uses Unleash Edge to handle high traffic loads
Hello again! If you haven’t gotten a chance yet, check out my previous post on how my team at Codeium uses Unleash to iterate on our AI code assistant. This is part two.
In this article I’ll cover a scaling problem we ran into due to the nature of our setup and how we leveraged Unleash Edge to resolve these issues.
As a brief recap from the previous article: my team at Codeium utilizes Unleash in both our cloud API server and within the IDE extensions running on our users’ machines.
What was the problem?
One day our system went down. All requests to the API server were timing out which meant our users weren’t seeing any code completions.
After a brief investigation we were able to identify what happened: we found an overload on our Postgres instance. This instance contains databases for our system’s telemetry, as well as Unleash-generated metrics.
We then analyzed the telemetry for each database. We found that a buggy change–introduced in the most recent release–caused our system to write far more frequently than intended. A quick revert brought our service back to a healthy state.
Still, we had cause for concern. We saw that the number of transactions per second from Unleash was massive at more than 350 tx / sec!
This put us dangerously close to overloading our Postgres instance. The buggy change pushed us over that limit.
This load alone was worrying. A bigger concern was that it seemed to be growing over time.
What was the cause?
The situation wasn’t great, so I reached out on the Unleash Community Slack for help to understand what could be going wrong and how I could fix it.
After discussing back and forth with Unleash team members, we identified a clear red flag. Our Unleash dashboard showed tens of thousands of Unleash instances registered.
This was a direct result of the distributed nature of our system combined with an unexpected usage of the Unleash SDKs.
In most scenarios, Unleash code running on user machines would utilize the Client SDKs for platforms like iOS, Android, and React. However, our situation is unique, in that we run Node-based extensions and Go binaries on our users’ machines.
These environments are usually reserved for server-side software. Because of this, we had opted for the Server SDKs without understanding the implications.
Ultimately this meant that:
- the number of SDK instances scaled with our users,
- all of the SDK instances hit our Unleash server directly, and
- metrics from each SDK instance were being directly written through to our Postgres instance leading.
The end result was a huge number of transactions per second.
How did Unleash Edge solve the problem?
Even after understanding the core problem, we were still left in somewhat of a bind. We couldn’t leverage the Client SDKs, whose metrics would have been aggregated by an Unleash proxy before being passed to the main Unleash server. None of them supported Node or Go.
We considered enabling our clients to fetch experiments from an intermediate server in our Cloud. The idea would be to reduce the number of Server SDK instances to one.
However, this was also infeasible. We couldn’t accept the additional round-trip latency from our users’ machines to our Cloud, then back. Our product depends on us quickly providing AI-powered code completions on each keystroke.
The Unleash team suggested using Unleash Edge.
Unleash Edge sits between the SDK instances and the main Unleash server, and provides a cached read-replica.
It is highly performant, being able to handle tens to hundreds of thousands of requests per second. Most importantly for us, it batches metric writes from all the SDK instances connected to it.
This limits the number of requests hitting the main Unleash server. At the same time, it transitively limits the number of transactions in the Postgres database.
In short, Edge promised a solution to all of the problems we were seeing.
How did we deploy Unleash Edge?
Unleash Edge was extremely easy to deploy using the Unleash Edge Helm chart.
The only configuration we needed was to set the upstream Unleash server URL. We set the URL to the Kubernetes-internal address of the Unleash server since they would both run in the same cluster.
Best of all, Unleash Edge exposes the same API as the Unleash server. This meant that we could make the transition by simply changing the DNS entry we had for the Unleash server to point to Unleash Edge.
We didn’t even have to roll out an update to our users!
After making these changes we saw an immediate effect.
The number of transactions per second in the Unleash Postgres database decreased by more than 100x, completely nullifying any concerns we had.
Edge only introduced a few “downsides”:
- It added a layer of complexity to our Unleash set up, though honestly it’s not really that complex.
- It also included an additional propagation latency for experiment toggles updates. Unleash Edge periodically fetches updates from the Unleash server. Unleash SDK instances then fetch the update from Unleash Edge.
We felt the propagation latency was a worthwhile trade off, and also within our control. We could simply adjust the refresh interval for both the SDK instances and Unleash Edge to make up the difference.
In closing, Edge is fantastic. I recommend other developers using Unleash to try out Unleash Edge just for its scaling advantages. You can read about all Edge can do in Unleash’s docs.