Infrastructure 2.0 [updated 2023]
Note: We’ve made a lot of progress since Infrastructure 2.0. Read up on our latest infrastructure improvements in our Infrastructure 3.0 blog.
As Unleash grows our hosted offering, we’ve seen the need to automate what we have been doing with carefully crafted AWS scripts and manual effort. This blog post goes through our process and learnings in how we handle our infrastructure.
As of April 2022, what has been the result of a long 4 months of effort culminated in turning off the last EC2 instance. The biggest benefit has been that our new set-up allows us individual scaling of each customer based upon their needs and requirements. This blog post will not go into details about any specific customer but rather focus on the general setup.
We had outgrown our initial housing setup with 10x the amount of Unleash instances it was designed for.
This made it clear that we needed to change how we were doing deployments, we could no longer take the risk of our EC2 instances crashing or the owner of our scripts being unavailable.
Warning : The rest of this article will be technical in nature, the TL;DR: (too long; didn’t read:) is that we’ve migrated from manually controlled instances to a Kubernetes setup.
What have we done
We have moved from manually controlling EC2 instances with carefully scripted shell scripts and AWS codedeploy to deploying using https://www.pulumi.com. We use Pulumi for managing two EKS clusters (one in each region where we’re offering hosting) and with an added bonus that configuring another EKS cluster in a different region is simply a matter of adding a new Pulumi configuration. For each push to the main branch on our IaC repositories, we call pulumi up, telling Pulumi to synchronise and update our infrastructure to match the requested state in our configuration files.
Currently our regions are running on t4g.xlarge nodes. We’ve chosen AWS graviton2 nodes because of their optimal performance/price ratio, basically twice the performance for half the price per node. This allows us to scale the number of nodes in each cluster without our already deployed customer noticing anything.
All our application deployment is done using pulumi helm releases, so defining a new customer is a matter of adding a list element to one of our pulumi stack config files. We’ve also configured github actions to only run the stack that is currently being changed, so new customers will not have to wait for the rest of the stacks synchronising their state.
[Note: We’ve since moved from Prometheus to VictoriaMetrics. Read more in our Infrastructure 3.0 blog]
All our unleash server instances expose a metrics endpoint that Prometheus can scrape. Each Kubernetes cluster is running a Prometheus-server that scrapes all these endpoints to give us the same metrics for each instance on a per cluster basis. In addition we’ve set up a federated Prometheus service that scrapes all Kubernetes clusters to allow us to query a single data source in Grafana for building dashboards that display our health and statistics across all clusters.
[Note: We no longer send logs directly to Cloudwatch. Instead we just log to standard output Read more in our Infrastructure 3.0 blog]
For logging we use Amazon CloudWatch Logs which allows us to easily query and filter our logs across all customer instances. To facilitate the integration inside Unleash we ended up with a framework called winston which also has an adaptor to send the logs directly to CloudWatch.
Multiple AZ / Failover
Our cluster is set up with a minimum available tolerance for all customer deployments of 1 instance. This means that all customers should always have at least one deployment available, allowing cluster scaling to happen without downtime for our customers. In addition, we’ve set up our nodes across three AZs in each of the regions we have clusters, so even if one AZ should go down, we will still be available.
What not to do with Pulumi
An early attempt configured all instances in the same stack, this caused a full deploy run to take >20 minutes or timeout due to AWS throttling the amount of calls we could make towards EKS to synchronise state. So, if you’re using helm releases to manage your resources, don’t put them all in the same stack.
What to do instead
In our current iteration we now have one configuration file (called a stack) for Unleash instances that use the same rds cluster.
This allows us to make changes to any instance in 2-3 minutes, with no need for knowing which aws commands we need to run, only tweaking the settings for the instance we wish to change.
Benefits of current setup
Since we’re running each customer in separate namespaces in Kubernetes, we have set up strict access controls with unique service users per customer. These service users only have access to resources within their namespace, guaranteeing that customer A cannot read data for customer B.
Running each customer as a single tenant allows us to scale each customer separately as well as limit the number of affected clients from a single misconfigured customer.
Add more regions
Due to our setup, we can now establish Unleash hosting in new region by adding another config file and running Pulumi to bring the configured infrastructure up.
Node usage optimization
Currently we’re running custom node groups managed through Pulumi. In the future we would like to change to managed node groups to allow AWS to keep us up to date with security patches and configuration parameters for each node. Currently this is done by Pulumi and our cloudformation templates when we update our stacks. We’d love for this to be something we didn’t have to think about.
For even smarter resource usage we’ve also started looking at karpenter for auto scaling. Karpenter starts single nodes instead of node groups, which should make it even faster to scale the cluster up or down.
We’re currently deploying our apps using Helm through Pulumi. As stated in the what not to do with Pulumi section, this has a couple of weaknesses, amongst other things that the more releases you’re managing through a single stack the longer the synchronisation step that Pulumi has to do to see which releases need a change. We’ve already seen that a deployment can take 2-3 minutes, ideally we’d like to move this to sub-minute. We’re looking at ways to optimise this further.