Using a mix of normal and preemptible nodes we were able to save 44% on our Google Cloud bill.

In a busy week at Vamp, our test and demo clusters can easily exceed 100 vCPUs, especially when we are testing our multi-cluster, multi-tenant tooling. So cloud costs are a concern.

One of the attractions of using a managed Kubernetes service is that it provides an out-of-the-box auto-scaling solution. The downside is that short-lived VM instances are the most expensive to run.

“On-demand” is Expensive

We use Google Cloud’s Kubernetes Engine (GKE) as our reference platform because if it doesn’t work on GKE, it’s probably not going to work on Amazon’s EKS or Azure’s AKS.

At the current pricing (July 2019) for standard machines in Google Cloud’s europe-west4 (Netherlands) region, those 100 vCPUs cost 5.23 USD/hour which means an annual bill of over 27,000 USD.

The Google Cloud’s Compute Engine sustained use discounts and committed use discounts apply to GKE nodes but the sustained use discounts only start after the nodes have been running for more than a week (182.5 hours). The majority of our test and demo clusters have an average lifespan of 20 hours, so the sustained use discounts don’t help.

We do have some long-term test clusters which make it cost effective for us to commit to 16 vCPUs per year (8 in the europe-west1 region and 8 in europe-west4 region which are currently the lowest-priced Europe regions). This reduces our annual bill by 2%.

The Surplus Store

GKE has beta support for using preemptible VM instances as nodes. Preemptible VMs are excess Compute Engine capacity and are 80% cheaper than a node that uses a normal instance. This has the potential of reducing our 27,000 USD annual bill to less than 9,500 USD.

The downside is that whilst preemptible instances are guaranteed to have a lifespan of no more than 24 hours, they can be terminated at anytime, if the resources are need for other tasks (higher paying customers).

We cannot use only preemptible nodes. A typical test cluster contains a mix of test services and core components. Executing the test services on preemptible nodes works really well but experience shows that doing the same for the core components leads to false negatives. Investigating these failures and repeating the tests means that using preemptible nodes for these components is more expensive than using normal nodes.

Fishing in Two Pools

The solution is to use two node pools, a core pool of normal nodes for the core components and a service pool of preemptible nodes for the test services. To improve stability and to ensure the various Pods end up where we want them, we use a simple mix of node labels, taints and pod anti-affinity.

Our Release Agents use NATS Streaming to orchestrate releases across multiple tenants spread across multiple Kubernetes clusters. To facilitate this, we run a cluster of NATS Streaming Servers in each Kubernetes cluster. We also use Elasticsearch. In production we recommend one Elasticsearch cluster per region but for most tests we deploy a single node Elasticsearch cluster in each Kubernetes cluster.

The core pool for each cluster is sized for 1 Elasticsearch Pod and 3 NATS Streaming Server Pods. The NATS Streaming Servers are lightweight, less that 1% CPU and 10M memory per Pod. Elasticsearch requires 20% CPU and 2G memory.

Our advice to customers is always to use regional clusters with nodes spread across multiple zones. And we follow this best practice in our demo and test clusters.

The result is a core pool consisting of 3 normal nodes using n1-standard-1 instances without auto-scaling, 1 per zone.

gcloud command to create the core-pool

Node Selection

We set a role=core label on the nodes in the core-pool so that we can use to assign the core components to these nodes using a nodeSelector.

spec:
  containers:
  ...
  nodeSelector:
    io.vamp/role: core

Taint

We also add a role=core taint to shoo away the test service Pods and crucially we set the effect of the taint to PreferNoSchedule. This allows Kubernetes to schedule test service Pods onto these nodes in preference to increasing the size of the service-pool, further helping to keep the cost down.

spec:
  containers:
  ...
  nodeSelector:
    io.vamp/role: core
  tolerations:
  - key: io.vamp/role
    operator: Equal
    value: core

A typical service-pool consists of preemptible nodes using n1-standard-1 instances and auto-scaling with 1–4 nodes per zone, 3–12 nodes in total. We set a role=service label on these nodes but it’s only used for identification, we don’t use it for assigning Pods.

gcloud command to create the service-pool

Anti-Affinity

There is no need to add a toleration to the test services. The only change we make is to add PodAntiAffinity, so Pods are deployed on different nodes. This improves the services’ resilience to the nodes being pre-empted.

spec:
  containers:
  ...
  affinity:
  podAntiAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:
    - labelSelector:
        matchExpressions:
        - key: app
          operator: In
          values:
          - sava-cart
      topologyKey: kubernetes.io/hostname

Making these small changes to use a mix of normal and preemptible nodes, we were able to reduce our 27,000 USD annual bill to a little under 14,000 USD.

But what about unplanned node loss?

We originally resisted using preemptible nodes because we were worried about the effect of losing all the nodes at once. So far, we’ve not experienced this in any of the European regions. Where nodes are preempted because they reached their 24 hour limit, the replacement nodes are typically usable within a minute.

Initially, we experimented with using a Kubernetes controller to prevent the cliff-edge situation where all the nodes could be preempted after 24 hours. In our case it is not necessary. Our longer running tests deploy, release and undeploy 100s of new versions of services, their dependencies and/or feature toggles. The auto-scaling means the nodes in the service-pool have a mix of ages, meaning we rarely have more than 50% of the pool preempted at one time.

The potential for unplanned node loss also adds an additional layer of realism when testing the Release Agent — Vamp’s release policies are flexible enough to allow for short-term conditions such as loss of nodes without triggering a rollback.

You can find out more about Vamp and our policy-based, data-driven approach to Service Reliability Engineering at Vamp.io.