Scaling Picnic systems in Corona times

Published in

Picnic Engineering

8 min readJul 21, 2020

As an online groceries company, Picnic saw an enormous increase in demand during the start of the Corona crisis. Our systems suddenly experienced traffic peaks of up to 10–20 times the pre-Corona peak traffic. Even though we build our systems for scale, surges like these exposed challenges we didn’t know we had. In this post, we’ll share what we learned while scaling our systems during these times. If you prefer, you can also watch the recording of an online meetup we organised around this same topic.

When we talk about scaling systems, you might be tempted to think about infrastructure and raw processing power. These are definitely important pieces of the puzzle. However, just throwing more resources at a scaling problem is rarely the best way forward. We’ll see that scaling our infrastructure goes hand in hand with improving the services that run on this infrastructure.

Before diving into solutions, let’s first get a better understanding of how Picnic operates and what issues arose.

The Picnic Promise

Customers use the Picnic app to order groceries for the next day, or further into the future. Picnic then fulfills this order by picking the products in our fulfillment centers, and delivering them to the door using our iconic electric vehicles. Unfortunately, we don’t have infinite stock, vehicles, and picking capacity. That’s why a customer can choose a delivery slot for a certain day and time, as long as it has capacity available.

It’s mid-March, right after the ‘intelligent lockdown’ in The Netherlands was a fact. Our story begins with many existing and new customers turning to Picnic for their essentials in these uncertain times. Slots are filling up quickly, and customers are eagerly waiting for new slots to become available. New slots are made available at a fixed time every day, and we communicate this to give everyone a fair chance to place an order. Of course, this does lead to a big influx of customers around these slot opening times. These moments brought about the toughest scaling challenges.

Now, let’s briefly sketch the Picnic system landscape before we move on to the lessons we learned.

Picnic’s System Landscape

Our systems run on AWS. Kubernetes (EKS) is used to run compute workloads, and for data storage we use managed services like Amazon RDS and MongoDB Atlas. When users order groceries through the Picnic app, it communicates with a service we call Storefront. Behind Storefront, there are many other internal services handling every aspect of our business, from stock management to order fulfilment. They all communicate with REST over HTTP, and most of our services are implemented on a relatively uniform Java 11 + Spring Boot stack.

Scaling Kubernetes Pods

With the first surges of traffic, the Kubernetes pods running Storefront receive more traffic than they can handle. Scaling up the number of replicas is the obvious first step.

Doing so uncovered some interesting bottlenecks in the service itself, which we’ll talk about in a bit. For now, let’s focus on how the infrastructure evolved to cope with traffic surges. Manually increasing the number of replicas works, but Kubernetes also offers a way to automatically scale pods. We first tried using the Horizontal Pod Autoscaler (HPA) functionality. Based on configurable CPU utilization thresholds, additional replicas are automatically added or removed based on the actual load.

Only, there’s an issue: the highest traffic peaks occur when new delivery slots are opened up for customers. People start anticipating this moment, and traffic surges so quickly that the Autoscaler just can’t keep up.

Since we know the delivery slot opening moments beforehand, we combined the HPA with a simple Kubernetes CronJob. It automatically increases the `minReplicas` and `maxReplicas` values in the HPA configuration some time before the slots open. Note that we don’t touch the replica count of the Kubernetes deployment itself, that’s the HPA’s job!

Now, additional pods are started and warmed up before the peak hits. Of course, after the peak we reduce the counts in the HPA configuration commensurately. This way, we have the certainty of having additional capacity before the peak occurs. At the same, there’s also the flexibility of the HPA kicking in at other moments when traffic increases or decreases unexpectedly.

Scaling Kubernetes Nodes

Just scaling pods isn’t enough: the Kubernetes cluster itself must have enough capacity across all nodes to support the increased number of pods. Costs now also come into the picture: the more and heavier nodes we use, the higher the EKS bill will be. Fortunately, another open source project called Cluster Autoscaler solves this challenge. We created an instance group that can be automatically scaled up and down by the Cluster Autoscaler based on utilization. The nodes in this instance group were used exclusively for applications subject to the HPA scaling discussed earlier.

Combining HPA and Cluster Autoscaler helped us to elastically scale the compute parts of our services. However, the managed MongoDB and RDS clusters also needed to be scaled up. Scaling up these managed services is a costly affair, and cannot be done as easily as scaling up and down Kubernetes nodes. Hence, besides scaling up, we also had to address the ways services utilize these managed resources.

MongoDB Usage Patterns

It’s not just that scaling up a MongoDB cluster is expensive. We were even hitting physical limits of the best available underlying hardware! A single MongoDB cluster was serving Storefront and all the other services shown earlier on. Even after scaling up this cluster, at some point, we were saturating the 10 Gbit Ethernet link to the master node. Of course, this leads to slowdowns and outages across services.

You can’t scale your way out of every problem. That’s why, in addition to all of the infrastructure efforts discussed so far, we also started looking critically at how our services were interacting with MongoDB. In order to reduce bandwidth to the cluster (especially the master node), we applied several code optimizations:

Read data from secondaries, wherever we can tolerate the reduced consistency guarantees.
Don’t read-your-writes by default. Developers do this out of habit, but often it’s not necessary because you already have all the information you need.
When you do read data, apply projections to reduce bytes-on-the-wire. In some cases this means introducing multiple specialized queries with projections for different use-cases, trading a bit of code reuse for performance.

A somewhat bigger change we applied for some services was to provision them with their own MongoDB clusters. This is a good idea anyway (also from a resilience perspective, for example), and the Corona scaling challenges only made this more apparent.

Observability

Optimizing and re-deploying services during a crisis is stressful, to say the least. Where do you start, and how do you know your changes are actually helping? For us, observability and metrics were key in finding the right spots to tune service implementations. Fortunately, we already had Micrometer, Prometheus, Grafana, and New Relic in place to track and monitor metrics.

This was especially useful when we encountered an issue that stumped us for a bit. Requests from the app to Storefront were painfully slow, yet the Storefront pods themselves were not exhibiting any signs of excessive memory or CPU use. Also, outgoing requests from Storefront to internal services were reasonably fast.

So what was causing the slowdown of all these user requests? Finally, a hypothesis emerged: it might be a connection pool starvation issue of the Apache HTTP client we use to make internal service calls. We could try to tweak these settings and see if it helps, but in order to really understand what’s going on, you need accurate metrics. Apache HTTP client Micrometer bindings are available out-of-the-box, so integrating them was a relatively small change. The metrics indeed indicated that we had a large amount of pending connections whenever these slowdowns occurred. With the numbers clearly visible, we tweaked the connection pool sizes and this bottleneck was cleared.

Service Interactions

Removing one bottleneck often reveals another, and this time was no different. Increasing the amount of concurrent calls from Storefront to internal services (by enlarging the HTTP client connection pool sizes) puts additional load on downstream services, revealing new areas for improvement. This additional load caused us to take a hard look at all the internal service interactions stemming from a single user request.

The best internal service call is the one you never make. Through careful log and code analysis, we identified several calls that could be eliminated, either because they were duplicated in different code paths, or not strictly necessary for certain requests.

Most calls, however, are there for a reason. Judicious use of caching is the next step to reduce stress on downstream services. We already had per-JVM in-memory caches for several scenarios. Again, metrics helped locate high-volume service calls where introducing caching would help the most.

Part of our caching setup also worked against us during scaling events. One particular type of caching we use is a polling cache: it periodically retrieves a collection of data from another service. When the service starts, these polling caches initialize themselves, until they expire and reload data. Normally, that works just fine. When the whole system is already under high load, and Kubernetes automatically starts adding many more pods, things get dicey. These new pods try to fill their polling caches simultaneously, putting a lot of load on the downstream services providing the data. Subsequently, these caches all expire at the same time and reload data, again causing peak loads. To prevent such a ‘thundering herd’, we introduced random jitter in the schedule of polling caches. This smoothens the same load over time.

Wrapping Up

To solve scaling challenges, you need a holistic approach spanning infrastructure and software development. Every improvement described in this post (and many more) was found and implemented over the course of several weeks. During these first weeks, we had daily check-ins where both developers and infrastructure specialists analyzed the current state and discussed upcoming changes. Improvements on both code and infrastructure were closely monitored with custom metrics in Grafana and generic service metrics in New Relic. Every day, our services performed a little bit better, and we could soon prevent downtime like we had in the first days.

Eventually, the traffic peaks smoothened and somewhat subsided, as society adapted to the new reality. Our improvements, however, will be with us for a long time!