Case Study

At 2M RPS - a Defining Moment for Linkerd Adoption

CloudCover + ShareChat + Linkerd

tl;dr

  • Sharechat with more than 200 million users required a centralized way to manage security, adopt different load balancing and traffic shaping algorithms.
  • CloudCover had previously carried out a successful migration to Google Cloud for Sharechat.
  • Analyzed and tested various Service Mesh providers, focusing on scalability and ease of implementation.
  • Observation at 2 Million RPS at the Linkerd Gateway for one micro-service exceeded all expectations.

Here’s what happened

ShareChat, an Indian Social Media and Networking giant with more than 200 million active users and valued at more than 2 billion USD is spearheading India’s internet revolution.

ShareChat is running the majority of its workload in Google Cloud and uses more than 100 GKE clusters to host its 100+ micro-services and 300+ Kubernetes jobs spread across multiple regions.

To achieve exponential growth, Sharechat needed to scale and allocate resources dynamically. Given the success achieved in the past, re-engaging with CloudCover became a natural choice for ShareChat.

Enter CloudCOver

Complex problems require sophisticated solutions,

CloudCover decided to functionally test and analyze top Service Mesh providers. A top-end Service Mesh can be a game-changer for an application complex in number and depth of services and is fully invested in ramping up it’s Kubernetes infrastructure.

Search for a Service mesh provider for managing 100s of micro-services & k8s began with these 3 properites, a need for dynamic service discovery, a demand for automated locality-based routing or failover & a Lack of visibility and control at the network layer

CloudCover evaluated various service mesh offerings available in the market, including Istio, Tetrate, Linkerd, Consul Connect, Google Anthos Service Mesh, Google Traffic Director, Aspen Mesh and Kuma Service Mesh

CloudCover’s team

CloudCover’s team of highly competent Kubernetes and Service Mesh team includes

  • 11 Certified Kubernetes Administrators (CKA)

    11 Certified Kubernetes Administrators (CKA)

  • 7 Certified Kubernetes Application Developers (CKAD)

    7 Certified Kubernetes Application Developers (CKAD)

  • 2 Certified Kubernetes Security Specialists (CKS)

    2 Certified Kubernetes Security Specialists (CKS)

  • 1 Certified Istio Administrator (CIA)

    1 Certified Istio Administrator (CIA)

The Winner

The Winner of the Meshy Marathon : Linkerd

Linkerd supports all key service mesh features provided by any other service mesh on the market. Critical was its ability to scale. It doesn’t require as much of a steep learning curve as the other meshes and Linkerd’s minimalistic approach avoided complexities.

Linkerd

  • Supported Protocols

    Linkerd automatically enables advanced features, including sidecar injections, metrics, load balancing, retries, and more for HTTP, HTTP/2, and gRPC connections. It also automatically allows mutual Transport Layer Security (TLS) for all communication between meshed applications.

  • Security

    Certificates are used for authentication and, while ACLs are not yet supported for authorization, can be enabled at the ingress level.

  • Observability

    You also get rich observability features around monitoring, logging, and tracing.

The Decisive Factor

Performance and Scalability

Performance and scalability were non-negotiable for the success of this project. To adopt any service mesh, the foremost requirement from ShareChat was scalability and performance of mesh control plane and data plane. As the ultimate goal was to benchmark performance at 2 million RPS with latency within sub milliseconds, we decided to do it in a phased manner which included:

  • POD Testing

    POD Testing

    1 pod testing of Linkerd components and application components

  • Multiple Iterations

    Multiple Iterations

    Multiple iterations of load testing from 10K RPS to 2 Million RPS, tweaking and tuning various configurations and resource requirements

Benchmarking

For benchmarking the performance of various service meshes, we used an in-house developed sample microservices application. These microservices were deployed into multi-cluster architecture as depicted in the diagram below.

Benchmarking

Here are the individual components of this setup:

  • Benchmarking

    Distributed Locust for performing benchmarks

  • Benchmarking

    Nginx ingress as ingress

  • Benchmarking

    Linkerd, a multi-cluster component running in all three clusters

  • Benchmarking

    Product and user micro-services, each deployed in a dedicated cluster

Benchmarking Result

After multiple rounds of testing and tuning, the final results at various stages of benchmarking were beyond anticipation

Take a look at some snapshots below

  • 500K RPS

    At 500K Throughput, Linkerd proxy latencies for microservices are almost negligible. The p99 latencies at ingress and Linkerd gateway are way under 15ms which we were aiming to achieve. Benchmarking

  • 1 Million RPS

    At 1 Million RPS, the results don’t drift too much p50 for all the components are almost below 2ms. Benchmarking

  • 2 Million RPS

    At 2 Million RPS, the peak p99 latency observed was below 50ms at the Linkerd gateway for one microservices. And this was it, one of the defining moments when we were confident to adopt Linkerd at the mammoth scale of Sharechat. Benchmarking

Resource Consumption

Despite super performance in terms of latency, everything comes down to the cost of running a service mesh and how much additional dollar this system might end up consuming while offering all the cool mesh features.

Below are some of the stats recorded around the number of cores (vCPUs) and memory (GiBs) each component ended up consuming at a given scale. It’s evident that it’s following a linear pattern of consumption which is less scary than abrupt high usage. And this was the second defining moment that pushed us to integrate Linkerd at Sharechat scale.

  • CPU

    Benchmarking

  • Memory

    Benchmarking

What are we building next?

After two successful collaborations

CloudCover and ShareChat plan on developing a seamless platform solution that makes it easier to onboard 100s of Kubernetes clusters with 100s of micro-services and run alongside Linkerd service mesh in a centralized manner.

Quote

“Hats off to Google PSO and CloudCover for being available for us at any time. Our collaboration and synergy helped us drive towards a shared goal.”

Venkatesh Ramaswamy, Vice-President of Engineering, ShareChat