At 2M RPS - a Defining Moment for Linkerd Adoption
CloudCover + ShareChat + Linkerd
- Sharechat with more than 200 million users required a centralized way to manage security, adopt different load balancing and traffic shaping algorithms.
- CloudCover had previously carried out a successful migration to Google Cloud for Sharechat.
- Analyzed and tested various Service Mesh providers, focusing on scalability and ease of implementation.
- Observation at 2 Million RPS at the Linkerd Gateway for one micro-service exceeded all expectations.
Here’s what happened
ShareChat, an Indian Social Media and Networking giant with more than 200 million active users and valued at more than 2 billion USD is spearheading India’s internet revolution.
ShareChat is running the majority of its workload in Google Cloud and uses more than 100 GKE clusters to host its 100+ micro-services and 300+ Kubernetes jobs spread across multiple regions.
To achieve exponential growth, Sharechat needed to scale and allocate resources dynamically. Given the success achieved in the past, re-engaging with CloudCover became a natural choice for ShareChat.
Complex problems require sophisticated solutions,
CloudCover decided to functionally test and analyze top Service Mesh providers. A top-end Service Mesh can be a game-changer for an application complex in number and depth of services and is fully invested in ramping up it’s Kubernetes infrastructure.
Search for a Service mesh provider for managing 100s of micro-services & k8s began with these 3 properites, a need for dynamic service discovery, a demand for automated locality-based routing or failover & a Lack of visibility and control at the network layer
CloudCover evaluated various service mesh offerings available in the market, including Istio, Tetrate, Linkerd, Consul Connect, Google Anthos Service Mesh, Google Traffic Director, Aspen Mesh and Kuma Service Mesh
CloudCover’s team of highly competent Kubernetes and Service Mesh team includes
11 Certified Kubernetes Administrators (CKA)
7 Certified Kubernetes Application Developers (CKAD)
2 Certified Kubernetes Security Specialists (CKS)
1 Certified Istio Administrator (CIA)
The Winner of the Meshy Marathon : Linkerd
Linkerd supports all key service mesh features provided by any other service mesh on the market. Critical was its ability to scale. It doesn’t require as much of a steep learning curve as the other meshes and Linkerd’s minimalistic approach avoided complexities.
Linkerd automatically enables advanced features, including sidecar injections, metrics, load balancing, retries, and more for HTTP, HTTP/2, and gRPC connections. It also automatically allows mutual Transport Layer Security (TLS) for all communication between meshed applications.
Certificates are used for authentication and, while ACLs are not yet supported for authorization, can be enabled at the ingress level.
You also get rich observability features around monitoring, logging, and tracing.
The Decisive Factor
Performance and Scalability
Performance and scalability were non-negotiable for the success of this project. To adopt any service mesh, the foremost requirement from ShareChat was scalability and performance of mesh control plane and data plane. As the ultimate goal was to benchmark performance at 2 million RPS with latency within sub milliseconds, we decided to do it in a phased manner which included:
1 pod testing of Linkerd components and application components
Multiple iterations of load testing from 10K RPS to 2 Million RPS, tweaking and tuning various configurations and resource requirements
For benchmarking the performance of various service meshes, we used an in-house developed sample microservices application. These microservices were deployed into multi-cluster architecture as depicted in the diagram below.
Here are the individual components of this setup:
Distributed Locust for performing benchmarks
Nginx ingress as ingress
Linkerd, a multi-cluster component running in all three clusters
Product and user micro-services, each deployed in a dedicated cluster
After multiple rounds of testing and tuning, the final results at various stages of benchmarking were beyond anticipation
Take a look at some snapshots below
1 Million RPS
2 Million RPS
At 2 Million RPS, the peak p99 latency observed was below 50ms at the Linkerd gateway for one microservices. And this was it, one of the defining moments when we were confident to adopt Linkerd at the mammoth scale of Sharechat.
Despite super performance in terms of latency, everything comes down to the cost of running a service mesh and how much additional dollar this system might end up consuming while offering all the cool mesh features.
Below are some of the stats recorded around the number of cores (vCPUs) and memory (GiBs) each component ended up consuming at a given scale. It’s evident that it’s following a linear pattern of consumption which is less scary than abrupt high usage. And this was the second defining moment that pushed us to integrate Linkerd at Sharechat scale.
What are we building next?
After two successful collaborations
CloudCover and ShareChat plan on developing a seamless platform solution that makes it easier to onboard 100s of Kubernetes clusters with 100s of micro-services and run alongside Linkerd service mesh in a centralized manner.
“Hats off to Google PSO and CloudCover for being available for us at any time. Our collaboration and synergy helped us drive towards a shared goal.”
Venkatesh Ramaswamy, Vice-President of Engineering, ShareChat