By browsing our website, you consent to our use of cookies and other tracking technologies. For more information, read our Privacy Policy.

Intro

CloudCover helps customers in choosing the best-suited service mesh for their use case. The first step towards evaluation is feature comparison and to check if all the critical features are supported in the given service mesh and the second aspect is whether a given mesh is performant and optimized when it comes to real-time usage. In order to get a brief idea about the functionality of the top three service meshes i.e Istio, LinkerD, and Consul Connect.

Now that you have understood various functional aspects and adoption trends of the service mesh from the report above. It becomes very clear that istio is leading from the front and the preferred choice for production adoption. But it might now be as straight and as simple to decide and to pick istio for implementing it for a given customer use case.

This blog is focused on covering the second aspect of service mesh performance evaluation.

Performance Evaluation: Consul vs. Linkerd vs. Istio

We have considered a benchmarking tool: Github link to start with and we are updating the same as per our requirement and maintaining it in our cldcvr repo. PFB the metrics which are being used to evaluate the SM’s:

  1. CPU and Memory Utilization of Control Plane
  2. CPU and Memory Utilization of Data Plane
  3. Latencies
  4. Network usage

Setup

Assumptions: Complete POC is done on the GKE cluster.

The workload and benchmark applications are deployed on a GKE cluster consisting of two node pools. One is assigned a label of role: workload and the other with role: benchmark. The application pods will be running on the nodes with workload labels and benchmarking pods will be running on the ones with benchmark labels.

Push-Gateway Installation

  1. The Prometheus Pushgateway allows ephemeral and batch jobs to expose their metrics to Prometheus. Since these kinds of jobs may not exist long enough to be scraped, they can instead push their metrics to a Pushgateway. The Pushgateway then exposes these metrics to Prometheus.
  2. The benchmark load generator will push intermediate run-time metrics as well as final latency metrics to a Prometheus push gateway.
  3. For push gateway installation we need Service Monitor resource which is not available by default. It is a custom resource that is part of the kube-prometheus. A detailed explanation of kube-prometheus is available here. PFB the commands required for the setup
git clone git@github.com:prometheus-operator/kube-prometheus.git 
kubectl create -f manifests/setup 
kubectl apply -f manifests/.

Deploy Prometheus push gateway

cd service-mesh-benchmark
helm install pushgateway --namespace monitoring configs/pushgateway

Grafana Dashboard Creation

After the Grafana pod is up and running in the monitoring namespace access the UI by forwarding the Grafana service port from the cluster:

kubectl -n monitoring port-forward svc/grafana 3000:3000 
  1. Log in to Grafana and create an API key we’ll use to upload the dashboard.

  2. Clone the benchmarking tool from the link here and move to the service-mesh-benchmark/dashboards. Run the commands mentioned below to create the required dashboards:

./upload_dashboard.sh "[API KEY]" grafana-wrk2-cockpit.json localhost:3000
./upload_dashboard.sh "[API KEY]" grafana-wrk2-summary.json localhost:3000

Note: [API KEY] must be created via the Grafana dashboard.

We are creating two dashboards:

  1. wrk2 cockpit: Grafana dashboard for live metrics of the load test.

  2. wrk2 summary: Dashboard to fetch the comparison report after the benchmarking is completed for all the service meshes. Currently, this only supports reports for 3 service meshes in the picture.

Install ServiceMeshes

An automation script is in place which will install all the SM. Clone the benchmarking tool from the link here and move to service-mesh-benchmark/scripts

./setup-servicemeshes.sh

Deploy the application

Note: Namespace creation is integrated into the application deployed part via helm

We would be deploying the emojivoto application with all 3 service meshes. For this POC, we will be deploying 3 instances of the emojivoto application with one service mesh. All the applications will be deployed in separate k8s namespaces.

Deploy emojivoto with consul

  1. Switch to the benchmark repo and execute the following commands to deploy the application.

    for i in {0..2} ; do
       helm install emojivoto-consul-$i --set servicemesh=consul configs/emojivoto
    done
    
  2. Validate the application container and proxy container running in each pod

Deploy emojivoto with linkerd

  1. Helm will manage the creation of the k8s namespace with linkerd.io/inject: enabled annotation

  2. Switch to the benchmark repo and execute the following commands to deploy the application.

    for i in {0..2} ; do
    helm install emojivoto-linkerd-$i --set servicemesh=linkerd configs/emojivoto
    done
    
  3. Validate the application container and proxy container running in each pod. Also, we can refer to the linkerd dashboard for the same.

Deploy emojivoto with istio

  1. Helm will manage the creation of the k8s namespace with an istio-injection=enabled label.

  2. Switch to the benchmark repo and execute the following commands to deploy the application

    for i in {0..2} ; do 
       helm install emojivoto-istio-$i --set servicemesh=istio configs/emojivoto 
    done
    
  3. Validate the application container and proxy container running in each pod.

Start Benchmark(Loadtest)

  1. The benchmarking tool is deployed using code available here service-mesh-benchmark/configs/benchmark

  2. We will be launching separate instances of the benchmark tool for load testing applications per service mesh.

  3. Before initiating the benchmark application make the changes required in service-mesh-benchmark/configs/benchmark/values.yaml file.

    wrk2:
    duration: 1800 #time in seconds
    connections: 96
    RPS: 500 #set the required RPS for the load test
    initDelay: 0 
    serviceMesh: "linkerd" #which service mesh are we testing
    app:
     name: emojivoto-linkerd #values can be emojivoto-linkerd, emojivoto-istio or emojivoto-consul
     count: "3" #count of application instance  deployed per service mesh
    appImage: quay.io/kinvolk/wrk2-prometheus
    
  4. Set the values servicemesh: consul and name: emojivoto-consul and start the benchmark tool to load test application with consul service mesh using the following command:

    helm install --create-namespace benchmark-consul --namespace benchmark-consul configs/benchmark
    
  5. Set the values servicemesh: linkerd and name: emojivoto-linkerd and start the benchmark tool to load test application with consul service mesh using the following command:

    helm install --create-namespace benchmark-linkerd --namespace benchmark-linkerd configs/benchmark
    
  6. Set the value of servicemesh: istio and name: emojivoto-istio and start the benchmark tool to load test application with consul service mesh using the following command:

    helm install --create-namespace benchmark-istio --namespace benchmark-istio configs/benchmark
    
  7. Check the pod status in all the benchmark namespaces

  8. We can monitor the live metrics on the wrk2 cockpit dashboard by selecting the correct job name and the timing in the variable dropdown.

Compare and Conclude

  1. When all the 3 benchmarking deployments are done we run the metrics-merger job to update summary metrics on the wrk2 summary dashboard.

  2. The metric-merger code available here service-mesh-benchmark/configs/metrics-merger. Run the following command:

    helm install --create-namespace --namespace metrics-merger metrics-merger configs/metrics-merger
    
  3. After the above job is completed we can check the wrk2 summary dashboard for the final result.

Findings

We did a benchmark on all service mesh for 30 min at 500RPS benchmark against 3 emojivoto app instances, with 96 threads / simultaneous connections. PFB the monitoring metrics captured for all.

Consul metrics

Sidecar memory usage Sidecar memory usage

Sidecar CPU usage Sidecar CPU usage

Application CPU usage Application CPU usage

Component CPU Seconds Usage Memory Usage(MB)
consul-connect-lifecycle-sidecar 0.00931 22.81
consul-connect-envoy-sidecar 0.00585 16.05

Linkerd metrics

Sidecar memory usage Sidecar memory usage

Sidecar CPU usage Sidecar CPU usage

Application CPU usage Application CPU usage

Component CPU Seconds Usage Memory Usage(MB)
Linkerd -Proxy(Sidecar) 0.150 11.26

Istio Metrics

Sidecar memory usage Sidecar memory usage

Sidecar CPU usage Sidecar CPU usage

Application CPU usage Application CPU usage

Component CPU Seconds Usage Memory Usage(MB)
Istio-Proxy(Sidecar) 0.2546 64.3

Summary Report

Memory usage and CPU utilization Memory Usage Control Plane

CPU Usage Control Plane

Values from the graph and added in the table below

Component CPU Seconds Usage Memory Usage(MB)
Consul 0.0463 178.5
Istio 0.0053 90.5
Linkerd 0.0023 55.7

While the above charts imply that Consul control plane components have utilized the most resources followed by Istio and then Linkerd. Also, it can be observed in the sidecar utilization metrics of the respective service mesh that Istio’s sidecar utilized the highest resources followed by Linkerd and then Consul. So it is really important to understand the resource utilization before using any SerivceMesh in the setup.

Latency Percentiles

Percentile based Latency

Percentile based Latency

With this load, Linkerd and Istio easily generated latencies in the minutes' range. No socket / HTTP errors were observed during the load test. Also, the effective throughput was around 500 RPS. Observing the above metrics Consul is the winner by a long margin with a 1.0 percentile latency of 1.06 s whereas Linkerd and Istio having a latency of 3.08 min and 5.93 min respectively.

Conclusion

Consul out-performed Linkerd and Istio when it comes to latency, with the acceptable overhead of resource consumption by the control plane. But with consul, we need to consider the ease of implementation wrt complex setups. Linkerd takes the edge on resource consumption, even the application CPU utilization is the lowest as compared to others. This proves that Linkerd is lightweight when it comes to resource utilization. The final call of using the appropriate service mesh will depend on the requirement where all the above pointers shall be considered.