By browsing our website, you consent to our use of cookies and other tracking technologies. For more information, read our Privacy Policy.

Intro

In the first part of this five-part blog series, my colleague, Shashank, drew parallels between in-flight refueling maneuvers and cloud-to-cloud migration.

He also shared three lessons that we learned the hard way while completing a big-bang migration for one of India’s largest social networking service providers. In this blog, I’m diving into our migration challenges from a technical standpoint. I'll share with you five steps that you can take now to reduce risks, find practical solutions, and prepare for the unknown. It's also our five-step journey toward predictable outcomes and customer success.

The context of being Always On

Keeping our customer’s promise to its user base means that we have to ensure zero data loss, zero downtime, and zero or minimal changes in the application’s code base.

Unfortunately, knowledge of the operating environment is a preemptive strike that doesn’t get the respect that it’s due. And it's also a lesson that we learned the hard way this year.

The elephant in our room is a distributed monolith, which precludes the lift-and-shift (rehosting) migration approach.

We tested multiple hypotheses, but the complexity of dealing with intricately interwoven services makes piecemeal and batch migrations unviable options.

Which leaves us with the big-bang approach that Shashank spoke about in our first blog.

Comparing apples to oranges in a cloud-native stack

I’m not saying that being a monolith is archaic or irrelevant, but being a distributed one in the cloud is an obnoxious burden.

My team isn't just migrating the infrastructure and data from one giant cloud to another. Instead, we are changing the entire stack, rearchitecting so to speak, to keep our customer’s promise of being Always On.

Here’s how that stacks up for us after some brainstorming sessions and experiments.

Current Stack: Amazon Web Services (AWS)Future Stack: Google Cloud Platform (GCP)
Elastic Container Service (ECS)Google Kubernetes Engine (GKE)
DynamoDB (with DAX)Spanner and Bigtable (with Redis)
KinesisPub/Sub
Simple Queue Service (SQS)Pub/Sub
ElastiCacheRedis Labs
LambdaCloud Functions
Simple Storage Service (S3)Cloud Storage
Application Load Balancer (ALB)NGINX Ingress Controller for Kubernetes (ingress-nginx)
Public Elastic Load Balancer (ELB)/ALBGlobal Google Cloud Load Balancer (GCLB)
CloudWatchStackdriver, Grafana, and Prometheus

Note: Let's now pull the curtain back on our five steps!

  1. Create a snapshot of the current cloud reality
  2. Choose the foundational pillars for infra and data care
  3. Load test the alternate cloud reality with live traffic
  4. Fix problems as they surface
  5. Brace yourself for the unknowable unknowns

[Prerequisite] Plan for failure when the rubber meets the road

A big-bang migration has only a binary outcome. You either fail or succeed, and even the smartest people cannot plan for what’s unknown and unknowable in the field.

And without observability tools, you’d simply be shooting in the dark. For instance, in the heat of the moment, we'd easily have 50 dashboards and hundreds of metrics being tracked for hours.

Monitoring, tracing, and logging certainly help us fix problems as they surface while we are shifting live traffic from AWS to GCP.

There’s no rollback to AWS after 50 percent of the traffic moves to GCP because there’s no data to sync back to AWS after this point.

Segregation of duties might seem obvious in theory, but it’s a lifesaver in the field. Collaboration ceases to be a platitude when you have barely 6 hours to complete the cutover without hangups on the user experience front.

It’s at such times that you truly appreciate sweating in peace, having fierce friends (Google's team of engineers), and heading out with a plan. Stuff still breaks, but you aren't left blindsided.

Step 1: Create a snapshot of the current cloud reality

Risk mitigation is a key concern, which led us to set the following objectives.


Objective #1 Objective #2 Objective #3 Objective #4 Objective #5
Decouple the services from the database Decouple the data streaming services Bring more visibility to the services Control flow and audit responses Mock cutovers every day

To address these objectives, we created an alternate universe within GCP that matches the current reality within AWS. Here’s how the current cloud reality looks like in AWS.

Chart

Getting Down to Business

The cloud infrastructure team built the CI/CD pipeline to deploy all the 75+ services and 200+ jobs across 15 GKE clusters. And the data team worked on replicating and getting in sync 220+ tables from the AWS universe to the GCP one. The data migration journey has its share of challenges, and my colleagues, Roshan, Krishna, and Rishi, are happy to walk you through that leg of the migration journey.

Step 2 Choose the foundational pillars for infra and data care

Choosing Google Kubernetes Engine (GKE)

The customer has over 5K servers in their ECS clusters for services and jobs.

To prevent the maintenance overhead associated with managing one cluster, we chose a multi-cluster setup.

The following are some of the implementation decisions around GKE:

  • 15 GKE clusters host 75+ services and 200+ jobs, consuming over 25K cores at peak traffic. Each GKE cluster scales up to 255 nodes.
  • Dedicated clusters are assigned to run critical jobs. Services are distributed across GKE based on the throughput and the collocation of the dependent services and jobs.
  • Ingress-nginx is preferred over Istio and Layer 7 Internal Load Balancer (L7 ILB) for inter-cluster and intra-cluster communication.
  • DaemonSet is deployed on each GKE cluster to customize the sysctl flags at the node level.
  • NodeLocal DNSCache improves the DNS lookup time and subsequent latency. Without this approach, DNS lookup tends to increase latency because each query goes to kube-dns. After we enable NodeLocal DNSCache, the pods that run on a given node cache the DNS; thereby, resulting in less hits to kube-dns and better response. Note: The kube-dns pods not being scheduled on node pools with taints is still an open issue.

Choosing Google Cloud Spanner and Bigtable

As maintenance overhead is undesirable to the customer, we aren't even considering open source software (OSS) solutions, such as Cassandra and Aerospike.

Which leaves us with managed database services on GCP that offer comparable performance in terms of latency and support for secondary indexes. Given these constraints, using Spanner and Bigtable in concert seems like the most logical conclusion.

Bigtable offers comparable performance, but it does not support secondary indexes. Spanner offers secondary indexes.

Our decision to use NoSQL and SQL databases in concert led to the discovery of the game changer that accelerated the cloud-to-cloud migration journey.

Step 3: Load test the alternate cloud reality with live traffic

After several rounds of functional validation for beta users on the GCP production infrastructure, it's time to load test it.

Due to our big-bang approach, we aren't running load tests for 75+ services in parallel. Besides, we don't want to run the risk of corrupting our production data with test data.

The only way to succeed with end-to-end testing is to rehearse the entire production traffic in a controlled manner.

To learn how we simulated the users' experience on GCP without going live on GCP, see Inter cloud routing using the Zuul API gateway.

With the help of a custom-built solution on top of the Zuul API gateway, we can shadow live traffic from AWS to GCP and uncover several issues with our GCP setup.

Here’s how our current and alternate cloud realities stack up at the time of load testing.

Chart

You've already seen the AWS stack (left) in step 1. The following image illustrates the GCP stack (right).

Chart

Roshan will dive into this GCP universe in the next blog where he will walk you through the game changer that we now call the dynamodb-adapter service (erstwhile DB Driver).

Step 4 Fix problems as they surface

At 50 percent of peak load testing, Istio buckled and the setup faced socket hangups, increased latency, and crashing of Istio Mixer’s policy functionality.

As we didn't have the luxury of time or an in-house Istio expert to handle load at this scale, we dropped Istio in favor of L7 ILB (Envoy).

Envoy performs well at 50 percent peak load, but it starts to fail at 60 percent. And that’s internal traffic of over 200K requests per second (RPS) that we are talking about here.

After much ado and architectural changes yet again, we chose ingress-nginx. Although that point of contention is finally put to rest, the overall performance is still unsatisfactory to me.

With no DAX equivalent on GCP and the customer’s requirement of zero or minimal change in the application’s code base, we spent days optimizing Redis and Spanner to prevent database snags at peak traffic.

We've listed some of these problems in the following table.

ProblemCauseResolution
High packet drop when we increase traffic load from 5% to 10%Components in the GCP setup are talking back to DynamoDB, Kinesis, and SQS via NAT.Allocate a dedicated NAT per subnet and a dedicated NAT for the GKE cluster with high external traffic.
Hotspotting on BigtableMonotonically increasing IDs tend to create load on a given node, and this issue aggravates at peak traffic; thereby, creating a hotspot on a given node.Hash the row keys to have them evenly distributed across the nodes. The dynamodb-adapter service, which Roshan will talk about in the next blog, handles the hashing of the row keys.
Hotspotting on SpannerSimilar to the problem that we faced with Bigtable.
Note: Spanner's design best practices can help you avoid this problem altogether. So, we highly recommend that you review those guidelines.
Develop a cached layer to reduce the number of reads to Spanner.
Note: Due to our project constraints, we couldn't modify the application services to adhere to Spanner's design best practices.
Our best options were query optimization and minor changes to some services so they use latency caching
Write failures on SpannerUnnecessary fields in each read and write operation cause write contention on the in-demand rows; thereby, resulting in a Transaction Abort exception.Remove the irrelevant fields from all the schema.
Unoptimized indexes on SpannerUnnecessary use of indexes and multiple indexes of the same type cause this issue. The sort order is ignored, which increases contentionRecreate all indexes with the sort key in the descending order.
Unoptimized usage of Spanner sessionsLack of visibility into the actual service/job-to-table mapping seemed to consume all the available Spanner sessions, including sessions that weren't in use; thereby creating an unnecessary burden on dynamodb-adapter and resource consumption.Develop a session manager that uses a session pool configuration to determine the number of in-use and idle sessions.
Session Manager tags each session in the session pool and enables dynamodb-adapter to engage only those Spanner instances that are relevant to its service.
More on this in Roshan's blog!

Step 5 Brace yourself for unknowable unknowns

After performance optimization and several days of testing at peak load, it was time to flip the switch and make GCP the current cloud reality. Everyone involved was ready to get this big-bang migration over with. However, everything came to a halt 24 hours before the actual cutover!

A stranger walks into our lives.

COVID-19 derailed our migration journey. How so?

Businesses of all shapes and sizes were switching to Google Cloud to stay alive and support remote work.

To better equip the Mumbai datacenter for this additional capacity, our friends at Google suggested that we delay the actual cutover date. That's a blessing in disguise because we had time to review all our efforts and iron out some minor issues.

Our efforts paid off when we performed that gargantuan leap from one cloud to another with zero service downtime and zero data loss.

We kept the customer’s promise of being Always On by making mundane the new black.

In the next blog, Roshan will unveil the game changer that accelerated our customer’s cloud-to-cloud migration journey with zero change in the application’s code base.

Stay tuned!