I had a bad feeling about this cloud-to-cloud migration project from the get-go. Nothing about this endeavor seemed realistically achievable to me.
And I'm speaking from a place of experience because my team has completed a ton of migration projects. In fact, we're the specialists that our friends at Google holler when they receive challenging projects. So, we don't shy away from tough migrations. However, this cloud-to-cloud migration project is certainly different and daunting. Everything about it seems unwieldy; its scale of operations is unprecedented, its architectural design is intricate to say the least, its project constraints are rigid, and its timing seems to mark the advent of COVID-19.
All the forewarnings, except for COVID-19, are already in plain sight. But we still sign up to support one of India’s largest social networking service providers in their migration journey from AWS to GCP.
The context of a risk-mitigation plan
Here are some stats about our customer that might give you a glimpse into their scale of operations:
- 60 million Monthly Active Users (MAU) and over 10 million Daily Active Users (DAU)
- Total peak-time read of 2 million queries per second (QPS) and total peak-time write of 560K QPS
Cost and time are primary project constraints, but so are zero downtime, zero data loss, and zero or minimal changes in the application's code base. Which means that service quality is a non-negotiable too, not that anyone likes to drop the ball on that one.
Against all odds, we lived to tell the tale. However, this story of success is so multithreaded that we had to split our whole experience into a five-part blog series just to present the bare bones!
Our project manager, Shashank, shared his perspectives as risk-mitigation lessons. Our data champions, Rishi and Krishna, shared their data migration story. And our cloud infrastructure team has its own share of learnings.
I've shared some of our technical challenges in my other blog. Shashank and I are now wrapping up this blog series with our 3-pronged approach to a risk-mitigation plan.
How to sense and observe complexity during live migrations?
Business as usual for this customer means ensuring zero downtime while migrating over 75 services, 200 jobs, and 220 tables (approx. 80 TB of total data) from AWS to GCP.
Tons of stuff is flying around, and we don't know where and how we should begin our sense-making process. Knowing the operating environment is critical to any plan of action, so we decide to examine issues as they arise.
One approach that seems apt to initial sense-making is to move from being a distributed team to a centralized one. Obviously, COVID-19 moved us all back to remote work. However, that initial, face-to-face engagement is crucial for project clarity.
A distributed monolith poses yet another challenge to our sense-making efforts. Sweeti and Roshan have blogs dedicated to this bone of contention, so I won't delve into this challenge in this blog.
An observability tool, such as Lightstep, helps immensely with root cause analysis (RCA). But, we realized its true value only because we had an audit plan in place. In fact, we set two project objectives to steady our initial sense-making efforts and to figure out how to mitigate our overall project risks.
The first objective is to audit the flow of all the responses and the second is to mock cutovers every day. With an arguably mundane ritual in place, we can move forward with more confidence.
Why recreate a snapshot of AWS in Google Cloud?
To meet the objectives that we set during our sense-making process, we need a handle on the current reality of the customer's AWS environment.
So, we created an alternate reality in GCP and decided to load test this environment with live traffic. We directed a portion of the customer's live traffic to GCP, but we didn't send the response from GCP to the users.
A team member from the customer's side describes this effort in his blog on inter cloud routing using the Zuul API gateway.
As we increased the percentage of load testing, problems began to surface. Fixing problems as they arise is pretty much the theme of my other blog. I unabashedly recommend it if you want to know how we sweat the details on optimizing for performance.
Let's now circle back to risk mitigation through increased visibility.
Recreating a snapshot of AWS in GCP helps us ensure parity between the responses in both the environments. It's no accident that Roshan attributes immense value to our groundwork of auditing the flow of all the responses.
He's on point when he talks about finding the levers to change this migration game. This very process of creating an alternate reality in GCP serves as a catalyst for the audit application that Roshan built, which increased visibility.
The genesis of a 3-pronged, risk-mitigation plan
Zero or minimal code change is one of our primary project constraints because the application’s code base is tightly coupled with DynamoDB.
Changing the code base would be a nightmare from a time and resource lens. So, Roshan put in a ton of effort to find a way that enables the application to continue to use the existing DynamoDB interfaces.
We chose a combination of Cloud Spanner, Bigtable, and Redis to take on DynamoDB with DAX, and Roshan benchmarked this setup for almost a month to evaluate the potential impact on the users' experience.
You can read the details in his blog on what we call as his innovative solution, the dynamodb-adapter service (erstwhile DB Driver).
But we want to highlight the reason for its existence.
The dynamodb-adapter service, in its primitive form, helped us figure out the extent of coupling between the application services and DynamoDB.
It's part of that "distributed monolith" challenge that all of us have been ranting about for the umpteenth time now. So, we'll spare you the drama.
Why is dynamodb-adapter relevant to this discussion?
Because this service that Roshan built has an optional component called DB Audit, which provides visibility on the degree of data consistency between DynamoDB and Spanner/Bigtable.
It's also what helps us ensure parity between the responses in the AWS and the GCP environments.
The following image of dynamodb-adapter (erstwhile DB Driver) illustrates the DB Audit component and how it communicates with a custom-built audit service.
#1 Auditing at the API and data level for response parity
The DB Audit component within dynamodb-adapter initiates the following steps to check if the application services work correctly and if there is parity in the responses:
- For each request to Spanner or Bigtable, DB Audit fetches the relevant row from DynamoDB.
- It sends the requests and responses from DynamoDB and Spanner or Bigtable to the Audit service. As the Audit service receives these requests and responses asynchronously, it uses Bigtable to store these requests and responses.
- The Audit service performs a JSON diff on the responses from both the environments. In case of a difference, it pushes the diff to Pub/Sub and InfluxDB.
- Grafana uses the data from InfluxDB to help visualize the overall API audit data.
- Cloud Dataflow jobs extract the messages from Pub/Sub and load them to BigQuery. Data Studio then helps visualize mismatches, if any.
The following image illustrates the API audit dashboard.
#2 Auditing data lag for data integrity
In the data migration blog, Rishi, Krishna, and Roshan spoke about the pivotal role of auditing in ensuring the integrity of the customer's streaming data.
We had set a batch interval of 5 seconds to avoid the per-second transaction cost of moving a huge volume of streaming records from Kinesis Data Streams.
Which meant that we had to ensure that we moved all the data from AWS to GCP before the actual cutover.
Roshan built a stream processing application to ensure that we maintain the integrity and the sequence of all the streaming data.
He also added a functionality to this application to calculate the lag between the time it receives data from AWS Lambda to the time when data is written to Spanner or Bigtable.
The application pushes this time-series data to InfluxDB, and Grafana integrates with InfluxDB to enable visualization of this data lag for each of the 220+ tables.
The following image illustrates the details for one such table.
#3 Auditing access patterns to see service-to-table mappings
We kept our challenges of dealing with a distributed monolith at the forefront of all our discussions.
So, the need to gain control through increased visibility into the level of service-database coupling seems natural to us.
To douse this burning need, we disabled the write operations in the configuration file. We then monitored the access patterns between the services and the tables.
We also spotted jobs with active write operations to the tables when the traffic was down to zero.
By monitoring the service account names and the table names, we figured out which services communicated with which tables and how often. We used InfluxDB to store this audit data, and it had a retention period of 24 hours.
In my other blog, I spoke about the unoptimized usage of Spanner sessions, which stems from this lack of visibility about the actual service-to-table mapping.
So, auditing these access patterns helped us address some of this unoptimized usage and lift some of the unnecessary burden on resource consumption.
Increasing visibility to mitigate risks
Now that you know our 3-pronged, risk-mitigation plan, you probably get why we attribute our success to this perpetual need for increased visibility through audits.
Perhaps it's worth reiterating our objectives to show how all the five blogs in this series come together. Let us know if you think otherwise. We'd love to hear from you!
|Objective #1||Objective #2||Objective #3||Objective #4||Objective #5|
|Decouple the services from the database||Decouple the data streaming services||Bring more visibility to the services||Control flow and audit responses||Mock cutovers every day|
Our auditing approach puts each of these five objectives into motion.
For example, for objective #1, we monitor the access patterns to determine the service-to-table mappings. This action helps us address potential data loss challenges that can arise because of the level of service-to-database coupling.
Similarly, for objective #2, we monitor the lag in data transfer when data moves from Kinesis Data Streams through AWS Lambda to the stream processors and onward to Spanner and Bigtable. This action helps us maintain data integrity.
Objectives #3, #4, and #5 form the core of our auditing effort, which is also why we have this blog in our five-part blog series.
Reflecting on our AWS to GCP migration journey
Sweeti and I signed up to support this migration project with uncertainty about our execution strategy and our probability of a success.
To this day, I am unsure if we could have pulled this off without all those insane levels of auditing, mundane ritual honoring, and let-off-steam kind of intense conversations.
Our customer's scale of operations and the project constraints definitely bred creativity of a different kind. And we are ecstatic that Roshan's innovative solution, which we now call dynamodb-adapter (erstwhile DB Driver), will soon join the Cloud Spanner Ecosystem...Woohoo!
- What if we were to lift these constraints?
- What if service downtime was an option?
- What if we could change the application's code base?
Reflecting back on all these what-if questions, I wonder how we would approach this migration project.
I know that we would definitely want to stop and rethink our big-bang migration approach. Also, we all agree that dynamodb-adapter (erstwhile DB Driver) and the stream processor wouldn't even exist if those what-if questions were true for us.
We would most definitely follow Spanner's design best practices to avoid all those hotspotting issues altogether, which would save us some heartburn.
Perhaps we'd get rid of all the secondary indexes and make this a NoSQL to NoSQL migration from DynamoDB to Bigtable, which was our first solution anyways.
With all of that said and even if we ignore all of what we just said we'd do, one action is certain. We would strangulate that distributed monolith and give it a proper microservices architecture.
I hope that you can now get a sense of why I said, "Run away as fast and as far as you can from a distributed monolith."
We've thoroughly enjoyed sharing our learnings, technical challenges, innovative solutions, and emotions with you during the course of this blog series.
That's us trying to squeeze out and amplify our shared moments of joy and camaraderie. You can see all our blog authors here and some of our other buddies. Roshan likes to call himself Neo, and you get no points for guessing that he is a fan of The Matrix.
We hope that you find each of these blogs applicable in some shape or form to your cloud-to-cloud migration journey. If you do, we'd love to hear all about it. And if you don't, we'll love to hear that all the more!
How can we help?
At CloudCover, we are always looking forward for the next challenge. Drop us a line, we would love to hear from you.
Thanks for writing us! We'll be in touch real quick.Back to website