In this third part of our five-part blog series, I’m going to unveil the game changer that we've been alluding to in the first two blogs. It accelerated our customer’s cloud-to-cloud migration journey with almost no change in the application’s code base and no adverse impact on latency.
Speaking of games, this belief held by some that game changers have a 30,000-foot view of the problem piques my interest.
My two cents to this belief is that true transformation has its roots in the trenches where the foundation is laid.
And game changers, while working in those trenches, figure out exactly which foundational pieces provide the mechanical advantage to change the game itself.
From this lens, I feel that defining the problem space in terms of business challenges and customer needs seems essential to laying the foundation and gaining that advantage. So, that's where I'll start.
Defining the cloud-to-cloud migration problem space
Let’s now look at the problem space that makes this customer promise a daunting task.
|Problem||Business Challenge||Customer Need|
|A||Encountering issues at scale with DynamoDB and other business challenges with AWS. So, they decide to move to GCP.||To reap the full benefits of the cloud without the burden of maintenance and the subsequent need to hire specialists.|
|B||Doesn't want to renege on their service promise of being Always On.||Zero service downtime, as users' experience is paramount to their success.|
|C||Application’s code base is tightly coupled with DynamoDB.
Changing the code base would be a nightmare from business continuity and service excellence standpoints.
|An approach that enables the application to continue to use the DynamoDB interfaces.|
That is, ensure zero or minimal code change when migrating the infrastructure and data from AWS to GCP.
|D||Doesn't want to lose the users' data when migrating from AWS to GCP.
Batch and streaming data migration without data corruption or loss is daunting to say the least.
|To ensure zero data loss and to maintain the sequence of the streaming data.|
Shashank and Sweeti addressed problems A and B in their blogs. Rishi, Krishna, and I will address problem D in the next blog.
I am addressing problem C in this blog. So, let’s unveil this game changer.
Laying the foundation for zero code change
In the first blog, Shashank alluded to in-flight refueling maneuvers to illustrate the risks that are associated with a migration at this scale.
And he shared five objectives to mitigate risks and to ensure predictability.
Laying the foundation, in this context, means knowing your operating environment and your formidable opponent.
My formidable opponent was none other than Amazon’s NoSQL beast on steroids: DynamoDB with DAX (managed cache solution).
It’s naive to think of competing against this beast with a SQL database. Nobody in their right mind would ever consider these two comparable.
DynamoDB with DAX is in a league of its own. You CANNOT tame this one with a SQL database.
So, my greatest challenge is two-fold: First, how do I tame this beast on steroids? And second, how do I accomplish this feat in less than 60 days?
Speed, cost, and quality of service are all critical and at the same time. I’m sure conventional project management practices balance these three pillars to offer a realistic outcome.
However, the gospel that Cloud Computing and Cloud Native Computing are proclaiming to even those on the fence makes such discussions moot. So, let’s not even talk about tech startups that are soaking in cloud-native technologies.
Also, in 2020, it’s pointless resisting this wave when I might have a better shot at riding it. I rest my case exactly at this point.
Here I am, in the trenches, laying the foundation for that gravitational shift. And I need all the help that I could get and some more.
Finding the levers to change the game
In her blog, Sweeti described how we chose the foundational pillars for infrastructure and data care.
After testing our hypotheses, we learned that taming this beast on steroids needs three players to work in concert: Google Cloud Spanner, Bigtable, and Redis.
So, we benchmarked the setup for almost a month to figure out if its performance on latency is anywhere near that of DynamoDB and DAX combined. Obviously, it isn't.
However, after evaluating its impact on the users' experience, which is negligible, our customer settled for 30-50 ms of latency as acceptable.
I don’t need to tell you that DynamoDB with DAX can probably offer a response time of 3 ms.
And I benchmarked the DynamoDB database in its entirety because I feel that I might never get another opportunity to develop a solution for over 500K queries per second (QPS).
The Foundational Test
Sweeti spoke about our experience of load testing with live traffic, and you can read more about this step to customer success in her blog. This step is critical because our ability to deliver on the customer’s promise of being Always On depends on how well our foundational pieces hold up against the customer’s existing stack on AWS.
The following image illustrates how the GCP stack mirrored the customer’s AWS stack.
It's probably hard to see this image, and you can see the individual cloud stacks next. However, I used this image to draw your attention to the line between the external Zuul API gateway on the AWS side and Google Cloud Load Balancer (GCLB) on the GCP side. It indicates how we used a portion of ShareChat's live traffic for load testing.
The following image illustrates the customer's AWS stack.
The following image illustrates the customer's GCP stack.
Let’s briefly look at the pieces on the GCP side.
- We have 15 GKE clusters hosting 75+ services and 200+ jobs, consuming over 25K cores at peak traffic. Google Cloud Load Balancer (GCLB) is the load balancer. Ingress-nginx is in place for inter-cluster and intra-cluster communication.
- MySQL is in place for data transfer from Amazon RDS. Bigtable stores 110 tables that have no secondary indexes. Spanner stores over 110 tables that have secondary indexes. Redis is our caching layer in front of Spanner. Pub/Sub receives the change-event streams from DynamoDB (Kinesis).
- Elasticsearch Cluster is in place for search-specific data from Amazon Elasticsearch Service.
- BigQuery is the datastore for the Audit service that I’ll talk about later in this blog.
- DB Driver (now called dynamodb-adapter) is the data access layer that I'm referring to in this blog, so that’s what’s covered at length here.
With this first gigantic piece in place, which isn't as easy to come by as it is to write about, we continued to dig deeper to find the other levers to effect that gravitational shift.
Spanner Schema Choices
We had no clue to what extent the customer’s application depended on the DynamoDB interface, or what were all the date types and fields per object.
Getting the schema configuration right is the prerequisite to all the subsequent tasks.
We stored the schema configuration in a table in Spanner, which has four fields: tableName, column, originalColumn, and dataType.
Spanner supports only one special character, underscore (_). However, DynamoDB supports others too. So, the column and originalColumn fields have different values when a special character other than an underscore is present.
|DynamoDB Data Type||Spanner Data Type|
|N (number type)||FLOAT64|
|S (string type)||STRING|
|BOOL (boolean type)||BOOL|
|M (map), L (list), NS (number set), and SS (string set)||BYTES|
With the schema in place, we chose four application services to determine the extent of coupling between the services and DynamoDB.
The game changer that we now call dynamodb-adapter was a primitive service that I developed as a stopgap while we were all still in the trenches.
Its sole purpose was to enable the application services to field test against it so that we can figure out the gaps in terms of the coverage of the DynamoDB functionalities.
Errors start to unravel the service-to-database coupling story. And that’s why Shashank emphasized the need to honor the mundane routines in the first blog of this series.
It’s also why we decided to set our first objective of decoupling the services from the database.
Atomic Increments in Bigtable
Field testing moved dynamodb-adapter to its next iteration and put in place another component called Atomic Counter.
Our initial data-type mapping for the N number type within DynamoDB is FLOAT64 within Spanner. However, of the 110 tables that we migrated to Bigtable, 12 had atomic counters of type INT.
Bigtable does not support atomic increments for Float values. So, all counter fields are stored as Integer, and the other N type variables are stored as Float.
Query and Data Transformation
As I dug deeper through the field errors and improved dynamodb-adapter, it was clear to me that I'd have to create a channel to enable communication among four distinct object types.
Bigtable’s storage and retrieval mechanism involves binary data, that of Spanner involves interface data, that of DynamoDB is totally different from Bigtable and Spanner, and DynamoDB Stream objects are a different beast altogether.
So, how do I effect change without changing the application’s code base? From this daunting question, emerged the following components.
|Config Controller||Query Transformer||Response Parser||DynamoDB Stream Object Collector and Creator|
|Determines which request goes to which destination: Spanner, Bigtable, or Pub/Sub||Transforms service queries into a format that Spanner and Bigtable understand||Transforms query responses from Spanner and Bigtable into the DynamoDB format||(Collector): Receives the old and new image objects from DynamoDB Streams
(Creator): Creates and publishes individual objects to Pub/Sub
Up to this point, I built seven components to solve the technical challenges as they surfaced. These are: Schema Configuration, Atomic Counter, Config Controller, Query Transformer, Response Parser, DynamoDB Stream Object Collector, and DynamoDB Stream Object Creator
Each of the following components have a behind-the-scenes drama that can lend itself to another blog series. But to spare you that drama and for the sake of brevity (considering how long this blog already is), I’ll try my best to cover the highlights.
The Sidecar Pattern
I placed dynamodb-adapter (erstwhile DB Driver) within the same Kubernetes pod as the application service for three reasons:
- To cater to the abrupt nature of the customer’s user traffic
- To avoid the overhead in terms of cluster and infrastructure maintenance
- To reduce the network hop, which reduces latency, because running dynamodb-adapter as a sidecar makes it available on a loopback address
This sidecar pattern, as illustrated in the following image, enables dynamodb-adapter (erstwhile DB Driver) to scale along with the application service and eliminates the burden of maintenance.
Session Count Manager
We had 15 GKE clusters hosting over 75 services and 200 jobs, consuming over 25K cores at peak traffic. And dynamodb-adapter had access to seven Spanner instances.
If we consider 100 sessions, our total session count is as follows:
Session_count = 25K*7*100 = 17.5 million sessions
where 25K is the number of pods, 7 is the number of Spanner instances, and 100 is the minimum number of sessions
You might be wondering why we are engaging all seven Spanner instances. That seems ludicrous, right?
The answer lies in the second lesson that Shashank shared in the first blog: "Run away as fast and as far as you can from a distributed monolith."
Because of the inherent nature of the customer's platform, we had no clue about service-to-table mapping.
It’s something that Sweeti also highlighted in the second blog when she was fixing problems as they surfaced.
All services have direct access to a common data pool, which is not the case in a microservices architecture. Each microservice has access to only its data, and access to another microservices’ data is through a service-to-service call.
As always, the Google Spanner team was by our side. Through their internal logs, they figured out which application service was hitting which Spanner instance; thereby enabling us to engage only the relevant Spanner instances.
Addressing this issue that was deep in the trenches led me to create Session Manager.
Session Manager tags each session in the session pool and enables each dynamodb-adapter service to engage only those Spanner instances that are relevant to its service.
Slow Query Logging
Having fixed as many problems as they surfaced, we continued to face new challenges. So, it was time to mitigate risks through Shashank’s third objective: Bring more visibility to the services.
In her blog, Sweeti explained why we replaced Istio and Envoy with ingress-nginx and how that addressed some performance issues. However, she was unhappy with the overall performance because we couldn’t figure out why responses to four key services were exceeding the timeout threshold of two minutes.
We are aware that building a caching layer in front of Spanner can address some of these issues, so we added Redis. However, we needed to figure out what’s causing the performance snag.
To view all the queries and indexes across the system, we used Lightstep (an observability tool).
I investigated all the queries that hit the timeout threshold and developed a component called Slow Query Logging to identify and ignore all the slow queries. That is, queries that result in a response time of more than 10 seconds.
Data Consistency with DB Audit
The DB Audit component ensures data consistency between DynamoDB and Spanner/Bigtable.
When a query reaches Spanner or Bigtable, DB Audit fetches the row from DynamoDB and performs a row-to-row data audit.
It sends both the objects from DynamoDB and Spanner/Bigtable to the Audit service via REST API, which calculates the difference between the objects. BigQuery serves as the datastore for this service.
NOTE: DB Audit isn’t a core component of dynamodb-adapter, but it’s critical to boost your confidence during such migrations because it tells you if everything crossed over. See the last blog of this series to learn how auditing increased our confidence. Also, I recommend that you turn on DB Audit to ensure data consistency if you intend to explore a database-migration route from DynamoDB to Spanner.
Using the levers to gain mechanical advantage and change the game
Being born in the cloud helps us tremendously because it naturally makes the cloud our foundation (home turf), so helping our customers ride the cloud is second nature to us.
But while working in the trenches, we figured out exactly which foundational pieces provide that mechanical advantage to change this migration game.
The following image illustrates how the components fit together to provide this mechanical advantage.
All of this materialized without touching the application’s code base and and without increase in the agreed-upon latency.
Config Controller emerged as the primary lever, and it enabled dynamodb-adapter (erstwhile DB Driver) to become almost 80% configuration-driven.
Config Controller controls the read and write operations on a per-table basis. It also controls the operations of the DB Audit component on a percentage basis per table.
We stored its configuration in one of the Spanner tables with the following columns.
|Config Controller||Query Transformer|
|tableName||Stores the read, write, and audit configuration.
Read = 1 or 0 | Write = 1 or 0 | Audit_percentage = 0 to 100%
|config||Transforms service queries into a format that Spanner and Bigtable understand|
|uniqueValue||Stores information that triggers configuration updates in dynamodb-adapter.|
|cronTime||Stores the polling time, in minutes.|
|enabledStream||Stores the value that determines whether DB Streams is enabled on the table. Its value can be 0 or 1.|
|pubsubTopic||Stores the name of the Pub/Sub topic to which the data stream is pushed. Its value can be 0, 1, or the topic name.
If the value is 1, the data stream is pushed to the default topic. If the value is a topic name, the data stream is pushed to that topic.
[Spoiler Alert] We are collaborating with the Google Cloud Spanner team to develop dynamodb-adapter as an open source project under the Cloud Spanner Ecosystem.
I’ll end this blog with my quote that Shashank shared with you in the first blog. And I’m sure that you now have a better view of why I said what I said in our first blog.
We are creating a query translation engine that performs string processing logic and complicated database operations.
Latency should increase, right? But, it does not. We delivered a solution with no increase in latency!
In the next blog, Rishi, Krishna, and I will share with you the data migration story that enabled our customer to say with absolute confidence that they migrated from one giant cloud to another with zero data loss. Until then!
How can we help?
At CloudCover, we are always looking forward for the next challenge. Drop us a line, we would love to hear from you.
Thanks for writing us! We'll be in touch real quick.Back to website