CloudCover + AWS Data Lake + Customer
Customer - A high growth biotechnology company based in Singapore and USA
- Build an effective data lake solution from disparate data sources
- Deliver near real-time analytics, to drive insights and innovation
- PII information - make compliance a top priority
- Follow the vetted 9 step journey for building a data lake on AWS
About the Customer
The customer is a high growth biotechnology company based in Singapore and USA. They manufacture and deploy wearable healthcare devices for the elderly. These devices collect multiple data points about the human body (heart rate, ecg parameters, gait angle, dosage requirements etc). These raw data points are stored on AWS in S3 and DynamoDB without any data processing. In addition, they were ingesting multiple third-party API’s and CSV/Excel dumps on S3.
This is a fairly complicated task. The driving factors of this complexity are disparate data sources, a combination of batch & streaming data and skill sets around building scalable pipelines. The end result is data teams building under optimized data lakes which are far from best practices for data sanity, security, and scalability.
The key problem statement was to build a data lake to facilitate data scientists to ask questions of the data
There were multiple challenges we faced along the way in order to fulfill the objectives of the project. The key problem statement was to build a data lake to build a data lake to facilitate data scientists to ask questions of the data, build machine learning models on top of the data and to provide their partners (healthcare providers) with analytics.
A data lake is built to enable fast and cost-effective queries to resolve.
The goal statement above was found to be too broad to engineer an effective data lake.
This influences the source identification and intermediate ETL/ELT layers. Our first step for this customer was to ingest all the data in S3 from all the sources. Before any code was written in the next step, we sat through multiple rounds of discussion with stakeholders to understand what problems they wanted to solve.
We jointly came up with a list of tangible business outcomes and problem statements. Each outside and problem statement is usually composed of a basket of queries.
Questions we asked ourselves and the customer
Sources of Data?
“In order to solve this problem statement, what sources of data do I need to look into?”
What and How?
“For each source of data what are the key attributes, how much data is there and how will it need to be pre-processed?”
“How do we combine data from these sources to solve the problem statement?”
At this point, ETL jobs were written to partition, combine, chain + massage the sources and reach a queryable output.
Defined problem statement?
“Is the output satisfactory?”
If the problem statement is not clearly defined in the first step, then the steps before have to be repeated multiple times till the end goal is achieved.
Data Lake Solution
CloudCover designed a data lake strategy after meetings with key stakeholders.
From a technical standpoint, the key points of discussion were: Sources of Ingestion, Types of data sources, Data Lake schema, Database schema, Intermediate datastores, ETL pipeline, Tangible use cases, Access patterns for the data.
We deployed a combination of cloud native and open source tools on AWS to setup the lake. These are:
AWS Glue for ETL
Aurora for intermediate state
Glue catalog for metadata
Crawlers to discover sources
S3 as the main parquet/ORC data lake
IAM policies for least privilege access
Jupyter/Zeppelin notebooks for data scientists
Terraform for IaaC
Cloudwatch for monitoring and logging
Athena for querying
The key premise was to out of the box support disparate data sources (batch, streaming, REST and SQL/NoSQL databases which are self-hosted or on other public clouds). Boilerplate ingest pipelines for these sources were deployed and scheduled to ingest to AWS cloud. The data sanitization and ETL layer was engineered to load data sinks and the data lake.
The end result is a columnar data lake sitting on S3. Once this solution is built, we can deploy any layer for the querying and configure multiple sinks for the data depending upon the access patterns and stakeholders.
What about future use cases?
The common issue is that if a data lake is built with specific problem statements in mind, how do we ensure that future users can still make the lake usable for problem statements that are yet unknown.
The solution here is to have a staging data lake setup in S3 to test out future queries. ETL pipelines are broken down into cleaning and preprocessing steps to ensure that they will be reusable for future use cases. However, adding a new source or use case would entail experimentation. Having a staging data lake removes the overhead of creating resources from scratch. Additionally, since access patterns for users remain constant, the only additions/modifications are to the ingestion (in case of new source) and ETL pipelines.
The end users of the data lake access the data in a fairly consistent manner. These are usually different for each team. Data Analysts may use Excel output or a SQL tool, engineers write queries using ORM and python, data scientists use Jupyter notebooks and business users want to view reports in a dashboard.
Steps in Data Lake Journey
Step 1Customers are guided to list down all problem statements and use cases
Step 2List down all data sources
Step 3List down access patterns (how will stakeholders access or query the data)
Step 4Build data ingestion pipelines
Step 5Map out data schema, partition strategy and data models
Step 6Build the ETL jobs
Step 7Test the business problem with queries or code tests on the processed data
Step 8Iterate as required
Step 9Productionize for security and scalability