Case Study

CloudCover + AWS Data Lake + Customer

Customer - A high growth biotechnology company based in Singapore and USA

tl;dr

  • Build an effective data lake solution from disparate data sources
  • Deliver near real-time analytics, to drive insights and innovation
  • PII information - make compliance a top priority
  • Follow the vetted 9 step journey for building a data lake on AWS

About the Customer

The customer is a high growth biotechnology company based in Singapore and USA. They manufacture and deploy wearable healthcare devices for the elderly. These devices collect multiple data points about the human body (heart rate, ecg parameters, gait angle, dosage requirements etc). These raw data points are stored on AWS in S3 and DynamoDB without any data processing. In addition, they were ingesting multiple third-party API’s and CSV/Excel dumps on S3.

This is a fairly complicated task. The driving factors of this complexity are disparate data sources, a combination of batch & streaming data and skill sets around building scalable pipelines. The end result is data teams building under optimized data lakes which are far from best practices for data sanity, security, and scalability.

Problem Statement

The key problem statement was to build a data lake to facilitate data scientists to ask questions of the data

There were multiple challenges we faced along the way in order to fulfill the objectives of the project. The key problem statement was to build a data lake to build a data lake to facilitate data scientists to ask questions of the data, build machine learning models on top of the data and to provide their partners (healthcare providers) with analytics.

Approach

A data lake is built to enable fast and cost-effective queries to resolve.

The goal statement above was found to be too broad to engineer an effective data lake.

This influences the source identification and intermediate ETL/ELT layers. Our first step for this customer was to ingest all the data in S3 from all the sources. Before any code was written in the next step, we sat through multiple rounds of discussion with stakeholders to understand what problems they wanted to solve.

We jointly came up with a list of tangible business outcomes and problem statements. Each outside and problem statement is usually composed of a basket of queries.

Questions we asked ourselves and the customer

  • Effective Data Lake

    Sources of Data?

    “In order to solve this problem statement, what sources of data do I need to look into?”

  • Access Patterns

    What and How?

    “For each source of data what are the key attributes, how much data is there and how will it need to be pre-processed?”

  • Future Use Cases

    Mulitple sources?

    “How do we combine data from these sources to solve the problem statement?”

    At this point, ETL jobs were written to partition, combine, chain + massage the sources and reach a queryable output.

  • Future Use Cases

    Defined problem statement?

    “Is the output satisfactory?”

    If the problem statement is not clearly defined in the first step, then the steps before have to be repeated multiple times till the end goal is achieved.

Data Lake Solution

CloudCover designed a data lake strategy after meetings with key stakeholders.

From a technical standpoint, the key points of discussion were: Sources of Ingestion, Types of data sources, Data Lake schema, Database schema, Intermediate datastores, ETL pipeline, Tangible use cases, Access patterns for the data.

We deployed a combination of cloud native and open source tools on AWS to setup the lake. These are:

  • AWS

    AWS Glue for ETL
  • AWS

    Aurora for intermediate state
  • AWS

    Glue catalog for metadata
  • AWS

    Crawlers to discover sources
  • AWS

    S3 as the main parquet/ORC data lake
  • AWS

    IAM policies for least privilege access
  • AWS

    Jupyter/Zeppelin notebooks for data scientists
  • AWS

    Terraform for IaaC
  • AWS

    Cloudwatch for monitoring and logging
  • AWS

    Athena for querying

The key premise was to out of the box support disparate data sources (batch, streaming, REST and SQL/NoSQL databases which are self-hosted or on other public clouds). Boilerplate ingest pipelines for these sources were deployed and scheduled to ingest to AWS cloud. The data sanitization and ETL layer was engineered to load data sinks and the data lake.

The end result is a columnar data lake sitting on S3. Once this solution is built, we can deploy any layer for the querying and configure multiple sinks for the data depending upon the access patterns and stakeholders.

What about future use cases?

The common issue is that if a data lake is built with specific problem statements in mind, how do we ensure that future users can still make the lake usable for problem statements that are yet unknown.

The solution here is to have a staging data lake setup in S3 to test out future queries. ETL pipelines are broken down into cleaning and preprocessing steps to ensure that they will be reusable for future use cases. However, adding a new source or use case would entail experimentation. Having a staging data lake removes the overhead of creating resources from scratch. Additionally, since access patterns for users remain constant, the only additions/modifications are to the ingestion (in case of new source) and ETL pipelines.

Access Patterns

The end users of the data lake access the data in a fairly consistent manner. These are usually different for each team. Data Analysts may use Excel output or a SQL tool, engineers write queries using ORM and python, data scientists use Jupyter notebooks and business users want to view reports in a dashboard.

Steps in Data Lake Journey

  • Step 1

    Customers are guided to list down all problem statements and use cases
  • Step 2

    List down all data sources
  • Step 3

    List down access patterns (how will stakeholders access or query the data)
  • Step 4

    Build data ingestion pipelines
  • Step 5

    Map out data schema, partition strategy and data models
  • Step 6

    Build the ETL jobs
  • Step 7

    Test the business problem with queries or code tests on the processed data
  • Step 8

    Iterate as required
  • Step 9

    Productionize for security and scalability

CloudCover Product

Introducing Data Pipes

A DataOps and insights generation platform to help analytics and IT teams to catalog, manage, and democratize their data by automating data practices

Learn more