By browsing our website, you consent to our use of cookies and other tracking technologies. For more information, read our Privacy Policy.

About Hike

Hike is a social content platform focused on privacy, expression and bite-sized content. A place where you can privately chat with friends and also consume snackable content that you love.

The app was launched on 12/12/12 and acquired a user base of over 100 million in January 2016. In August 2016, Hike raised its fourth round of funding of USD 175 million led by Tencent and Foxconn at a valuation of USD 1.4 billion, making it the fastest company in India to attain a valuation of USD 1 billion, having reached the milestone in just 3.7 years.

Challenges And Goals

To achieve better app performance on scale at a viable cost, Hike evaluated Google Cloud Platform services for performing ETL. The results indicated that GCP would help it build, test, and deploy applications quickly in a scalable, reliable cloud environment.

Why Google Cloud

While speeding up ETL was crucial for Hike, it wanted to do so in a cost effective manner and with full accuracy. Factors such as flexibility, ease of use, reliability and performance made Google Cloud Platform a clear winner.

Google Cloud Products Used

BigQuery

BigQuery is Google's serverless, highly scalable, enterprise data warehouse designed to make all your data analysts productive at an unmatched price-performance.

Cloud Pub/Sub

Cloud Pub/Sub is a simple, reliable, scalable foundation for stream analytics and event-driven computing systems.

Cloud Dataflow

Simplified stream and batch data processing, with equal reliability and expressiveness.

CloudCover’s Strategy and Approach

CloudCover consulted for Hike and implemented this radical transformation in their Big Data Analytics ecosystem with the help of powerful Google Cloud services. This, in turn, resulted in Hike's end to end migration to Google Cloud.

The POC included testing and verifying for VMs, Scalability, Security, Monitoring, Billing and Big Data. CloudCover approached the migration by making the most of these tools:

  • Cloud Dataflow for Data transformation in batch and streaming mode
  • Google BigQuery for storing processed data and performing higher order analytics
  • Cloud Pub-Sub for ingesting live streaming data to DataFlow Pipeline
  • GCS as Big Data source/sink

Solution

CloudCover identified the closest mapping of Hike’s existing components on AWS to equivalent services in GCP. These were highly scalable, highly available, no-ops managed services on Google’s world-class infrastructure:

  • Kafka Cluster :: Google Pub-Sub (Key benefit/advantage: Serverless, no-ops managed message bus)
  • Hadoop MapReduce Clusters (EMR) :: Google Dataflow (for both batch and stream mode, Key benefit/advantage: No-ops, managed, high-performance distributed computing service for Big Data)
  • Hive ETL :: Google BigQuery
  • Redshift :: Google BigQuery (Key benefit/advantage: Scalable and performant (even over massive datasets) data warehouse, while being relatively easy-to-use)
  • S3 :: GCS
  • GCP App Engine

Batch Mode Architecture

Bolt Screenshot

Batch Mode Workflow

  • FileUpload event on S3 triggers SNS event.
  • A custom application deployed on GCP App Engine is subscribed to the SNS topic. This application runs on a schedule.
  • The app processes all the accumulated events and copies all log files from S3 to GCS in a batch for further processing.

Streaming Mode Architecture

Bolt Screenshot

Streaming Mode Workflow

  • Hike’s applications writes log streams to a Kafka cluster running in AWS
  • A Kafka consumer to forward the log streams to a GCP Pub-Sub Topic

A Dataflow Streaming pipeline consumes these streams from Pub-Sub and performs required transforms in near real-time.

Current Hike Implementation

During the actual implementation of Hike's Analytics platform on GCP, Hike chose to stick with Kafka as all the messaging servers were writing to the Kafka stream and Dataflow's Kafka connector available at that time was still maturing. Hike went ahead with using Hive instead of BigQuery due to various dependencies in order to avoid too many changes in architecture during the migration phase. However, Hike switched to BigQuery later, here is a detailed blog about this.

Results

From the graph below, we can see how Google’s Cloud DataFlow service performed ETL in the most cost-effective manner.

Bolt Screenshot

With the help of Dataflow it was possible to process TBs of data in a significantly reduced amount of time. The table below shows how we achieved scaling even for massive parallel processing.

Operation AWS GCP
hike.analytics ETL 30m ~ 1h 15m 5m ~ 30m
Higher Order ETL 30m ~ 1 hour < 10 minutes

Batch Mode (Dataflow Job Timing Statistics)

Job Name Duration (hh:mm:ss) Remarks
Unzip Process* 00:06:58 Input Loglines - 900 million
Handler Transformation 00:11:47 Output Loglines - 1.068 billion
BigQuery Insert 00:05:27 Loaded rows - 1.068 billion
Total Time 00:24:12

Streaming Mode (Dataflow Job Timing Statistics)

Cloud DataFlow’s blazing-fast streaming performance (with the same code as the batch process) made it possible for Hike to perform ETL even on their live stream (ingested into PubSub). This capability was notably absent in their previous Big Data Analytics ecosystem.

Job Name Duration (hh:mm:ss) Remarks
Unzip Process* N/A Input Loglines - 900 million
Handler Transformation Near real time Output Loglines - 1.068 billion
BigQuery Insert Near real time Loaded rows - 1.068 billion

Conclusion

This massive performance boost in data transformation and analysis in a cost-effective way prompted Hike to focus more on improving the application functionality. The optimum use of GCP managed services reduced overhead of maintaining infrastructure.

There was a dramatic improvement in speed and responsiveness of the new user analytics dashboard. This in turn resulted in a better user experience with no complications.

Benefits post migration (in a nutshell)

  • The cost post migration went down by about 50-70%.
  • Running queries in BQ takes seconds which means the performance automatically improved and was faster by 40-60%.
  • All GCP services used are managed services; hence there was no hassle of scaling or maintenance. This meant Hike's team could utilise their time to focus on automating and not so much on maintenance issues.

About CloudCover

CloudCover delivers the insane potential of the public cloud to start-ups & agile enterprises through a combination of weaponized geekiness, extreme automation, and battle-scarred experience.