Case Study
Data Engineering Solution for an Assessment SaaS Platform

Objective

The project aimed to enhance the data infrastructure of an Assessment SaaS platform to enable real-time reporting and comprehensive business analytics. The key objectives included:

Real-Time Reporting for Customers: Providing up-to-date reports to customers to improve decision-making processes.
Feature Adoption Tracking: Monitoring the adoption of new features among select users before broader rollout.
Infrastructure Usage Analytics: To help optimize infrastructure based on usage data.
Customer Behaviour Tracking: Understanding customer behavior to improve user experience.
Benchmarking Assessment Questions: Analyzing and benchmarking the effectiveness of assessment questions.
Proactive Customer Assistance: Empowering the Customer Success team to assist customers proactively.

Challenges

Multiple Data Sources: The data originated from various sources including

Transformation of diverse and voluminous data into structured format: The platform handled substantial data volumes daily:
- Daily Active Users: 200,000 - 300,000
- Daily Transactions: 4,000,000 - 5,000,000
Mutable Data: Handling mutable data efficiently was necessary, especially for upserts, which are typically performance-intensive in data lakes.
GDPR Compliance: Ensuring data mutability for GDPR compliance, allowing users to erase their data upon request.
Data Lake Requirement: A secure and private data lake within the existing cloud infrastructure was essential due to the sensitivity of the data.
Non-Disruptive Data Pipeline: The data pipeline had to be designed to avoid impacting the performance of the transactional source databases.

Solution

To address these challenges, a robust data engineering solution was implemented:

Pipeline Design:

The pipeline was meticulously designed to optimize performance and scalability:

Data Ingestion: Data was read from logs instead of directly accessing the databases, minimizing the load on transactional databases.
Custom Data Lake: Built a custom data lake using a combination of tools selected for their performance and scalability. The custom data lake was based on Delta Lake technology.

Data Transformation:
Transformation of data was achieved using the following tools:

Debezium: Used as a connector for change data capture (CDC) to stream data changes.Apache Hudi: Managed large-scale data lakes with support for efficient upserts and deletes.
AWS Glue: Provided serverless ETL capabilities to transform and load data into the data lake.

Data Lake Optimization:

Delta Lake: Enabled scalable and high-performance data storage with support for ACID transactions, efficient upserts, and schema enforcement.
Apache Hudi: Complemented Delta Lake by managing data versioning and handling incremental data processing.

Visualization and Monitoring:

Tableau: Used for creating interactive and real-time dashboards for business analytics.
Grafana: Monitored infrastructure usage and performance metrics.

Data Privacy and Compliance:

Ensured GDPR compliance by implementing data mutability features that allowed for data deletion requests to be processed efficiently.

Outcome

The implemented solution successfully met the project's objectives by providing real-time reporting and comprehensive analytics. The platform now offers:

Enhanced Customer Reports: Real-time and detailed reports that help customers make informed decisions.
Improved Business Decisions: Data-driven insights that guide strategic business decisions.
Feature Adoption Monitoring: Effective tracking of new feature adoption, aiding in strategic rollouts.
Optimized Infrastructure: Analytics on infrastructure usage leading to cost and performance optimization.
Proactive Customer Support: Empowered Customer Success team with insights to proactively assist customers.
Compliance and Privacy: Adhered to GDPR requirements ensuring data privacy and compliance.The customized data lake and efficient data pipeline design ensured that the platform could handle large volumes of mutable data without impacting the performance of transactional databases, ultimately leading to a scalable and high-performing data infrastructure.