Case Study
Cloud Optimization for a Cybersecurity Education Platform
Overview
Our client is a purpose built cyber range platform that measures and curates cyber security skillsets. It provides on-demand pre-configured infrastructure setups for cybersecurity hands-on training.
Client’s infrastructure had been facing a high load variability. To control costs associated with running the infrastructure 24x7, the training infrastructure was spinned up on demand. However, this required the capability to spin up >1,000 concurrent setups of training lab at times.
Problem Statement
Following challenges were identified while the scoping discussions with the client:
-
End user sometimes had to wait up to 1 hour to set the training lab up and running. This was primarily happening due to failure of the system to handle multiple (up to 1000) concurrent requests to start the training lab during a live webinar or offline class.
-
Low ROI on investment due to underutilization of provisioned infrastructure and high variability in load driven by the ongoing training labs as per training institutes course schedule.
-
Achieving zero down time for system update as it was difficult to find a maintenance time window for updates, due to scheduled training of partner institutes.
-
Make system fault-tolerant given the criticality of the platform for reputed universities
Solution
Xponentium’s team designed and supported implementation of the following changes:
-
Since every incoming request for setup of training lab was forwarded directly for actual labs spin up on VMs, it was overwhelming for the infrastructure to handle 1000s concurrent lab creation request. Xponentium brought RabbitMQ Message Queue in between and delegated task of spinning up of required labs to Celery workers, so they can do batch processing of incoming requests and take new job as soon as any lab gets created. This modification significantly improved average lab creation time when many user request for labs simultaneously. Leveraged both serverless + managed services + mix of reserved and spot instances to scale as needed, paying extra cost only if needed.
-
Identified all the single point of failures and introduced new infrastructure components like availability-zone and region redundancy, using managed service for database ensuring high availability. Implemented blue green deployment strategy to make deployment new update in few minutes, achieving almost zero downtime.
-
Optimized database queries, removed unused indexes, Moved non-clustered indexes to clustered indexes wherever possible to reduced database IO, resulting in significant performance cost benefit.
Xponentium Impact
-
Reduced time to create concurrent labs from 60 mins to 5 mins even in case of 1000 concurrent lab creation as per benchmark load test
-
Achieved close to zero downtime
-
Reduced over all cloud cost by 50%
Technologies
Rabbit MQ
Celery