
EDF EBS: Optimising Disaster Recovery and Observability
How Cloud Elemental worked with EDF’s EBS department to improve disaster recovery, observability and database resilience for a new mission critical application.

The Client
EDF is one of the UK’s largest energy providers, delivering electricity and gas to millions of residential and business customers. EDF Business Solutions (EBS) provides power purchase agreements (PPAs) to industrial and commercial organisations and is the UK’s leading PPA provider.
With Cloud adoption maturing across the business, EBS sought to improve the resilience, security, and observability for a new business critical application and provide a template for future application delivery.
The Challenge
EBS engaged Cloud Elemental to enhance the platform’s disaster recovery (DR) posture, improve infrastructure observability, and audit database operations. The goal was to reduce operational risk and increase confidence in the platform’s stability, particularly in the event of an incident or recovery scenario.
- Disaster Recovery Gaps
EBS wanted to create DR playbooks and automate the process to ensure that Recovery Time Objectives (RTO) were consistently achievable, reducing business impact
- Limited Observability
Monitoring relied primarily on Amazon CloudWatch, offering only basic telemetry and no integration with EDF’s incident management tooling
- Database Access and Security
EBS wanted to audit one of their core databases to ensure that it met industry standards for database security and to optimise performance

Our Approach
Cloud Elemental worked closely with EDF EBS throughout the engagement, starting with a series of discovery workshops and technical audits to assess the current-state architecture, tooling, and workflows.
Assessment
During initial assessments, we identified the following opportunities for improvement:
- No automated DR testing framework or scheduled simulations
- Incomplete DR documentation and lack of confidence in current RTOs
- Gaps in monitoring, particularly for underlying infrastructure components
- Risk-prone processes for database cloning and key management
We validated these findings with technical leads across EBS, prioritising use cases based on risk and effort.

Discovery
- Held deep-dive workshops
- Mapped out application architecture, risks and maturity levels
Design & Planning
- Defined future-state architecture
- Defined DR flows using AWS-native services and IaC


Build & Implement
- Delivered automated pipelines, integrated monitoring and infrastructure updates
Enablement & Handover
- Provided a thorough DR playbook and documentation
- Knowledge-shared with internal teams

Once the priorities were defined, we delivered a set of modular improvements using infrastructure-as-code, automation, and integrations with EDF’s existing systems. Each solution was designed to be scalable and auditable, with minimal disruption to ongoing operations.
Our Solution
Our tailored recommendations enabled EDF to automate recovery, unify observability, and enforce secure, scalable database practices, all whilst using AWS-native tooling and integrating with their enterprise systems.
Here’s how we did it:
Introduced AWS Backup to automate snapshot creation and lifecycle management.
Developed custom recovery playbooks and supporting scripts.
Created Terraform modules to enable repeatable infrastructure deployment in DR scenarios.
Established monthly DR testing cycles to validate recovery processes.
Integrated Dynatrace for deep infrastructure and application monitoring.
Streamlined incident triage by linking Dynatrace alerts to ServiceNow via automated workflows.
Built dashboards to improve visibility of platform health and performance trends.
Implemented a self-service mechanism for secure, time-limited access to production databases.
Automated database cloning for use in DR testing and non-production environments.
Replaced legacy encryption processes with customer-managed AWS KMS keys.
- Delivered a detailed report highlighting key pain points with actionable recommendations to resolve issues
Reduced manual steps in DR simulations, allowing for more frequent and reliable testing
Improved auditability of database access and encryption activities
Enabled on-call engineers to respond faster to incidents through clearer alerts and integrated tooling
Established a foundation for continuous improvement by codifying processes and infrastructure
Our Results
By implementing automated DR, enterprise-grade observability and secure database practices, EDF EBS has significantly reduced operational risk while improving agility and control across their AWS infrastructure.
Operational Risk Reduced
- EDF EBS can now simulate DR events monthly, with confidence in recovery success and minimal overhead
Improved Observability
- Engineers now have access to real-time, actionable data to troubleshoot incidents quickly
Increased Security & Auditability
- Secure, temporary access workflows and automated encryption reduce the risk of data breaches or compliance issues
Faster Delivery Cycles
- Standardised infrastructure provisioning and cloning processes have improved agility and platform reliability
Want to improve your Cloud Resilience?
Cloud solutions, simplified.
Let's discuss how we can help you achieve your Cloud goals with our expertise and proven methodology.
