Reliability and Business Continuity
Implement high availability, fault tolerance, backup strategies, and disaster recovery solutions (~16% of exam).
Reliability and Business Continuity are critical pillars in AWS architecture that ensure systems remain operational and recoverable during disruptions. For the AWS Certified SysOps Administrator - Associate exam, understanding these concepts is essential. **Reliability** refers to the ability of a…
Concepts covered: Connection draining, Route 53 health checks, Route 53 failover routing, RDS Multi-AZ deployments, RDS read replicas, Aurora Global Database, Aurora Auto Scaling, S3 cross-region replication, Amazon EC2 Auto Scaling, S3 lifecycle policies, DynamoDB backups, Auto Scaling groups, Launch templates, Auto Scaling policies, Target tracking scaling, DynamoDB global tables, DynamoDB point-in-time recovery, Disaster recovery strategies, Backup and restore DR, Step scaling policies, Scheduled scaling, EBS snapshots, Predictive scaling, EBS snapshot lifecycle, Application Auto Scaling, Multi-AZ deployments, Elastic Load Balancing, RDS automated backups, RDS manual snapshots, Point-in-time recovery, Application Load Balancer, Network Load Balancer, Gateway Load Balancer, Load balancer health checks, Cross-zone load balancing, AWS Backup service, Backup plans and vaults, Amazon Data Lifecycle Manager, AMI creation and management, S3 versioning, S3 object lock, Pilot light DR pattern, Warm standby DR pattern, Multi-site active-active DR, AWS Elastic Disaster Recovery
SOA-C02 - Reliability and Business Continuity Example Questions
Test your knowledge of Reliability and Business Continuity
Question 1
A global telecommunications company operates a multi-site active-active disaster recovery architecture across AWS regions us-west-2 and eu-central-1. Both regions actively process customer billing transactions using Amazon Aurora Global Database. The SysOps Administrator is designing an automated testing strategy to validate that the DR architecture performs as expected during actual failures. Currently, the team only performs manual failover tests once per quarter, and the last actual regional incident revealed that several application components had configuration drift that prevented proper failover. Management requires monthly validation of the entire active-active architecture including application layer, data layer, and routing components. The validation must be performed with minimal risk to production traffic. Which testing approach should the administrator implement to meet these requirements?
Question 2
A marketing agency operates a campaign analytics platform on an Auto Scaling group that processes advertising data. The ASG is configured with a minimum of 2 instances, desired capacity of 4, and maximum of 15 instances across two Availability Zones. The team uses a simple scaling policy that adds 3 instances when the SQS ApproximateNumberOfMessagesVisible exceeds 500. During a recent product launch campaign, the operations team observed that after the first scale-out triggered, subsequent scaling activities were blocked for 5 minutes even though the queue depth continued to rise to over 2000 messages. This delay caused significant processing backlogs. The team wants to reduce the waiting period between consecutive scaling actions to allow faster response to sustained high demand. Which Auto Scaling configuration should the SysOps Administrator modify to allow more frequent scaling activities?
Question 3
A manufacturing company has configured Amazon Data Lifecycle Manager (DLM) to create daily snapshots of their production EBS volumes with a retention count of 30. The DLM policy is set to run at 03:00 UTC. After several weeks of operation, the storage team notices that snapshots for three specific volumes are consistently created 2-3 hours later than expected, while other volumes in the same policy complete their snapshots at the scheduled time. All volumes are attached to running EC2 instances in the same Availability Zone and are tagged correctly. The delayed volumes are 2 TB io2 volumes with high IOPS configurations, while the others are 500 GB gp3 volumes. The team needs to ensure all snapshots complete within a predictable timeframe for their nightly backup reporting. What is the most likely explanation for this behavior and appropriate resolution?