The cost of downtime
Downtime costs enterprises money, in fact a great deal of money. The actual cost depends on the industry, but on average the revenue losses amount to between $84,000 and $108,000 for every hour of unplanned downtime. That isn’t the only loss; on to that you need to add the intangible costs of the impact of downtime on reputation and loyalty.
There is a big difference between scheduled and unscheduled downtime. Scheduled downtime is necessary in order to perform maintenance such as software patches, system configuration changes and database and hardware maintenance. Unscheduled downtime occurs typically as the result of hardware or software failure or an event such as a power cut or environmental catastrophe.
Designing a system for high availability is problematic. Increasing system complexity increases the number of possible failure points. Simply installing internal hardware redundancy isn’t an answer as it means that the whole system must be taken down for maintenance. It is necessary so design the system so that it can be maintained without affecting service availability.
Such a management tool needs to satisfy three criteria: high availability, fault tolerance and scalability.
High availability implies that the uptime of an application is 99.9999%, which is often termed “five nines”. It equates to a maximum downtime of 5.26 minutes a year which includes both planned and unplanned outages or downtime. Of course the ultimate goal is an application that has no downtime at all and is always available.
Xervmon Solution: Users can now unleash the power of visualized deployments with Xervmon. Xervmon can reverse engineer your existing deployments on AWS, then create a blueprint of the deployments and then either migrate the deployments to another provider in a different region or clone the deployments to another region of AWS to deliver high availability. Managed Services powered by Xervmon Calculate ROI
The scalability of an application refers to its ability to accommodate a growing amount of traffic without requiring architectural changes. If an application is incapable of scaling, then it is likely that its availability would be impacted, but although scalability may be a critical feature of availability, simply making is scalable doesn’t guarantee availability.
Fault-tolerance is the measure of how well an application can continue to operate satisfactorily in the event of a fault or faults occurring. It is an essential property of high-availability applications and is achieved by building redundancy into the system so that should a fault occur then an alternative part of the system would automatically take over that function.
AWS and High Availability
Amazon Web Services IaaS is an ideal platform for building highly available systems that are reliable, fault tolerant and require minimal intervention. The architectures include multiple servers, isolated redundant data centres, global regionalisation, and availability zones (AZs) within those regions.
The AWS building blocks include the following highly available and fault tolerant services that span across the AZs:
- Amazon S3
- Amazon SQS
- Amazon DynamoDB
- Amazon SNS
- Amazon CloudFront
- Amazon SES
- Amazon RouteS3
- Amazon SWF
- Amazon Elastic Load Balancer
And we are able to design high availability systems using Xervmon Visual Designer with the following subset:
- Amazon EC2
- Amazon EBS
- Amazon RDS
- Amazon ELB
- Amazon VPC
Design for failure
Designing a system that is able to withstand failure implies designing it so that single points of failure are eliminated. If any single function fails then the application will continue to function and there will be no impact on business.
Philosophically the approach assumes that everything fails all the time and we work backwards from there. It is necessary to address the question as to what are the potential failure points and what are the possible scenarios of failure. For instance, if the architecture includes master and slaves, then it is necessary to decide what to do if the master node fails; similarly if a node fails or a node balancer fails.
It is also essential for the system to be capable of recognising failure and when it does so to have the ability of replacing the failed node. It also must be capable of handling situations such as the cache keys growing beyond the memory limit of an instance, and other potential scenarios.
Existing tools for designing such a system are fragmented. However Xervmon is able to offer a unique solution by providing a unified dashboard which can identify problem areas, nodes, servers or clusters along with scripts and tools to replace the node or take other actionable events.
Mechanisms that are designed to handle failure include:
- Build process threads that resume on reboot
- Allow the state of the system to re-sync by reloading messages from queue
- Keep pre-configured and pre-optimized virtual images to support above point of launch/boot.
- Avoid in-memory sessions or stateful user context, move that to data stores.
- Have a coherent backup and restore strategy for your data and automate it.
Auto scaling is achieved by:
- Enables to automatically scale Amazon EC2 capacity up or down
- Enables to terminate Server Instances at will
- Enables to add more instances in response to an increasing load
- Enables launch of a replacement instance immediately, in case of a failure
- Enables application to transition seamlessly in case the primary server fails
Elastic Load balancing (ELB) distributes incoming traffic to an application across several Amazon EC2 instances. The ELB is given a DNS host name and requests sent to that host name are delegated to a pool of Amazon EC2 instances. Any unhealthy instances in that pool are detected and automatically rerouted to healthy instances until the unhealthy instances have been restored.
ELB and Auto Scaling combine ideally: ELB gives a single DNS name for addressing and auto scaling ensures there is always the right number of healthy Amazon EC2 instances to accept requests.
Building fault-tolerant applications on Amazon EC2 requires that the best practices are followed, for instance:
- Commission replacement instances rapidly
- Amazon EBS should be used for persistent storage
- Multiple Availability Zones along with elastic IP addresses.
Multi AZ architecture
By distributing applications geographically one can achieve greater fault tolerance. As the Amazon EC2 commitment is 99.95% availability for every EC2 Region, it is essential to deploy applications across multiple AZs.
Redundant instances are placed in distinct AZs and ELB will automatically balance traffic across multiple instances and multiple AZs.
Systems Management and monitoring integrated with Xervmon
Xervmon can help companies to plan for redundancy for high availability adopts best practices and even provision servers appropriately checked and certified for security vulnerabilities. With Xervmon monitoring, customers can now quickly identify issues on the server or the apps and respond to actionable events even more quickly compared to fragmented tool suite in your arsenal comprising any where between 3-5 of them. Now if there is no support to your tool on Xervmon, just contact us and we will ship you one as soon as possible. Integration is made easy for Systems Management.
Try Xervmon to manage and monitor your cloud assets with confidence? Sign Up
Managed Services powered by Xervmon Calculate ROI