Unless you have been living under a rock for the last ten years, most people in IT have grasped the concept that perfectly working applications will eventually break: sometimes badly, other times CATASTROPHICALLY. Let's chalk this up to a mix of Murphy's and Moore's laws:
"With the speed of technology advancing and refreshing constantly, whatever can to wrong eventually will..."
Or as I like to define my 1st law:
"The amount of entropy in any IT system is directly proportionate to the frequency that things will go horribly wrong."
Now the worst part of this is that business is asking for IT systems to be more dynamic and agile moving forward. The capacity for catastrophic disaster when projects go "agile" or start a "continuous delivery" cycle are much greater if changes are not effectively tested and managed. Discreet infrastructure changes that are needed to support these agile and continuous deliveries can also cause havoc.
Now virtualisation has helped solve entropy in server hardware architecture by providing an abstraction layer between hardware and the operating system.
Moving further up the stack, similarly, Application Delivery Controllers (ADC) have carved out a place in the modern data centre to provide a critical abstraction between the entry point for a service and the resources that actually processes a given request. The abstraction comes when you create a virtual server and define a pool of resources to handle requests to that virtual server: The ADC ensures that the request is serviced by an available resource and that any resources that are not funcitoning correctly are bypassed.
Now we get to my 2nd law:
"Good health checks lead to good traffic management decisions."
The more granular and effective the health checks on any given system, the more effective your traffic management will be.
Now health checks are only half of the solution. Health checks are great at letting you know when a resource fails, but deciding what to do *when* that resource fails is where the "rubber meets the road."
Let's get to the topic in the title of this post: "Just what is the backup plan for the backup plan to the backup plan?"
Over the next four posts, I'll be running through the areas where your humble Stingray Traffic Manager can be configured to provide healthy available applications when things go HORRIBLY wrong. We are going to cover a variety of topics from things you can do inside your data centre to ways you can leverage public and private clouds to ensure you are able to keep on keeping on when DISASTER strikes.
This week we are going to cover:
Load Balancing - your first line of defence.
So you have your shiny new virtual ADC installed and you have set up your first virtual server and pool of resources. You are pretty proud of yourself but you wonder:
"Is there anything I could do to make it... Better?"
Chances are that there might be. There are several things that I do instinctively when setting up ADC virtual servers. Now good decisions come from experience, and experience, as we all know comes from *bad* decisions. So what have I learned in the last few years from all the *bad* decisions I have made looking after ADC's?
1) Make your health checks as granualr as you can:
- ICMP Ping is better than nothing;
- TCP Port checks are better than ICMP Pings;
- "Basic HTTP" checks are better than TCP port checks (for Webservers anyway); and
- "Full HTTP" checks that actually submit a valid HTTP request and validates the server's response are better still (again for web servers).
If we look at the Basic HTTP checks in the screenshot below, you can see that there is limited scope for customisation.
The basic HTTP check will send a "GET /" and will accept *any* response as a succesful health check. If the web server responds, it is considered healthy.
Now for a more granular health check, we could use the "Full HTTP" check. Out of the box the "Full HTTP" check will perform a similar "GET /" check, but will require the server to respond with a 2XX, 3XX or 4XX server response code to be considered healthy. A "500 Internal Server Error" status code would cause the node to fail the health check:
As you can see, we could also specify:
- A Host Header (in case we have virtual hosts being used to present multiple sites on a single back end node )
- A Path to retrieve: /myapplication/login.aspx for example
- A Status Regex to match ('^[0-9][0-9]$' is the default that will ensure a 2XX, 3XX or 4XX message is returned)
- An HTTP response body value to match so you can ensure your application is responding correctly.
Now that we have granular health checks in place, it is time to look at what we do when health checks fail.
2) Failure Pools
Failure Pools are designed to let you have a pool that has one or more servers to use if there are no healthy members of the default pool to service the request.
A common use for failure pools is to host a "Sorry" page - a pretty static content page advising users connecting to a service that there is a technical fault or a scheduled outage. To define a Failure Pool, just create a pool in the normal fashion. Lets say we called our failure pool "my_failure_pool". You can then configure our main pool to use it as a backup. Inside the configuration of the main pool, drop down the "Failure Pool" list and select it. It really is that simple, as you can see from the screenshot below:
3) Priority Lists
Priority lists are designed to allow you to control which hosts get used in a particular pool. A common use case for Priority Lists is when there is connectivity between two data centres, a pool definition might include resources from two different sites. Ideally, you want to always use the nodes that are local to the ADC. If enough of the local nodes are not available, then you want to expand the pool definition to include the nodes on the remote site. In the screen shot below, you can see I have configured the 172.16.10.x nodes as the highest priority group, but if the number of nodes in this group drops below 1, it will engage the next Priority Group level and use the nodes in the 172.16.135.x range.
4) Autoscaling Pools
Stingray Traffic Manager support the ability for the STM to automatically scale the number of nodes in a pool based on demand for the service. Out of the box VMWare ESX (tm), Amazon EC2 (tm) and Rackspace (tm) are supported targets.
Many integrations have been performed by Stingray customers into other API's for Autoscaling such as environments built on OpenStack and CloudStack. The Autoscale feature is provided with appropriate credentials to the Hypervisor or cloud environment and monitors server response times. If the defined percentage of server responses exceed the required maximum threshold for response times, the STM will trigger an autoscale to increase the number of nodes in the pool. The size of the Autoscaled pool can have a low and high watermark set to allow you to constrain the minimum and maximum nodes that will be dynamically provisioned. As you can see from the screen shot below, Autoscaling has been set up onto a VMWare hypervisor for a mimumum of 2 nodes and a maximum of 10 nodes.
In the screenshots above, if 40 percent of the connections exceed 1000ms response times for a period of 20 seconds, the STM will initiate a scale up of the pool. If 95 percent of the connections return under the 1000ms response time for more than 20 sec, the STM will initiate a scale down of the pool.
So as you can see, even with basic load balancing features such as Advanced Health Checks, Failure Pools, Priority Group Lists and some advances features such as Pool Autoscaling, there really are many things that you can do to add more resiliency to your deployment.