An adventure in agile administration: January 2012

In the enterprise, people get very attached to servers as opposed to compute capacity. You can't hug something as amorphous as compute capacity, but you can get very attached to server 1 in location X.

This usually happens by accident rather than by design. To get a server, application support teams worked with their technical teams to specify its configuration, they waited weeks for it to arrive, they checked it to make sure it meets their requirements, in some cases they even named it and most of all they paid for it! The server is then reconfigured as required, perhaps multiple times over a number of years. This leads to a kind of special status given to "their server" and usually manifests itself during outages:

Server Support: "I rang to let you know that server 1 has failed and we are working on it, can you fail over your applications?"
Application Support: "We need server 1 up and running"
Server Support: "Server 1 has a hardware issue, you have servers 2, 3 and 4 in the same location, why can't you run there?
Application Support: "You don't understand, server 1 is special and we need it back up or X will fail!"

Through some action, either by the application or server support teams, server 1 has become unique giving it an elevated status. At the core of this is the question of what exactly is a server?

At a simple level, a "server" is nothing more than a bunch of bits on a disk which provides compute capacity. This splits the concept of a server into two pieces;

The hard physical components - Hardware, Network connections etc.
The softer logical components - Operating System, Configuration etc

It is when these two components are joined together and referenced specifically that the issues arise.

Application reliance on specific servers is something that should be avoided. I think this concept is well understood by most people in technology, however it is not always clear how it can be avoided. Use of aliases, virtualisation, clustering and load balancing are all relatively simple ways to abstract the ultimate destination of where a request to a "server" ends up. Examples of this include:

Providing an DNS alias to a server allowing the alias to be moved to other servers as necessary
Virtualisation allows the "server" to be moved to other hardware as necessary.
Clustering allows applications to be moved between servers and uses an alias to a virtual IP address to allow the clients to maintain the same reference.
Load balancing allows requests for an application to be redirected as required

Virtualisation takes a good shot at solving both the hardware issues and the configuration issues associated with relying on a specific server. If the hardware fails, then the "server" runs on alternate hardware in the cluster. There is no need to worry about configuration differences, because the "server" is moved as a whole taking its Operating System, Configuration and its identify (DNS name or IP Address) with it. Surely this solves all the problems?

Unfortunately, it doesn't. Many organisations run 24x7 businesses where there is no "good time" to take down a server for maintenance such as patching or hardware upgrades. This brings me back to the point of the post, relying on specific servers for an application is a bad idea. I provided a number of techniques for getting around direct references, but none of them really deal well with both the hardware and the configuration issues and at the same time allow for maintenance. The closest is virtualisation for "server" availability and clustering for application availability, either through a high availability, application-level or load balancing cluster.

What is needed is a means of abstracting away from the idea of an application running on a server to an application requiring compute capacity. This requires that you have the ability to easily move an application from one host to the next. I specifically changed to using the word host from the word server because this conveys more adequately what is required, somewhere to run or "host" an application rather than a specific "server" which is a combination of hardware, operating system and configuration.

Once you can easily move your application from one host to the next, this opens a lot of opportunities for you in terms of reduced outages to the applications.

Patching simply becomes a series of application moves as applications are moved to patched or newly build hosts where the older hosts and patched or rebuilt while nothing in running there. Note: I did not get into patch vs rebuild, I will save that for another post.
Capacity upgrades (in a scale up sense) become nothing more than an application move.

In order to move between hosts with confidence, application support teams need the assurance that one host is configured the same as the next. The only practical way to do this is through the use of a tool that allows one server to be configured the same as the next. While it is possible to use AMI's or the more traditional ghost or gold image to handle this, the focus of the next post will be on using configuration management tools such as Puppet or Chef to deploy hosts with the same configuration to enable this sort of flexible approach using the specific example of patching.

An adventure in agile administration

Sunday, January 29, 2012

Dependence on specific servers and its effect on agility