An adventure in agile administration: 2012

Saturday, July 14, 2012

Postinstall scripts are not configuration management

This post is a reaction to reviewing a puppet manifest at ${WORK} that installs a package and then has an exec resource run a script that modifies a configuration file for the package. Something in me just screams that this sort of approach is wrong, but I was looking for a way to explain why. Rather than just rant and rave at length which I am known to do on occasion, I wanted to give a succinct set of rules the team could follow. Fortunately, people aren't robots and so they usually want to understand the reasons behind those rules. This post is a means of me fleshing out those reasons.

It should be noted that the author of said manifest is a junior member of staff learning Puppet and so this is not a critique of their work by any means. I wanted to take the time to explain to them and provide guidance to the rest of the team about how various components should be handled.

First of all, let's start with what this person did right

They fully automated the installation of this agent using Puppet
They deployed the binaries and scripts using a package

Based on that alone, this person is already heading in the right way and is leagues ahead of a large percentage of system administrators. If I asked "How would you install X on Y systems?" in an interview and that was their answer, they would be hired based purely on their potential.Furthermore, given that the manual way to install X was to install the package and run the configuration script, it is understandable why the person who wrote the manifest would do it that way.

The issue is that using scripting to modify configuration files or settings has a number of problems with it

Once you "shell out" of Puppet to perform an action or create configuration, Puppet has no way of knowing what was done.
RPM also has the same issue, especially if you are using the script to modify the state of files delivered by the package. RPM does have the concept of a %config macro to help with this problem, but ultimately once you use a script to change the contents of a file from its original state, RPM verify will start reporting errors.
What ever the script is doing, this should be handled explicitly within puppet. In the example, a configuration file was modified to use specify that a particular user should be used to execute the agent. This should have been delivered as an ERB template with the user as a variable. Other examples would be enabling a service etc.
Scripts, either external to the package or things like preinstall or postinstall scripts are rarely idempotent and unless coded specifically to detect the state of what they are modifying will generally result in different results. An example of this would be adding a line to xinetd.conf for a service, in order to make the script idempotent, you would first need to check if the line exists and if it does not, then you can safely add it.

General rules

Based on the above, what general rules can be stated about packages and configuration?

Note: These general rules assume that you have a configuration management tool

All binaries and scripts should be packaged. If it won't change once you install it on the system, then it belongs in a package.
All configuration dependencies should be expressed in Puppet (users, services etc)
All configuration files should be delivered as ERB templates as part of the puppet module and any modifications should be made either by including Puppet variables or using system facts for any host-specific configuration. The collary to this is that configuration files should not be packaged.
Packages should contain no pre or post install scripts, any configuration requirements should be expressed within puppet.

This thinking lines up with how the IPS packaging system works in Solaris 11. This is described in Stephen Hahn's paper called pkg(5): a no scripting zone. While the logic of IPS is sound, there is an implicit reliance on another tool (Puppet, Chef etc) to handle the configuration. There is a hack to use SMF to launch a script to configure the package, but that just seems out of place and awkward. The flaw in the logic that Sun had and Oracle inherited was that Solaris had no native configuration management tools to handle this and still do not.

Benefits of this approach

If your packages only contain binaries and scripts that do not change, then the package verification checks such as Red Hat's RPM Verify or Solaris' pkgchk(1m), come back with no errors as the expected contents of the package match what is actually on the system.
There are no conflicts between your package management systems and configuration management systems. If RPMs modify files delivered by packages or packages modify files controlled by Puppet, then a discrepancy will arise. Because the package management scripting is one shot only based on installation time, Puppet will overwrite the configuration with what it expects.
It is very clear what requirements a particular piece of software has. For example, if a piece of software needs a user defined, this should be clear from the manifest. Should the user need to change, this can be easily handled within puppet.

Other considerations

One problem with this approach is that you have information about a particular component across two different area; the packages and the puppet code. It is important that both are kept under source control, preferably within the same repository because there is a dependency between them. These items should cross reference each other as well in the source repository.

The other downside is if you have to manage systems that are under puppet control and systems which are not. If you have to manage both types of systems, then I think you need to allow configuration files to be delivered in packages and modified in scripts so that you legacy systems will continue to work. For your newer systems, you should also express that configuration within Puppet and live with the expectation that puppet will redo the work of the package.

Closing Thoughts

This was a general introduction to how I believe configuration files should be handled. What I would really like to see is a series of patterns that can be used to explain these sorts of concepts. The Limoncelli, Hogan and Chalup book The Practice of System and Network Administration provides a good overview, but could be updated as the second version is 5 years old today as I write this.

Friday, April 27, 2012

Virtuous cycle of devops: Standardisation - The implied link

Joe Kinsella (@joekinsella) from sonian wrote an excellent article called "Virtuous Cycle of Devops" which succinctly summed up the the benefits of adopting a more agile approach to managing systems and the applications which run on them. The article is spot on the money and sums up really well what I have been trying to achieve for the last 2 years. You should go and read the article, it's very short and well worth it, but the main point was that there are a number of benefits from this approach which flow into each other creating a cycle of continual improvement.

Source: http://www.hightechinthehub.com/2012/04/virtuous-cycle-of-devops/

Each of the benefits described in Joe's article relies on the assumption that you are building on a standardised and consistent base. Providing a standardised configuration at the server level allows you to deliver flexibility at higher layers in the stack because the basic layers are configured the same way and can be treated as a single piece. Standardisation allows you to deliver value added flexibility where innovation is most beneficial to the business; at the application layer. This concept is summed up really well in an article on the SkyDingo blog called DEVOPS: FLEXIBLE CONFIGURATION MANAGEMENT? NOT SO FAST! where the authors claim that by limiting flexibility at the infrastructure level (gratuitous flexibility) that you increase flexibility at a application level (value added flexibility).

Sysadmins: This is not just for developers, having standardised and consistent configurations across your fleet allows you to respond with agility to changing requirements at the infrastructure layer because you know that all the servers are configured the same way and that any action taken should respond uniformally (Murphy will throw you some edge cases from time to time though based on things outside of your control, that's why you need solid automated tests).

There are numerous examples where this sort of approach is applicable, but the main one of benefit to system administrators is the application of security or bug fixes. If your environment is standardised, it becomes a relatively simple exercise to test a new fix in a lab environment and then roll that out to your fleet. On the other hand, if you do not have a standardised environment, the roll out of any fix becomes a configuration by configuration (or worse server by server) exercise. Knowing how your servers will behave to a new configuration requirement is the difference between being able to patch hundreds of servers at a time or handling them one by one.

Ask yourself: If a zero day patch was released tomorrow for SSH how would you handle it? If the answer is roll it out by hand, you already lost. These are the sorts of things that differentiate small scale thinking about individual systems from large scale thinking about an infrastructure ecosystem.

In startups these days, working with a standard configuration is a basic assumption and through the use of tools like Puppet, Chef and Cfengine is becoming more and more mainstream. If you look at some of the names of companies sending people to puppetconf and the upcoming chefconf, these ideas are catching on in very large enterprises and this is a very different space with very different requirements.

In green field environments, it is relatively easy to control configuration drift if you built the environment correctly from the start using configuration management practices. In legacy environments, it is not that easy. You have existing servers, built by different people over a number of years, new operating system releases come out leaving the older ones behind and technical debt piles up if left unchecked. Pulling that all together is really difficult and as we see the adoption of configuration management tools in large enterprises, this is something that should spoken about more openly at conferences. This is not a solved problem, not by a long shot.

I know of one investment bank (not my current employer, but I'd love to work there) that rebuilds its global infrastructure every night to ensure absolute consistency and avoid configuration drift. While I personally think that is a little over the top, to achieve that level of control over and confidence in your infrastructure is really the pinnacle of system administration, regardless of whether you are a startup or a bank that has been around for 200 years.

If you are not using a configuration management system, pick one use it and get on to more interesting things like adding value for your business.

Friday, April 13, 2012

How to not get bitten twice (or OODA loop in action)

Over the last two years, I have had the pleasure of working with one of the best admin teams I have ever worked with and here is a simple example of why.

You've discovered an issue, let's say an NFS performance problem as an example. You determine through some digging in /proc/net/rpc/nfsd (explanation of contents here) that you have too few NFS threads configured (all NFS threads were busy and IO was stalling) and this has happened a large number of times since boot. What do you do?

Note: Most of this actually happened, but some of it is what I would like to have happened (doco and build not updated yet)

Scan the fleet for other servers with the same issue
Create changes to fix the issue and advise your customers
Develop a custom SNMP extension that outputs a 1-minute rolling value of the stalls instead of the absolute value
Plug monitoring into your tool of choice (Zenoss, Zabbix, OpenView, Patrol etc)
Set an alert threshold to generate events in case the problem ever returns (I love the etsy engineering quote "If it moves, we track it. Sometimes we’ll draw a graph of something that isn’t moving yet, just in case it decides to make a run for it" http://codeascraft.etsy.com/2011/02/15/measure-anything-measure-everything/)
Update your internal wiki so that when the alert is generated again, the on-call guy knows what to do.
Fix your build (hopefully using configuration management) so you won't have the same problem in the future

There are a couple of things that are important for this to work properly

You need to be able to understand the problem, having people around who have deep understanding of technologies like NFS (we use it a lot) and where the problems can occur is really useful. You can have the best tools in the world, but you need the ability to use them and that requires understanding.
You need to be able to move quickly, this problem and the fix were discovered in Australia, the SNMP extension and graphing was developed in London and the alerting, documentation and build updates were configured in the US; Now that is a seamless handover process if ever I saw one. The Australians will have gone from leaving the office with a problem and a fix at 5PM and when they came in the next day at 9AM, the monitoring and alerting was plugged in already so they could assess the state of the problem (Observe, Orient, Decide, Act and back to Observe again ... just in case it makes a run for it).
You need to be able to get it out there, developing a fix is great, but you need to be able to get it out there quickly. The number of hosts for this was limited, but had we needed to get it out to the whole fleet quickly, something like Puppet with custom resource providers for SNMP extensions

Icing on the cake

If you have not read The Practice of System and Network Administration then you should. It is a fantastic book about how to be a sysadmin, not necessarily technology A or B, but rather promoting the right mindset and providing an overview of the required knowledge areas. One of the things I like about it is that it sets out recommended practices and then provides the icing on the cake section. I am all about cake and icing, both figuratively and literally.

While I would like to think that the scenario I described above is a well implemented best practice, alas it is not and people who have such an approach are few and far between. At any rate, if I were to improve the above scenario, instead of alerting and waking someone up (this is not necessarily worth that), I would like to see the system automatically scale up the number of NFS threads and drop a message in the logs saying it did just that. If Linux handled this automatically, that would be great and I will log an RFE with our vendor to do just that.

In the interim here's something cool: Facebook have implemented a self-healing system call FBAR (FaceBook Auto Remediation - https://www.facebook.com/notes/facebook-engineering/making-facebook-self-healing/10150275248698920) that automatically responds to such issues with automatic fixes, only escalating to a human if necessary. Now if only I could figure out how to increase the number of threads without a restart ... off to google.

Sunday, April 8, 2012

Linux Long Term Support and why enterprises think they need it

Disclosure: I have worked in various enterprises (banking and telcos) for the last 10 years or so as a UNIX Admin, the last 5 in a Bank.

The guys at the Food Fight Show recently had a discussion called "Distro Dancing" where they discussed their opinions on various Linux distros, which one they used first and which ones they like now and why they find them useful. It was interesting to see how people's needs developed over time with the development of Linux as a "play thing" to Linux as an OS that powers many of today's data centres and how that impacts people's requirements.

The main gist was that if you are using tools like chef or puppet, then you end up caring less about the distros because all of the functionality for configuration is abstracted behind the respective configuration tools and you are not tied to a particular distro's administrative interfaces. While I agree with this at a high-level there are some added extras that particular distros bring to the table such as their associated software repositories. The ability to "yum install" or "apt-get install" a particular application or library without the need for customized repos and packaging etc is quite useful.

John Vincent (@lusis) made a very insightful series of comments around the side effects of distros such as RHEL and Debian/Ubuntu and their Long Term Support implementations. Long story short, the Red Hat guys lock their software major versions down about three years before the software is actually released. This means that by the time the software is ready for general availability, it is already behind the state of the art in terms of updated packages. The example given was RHEL locking the system version of Python at 2.4 for their system tools where 2.7 is current and 2.6 still very popular and the impact that this has on people wanting a more up to date version of Python to work with.

The question that I had for myself was is why do enterprises require operating systems that are Long Term Supported? By that I am not only referring to the length of the support contract although that does play a part, but rather some of the restrictions around major version lockdowns etc. Here are some that came off the top of my head (in no particular order)

Enterprises use software from major software producers (Oracle, Sybase, Weblogic etc.). Version lockdowns allow those software producers to release software that will work with a known OS configuration. By keeping their versions of the underlying system libraries (glibc etc) stable, the software vendors have less to worry about when it comes to certifying releases of their products. This is a feedback loop where the demand for software compounds the demand for locked down OS releases. Long story short, if you want one of the big DB players, you will end up on Red Hat or Solaris.
Enterprises have usually been around for a while and have collected some serious technical debt. You can argue whether or not this is a valid argument, but it exists and it is reality for many admins. Developers move on, projects are abandoned, but unless someone is keeping a very keen eye on their software inventory, that code is still running on a OS somewhere and its support details can be somewhat sketchy. Because of this technical debt, people can become very tied to specific server configurations and any attempt to "mess with" said configurations is met with fear and trembling. Keeping the binary compatibility guarantees of say a Red Hat or Solaris, means that these legacy apps can keep on running without intervention.
Enterprises do not like surprises. One of the requirements that a lot of enterprises have is that of regular patching. Were patching to introduce major version changes to parts of the system, then the anxiety associated with patching would be much higher. Imagine if Apache were to suddenly deprecate the use of single file httpd.conf configurations and force everyone over to httpd.d type configurations, this would break multitudes of applications and stop any progress of keeping systems up to date. In an ironic twist, locking down versions of system software actually helps to keep them current (at least in terms of security fixes) as there is higher confidence in the patching process.

There are probably many more reasons that enterprises need Long Term Supported OS'; However I am more interested in what admins can do to avoid getting locked into certain OS' or specific server configurations in the first place to allow easier moves to more up to date releases.

In summary

STOP DOING THINGS BY HAND and get a configuration management tool
Use said configuration management tool for defining your applications' requirements
Test your applications' portability to flush out hidden dependencies
Support all of the above with proper policies around life cycle management and configuration management.

1. STOP DOING THINGS BY HAND. Seriously, 1999 wants its administration techniques back. With tools like chef or puppet available for free or with support contracts, there is no excuse to be hand crafting configurations on servers anymore. That's all very nice to say, but what are the consequences of hand crafting a server?

There is no reasonable way to replicate the environment, this means that when you need to move from one OS version to another for support reasons, the process to do so is based on the administrator's ability to document (or remember if it is the same person) the steps required to configure the server and install the application.
Hand crafting a server limits your ability to create production-like development or test environments or staging environments for moving your applications to a new server.
People become very attached to their hand crafted servers because their hand crafted servers were built specifically to run their application. When this happens, they become complacent about maintaining the documentation and configuration of the servers because they will always be there right? Wrong! You will eventually need to upgrade the hardware or the OS soon enough and that "one-off hack adding a symlink to X" will comeback and bite you when you move over to the new machine.

Side note: I remember one particularly experienced application administrator telling me to always remember the three P's of things that can go wrong in a migration; Passwords, Profiles and Permissions. This has stayed with me and influenced much of my thinking around configuration management.

2. Explicitly define *all* of your specific dependencies. If your application specifically requires a particular version of a library or a user particular user defined or a particular directory permission set then this should be explicitly called out. This can be done in documentation, but we all know that documentation can suffer from bit rot just like software; It becomes outdated or downright inaccurate over the time. The best way to enforce these dependencies is using a configuration management system like chef or puppet because not only are these set at install time, they are also actively maintained as part of the run time configuration for the server so that if a permission is changed by hand, it will be changed back. Defining all of your configuration explicitly in a configuration management tool allows you to recreate the application environment (think the three Ps) on a new system without worrying that particular pieces are missing.

Side note: This does force people to work within the confines of the configuration management tool, but that is more of an organizational issue than a technical one.

3. Keep your applications portable. I made a previous attempt at discussing this, but basically if your application can be *easily* moved between servers, then the odds are high that you do not have underlying undocumented system dependencies and this will allow for an easy migration. If your application runs on a single server and it was deemed important enough to have a DR backup, then you should be moving the application between servers on a semi-regular basis to ensure portability. This could be done during official DR testing or more often if it is associated with events like regular patching/maintenance etc. If your application is run across multiple machines, then adding new machines to the cluster of machines will also test the configuration and ensure that there are no hidden system dependencies.

4. Don't let things become stale in the first place. If you allow configurations to drift or OS releases to age in your environment, the harder it will be to move off them. The people who originally installed the servers and the applications may no longer work for the company, finding support for the operating system may become much harder (Note: RHEL 4 and Solaris 8 both hit end of life recently) and the inevitable fear and trembling will set in. "You can't patch server X, no one knows how it works!". The only way to avoid this is to have strong policies in place around OS support (and subsequent upgrades) and configuration management. If people know that they will have to move their application in X years (defining X is an exercise for the reader) then they will be less complacent about sloppy configuration management practices, but for this to work properly they need to the right tools to record and maintain the configuration.

Am I saying that if you do all of those things, your enterprise will move from RHEL 5 to Fedora overnight? No, but it will make the move from RHEL 5 to RHEL 6 a lot easier than it otherwise would have been. If people have confidence in your ability to move from known configuration to known configuration, then maybe there would be a more relaxed attitude to say moving from RHEL to a distro that is slightly more up to date, but for that to happen you have to put in the hard yards first of collecting and maintaining all of that configuration out there.

There is a second option to relying on system defined requirements and that is to bundle all application requirements beyond basic system libraries into an application filesystem that is maintained by the developers. Example: If you application require zlib-X then bundle it with your application. This works very well for maintaining independence from the underlying server OS, but places a very high burden on the application support teams because they need to track and update versions of software as it is released for patches. It is much easier to allow the OS vendor to track and maintain this software, however the cost comes at the expense of application isolation from OS changes. I personally do not recommend this as developers should be spending their time developing rather than tracking and maintaining dependencies.

Wednesday, April 4, 2012

Devops Days Austin - Day Two

Day two in Austin was quite useful there were talks from Etsy on how they handle security in a devops manner, NI also talked about how they setup a SaaS team to handle new deployments in the cloud.

The open spaces were still my favourite part with lots of discussions about centralised logging, rugged devops and devops in a large IT arena. I met people with the same sorts of problems that I have (lots of legacy stuff) and people who do not (green fields in the cloud type startups).

All in all an excellent conference.

Monday, April 2, 2012

Devops Austin Day 1

Today I attended the devops days conference in Austin (http://www.devopsdays.org/events/2012-austin/). The usual suspects were there John Willis, Damon Edwards, Matt Ray etc, but so were a bunch of people who hadn't attended previous devops conferences. This made for a good mix of those inside and outside the echo chamber and brought in some new perspectives.

* John Willis gave a good "Devops State of the Union" presentation.
* James Turnbull had an excellent presentation on Devops and Security and how the two are not so different and actually complementary.
* Michael Cote gave a good overview of the work that Dell did with Crowbar

There was a provisioning panel where 6 different people from different vendors (and one consulting company) gave presentations on their respective products. Dan Bode and Matt Ray both gave presentations about Puppet and Chef respectively. Despite some of the heated discussions about the differences between, at this conference and at the Austin conference last year, Matt and Dan kept it quite professional. Ultimately, it does not matter whether you use Chef or Puppet, just pick one and do something with it.

I had an excellent lunch conversation with James Turnbull and Dan Bode about Puppet (and the issues of being an Australian in the US).

This was my first "open spaces" conference. I really like the idea and "the law of two feet" meant that people could wander in between them as necessary. I went to the following open spaces

* Monitoring sucks
* Devops cultural issues
* Devops Kanban

The open spaces part of the conference is where you really get to learn as it is all very open with people asking questions and providing answers.

Things I got out of day 1

* Tools are cool, but won't fix fundamental issues
* Security should be embraced early on in a project as it is easier to integrate their requirements upfront.
* Dell's crowbar project is pretty cool, but for our bare metal provisioning, it is overkill.
* Cultural change is really helped by the higher ups changing behavior and this has a trickle down effect.
* Running remote teams is difficult, it is not just me.
* Kanban is not just for dev teams
* EVERYONE is hiring right now.

I had to duck off to a meeting so missed the drinks, but am definitely looking forward to tomorrow.

Sunday, January 29, 2012

Dependence on specific servers and its effect on agility

In the enterprise, people get very attached to servers as opposed to compute capacity. You can't hug something as amorphous as compute capacity, but you can get very attached to server 1 in location X.

This usually happens by accident rather than by design. To get a server, application support teams worked with their technical teams to specify its configuration, they waited weeks for it to arrive, they checked it to make sure it meets their requirements, in some cases they even named it and most of all they paid for it! The server is then reconfigured as required, perhaps multiple times over a number of years. This leads to a kind of special status given to "their server" and usually manifests itself during outages:

Server Support: "I rang to let you know that server 1 has failed and we are working on it, can you fail over your applications?"
Application Support: "We need server 1 up and running"
Server Support: "Server 1 has a hardware issue, you have servers 2, 3 and 4 in the same location, why can't you run there?
Application Support: "You don't understand, server 1 is special and we need it back up or X will fail!"

Through some action, either by the application or server support teams, server 1 has become unique giving it an elevated status. At the core of this is the question of what exactly is a server?

At a simple level, a "server" is nothing more than a bunch of bits on a disk which provides compute capacity. This splits the concept of a server into two pieces;

The hard physical components - Hardware, Network connections etc.
The softer logical components - Operating System, Configuration etc

It is when these two components are joined together and referenced specifically that the issues arise.

Application reliance on specific servers is something that should be avoided. I think this concept is well understood by most people in technology, however it is not always clear how it can be avoided. Use of aliases, virtualisation, clustering and load balancing are all relatively simple ways to abstract the ultimate destination of where a request to a "server" ends up. Examples of this include:

Providing an DNS alias to a server allowing the alias to be moved to other servers as necessary
Virtualisation allows the "server" to be moved to other hardware as necessary.
Clustering allows applications to be moved between servers and uses an alias to a virtual IP address to allow the clients to maintain the same reference.
Load balancing allows requests for an application to be redirected as required

Virtualisation takes a good shot at solving both the hardware issues and the configuration issues associated with relying on a specific server. If the hardware fails, then the "server" runs on alternate hardware in the cluster. There is no need to worry about configuration differences, because the "server" is moved as a whole taking its Operating System, Configuration and its identify (DNS name or IP Address) with it. Surely this solves all the problems?

Unfortunately, it doesn't. Many organisations run 24x7 businesses where there is no "good time" to take down a server for maintenance such as patching or hardware upgrades. This brings me back to the point of the post, relying on specific servers for an application is a bad idea. I provided a number of techniques for getting around direct references, but none of them really deal well with both the hardware and the configuration issues and at the same time allow for maintenance. The closest is virtualisation for "server" availability and clustering for application availability, either through a high availability, application-level or load balancing cluster.

What is needed is a means of abstracting away from the idea of an application running on a server to an application requiring compute capacity. This requires that you have the ability to easily move an application from one host to the next. I specifically changed to using the word host from the word server because this conveys more adequately what is required, somewhere to run or "host" an application rather than a specific "server" which is a combination of hardware, operating system and configuration.

Once you can easily move your application from one host to the next, this opens a lot of opportunities for you in terms of reduced outages to the applications.

Patching simply becomes a series of application moves as applications are moved to patched or newly build hosts where the older hosts and patched or rebuilt while nothing in running there. Note: I did not get into patch vs rebuild, I will save that for another post.
Capacity upgrades (in a scale up sense) become nothing more than an application move.

In order to move between hosts with confidence, application support teams need the assurance that one host is configured the same as the next. The only practical way to do this is through the use of a tool that allows one server to be configured the same as the next. While it is possible to use AMI's or the more traditional ghost or gold image to handle this, the focus of the next post will be on using configuration management tools such as Puppet or Chef to deploy hosts with the same configuration to enable this sort of flexible approach using the specific example of patching.