Friday, April 13, 2012

How to not get bitten twice (or OODA loop in action)

Over the last two years, I have had the pleasure of working with one of the best admin teams I have ever worked with and here is a simple example of why.

You've discovered an issue, let's say an NFS performance problem as an example. You determine through some digging in /proc/net/rpc/nfsd (explanation of contents here) that you have too few NFS threads configured (all NFS threads were busy and IO was stalling) and this has happened a large number of times since boot. What do you do?

Note: Most of this actually happened, but some of it is what I would like to have happened (doco and build not updated yet)
  • Scan the fleet for other servers with the same issue
  • Create changes to fix the issue and advise your customers
  • Develop a custom SNMP extension that outputs a 1-minute rolling value of the stalls instead of the absolute value
  • Plug monitoring into your tool of choice (Zenoss, Zabbix, OpenView, Patrol etc)
  • Set an alert threshold to generate events in case the problem ever returns (I love the etsy engineering quote "If it moves, we track it. Sometimes we’ll draw a graph of something that isn’t moving yet, just in case it decides to make a run for it" http://codeascraft.etsy.com/2011/02/15/measure-anything-measure-everything/)
  • Update your internal wiki so that when the alert is generated again, the on-call guy knows what to do. 
  • Fix your build (hopefully using configuration management) so you won't have the same problem in the future
There are a couple of things that are important for this to work properly
  • You need to be able to understand the problem, having people around who have deep understanding of technologies like NFS (we use it a lot) and where the problems can occur is really useful. You can have the best tools in the world, but you need the ability to use them and that requires understanding. 
  • You need to be able to move quickly, this problem and the fix were discovered in Australia, the SNMP extension and graphing was developed in London and the alerting, documentation and build updates were configured in the US; Now that is a seamless handover process if ever I saw one. The Australians will have gone from leaving the office with a problem and a fix at 5PM and when they came in the next day at 9AM, the monitoring and alerting was plugged in already so they could assess the state of the problem (Observe, Orient, Decide, Act and back to Observe again ... just in case it makes a run for it). 
  • You need to be able to get it out there, developing a fix is great, but you need to be able to get it out there quickly. The number of hosts for this was limited, but had we needed to get it out to the whole fleet quickly, something like Puppet with custom resource providers for SNMP extensions 
  • Most of all, you need people with the right mindset and that is why I love working with the team.
Icing on the cake

If you have not read The Practice of System and Network Administration then you should. It is a fantastic book about how to be a sysadmin, not necessarily technology A or B, but rather promoting the right mindset and providing an overview of the required knowledge areas. One of the things I like about it is that it sets out recommended practices and then provides the icing on the cake section. I am all about cake and icing, both figuratively and literally.

While I would like to think that the scenario I described above is a well implemented best practice, alas it is not and people who have such an approach are few and far between. At any rate, if I were to improve the above scenario, instead of alerting and waking someone up (this is not necessarily worth that), I would like to see the system automatically scale up the number of NFS threads and drop a message in the logs saying it did just that. If Linux handled this automatically, that would be great and I will log an RFE with our vendor to do just that. 

In the interim here's something cool: Facebook have implemented a self-healing system call FBAR (FaceBook Auto Remediation - https://www.facebook.com/notes/facebook-engineering/making-facebook-self-healing/10150275248698920) that automatically responds to such issues with automatic fixes, only escalating to a human if necessary. Now if only I could figure out how to increase the number of threads without a restart ... off to google. 





No comments:

Post a Comment