Thursday, December 5, 2013

Medical Data Center Reliability

Well, it isn't Wednesday, but this tweet got me thinking about a FAQ:

The frequently asked question is this: why aren't Medical Data Centers...better? Given that their mission is so important, not just to an organization but to people's lives, why aren't hospitals and labs better at keeping their systems up and running?

The short answer is that they are trying too hard with an odd set of rules and a strange budgeting process.

Trying Too Hard
In my experience, small medical organizations can't afford good technology or good people, but large ones fail as well for the opposite reason: they are trying too hard.

Redundancy is the obvious way to get good up-time (reliability, uninterrupted service, however you like to characterize "running well"). But there are two different kinds of redundancy....

System Level Redundancy which means having more than one solution to a given problem: two independent, synchronized, preferably different systems doing the same job. If System A has a problem, System B takes over.
Component Level Redundancy which means having more than one component in the system doing a particular job. So you only have System A, but it has lots of redundant disks, and storage and whatever else you think you need.
The computers aboard the now-defunct Space Shuttle were an excellent example of System Level Redundancy: five independent computers, all doing the same tasks at the same time. In order for computer to fail on the Shuttle, all five systems would have to die at once. For good measure, they kept tabs on each other, so you had confidence that they were working properly.

Storage Area Networks are an example of Component Level Redundancy: you tell your server that it has a disk attached, but the "disk" is actually a computer which is sending your data to separate disks.

System Level redundancy is expensive (you need multiples of everything) and daunting (you have to maintain multiple systems and keep them in sync somehow.) But it covers all kinds of failures if you do it correctly: power problems, water mains breaking, etc.

Component Level Redundancy is cheaper (you only have extras of the parts you need) and more inviting (the complexity is hidden). But it makes the infrastructure more complicated and it does not help if a non-redundant component fails, eg water floods your primary data center (you have an offsite back up data center, right?).

Worse, if you choose Component Level Redundancy over and over, you end up with a mind-bogglingly complex infrastructure which punishes attempts to change it and works against attempts to debug it. Is the problem in the virtual machine, or the actual machine, or the network, or the SAN, or the app? Who knows? Let the Blame Roulette begin!

So in trying to be super-reliable, frugal and conservative, Medical data centers often up being flakey, expensive and conservative. At least they're conservative.

But I find that it often isn't their fault: senior management has taken charge of the ballooning tech budget and trying to make simple buying choices which are resulting in complicated situations. And complexity rarely leads to clarity or performance or long-term reliability. Just ask Google or Amazon.

UPDATE Dec-30-2013
This article by a DevOps guy gives some interesting perspective on the same phenomenon:

No comments:

Post a Comment