A cost of consolidation?

Despite the push to consolidate data centers in Washington, D.C., and at state capitols nationwide, some would-be centralizers might want to recalculate the risks involved after the recent meltdown of state services in Virginia.

The failure of a single storage system Aug. 25 at a data center near Richmond took down 485 of the state’s 4,800 data servers, knocking out services at three state agencies for more than a week and affecting operations at two dozen others.

State Chief Information Officer Sam Nixon told the Washington Post that the crash was caused by the dual failure of a pair of redundant, 3-year-old memory cards, one of which was supposed to back up the other.

"The thing that is never supposed to happen happened," Nixon said.

Virginia officials said EMC, the company that designed and supplied the storage system, told them the outage had never occurred before in 1 billion hours of system use.

The debacle turns a spotlight on the continuity-of-operations risks associated with consolidating multiple IT operations on fewer, larger data processing and storage systems, as Virginia has done through a $2.4 billion contract it awarded to Northrop Grumman in 2003 and renegotiated this past spring.

The incident also calls into question the design and reliability of supposedly fault-tolerant systems when real-world problems trigger them into action. Information Age reported that a situation similar to the Virginia outage occurred earlier this year when e-mail hosting provider Intermedia lost service to many of its customers after a problem on its EMC storage-area network.

Intermedia officials said a backup storage controller took over when the primary one failed because of a system bug, but the backup device had insufficient capacity to shoulder the entire workload. The company said it has taken corrective action to ensure that there is enough spare capacity on the storage-area network to continue operation in case of future failures.

Virginia CIO Nixon told local TV station NBC12 that he remains committed to centralized IT services despite the recent snafu. But the news will likely have government IT officials elsewhere taking a second look at their system designs and procedures to make sure they can continue operations in case of unexpected problems.

Reader comments

Thu, Sep 23, 2010 Allen

The story is a bit more interesting as four days of data were lost. Lesson 1 - expect the unexpected be it Titanic or computer equipment. Lesson 2 - when things break, ensure all the pieces can be put back together. Here an official source is needed for the VA. incident. Lesson 3 - use your backup system from time to time. Our auditor was amazed, yes amzed, that we use last nights backup to resfresh test system every day. Hence we know, know - not think, the restore works. May all this give voice to good people crying "test and train now" If money is an issue - they lack understanding of the propblem - IMO. Kind regards.

Please post your comments here. Comments are moderated, so they may not appear immediately after submitting. We will not post comments that we consider abusive or off-topic.

Please type the letters/numbers you see above