How to rebound from a large-scale IT failure

A HealthCare.gov gears up again for open enrollment, there are lessons to be learned for future system roll-outs.

Shutterstock image: epic fail stamp.

As the new revamped Healthcare.gov site prepares for this year’s open enrollment season, it also serves as a prime example of how to recover from a major, high-visibility failure. Another case study came earlier this year, when Congress sharply criticized the U.S. Air Force for its failed $1 billion ERP and logistics management system, and warned that the Defense Department could face similar problems related to financial management.

While it may be inevitable that IT projects do occasionally fail, it seems that the pace of failure is increasing as projects grow more complex -- and the public tolerance for failure is decreasing rapidly.

Having worked directly on a maintenance and repair solution for the Air Force, as part of a course correction from its failed ERP program, I’d like to offer the following considerations -- not only for keeping a project on track, but also how to rebound when problems arise.

Re-visit Requirements. Maybe it was the system implementer, maybe it was bad development practices, but at the heart of most major failures are poor requirements. The quality of requirements does not equal the quantity. In fact, many times the two are inversely proportional, as too many requirements are difficult to design, manage and test. On a major Air Force ERP implementation that failed, there were over 1,000 requirements spread across four separate releases related to maintenance, repair and overhaul (MRO) activities. On a follow-on project picking up the pieces of the Air Force’s MRO transformation, there were roughly 80 requirements targeted at specific capabilities. This was a 92 percent reduction in requirements that resulted in specific, demonstrable COTS functionality that was accepted by a core set of end users. When revisiting requirements, focus on users need versus what they want, as most users are resistant to change and simply want what they have today in a modernized package.

Don’t try to solve world hunger, focus on feeding a village. The Air Force realized major transformation of the entire supply chain was too big, and divided the effort into several smaller initiatives. This approach reduces risk for each of the smaller initiatives, but generates a larger strategic risk if the initiatives explore different solutions. An organization may end up with a combination of multiple COTS and custom solutions, requiring additional interfaces and increasing sustainment workload. Solve for each initiative, but don’t lose focus on strategic integration as multiple stove-piped solutions can be costly to integrate in the long term.

Focus testing efforts on negative testing, performance testing and security. While it is important to test the use cases created to demonstrate required capabilities, they probably tested well in the initial implementation. The problems arose when moving into production where end users ‘experimented’ with functionality, overloaded the hardware/network, and/or compromised security measures. Employ the following strategies to make the most of your testing activities:

  • Try to break the system through negative testing. End users are creative (and initially ignorant), and will manage to (sometimes intentionally) exercise functionality not planned for in test scripts. Go off the script and find out where the bugs live.
  • Plan for more than the peak user load. A colleague worked as an independent ERP consultant specializing in warehouse management systems. He was called in to troubleshoot an ERP issue post-production. After talking to several users, he realized the issue was everyone in the warehouse was running the same report every day at the same time, bringing the system to its knees. Go above and beyond the documented performance requirements to ensure additional capacity can be satisfied during periods like Open Enrollment.
  • Security testing should not be an afterthought. In all industries, hackers, competitors and foreign entities trying to steal our data and intellectual capital. Employ the latest security measures and use robust security testing to ensure there are no open vulnerabilities at ‘go-live.’

Employ a crawl, walk, run approach to your release strategy. We are all pressured with generating the ROI from our IT investments. This may lead to overpromising ROI by trying to field too much capability too early. Develop an implementation plan that delivers core functionality first with minimal interfaces and, in COTS implementations, extensions. An incremental approach also minimizes the learning curve for end users by getting them familiar with simpler operations prior to implementing more complex functionality. While ROI for the first increment may be limited, it will significantly reduce implementation risk while building the core for the full set of requirements. Parallel development activities during the first increment can expedite future capability and time to ROI realization.