Operational resilience: Bringing order to a world of uncertainty
There's seemingly no end to the list of risks that can threaten IT operations. If security breaches by outside hackers or careless insiders don't corrupt sensitive information, then a man-made or natural disaster can hobble operations until critical processes are restored via far-flung backup facilities. Add regional power outages and flu pandemics to the mix, and it's no wonder that many federal CIOs feel overwhelmed by their uptime responsibilities.
But now, some IT managers are taking a new approach. Rather than striving for excellence in key areas such as security, business continuity, disaster recovery and IT service management, they're applying best practices designed to orchestrate the individual disciplines into a coordinated and smoothly functioning whole known as operational resilience.
The anticipated payoff of this centralized, enterprisewide effort is clear: Agency managers hope to better understand how best to keep the organization running even in the aftermath of an unplanned disruption.
"Operational resilience helps you prioritize where you can have the most effect in terms of your investments, as well as understanding the change management and people skills required to keep agencies running effectively when a security threat or disaster strikes," said Gregory Crabb, inspector in charge of revenue, product and global security at the U.S. Postal Inspection Service (USPIS).
Related: Slinkys and organizational resilience
USPIS adopted a resiliency approach four years ago, and Crabb said it has helped the organization pinpoint resource requirements for IT security and other areas. And revamped internal processes designed with resiliency in mind are helping USPIS identify mail fraud more quickly.
"We've improved our ability to detect fraud when it's occurring against our dot-com applications, and as a result, we're able to reduce the consequences by literally millions of dollars," Crabb said.
Operational resilience describes an organization's ability to protect its critical assets and keep essential processes and services operating during a security threat or other disruption, said Rich Caralli, technical manager of the resilient enterprise management team at the CERT Program, the Internet security program at Carnegie Mellon University's Software Engineering Institute.
"Don't think about security processes or business continuity processes, think about a resilience process that traverses all layers and functions of the organization," he said. "Everybody's job should be to make sure the overall mission of the organization is being achieved. Sometimes you do that by preventing disruptions, [and] sometimes you do it by making sure that there isn't damage when a disruption happens."
Earlier this year, White House officials acknowledged the importance of resilience to the country's critical infrastructure when they issued Presidential Policy Directive 21, which calls for stronger security and resilience against physical and cyber threats.
IT managers at federal agencies are also under pressure to maintain resilient operations, but many worry that their organizations are not up to the task. Only 8 percent of federal IT professionals are completely confident that their agency could recover 100 percent of the data required by service-level agreements in the aftermath of a natural or man-made incident, according to research published by the online government IT community MeriTalk.
In another study, 77 percent of IT and lines-of-business professionals cited decentralized risk management as one reason why resiliency is a challenge, IDC Vice President David Tapper reported in the recent white paper "Lack of Operational Resilience Will Undermine Enterprise Competitiveness: A Strategy for Availability."
The answer, Tapper wrote, is a strategy for ensuring operational resilience that in part uses key performance indicators to help create a holistic, enterprisewide governance structure.
The potential payoffs
Veterans of operational resilience say a comprehensive plan offers a variety of benefits. "It's a way to help keep the IT environment running and available," said Greg Schulz, senior adviser at Server and StorageIO Group, a consulting firm that specializes in IT infrastructure technology.
A highly coordinated plan for security and uptime could also help agencies reduce the unnecessary costs and complexities that arise when "one group works from a security standpoint, another part of the organization focuses on business continuity and disaster recovery, and others perform backups and archiving," Schulz said.
A successful resiliency strategy can also spotlight policy gaps before they become a problem, Crabb said. For example, an agency might assume its operations have become safer after it installs the latest — and most expensive — security technologies to keep hackers from breaking into the internal network. "But if you don't educate your employees to not open email attachments from unknown sources, you're not going to move the ball forward relative to protecting the organization," he said.
Some resiliency models identify processes for implementing a complete mix of policies and procedures for security, uptime and related areas, he added.
Finally, the enterprisewide approach can help agencies make decisions about how to effectively allocate resources. "It may be possible to prevent all malware from being introduced into the network, but is it economically feasible to do that with the limited resources?" Crabb said. "Those value decisions can be difficult to make."
Nevertheless, implementing an operational resilience strategy can be challenging because it requires the coordination of many complex disciplines. "It's a matrix of practices and goals that you need to properly scope out and focus on to achieve the most improvement," Crabb said.
Turf wars can be another stumbling block. "Individual groups may fear that they'll lose their relevance" if they're brought together under a central strategy, Schulz said. "As with so many large IT projects, the barrier often is not the technology, it's the people, processes and politics of the organization."
For help, some agencies are turning to formal models to guide their resiliency efforts. USPIS uses the CERT Resilience Management Model (RMM), which agencies can download for free at CERT.org/resilience/rmm.html.
CERT says the process improvement model was designed to help converge activities around operational risk and resilience management. The model addresses more than 20 process areas, such as enterprise management, engineering, operations management and process management, and it includes metrics to gauge the performance of those processes from an operational resilience perspective.
Crabb said RMM is now being applied throughout his agency but particularly to improve malware protection and network performance in IT operations.
"The RMM helps us define the processes by which we conduct incident responses for security incidents, including how we interact with the other business units and the [chief information security officer's] office for the recovery of evidence and continuity of operations," Crabb said. "It can be used as a frame of reference to assure that our resilience improvement management is complete."
In the end, even a cohesive resiliency strategy won't bring order to a risky landscape, but it could make the work of security and uptime specialists easier.
"When you're working under uncertainty, you can't know everything," Caralli said. But "you can at least be prepared to create strategies around important assets that protect them from harm or keep them viable under degraded conditions."