The scoop on COOP
Backup and recovery software plays key role on continuity of operations
Continuity of Operations, or COOP, is a fundamental part of the planning of many government agencies, especially within the federal government. And since the September 11 attacks four years ago, it’s an area of IT management that’s gotten an ever-increasing amount of attention—both in government and in the private sector, where “business continuity” has become the mantra of many financial and IT managers.
While building a COOP plan is an agencywide effort, implementing it is increasingly becoming the job of the agency CIO and IT department. With the vast majority of federal operations now depending in some way on IT and telecommunications infrastructure, making sure the government can continue to do business in the face of disaster—or any other event that might otherwise interrupt business as usual—falls on the IT staff.
“Recovery isn’t just for disasters and viruses,” said Jeff Beallor, president of Global Data Vaulting Inc., a continuity service provider in Toronto. “It’s also about making policies that are for more of a business risk issue—like employee problems, and tampering with data.”
While technology has made it easier to shift operations to other locations in times of crisis, it hasn’t made establishing COOP any simpler. And for organizations that don’t have the internal expertise associated with running high-availability networks and data centers, it’s that much harder.
Unfortunately, the recent attacks in London just sort of reinforce the need for this, said Steve Higgins, director of business continuity and security for EMC Corp. of McLean, Va. “What we’re seeing is that in most cases, and maybe more on the government side, organizations are reaching out and looking for firms that bring a comprehensive plan for business continuity.”
In 2001, the bond-trading company Cantor Fitzgerald lost its primary data center (and many of its employees) in the New York World Trade Center. But the company was able to bring all its network-based services to customers back online within 47 hours—in time for the resumption of bond trading on Wall Street—because the company was using replication technology over an OC-3 fiber-optic network to a secondary data center in New Jersey.
Technology now exists that can reduce the downtime of a data center to a matter of seconds—instead of the hours or days that many organizations faced a few years ago.
Using hardware-based replication systems, the information within data center applications—that is, the data actually being processed by the application, as well as data being written to databases and storage—can now be synchronously (simultaneously) replicated to a system at a site within 20 miles on a high-bandwidth network connection.
For replication to sites farther away, or connected by a lower-bandwidth or high-latency network, hardware and software replication systems can perform what’s called “asynchronous” replication—caching changes and pushing them across to a system at a remote data center.Keeping a complete copy
While asynchronous replication leaves room for some data loss in the event of a disaster, it allows organizations to minimize the risk of total data loss by keeping a relatively complete copy of the data far enough away that a regional disaster—like a hurricane or earthquake—won’t take an agency’s applications completely offline.
Synchronous replication is better for handling more localized failures and can bring operations back online almost instantly without any loss of data. The two methods complement each other, but until recently, IT managers had to make a choice between them.
However, technology has improved to allow an application to replicate in multiple ways, as in hardware from EMC and other vendors.
“The technology is allowing you to do two writes from a single production location—a synchronous write and an async write—at the same time,” said Higgins. “This gives CIOs more flexibility in building a strategy than they had before.”
Unfortunately, the cost of hardware replication puts it out of the reach of many smaller federal agencies, and an even larger percentage of state and municipal governments. And even if every agency could afford multiple data centers with high-bandwidth networks connecting them, replication alone wouldn’t solve all the possible interruptions of IT services.
While mirroring data across multiple sites can protect against a big problem—like a disaster at a data center—it can’t deal with the more common causes of downtime such as software failures, virus and worm attacks and corruption of databases.
“One of the problems with hardware-based replication by itself [for COOP] is that if there’s a logical problem at one site, it gets replicated on the other,” said Kausik Desgupta, senior manager for backup products at BMC Software Inc. of Houston. The result is “rapid replication of rubbish”—corrupted data is spread across multiple sites quickly, and the mirrored systems can’t be used to restore the data effectively.
Fortunately, most organizations can overcome the COOP gap with a blend of software, hardware and services—once they figure out what data really needs to be instantly available and what can be archived in a more traditional way.
The key to a successful COOP infrastructure—without overspending—is analyzing exactly how much protection each type of data needs in order to ensure operations can resume efficiently.
“You have to be able to classify the information and application in terms of how you want to protect it,” said EMC’s Higgins.
“It’s a matter of right-sizing the recovery” for each type of failure, said BMC’s Desgupta. A system failure is “not usually a full site outage; the scope of recovery for most failures doesn’t allow a move from Site A to Site B,” he said.
For example, corruption of a database or application data can more easily be handled onsite with a database backup and logging tool—especially if the problem is caused by bad software, user error or a malicious user. Tools like BMC’s BackTrack software can take advantage of the high-availability features of mainframe and distributed database systems to make rolling back a database to a previous state almost a trivial matter—and help find the cause of the problem as well.
For mainframe databases such as IMS, DB2 and VSAM, BMC’s tools take advantage of “snapshot” technology that grabs the state of the database at any moment.
“You can use the processor cache or storage for a restore,” said Rick Weaver, BMC’s product manager for database and mainframe backup. The tools also take advantage of new technologies in storage systems for snapshots, he said. EMC, IBM and Hitachi all have announced a data-set-level snapshot capability that provides an instant snapshot of the data, taking just a few seconds to restore any data set, he added.
Similar capabilities are available for distributed databases through BMC’s BackTrack tools for Oracle, Sybase and other database platforms.
Then there’s the issue of such data as word processing and other office automation files. While the data in these files can be mission-critical, they can usually be backed up to and restored from offsite disk storage, tape libraries or CD-based archival storage systems. The real problem arrives when you’ve got to restore all that data—first, you need to have a backup server running, and then the backup server needs to know what to restore and where to put it.
Fortunately, some backup management tools now provide disaster recovery capabilities that help get over these hurdles. IBM Corp.’s Tivoli Storage Manager, for example, helps create a recovery plan for data as it backs it up. “It pulls information about what kind of storage you need on the recovery site,” said Trisa Jiang, IBM’s technical evangelist for Tivoli Storage Solutions. “It also provides scripts to get things up and running. The scripts can be updated nightly; administrators can generate a DR plan at the same time they’re taking [the data] offsite.”
Having a disaster recovery plan integrated into the backup management system is especially important because, as Jiang points out, if there’s a disaster, “the local people who deal with this on a daily basis will be more concerned with safety. This way, an offsite person can take the DR plan and get backups off and running.”
But not every organization can afford to dedicate the software, hardware and personnel required to build an effective COOP solution.
For smaller agencies, and state and local agencies that may not have the resources to have “hot” backup centers, the best alternative may be a continuity service provider.S. Michael Gallagher is an independent technology consultant based in Baltimore.