No quit in these systems

Need affordable high-availability computing? You’ve got it.

High-availability systems have always been part of the computing mix, but until fairly recently, they commanded a steep price and options were limited. Now the same inexpensive yet powerful technology that’s driving much of the sea change in enterprise computing is also bringing high availability to the masses.

PC servers with redundant storage running either Microsoft Windows or Linux — systems that cost just a few thousand dollars but can support hundreds of users and hundreds of gigabytes of data — can now be designed to stay up and running for all but a few hours a year.

True continuous, fault-tolerant computing — where no transactions are lost, performance never degrades and users are shielded from system failures — will likely always be expensive. But high availability that falls just short of that pinnacle is a choice now within the reach of most information technology shops.

“It has become an expected feature for business and mission-critical systems, a checkbox item,” said Robert Desautels, president of Harvard Research Group, a market research and consulting company. “Then it becomes a business decision about what level of availability is needed and what kind of outage can be absorbed.”

The Federal Aviation Administration was faced with making such a decision when the aging mainframe system that supports the National Airspace Data Interchange Network (NADIN) failed catastrophically in 2000, causing hundreds of flight delays around the country and millions of dollars in lost revenue for the industry.

No aircraft can enter, leave or fly within the National Airspace System without first filing a flight plan that is accessible to all flight controllers through NADIN. It’s the very definition of a system that needs to run around the clock with no downtime. FAA officials are in the process of replacing the old Philips DS714 mainframes that drive the NADIN message switching network with two Stratus Technologies ftServer 6600 servers, which use Intel Xeon processors.

The new systems deliver on a number of requirements, said Andy Isaksen, an FAA computer scientist and NADIN program manager. They provide for a minimum of 99.999 percent availability, they cost substantially less than the old systems, and they come with Microsoft’s Windows operating system, which has been bolstered with Stratus’ failsafe system software.

“We are all Windows on the programming side, so we had a major investment there we had to support,” Isaksen said.

The FAA is spending $6.4 million to field the new system, including the cost of the Stratus hardware and a 10-year maintenance agreement. That’s about one-tenth of what it would have cost to keep the old mainframe system up and running for the next decade, Isaksen said. He said he hopes the new system will be operational by November.

Working side by side

As late as 1998, Stratus’ systems were based on proprietary technology. Fault tolerance was achieved by a combination of specially engineered hardware and software. But when organizations began designing entire businesses around IT, the high costs of proprietary systems quickly pushed them to the industry-standard Intel processor for many important applications. Stratus had to follow, said Denny Lane, the company’s director of product marketing.

A new line of Stratus systems are basically two computers that work side by side — one functions as the operational system and the other as the backup system. The two machines process the same transactions simultaneously. If the operational system breaks down, the redundant system takes over with no interruption in processing.

The systems do not rely entirely on off-the-shelf components, however. Stratus uses specially designed chips to detect errors and faults in the system and ensure that the two computers are synchronized.

Company engineers also developed management and diagnostic software, and a system architecture that allows the servers to monitor their own condition so faults can be isolated and repaired before they can do any damage.

But the price difference for the new systems vs. Stratus’ original proprietary product line, which the company still offers, is dramatic. For example, the entry-level model in the company’s continuous availability Continuum line starts at around $500,000. In comparison, a model 2300 in Stratus’ new Windows-based ftServer W Series starts at about $10,000.

Another player in the high-availability market, Marathon Technologies, offers a product that works in a similar way, though solely through software. The company’s Marathon FTvirtual Server takes two standard PC servers connected by Gigabit Ethernet links and, using any mix of storage systems, combines them into a configuration that operates as a single system.

If the operational server or any component fails, the system hands control of transaction processing to the other server. The change is transparent, and as far as the end user is concerned, nothing has changed. The software then alerts the IT staff to the location of the failure so it can be repaired.

The cluster option

Another option for high availability comes in the form of clusters of PC servers. Such clusters are often used to harness the combined processing power of multiple machines for big or complex computing jobs.

They also offer some high-availability capabilities because some servers in the cluster can take over if others fail.

Clusters can be an attractive option to cost-conscious IT managers. That’s because with the active/active clustering approach, the reserve server doesn’t have to perform redundant work but can do other tasks.

That is not possible with the Stratus and Marathon models.

But using clusters for high availability is not simple. “High availability is one of the abilities of clusters, but it has to be engineered,” said Larry Kelly, senior systems administrator for RGII Technologies. The company developed a cluster system to run public Web sites for the National Oceanic and Atmospheric Administration that provide potentially life-saving information, such as early warnings about hurricanes.

Among other features, load balancing is important in clusters because servers that have to take on new responsibilities can quickly become overwhelmed. Also, administrators must develop and test cluster-aware applications and failover scripts. And they must properly configure access to storage resources so that reserve servers have quick access to the right application data.

Stacy Morang, Internet/intranet administrator for the Maine Legislature, considered both clusters and Marathon’s product when searching for a high-availability system to support the legislature’s Microsoft Exchange messaging and bill-drafting applications. She chose the Marathon server.

“If you use clustering, you have to make the application aware of clustering for the failover part of the operation,” Morang said. “With the Marathon product, you don’t need that kind of awareness because the application doesn’t need to know what it’s running on.”

Another point to consider with clusters: Typically, failover and recovery could take at least a few seconds and up to 1 minute or longer. End users may not see any interruption in service, but they could experience some system sluggishness.

In most cases, that kind of availability is enough, Desautels said. “Most computer users don’t need millisecond recovery times,” he said. “High availability for them is where they don’t see much [performance] difference.”

For many government applications, that kind of performance is improvement enough.

Keeping systems up and running

System availability can mean different things to people depending on their business requirements. Fortunately, there are some generally accepted availability definitions that can serve as a useful yardstick for assessing product options. The following is one of the most widely used lists of Availability Environment Classifications (AEC):

  • Fault tolerant (AEC-4) — For business functions that demand continuous computing and where any failure is transparent to the user. This means no interruption of work, no transactions lost, no degradation in performance and continuous operation.
  • Fault resilient (AEC-3) — For business functions that require uninterrupted computing services, either during essential time periods or during most hours of the day and most days of the week throughout the year. This means that the user stays online. However, the current transaction may need restarting and users may experience some performance degradation.
  • High availability (AEC-2) — For business functions that tolerate minimal interruptions during computing services, either during essential time periods or during most hours of the day and most days of the week throughout the year. This means users will be interrupted but can quickly log back on. However, they may have to rerun some transactions from journal files and may experience some performance degradation.
  • Highly reliable (AEC-1) — For business functions that can be interrupted as long as the availability of the data is ensured. For the user, work stops and an uncontrolled shutdown occurs. A backup copy of data is available on a redundant disk, and a log-based or journal file system is used for identification and recovery of incomplete transactions.
  • Conventional (AEC-0) — For business functions that can be interrupted and where data availability is not essential. For the user, work stops and an uncontrolled shutdown occurs. Data may be lost or corrupted.

    Source: Harvard Research Group

  • Look who wants high availability

    The applications to which users want to apply high-availability technologies are expanding as such tools migrate to lower-cost Microsoft Windows and Linux systems.

    “Earlier on, it was just for very critical applications with large dollar amounts attached to them, but that changed, particularly with [the Year 2000] and the dot-com era,” said Adrien Robichaud, high-availability product manager for SunGard Data Systems. “There’s a tremendous amount of [information technology]-driven automation now that’s key to the operation of organizations, and they have a significant need to keep it going.”

    The notion of which applications are considered critical has also changed. Just a few years ago, electronic messaging systems were not considered mission-critical, and some downtime could be tolerated, said Linda Mentzer, vice president of marketing at Marathon Technologies. Now they are thought of as applications that many organizations depend on to operate constantly.

    And people are starting to think beyond just keeping systems up and running, Robichaud said.

    “One of the biggest problems today is database corruption, and organizations increasingly have to keep rolling the database back to where it was before the corruption occurred,” he said. “So they are looking at applying high availability to that and such things as security, rather than just to systems.”

    — Brian Robinson

    NEXT STORY: Pegasus rides CryptoCard wings