A SWAT team for enterprise app woes

The Coast Guard Operation Systems Center employs a dedicated task force to troubleshoot complex problems with its own applications.

When an enterprise application doesn't perform as well as expected, finding the cause can be a daunting task. Is it the server? Or the database? Maybe it's the application itself, or some strange interaction between the two? The end user may blame the application itself, but the fault could also be on the network, say with a router or the firewall. Neither system administrators nor application management personnel have the complete view of the system needed to diagnose the problem.

Over the past year, the Coast Guard Operation Systems Center (OSC) has had a dedicated task force in place to troubleshoot complex problems with its own applications. And now it has begun offering the six-person team to other Coast Guard divisions as well, according to Lt. Anthony Baird, government project officer for the Performance Recording Operational Troubleshooting Evaluation and Consulting team.

The PROTEC unit can perform baseline testing of a new application before it goes into deployment to gauge how it will run under a full operational load. The team can also tease out where the problem resides for an application already in production use that is not operating as expected.

This is difficult work, approaching that of forensics. The applications being studied can run on anywhere from a server or two to up to 40, and the problems can take place almost anywhere the Coast Guard carries out operations, geographically speaking. The applications they have characterized run the gamut from Web servers, database hosting, e-mail financial applications, to specialized Coast Guard applications such as missile tracking, rescue support applications and apps that track ocean currents and tides.

"We can see how well an application is being utilized right here on OSC, or in Guam, Alaska, San Francisco or wherever," said Charles Asbury, a system engineer on the team. Performance is looked at, both from across the entire system, as well as evaluated from an end user's perspective.

The first step of diagnosing trouble is to understand where the problem actually resides. It could be a network issue, an Web server issue, an application issue or one involving the database. Once the problem is pinpointed, it can usually be fixed quite easily. Pinpointing the problem can take some effort, however.

To get a better handle of what is taking place, the team will look at network and server usage, and even watch what calls the program code itself is making. In some cases agents need to be installed on the same machine as the client software itself, as to watch the performance.

The team uses a variety of software to monitor performance. One application is Opnet Ace, from Opnet technologies. Opnet Ace is used to capture baseline statistics for how much bandwidth the system is using. It also comes in handy for troubleshooting.

These days more sophisticated tools are needed to understand what is going on with a network, said Russ Elsner, an associate vice president of Opnet. Opnet offers a range of what it calls application performance management to do the job of watching the traffic and filtering the results.

"Plugging a single packet-capture device up to your network and looking at a simple tracefile just doesn't work any longer," Elsner said. "There are so many moving parts in modern networks."

Other tools include Opnet's IT Guru, Hewlett-Packard's HP Loadrunner, HP Diagnostics and Simena's Network Emulator.

Even with such tools in place, it is still up to the team's engineers to figure out what is going on. "The tools provide us with clues where to look, but most of the knowledge is understanding the tools and where to go from what the information provides us," Asbury said.

In many cases problems will come about not so much as the result of one piece of technology failing, but rather because of interactions between how multiple products are configured.

In one case, the team was dispatched to diagnose a server that was taking up to two minutes to respond. It was one of 40 spread out across four different zones. The problem turned out not to be the server, though, but rather a firewall that would cut off connections after 30 minutes, in this case between the seemingly malfunctioning server and another server it was communicating with. The team reset the application to log in every 15 minutes, eliminating the problem, said Jason Kowalski, system engineer for the team.

In another case, the team confronted an application that performed adequately well in development but suffered performance problems when it was moved into production. Because the application pulled in hundreds of data fields it was difficult for the development team to determine where the bottleneck was. They guessed that the problem was with either the servers or the video cards, which they suspected couldn't keep up with the graphical output. In this case, however, the problem turned out to be three lines of ill-conceived code that were slowing the application. The Air Force has also used Opnet software to help diagnose application performance problems.