Understanding State's big database crash: Q&A with Greg Ambrose
On July 20, the system of 12 databases that processes U.S. visa and passport requests experienced a major glitch, going offline and creating a backlog in visa processing that disrupted travel for thousands of people around the world.
It took three days to get the system, known as the Consular Consolidated Database, back online, and about two weeks to clear the backlog in visas. The system was processing its normal amount of daily visas by Aug. 2, although it would be another month before the Bureau of Consular Affairs declared the glitch in the database fixed, according to a bureau spokesperson.
The CCD holds 165 million cases and is growing at an approximate rate of 42,000 per day. The CCD's purpose is to give the Bureau of Consular Affairs an aggregated and updated view of consular transactions around the world. The centerpiece of the system is an Oracle 10G database that the bureau is in the process of upgrading. At the time of the crash, the CCD ran on the Microsoft Windows Server 2003 operating system. As an interim step, it upgraded to Server 2008, and is now transitioning to the Linux operating system.
The July 20 crash was not the CCD's first significant technical problem this year. It experienced a less severe glitch earlier in the year that prompted Greg Ambrose, the State Department's director of consular systems and technology, to alert stakeholders to the problem, according to an industry source familiar with the issue.
In an exclusive interview that FCW is publishing in two parts, Ambrose details the technical failings that led to the CCD crash and how the Bureau of Consular Affairs responded, on the advice of Oracle and Microsoft. In part 2, Ambrose covers broader IT challenges that the State Department is facing and what the Bureau of Consular Affairs is doing to prevent a future CCD crash.
The Q&A was edited for clarity. The reporter's asides, explaining technical details of the CCD or other context, are labeled as such and in italic.
FCW: Can you put the CCD's struggles in the context of your bureau-wide efforts to better utilize IT to carry out your mission?
Ambrose: About four years ago, there were two modernization projects, called Global Visa System and Global Citizen Services, that were expected to modernize all of the consular systems worldwide. And those two modernization programs, before they got very far along, morphed into one program that is now called Consular One.
Not long after those two programs morphed into one, the previous director [Kirit Amin, now deputy CIO of the Commerce Department] departed.
After that, there wasn't a lot of momentum for the modernization project for Consular One. Then I came in [in April 2013], took a look at where we were across all of the systems for which CA is responsible, and we kick-started the program.
We had a vacancy rate in this organization of 41 percent. So a lot of key positions were not filled [and I] started to fill those key positions.
While CCD had a lot of instability, we didn't think we were going to have the problem that occurred in July as it did.
FCW: Can you provide a detailed timeline of what went wrong with the CCD in July?
Ambrose: While CCD had a lot of instability, we didn't think we were going to have the problem that occurred in July as it did.
We had just finished setting up our test environment in June for the 11G-Linux test. We kept having instability issues in production, an outage here and there. Oracle recommended that we upgrade [the production environment] to a newer version, which they believed was the most stable version of 10G.
REPORTER'S NOTE: The CCD was running on a Windows server and Oracle 10G; the "11G-Linux" test environment was for a planned migration to both a Linux platform and a newer version of Oracle's database.
When we upgraded to 10.2.0.5 is when we started to see instability in the cluster environment. So we had four nodes running in the cluster. We upgraded to 10.2.0.5 -- because we had some instability in June, and July 19 was a Saturday night window that we decided to do this patch upgrade that worked well in testing, it worked well when we first brought it up.
The second day, when we started to see more load on the system, when the Asian posts started coming online Sunday night, we started to see some instability in the cluster.
FCW: What do you mean by 'instability' in the cluster?
Ambrose: We had multiple nodes running simultaneously, handling the database traffic, and the cluster software was not keeping all of the nodes running and processing traffic. And so nodes kept shutting down.
FCW: Had you seen something like that before?
Ambrose: It is something we had not seen before and it was something that Oracle had not seen before. So we had Oracle support in the room with the engineers when we were doing the upgrade, and they immediately coordinated with their engineers, product development folks, and started taking a look at the data dumps, making recommendations for some configuration changes. We did that and the cluster became stable the next day.
So we were never really down. We had instability that was causing a backlog of transactions.
FCW: So is it inaccurate to say the database 'crashed'?
Ambrose: We didn't lose data. We were up and available, just not to the level that we would prefer for worldwide operations.
FCW: How did you deal with the CCD's nodes shutting down?
Ambrose: As we made some tuning recommendations per Oracle and Microsoft, we noticed that the clusters, even when we got them stable, they were not performing very well. We noticed that the system was actually running more efficiently on one node. So we took it down to one node and we started to process same-day transactions. But because we were running on one node on a very small server, that one node was pegged at 100 percent.
We knew we could run it on one node. We needed to have one very powerful node.
The one that we were having trouble with was the central server that is the brain of the operations that handles all the traffic. And so each of the ancillary databases, as they're processing requests, has to reach back, in many cases, to the primary [server].
It's interesting that even though we were running on one node, we didn't have a single point of failure because we had the other three nodes. They were having trouble working together but they could each function independently.
It was very strange and even Oracle was perplexed by it.
FCW: What did you do then?
Ambrose: We knew we could run it on one node. We needed to have one very powerful node.
REPORTER'S NOTE: Ambrose's office then ordered two servers from Hewlett Packard that he said HP “built for us overnight” under a blanket purchase agreement.
Ambrose: We found a 40-core server that HP built for us that Microsoft believed [Windows Server] 2003 could handle. They came in, they started to make some changes to the operating system so that it could function. Long story short: the recommendation was made to just go forward to [Windows Server] 2008.
And so we had to do some testing, of course, because that changed our architecture a bit. So we went forward with two new servers, running Windows Server 2008 and the Oracle 10.2.0.5 software, which we went from a 16-core, single-node server to a 40-core server. We were able to handle the traffic.
FCW: Where does your plan to modernize CCD stand, given its struggles in July?
Ambrose: Our modernization that we had been planning was to move from Windows Server 2003 and Oracle 10G to Linux and Oracle 11G. We're still planning that migration by January. That migration will occur on Oracle Exadata machines. We made a big investment in Exadata. We're going to test the CCD on Oracle Enterprise Linux and 11G, running on Exadata for our January release.
REPORTER'S NOTE: The bureau was working to transition the CCD to Linux and 11G in June, the month before the database crashed. Ambrose said the crash had made him more cautious about the transition.
Ambrose: We didn't want to leap to Linux and 11G and introduce an architecture that hadn't been fully tested out because we weren't sure what changes we needed to make to CCD. So we had this interim step in the middle to put 10G on a newer Windows operating system and a newer server so that we [could] stabilize production until we went to our larger Exadata, 11G Linux release.
Part 2: Workforce, infrastructure and avoiding future crashes: Q&A with Greg Ambrose.
Sean Lyngaas is an FCW staff writer covering defense, cybersecurity and intelligence issues. Prior to joining FCW, he was a reporter and editor at Smart Grid Today, where he covered everything from cyber vulnerabilities in the U.S. electric grid to the national energy policies of Britain and Mexico. His reporting on a range of global issues has appeared in publications such as The Atlantic, The Economist, The Washington Diplomat and The Washington Post.
Lyngaas is an active member of the National Press Club, where he served as chairman of the Young Members Committee. He earned his M.A. in international affairs from The Fletcher School of Law and Diplomacy at Tufts University, and his B.A. in public policy from Duke University.
Click here for previous articles by Lyngaas, or connect with him on Twitter: @snlyngaas.