Fed bioinformatics shops making do
- By Michael Hardy
- Jun 23, 2003
As life scientists grow increasingly dependent on the burgeoning field of computational science called bioinformatics, they're learning how to get biologists, software developers and statisticians to collaborate.
Bioinformatics generally refers to computer programs that aid biologists, geneticists and other life scientists in their research. Many programs require supercomputers or, when those are unavailable, less expensive grid clusters. In government, the technology is widely used at the National Institutes of Health, in Defense Department labs studying bioterrorism and in national laboratories.
Although a few vendors offer bioinformatics software, it is an area in which government agencies often have to develop their own applications. When government or commercial organizations have trouble finding products that meet their needs, they may develop their own bioinformatics divisions.
"There are not a lot of companies right now that are stand-alone companies. They've disappeared," said Eileen Mandell, co-founder of the Bio IT Coalition in McLean, Va. "I think that bioinformatics [as a stand-alone business] is not a sustainable discipline."
SRA International Inc., well-known as a defense contractor, maintains a small bioinformatics division that has collaborated with NIH to develop several software applications. David Kane, technical lead for SRA's bioinformatics group, said software developers who aren't closely connected to the scientists they support can have a hard time keeping up.
Biotechnology has progressed from determining the relatively simple, linear chains of chemicals that form genes to exploring the complex shapes of the proteins those genes cause cells to manufacture. An emerging field called systems biology explores the complex interactions and relationships among genes, proteins and cells. Those advances multiply the computational challenges enormously, Kane said.
Creating bioinformatics applications that can be widely used is difficult because there are few common data standards, said Andrew Sherman, vice president of operations at Turbo-Worx Inc. The New Haven, Conn., company makes tools to adapt applications for high-powered, clustered computers. Its TurboBLAST is a specialized version of a program for searching gene databases developed by NIH's National Center for Biotechnology Information.
"It's somewhat difficult because the data is so widely scattered and diverse," he said. "A lot of that is related to it being a relatively immature industry. Look at engineering; they've been doing sophisticated high-performance computing for decades. In the life sciences, they've been doing it [for] less than 10 years."
No life scientist doing research can get very far without bioinformatics tools, though. The amount of data researchers must analyze, file and manage is overwhelming. A typical example comes from Argonne National Laboratory, where scientists took a database containing 1.8 million known protein sequences and compared each against every other sequence as part of a project to refine the algorithms used to conduct such comparative studies.
"This science is basically the science of big numbers," said Natalia Maltsev, a computational biologist and head of the bioinformatics group in the Mathematics and Computer Science Division at Argonne. "All these humongous amounts of data, they can be meaningful only if they are somehow understood."
Using a cluster of 350 processors running the Linux operating system, scientists ran the comparison in less than four days, she said. A desktop computer would have taken almost eight years running nonstop to do the same work.
Argonne is also funding research being carried out at other labs. At Fort Detrick, an Army base in Maryland, scientists are working on projects related to bioterrorism, some in collaboration with Argonne researchers.
Jaques Reifman, a senior research scientist at Argonne, heads up the computational side of some of those projects. With Army scientists, he is developing an algorithm to identify which amino acid in a protein sequence is responsible for the protein's function.
A bioterrorist could modify a microbe to make it more infectious or deadly and to hide it from conventional detection technologies, Reifman said. But "if you knew which amino acids in a sequence are key to the protein function, you could identify them even if the amino acid is masked by another amino acid," he said.
In another Fort Detrick project, scientists are combing databases of publicly available genetic information to correlate the presence of specific microbes with the genes. They hope to discover if specific genes are "up" (telling cells to produce certain proteins) or "down" (telling cells not to produce the proteins). If such markers can be found, physicians could eventually diagnose diseases before any symptoms appear and possibly even tell "weaponized" microbes from naturally occurring ones.
"Generally, the stuff that we [did] here — until [the Sept. 11, 2001, terrorist attacks] and the anthrax attacks — was all considered [to be of] military relevance," Reifman said. "What was once considered just military relevance now affects the public and homeland defense."
Argonne scientists are developing a public server so that universities and other research facilities can take advantage of the lab's processor clusters, which multiply computing horsepower for large-scale applications, Maltsev said.
"A lot of small universities and small labs are building clusters, trying to process all this data," she said. "It requires huge amounts of human effort and it's very redundant." An early test version of the public server will begin running in September and a beta version soon after that, she said.
"It will allow people in remote locations to collaborate successfully, transfer large amounts of data," she said. "We are running large amounts of data for ourselves, but because there are so many technologies involved, and those technologies should be meshed together and integrated, it's pretty complex."