Search engine gives FirstGov edge
The new FirstGov search engine the General Services Administration revved up last month runs smoothly and needs only a few improvements to make FirstGov a best-of-breed Web portal.
Contractor AT&T Corp. chose the new search tools from Fast Search & Transfer, a Norwegian company with substantial U.S. operations. Fast provides spidered Web catalogs for the Lycos.com
portal and search tools for sites operated by Dell Computer Corp., eBay Inc., Ericsson Inc. and IBM Corp.
FirstGov’s previous Inktomi search tools came from the Federal Search Foundation Inc. of Washington at no charge under an agreement with GSA that ended this spring.
With the Fast engine, FirstGov can now index documents in 49 languages and 225 file formats including Adobe Portable Document Format, Microsoft Excel spreadsheets, Microsoft Word documents and other types, as well as HTML. So, the engine can scan a much wider range of government information than before.
But not all FirstGov visitors know they will need specific software to view some files. The site should add small icons to indicate, for example, that an Adobe Acrobat reader must be downloaded. Visitors ignorant of such matters get angry when pages don’t load.
The search platform, running on distributed Sun Microsystems Inc. servers under Solaris, can integrate multiple data sources including real-time data. Fast spokesman Peter Gorman said the site gets fresh content nearly every day and the 1.8 million URLs are refreshed every few weeks.
The Fast search platform can scale up to hundreds of terabytes, Gorman said.
Now that the engine is running, GSA has other aspects of the search process to improve. It faces the same issues as commercial portals in perfecting search forms, query algorithms, results pages, and back-end databases or catalogs.
The search form—usually a simple box—must be easy to understand yet powerful enough to satisfy advanced users. The algorithm logic must target text properly, avoiding inappropriate results.
Results pages must be easy to understand yet provide ways to refine searches if necessary. Back-end databases or catalogs must be relevant and free of annoying duplications.Pie in the cybersky
No search engine does all these jobs perfectly. Millions of visitors have millions of vastly differing needs. The trick is to satisfy most of the target audience most of the time.
The site handles its mission well as the government’s Web front door. But to become a truly best-of-breed portal, it should mimic some functions available on commercial portals.
Like most sites, firstgov.gov keeps the home page simple and puts advanced search tools on another page. The trouble is that visitors can’t jump immediately to the advanced search. They must conduct a separate search to find the more powerful search interface.
That’s a shame because the advanced search has a wealth of choices to search by word, phrase or combinations. Several filters can target a search at—or away from—certain sites. Power searchers would do well just to bookmark www.firstgov.gov/fgsearch/index.jsp
as their FirstGov entry point.
A front-page option can target a search to federal or state sites. I had some trouble getting it to work properly on the home page, but it worked well when selected from the Search Results page.
The spidered catalog is heavily de-duped to remove multiple entries of the same page. It unfortunately suffers from the same limitations as many other search engines. Some Web pages are incorrectly formatted without titles or with nonexistent or inappropriate descriptions. Search results say little about some matching pages, forcing the visitor to click through to look at them.
Here are two ways to smooth that flaw:
- Set specific rules for each spidered site, picking up descriptions from different places on pages or substituting a page headline when a title is missing. This takes editorial effort.
- Insist that spidered government sites police themselves better by standardizing their page titles, descriptions, metatags and so on. GSA might consider setting a suggested format for pages that interact with the spider. This could be as simple as a metatag or as complex as formatting descriptions.
Deborah Diaz, GSA’s deputy associate administrator for FirstGov, said an interagency portal group is studying how such format rules might assist the spidering.
An associated issue is access to the so-called invisible Web, the thousands of public databases that cannot easily be spidered.
To query them, visitors must find each database’s front door, then do separate searches.
Intelliseek Inc., a Cincinnati search company, has experimented with indexing Web pages that interface directly to databases. FirstGov could follow Intelliseek’s model to bring more government data to the Web.
FirstGov currently doesn’t offer a “more matches from this Web site” option among its suggested URLs. It could incorporate the function now offered at Fast’s own search portal, AllTheWeb.com. AT&T spokesman Wayne Jackson said requirements for such additions must come from GSA.
Diaz said there is no specific timetable for additions. She said GSA would analyze current traffic and monitor feedback before making any changes.
The one big thing missing from FirstGov is a navigable Web site directory, also called a subject-specific site taxonomy, like those at dir.yahoo.com
. The categories let visitors drill down by subject. They also can turn up matching categories, which allow a sideways move into another search.
Such directories require some human intervention and are work to maintain. Diaz said she would like FirstGov to move in this direction. Currently, the site has a directory of agency offerings. A subject directory would, for example, show environmental information not only from the Environmental Protection Agency but from any agency.
Fast’s automated directory-building tools could do much of the construction work for such a taxonomy. With the addition of a subject directory, matching categories and drill-down functions, FirstGov would make the leap to a world-class portal.
Shawn P. McCarthy designs products for a Web search engine provider. E-mail him at [email protected]
Connect with the GCN staff on Twitter @GCNtech.