Online KMWorld CRM Media Streaming Media Faulkner Speech Technology Unisphere/DBTA
Other ITI Websites
American Library Directory Boardwalk Empire Database Trends and Applications DestinationCRM Faulkner Information Services Fulltext Sources Online InfoToday Europe KMWorld Literary Market Place Plexus Publishing Smart Customer Service Speech Technology Streaming Media Streaming Media Europe Streaming Media Producer Unisphere Research



Magazines > Information Today > June 2003
Back Index Forward
 




SUBSCRIBE NOW!
Information Today
Vol. 20 No. 6 — June 2003
CONFERENCE CIRCUIT
The Eighth Search Engine Meeting
By Donald T. Hawkins

After last year's detour to San Francisco, the Search Engine Meeting returned to Boston April 7–8 for its eighth annual gathering. Despite these shaky economic times, organizer Harry Collier of Infonortics, Ltd. was pleased with the turnout of 135 attendees. These participants were primarily technologists and researchers working at the leading edge of information searching.

Delegates were treated to a feast of 20 presentations and a panel discussion.Adobe PDF files of the presentations are available at the Infonortics Web site (http://www.infonortics.com).

One of the major themes from last year—the differences between general Web search engines and those on intranets—was continued at this year's meeting. That idea seems to have broadened somewhat to focus on search engines as components of corporate intranets'information portals. A new topic — unified systems for searching both structured and unstructured information—has emerged as the theme of much current research.

Keynote Address

David Evans, CEO of Clairvoyance Corp., opened the meeting with a keynote address that reviewed search engine history and incorporated some personal reminiscences from 1994 to the present. During this session, he wondered which search engine was the first, but his search for the answer was not entirely successful. He found a number of conflicting opinions and had difficulty defining a search engine in the context of the Web's early days. Some go back to the pre-Web period and suggest that Archie, Veronica, and Gopher qualified as search engines, while others say that AltaVista and Excite were the first. Still others date searching to the early days of UNIX and its "grep" command. (Trivia question: What does grep stand for? The answer is at the end of this article.)

By searching Google's newsgroups, Evans found a message from Aug. 14, 1994, announcing Lycos, a new search service that offered "probabilistic retrievalof over 390,000 WWW documents." (Compare that with today's marketing claims by search engine companies, which tout the billions of documents available through their services!) Searching was not nearly as widespread then as it is now. Probably fewer than 100,000 people conducted searches regularly, as opposed to millions today.Americans are now said to spend more than 1 1/2 hours a week searching for information on the Web. All this searching has led to the emergence of the term "search rage," which describes the feelings experienced when searchers don't find what they're looking for within 12 minutes.

The years 1990–1994 saw the rise of human-produced subject directories. This led to the introduction ofYahoo!, which achieved its first million-hit day in 1994. By 1996, new search engines with additional features and increased functionality had appeared on the market, and the business was in full swing. The concept of paying for clicks or a search term began shortly thereafter and is a major industry force today.

We now have improved technologies that lead to higher relevancy of search results, the spread of search engines into intranets, new avenues for obtaining revenue from searching, and interestingly, a return to ontologies and taxonomies as ways to organize information.

Search Difficulties

Evans' keynote was followed by four presentations that covered search process problems. Elizabeth Liddy of Syracuse University, a pioneer in search engine research, described three projects that address the automation of metadata generation. Metadata is now generated manually, often by professional indexers—a costly and labor-intensive process.

Research at the University of Washington, Cornell University, and Syracuse is attempting to develop metadata-generation algorithms. Liddy's experiments compare search results using algorithmically generated metatags with those obtained using human-assigned tags. Further work is needed before such systems can be put into general use, but the research shows promise.

Claude Vogel, chief scientist at Convera, described a different approach to document classification. Documents are often indexed using a thesaurus (i.e., pre-coordinated indexing), which may not give the best retrieval results. Better results are obtained by combining pre-coordinated indexing with terms that have been dynamically generated from a full-text search using rule-based techniques (post-coordinated indexing). The problem with this approach is its complexity.

Vogel suggested that a search engine could assign a "semantic signature" to documents and then use it to organize them, thus improving results. Prototype systems built by Convera and Endeca take slightly different approaches to this technology. Endeca reorganizes search results using indexing built from the retrieval set, while Convera utilizes an ontology to organize the results.

Information overload is a common problem in Web searches. Raul Valdes-Perez, president of Vivísimo, suggested that many people solve information overload by "information overlook"—simply ignoring much of the data they retrieve. Using a statistic from Evans' keynote address, Valdes-Perez asked, "How many documents can you open in the 12 minutes before search rage occurs?"

Information overlook has the following significant business costs:

• Employees don't get the information they need to do their jobs.

• Customers can't solve their problems.

• Publishers may lose readership.

• Web advertisers may lose revenue through click-throughs.

• Users may miss discoveries and opportunities.

We can stop overload by eliminating useless and irrelevant information or by helping people become more efficient. Manual tagging is labor-intensive andexpensive. A Forrester Research report estimates that it costs up to $50 to tag a large document. Companies that have employed automatic tagging include Northern Light, whose search engine (which is no longer publicly available) placed search results in "folders," and Vivísimo, which uses document clustering that lets searchers organize information dynamically without the need to construct and maintain taxonomies.

New Searching Tools

Several presentations described some practical applications of new searching technology. Frank Smadja, chief technology officer of Elron Software, noted that with the recent rapid growth of e-mail spam, a huge opportunity exists for text-categorization and filtering tools.

Chahab Nastar, president of LTU Technologies, updated his presentation from last year's meeting that described his work retrieving images from the Corbis collection. Many "image retrieval" systems are simply doing text retrieval by searching the text of a caption or a description of the image. LTU's system looks at the actual pixels of the image and creates a "DNAsignature" for them, which is then searched. (A demo database of 70,000 images is available at http://corbis.ltutech.com.)

Alan Smeaton described Dublin City University's research on the more difficult problem ofsearching video archives.The school's system, Físchlár, uses various characteristics of video encoding to let its users search a library of TV programs. More than 2,000 people on campus use Físchlár for research, teaching, and entertainment.

Scientific documents are difficult to index because they have multi-word concepts and multilevel hierarchies. Written at an expert level, they are generally longer than other documents and many of their concepts are not stated explicitly. Because of this complexity, automated indexing may be impossible. In its merger with Union Carbide Corp. 2 years ago, Dow Chemical Co. faced those challenges. Union Carbide's information was largely in print form and not well-indexed. A team whose members had a wide variety of skills integrated Dow's and Union Carbide's documents and created a globally accessible electronic repository. The entire collection was re-indexed using automated and human-assisted techniques.

Government Information

Dealing with the federal government is far different than dealing with academic or corporate institutions. Steve Arnold, a well-known industry observer, discussed several of the pitfalls. He also listed some Web sites where searchers can find information on government procurement. Focusing on searching,Arnold said that there are three broad areas in which the government is interested: GSA schedules, records from a single agency, and classified material.

Arnold also noted that we must be aware of four major benchmarks that the government applies to search software proposals: relevance, database content, integration, and the interface. In his opinion, many of today's database search engines do not have the functionality that government agencies demand, primarily because much of the government's computing platform is based on UNIX, not Windows.

On the second day of the meeting, Arnold led a panel that examined pay-per-click advertising as a major new trend and identified four important issues: crawling technology, index freshness, fraud prevention, and analysis of click data. The second most heavily used Web functionality (after e-mail) is search, and companies using pay/click technology have figured out how to monetize it. This model may well drive the future of search engines.

Enterprise Applications

A group of presentations focused on search applications for enterprises. Martin White gave an excellent list of criteria for selecting an intranet search engine. He said that because there's often little to link to on an intranet, search—not Web surfing—is the most important function. White has found that many CIOsdo not understand searching, the conceptsof precision and recall, or taxonomy requirements. He suggested that the evaluation of a search engine and a taxonomy development system should be done separately, not as an integrated package from a single vendor.

White's session was followed by descriptions of case studies from AT&T and Deutsche Telekom. At AT&T, six teams were brought together to evaluate search, and the participants exploited synergies to develop a common set of requirements. Deutsche Telekom developed an intranet search engine that incorporates semantic features to classify e-mail and other information. The search engine was successfully integrated into the intranet and handles approximately 80,000 queries daily.

Matthew Koll, a pioneer in search engine development, now leads start-up company Wondir. He has observed that much information is not being found because it's in the invisible Web or in people's heads. Many ask-an-expert sites are available, but they're rarely used because searchers must know about them in advance. The growth of instant messaging services shows that you can find information by asking others. Wondir meets this need by providing an electronic meeting place where people can ask questions and receive answers from experts.

Koll thinks question-answering will one day be as easy as searching, with the Wondir system integrated into existing search engines as a value-added service. He said that Wondir could become "the last information service whose results remain totally free of commercial influence."

Collections of information with diverse types of data are common today, but such archives present significant challenges for search engines. Much of today's data isn't well-structured and doesn't lend itself to storage in relational databases. Paul Odom, president of Pliant Technologies, echoed many of the other speakers in describing the problems involved. Frequently, one must determine the intent of the user's query and deal with different word meanings, variant endings, concepts, and even contexts. Pliant's retrieval technology uses semantics and knowledge-based navigation combined with taxonomies. Generally, high-relevance hits can be retrieved with fewer than five mouse clicks.

Sue Feldman, vice president of content technologies research at IDC, categorizedthe tasks commonly done by knowledge workers as they explore, retrieve, analyze, and distribute information. Noting that we're drowning in a sea of content, she distinguished between content, data, and related technologies. Data- and content-centric applications have different technology requirements, but there's a need for integrated systems that can handle both. An integrated system could access various types of information with a single query and use standard tools to manipulate the results. Thus, the strengths of both data and content applications can be exploited. Feldman presented data showing a significant increase in retrieval using a combined system and discussed strategies for combining the technologies.

Search Models

The final group of presentations dealt with search models and applications. Peter Bell, co-founder of Endeca Technologies, talked about the integration of searching and navigation. Navigation helps users find information, but it's difficult to do well because there are often only one or two paths to each record. Using aids that combine full-text retrieval with information facets (broad subject categories), Endeca has developed a navigation system to guide searchers through information. This approach allows searching of both structured and unstructured data, which are often in different databases.

Raymond Lau, chief technology officer of iPhrase Technologies, addressed the problem of self-service information retrieval. He noted that the current model of searching is inefficient and places the burden of success on the user. Much content is buried deep in Web sites (more than three clicks from the home page), and few users click down to find it. The search engine often presents its results in a list that extends for several pages. Studies have shown that 85 percent of users abandon a search before looking at all the hits. Lau offered some technological solutions to these challenges, including natural language processing, dynamically designed presentation formats and user guidance, and single access to all information sources.

Prabhakar Raghavan, chief technology officer of Verity, Inc., continued the discussion of access to structured and unstructured information. He suggested that exploiting the structure (classification, tags, taxonomy, etc.) of information and tracking usage is the key to effective retrieval. Because XML is becoming a standard for tagging data, it provides a means of creating an information structure. However, most of today's search engines do not search XML data directly.

Michael Wollowski and Robert Signorelli of the Rose-Hulman Institute of Technology described their efforts to develop an XML search engine. Their prototype uses the structure inherent in XML documents, and it can display search results as plain text. The developers conducted a test with two groups of students, one using Google and the other using the XML engine. They found that the latter worked well on certain types of documents but failed as a general search engine. Searchers liked the interface once they learned how to use it.

In the final presentation, Jean Poncet of Pertimm suggested that in order to search effectively, we must consider information's context as well as its concepts. We communicate with complex mixtures of words and phrases, not Boolean logic. Pertimm uses linguistic data from seven languages in its retrieval system. By utilizing a document's text and structure, a "semantic glimpse" can be created without opening the file. The glimpse is then used to develop a search query. Excellent search results can be obtained with this approach.

The Search Engine meetings annually show that searching is by no means a fully developed technology. It's an exciting field that continues to progress. The Ninth Search Engine Meeting will be held April 19­20, 2004, in The Hague, Netherlands.

(Trivia question answer: grep stands for "general regular expression processor." It's a UNIX system command that's used to search for character strings in text files.)


Donald T. Hawkins is director of intranet content for Information Today, Inc. and editor in chief of Information Science & Technology Abstracts. His e-mail address is dthawkins@infotoday.com.
       Back to top