IT Report from the Field - Information Highways 2002

Volume 19, Issue 6 — June 2002

Table of Contents

Previous Issues

Subscribe Now!

ITI Home

• IT Report from the Field •
2002 Search Engine Meeting
The focus was on the current state of these systems and their future
by Donald T. Hawkins

The Search Engine Meeting, sponsored by Infonortics, Ltd. of Tetbury, U.K., grew out of a memorable Association of Information and Dissemination Centers (ASIDIC) event held in Albuquerque, New Mexico, in 1995. Recognizing a significant trend in the information industry, Infonortics began to host annual meetings on Internet search engines. The first two were held in the U.K., after which they moved to Boston for the next 4 years. Ev Brenner, former manager of the American Petroleum Institute's APIPAT and APILIT online databases, has been program chair for all of the meetings.

The early Search Engine Meetings concentrated on the technologies of advanced systems such as ConQuest, DR-LINK, and CLARIT. Even though the later gatherings continue to have a significant technological and research emphasis, they have also expanded to include the business aspects of search engines. The breadth of the event can be seen by the following subject areas that have been discussed in previous years:

• What end-user and professional searchers require and expect

• The trend to human categorization
and editing

• Text summarization

• Cross-language searching

• Speech recognition

• Visualization

• Clustering

• Text mining

Because of their eclectic mix of technical and business topics, the Search Engine Meetings tend to attract an interesting and diverse group of attendees. This year's event,the seventh in the series, was held April 1516 in San Francisco, and it followed the tradition of previous meetings. In fact, according to the organizers, the diversity of the audience was greater than ever. The theme was "Agony and Ecstasy," referring to the agony users often face when trying to find information in the flood that's available on the Web. The ecstasy is experienced not only when finding the right information, but also in the research and development of new search technologies. There were many outstanding presentations this year. Because of space limitations, not all of them are summarized here. The program and links to most of the presentations are available at http://www.infonortics.com/searchengines/sh02/02prog.html.

Keynote
Factiva CEO Clare Hart gave the keynote speech. In her opinion, the people whocome to the meeting are representative of the search engine industry: approximately half are technologists, and the other half are high-level corporate executives.

The next challenge in searching will be finding information in context using knowledge management initiatives. Hart stressed that the complexity and technology of searching must be hidden from end-users by eliminating the search box found on so many of today's engines. Advanced technologies likenatural language processing (NLP) must be used, as they will lead to gains in productivity, an increased use of taxonomies, and the ability to find information in context. Users can't be expected to change their behavior. Systems must be adapted instead.

Hart listed some illuminating statistics taken from a study ("How Much Information") that was conducted by the University of California (UC)Berkeley's School of Information Management and Systems (SIMS) (http://www.sims.berkeley.edu/research/projects/how-much-info):

• The volume of information that's being generated is enormous and growing. Most of it is in internal documents on users' PCs. Newspapers account for 25 terabytes of information per year and magazines account for 10 terabytes, but office documents account for 195 terabytes!

• Twenty percent of the world's data resides in relational databases.

• The first 12 exabytes (about 12 quintillion bytes) of information were generated by humans over thousands of years, but the next 12 exabytes will be generated in only 2 1/2 years.

• It's estimated that 610 billion e-mail messages, representing 11 terabytes of information, are sent every year.

Global information companies face other challenges that lower their productivity and increase their frustration. According to Hart, they're not able to leverage their information assets fully because of end-user search illiteracy, the lack of information in context, multilingual content, and other factors. Real productivity gains will occur when searching comes naturally to end-users and when they don't have to consciously think about how to do it. To overcome these barriers, systems must be developed that will help users by shielding them from technology details, working in the background, and delivering information efficiently from a variety of sources. Hart identified content management and better use of taxonomies as key technologies that will help us reach those goals.

Future of Search Engines
Following the keynote, three presentations looked at the future of search engines. We need to move beyond today's search technology, which consists of merely poking at a huge amount of information and getting something back. Instead, we require knowledge technology that uses ontologies and semantic networks to retrieve information in context. This way, people can be connected and sound decisions can be made.

A major goal of information retrieval is to make sense of the data by using its "aboutness" qualities. One key technology that helps people do this is visualization. Another is multitasking. This is helpful because humans typically do not rely on only one source for the information they need as they're often working on more than one project at a time. The next generation of search systems must take this into account, make use of the information retrieved in a first iteration, and refine subsequent retrievals.

According to Amanda Spink, an associate professor at Penn State University who has done research on information multitasking, search engine designers must help users by coordinating their multitasking efforts—which allows them to display and use their search histories—and by designing for longer and more complex searches. (Veteran online searchers will recognize many of these capabilities as those they routinely use in commercial online searching systems. Some more sophisticated end-users are now beginning to ask for them.) Metasearch engines, which provide a single interface to multiple systems, also have significant potential in this area, even though they may have some disadvantages. (See http://www.searchtools.com for a review of metasearch engines and their desirable features.)

NLP and Categorization
The second session focused on NLP and categorization techniques. Susan Feldman, research vice president at IDC, reviewed the reasons advanced technologies are needed in search engine development. Because of the vagaries of the English language and different ways to say the same thing, she said that searching is essentially a language game. Words can have several meanings or synonyms, asking questions is an art, and search engines are confusing.

Initial search engines were crude, but emerging technologies—taxonomies, categorization, and linguistic tools—are being increasingly used in their construction. NLP techniques are also important because they parse the user's query and decide which termsto present to the search engine. Disambiguation (deciding which of several word meanings is the desired one) is important but difficult to do algorithmically.

Feldman said that search results can be improved by using categorization, text structure, and heuristics to extract relevant concepts from the query. Many engines depend on lists of rules to decide which terms to use. Often, the user or a subject expert must construct these rules. Machine-learning technologies that automatically develop rule bases are beginning to appear.

In the near future, we can expect to see better search tools because linguistic technologies will be embedded in systems, searches will be executed across multiple knowledge bases, and text mining and pattern detection will improve queries, Feldman said. In the longer term, searching will be just another feature of a portal or gateway, and additional machine learning and agent technologies will be incorporated.

Marti Hearst, a professor at SIMS, gave an excellent presentation that was well-illustrated by examples of "faceted metadata." She pointed out that although today's Web search engines do well at getting users to a starting point in their research, they tend to retrieve overwhelming numbers of poorly organized results. Many Web sites contain links to useful information, which can be utilized when the direction to the desired information is clear. However, it's often difficult to detect which link should be followed. Full-text searching does not work well because it usually produces a disorganized mass of results.

Hearst said that the solution to many search problems is to seamlessly integrate the searching process into the overall information architecture and use hierarchical metadata to allow flexible navigation, organize the results, and provide the capability to expand and refine the search. The challenge of this approach is how to present large amounts of information to users without overwhelming or confusing them.

Hearst's research focuses on the use of metadata to help folks navigate through the search process. Metadata is more flexible than the simple retrieval of a list of links, but it's less complex than doing a full search. It helps users see where they have been and where they should go. Faceted metadata is the use of categories to organize the metadata. For example, if one were looking for information about a topic that occurred in a specific geographic region on a certain date, the facets would be the topic, region, and date. In her research, Hearst is studying the following questions:

• How many facets are allowable?

• Should they be mixed and matched?

• How much information can the
user assimilate?

• How should the hierarchies be
displayed (tabbed, progressively
revealed, etc.)?

• How should free-text results be
integrated into the search?

Some systems have attempted to utilize faceted metadata, but they don't do it well. Hearst showed an example ofYahoo!'s awkward use of facets. One must often drill down through many levels before arriving at the desired data. For instance, to find information about UCBerkeley, one must use the following navigation path:

U.S. States > California > Cities > Berkeley > Education > College and University > Public > UC Berkeley

This example illustrates a major problem with metadata systems: They are pre-defined and not tailored to tasks as they evolve. In contrast, the Epicurious Web site (http://www.epicurious.com) uses faceted metadata effectively. Epicurious creates combinations of metadata dynamically to display the same information in different ways. It shows the user how many sites a search will retrieve, makes it easy to back up in the search, and supports several types of queries. In a usability study, people found those features helpful and liked the style of the metadata search.

Image searching is another area in which faceted metadata can be utilized advantageously. For example, architects often have large collections of images that they use for reference and inspiration in designing new structures. Often, their image collections are stored more or less randomly, making retrieval difficult. Hearst and her team took a collection of about 40,000 images from the UCBerkeley Architecture Slide Library and produced a search interface that included detailed metadata about them. In a usability study of 19 architects, the system was received very positively, and the subjects expressed a strong desire to continue using it. When asked to choose between a matrix (faceted) approach and the tree structure used by most of today's Web search engines, the participants overwhelmingly preferred the matrix approach. They found it easier to develop their search strategy, keep track of where they were, and understand hierarchical relationships.

Hearst's research shows that faceted metadata can support many types of searching. It allows the user to switch from one search tactic to another during the search, and it makes expanding and refining searches easy. Although there are still questions to be answered, Hearst's work is fascinating and gives us a glimpse of how the search process may be improved in the future.

Laurent Proulx, chief technology officerat Nstein Technologies, Inc., discussed the current state of Web search engines. At the enterprise level, most of the content is unstructured and resides in many different repositories (primarily as e-mail messages or internal documents on PCs). Most users regard searching as a box in which to type words, rarely supply more than two or three words in a search, and look at only the first screen of results.

Proulx said that today's search engines are designed simply to match words and cannot interpret meaning. As a result, they deliver high recall but very low precision. They don't interact with users in any meaningful way, so the results frequently fail to meet expectations. Their interfaces generally use a hierarchical tree structure, but to enhance the search process a new interface is needed. Concepts could be determined and terms could be disambiguated by employing linguistic-based extraction techniques. The information could be organizedand retrieved by utilizing taxonomies, which help define an information framework for users. Categorization—determining the "aboutness" of an item—can define equivalent terms and can be done by humans, computers, or a combination of both. Computer-aided categorization provides editors with suggested terms and helps them define categorization rules. Fully automated systems that use metadata and authority files are currently being developed, and many of them show promise.

Content Filtering
The first day of the meeting concluded with a panel of speakers who discussed the issues surrounding the filtering of search results. Chaired by David Evans, CEO of Clairvoyance Corp. (formerly CLARITECH), it included speakers from RuleSpace, a corporation that develops filtering software, and FirstGov, the U.S. government's portal to over 51 million U.S. and state Web sites.

Evans introduced the panel by identifying areas in which content may cause concerns for users: security (viruses), copyright, offensive material (pornography), spam, publishers' guidelines and policies, competitive intelligence, corporate policy, and official information (national security). The Children's Internet Protection Act (CIPA), which was enacted in 2000 and mandates the filtering of public Internet access terminals in libraries, has raised a storm of controversy. The government claims that filtering software has greatly improved and that sites blocked in error can now be easily unblocked. According to an ALA lawyer, studies have shown that because of language vagaries and ambiguities, up to 15 percent of Web sites are blocked incorrectly.

Evans went on to show various methods of algorithmically identifying content. From a sample of text, words are extracted, sentences are parsed and tagged, and names and other proper terms are identified. The results are anywhere between 40- and 85-percent accurate. Advanced technologies such asNLP and filtering based on categorization will improve these results. Many systems require the development of rule bases and "training sets" of documents. Evans concluded that filtering is a challenging problem—much more than it appears on the surface—and that caution is needed when making claims of accuracy.

Daniel Lulich, chief technology officer at RuleSpace, presented a fascinating view of the behind-the-scenes technical issues involved in content filtering and how they must be balanced against users' requirements. RuleSpace's filtering software is based on machine learning and has won many awards for technology excellence. Several technologies that have been applied for filtering cannot handle the wide range of human language capabilities because their categories are too broad. They are also unable to deal with images and other nontextual data.

Lulich said that the nature of the Web does not lend itself to content policing because sites change so rapidly, content owners can be difficult to find, many pornographers are experts at defeating the systems, and filtering may inhibit people from finding useful, legitimate information. Many filtering vendors use control lists that routinely overblock good sites, and they are insensitive to privacy issues. Because of proprietary reasons, they refuse to release their control lists or details on how their systems work.

RuleSpace averages approximately 1 billion hits per day on servers that have its technologyinstalled. The service is opt-in; users can turn it off at any time. However, since the company began deploying its technology, the number of parents who opt for content filtering has doubled. On the average, RuleSpace receives 120 requests a day to unblock sites. Twenty to 30 of the requests are for sites that were blocked in error; the rest violate customer policies. Every day, without fail, someone asks RuleSpace to unblock Playboy.com—and sometimes for very inventive reasons.

According to Lulich, filtering policies have evolved in the last 2 years. In the past, most ISPs did not filter content and they were proud of it. Now they tend to offer filtering as an added-value service. Users expect 100-percent accuracy in the technology when it comes to pornography, but they're not so concerned about blocking alcohol- and tobacco-related sites. Filtering is difficult because some sites are really gateways that generate dynamic content. Therefore, rules must be developed and sites must be rated dynamically. RuleSpace sweeps the entire Web every 60 days and rates over 38 million sites, 8 percent of which are filtered. Of those 3.2 million sites, 81 percent are pornographic.

New Technologies
The second day of the meeting began with the always-popular "New Technologies" session. Steve Arnold, a frequent speaker on this topic at information industry conferences, led off. When the Web originated, everything was simple. There were only a few sites and only a few indexers were needed. Now there are over a billion Web sites, nothing is simple anymore, and flexibility is needed, he said. Cheap processors, smart agents, and rule bases ("scripts") that are fine-tuned by humans are common threads in successful search engines.

Arnold identified "computational search"and "mereology" as promising technologies that will advance search and transform it. Computational search enables the building of nimble and dynamic applications. Mereology, based on research dating to 1918, uses preliminary answers to queries to discover information relationships and then generates well-formed abstracts that may contain sufficient information to answer the query.

According to Arnold, three companies to watch are Nutech Solutions, Inc. (http://www.nutech.com), which is using mereology technology to develop a search engine; Pertimm (http://www.pertimm.com), which applies semantic iteration to discover new phrases and automatically update indexes to the content; and Applied Semantics (http://www.appliedsemantics.com), which relies on dynamic categorization and content summarization to create metadata and transform unstructured data into categorized and organized information. Pertimm is being usedby the Questel search service. The new Oingo search engine is utilizing Applied Semantics' solutions. Technologies such as these have a strong potential to advance search to new heights.

Chahab Nastar, co-founder and CEO of LTU Technologies (http://www.ltutech.com), described his company's image search technology. He pointed out that several of today's search engines do a good job of retrieving images, but not by searching the images themselves. Instead, they rely on textual descriptions that accompany the image, such as the title or metadata tags. LTU's technology analyzes the pixels of the images, which then permits construction of their "DNA" representation.

Intranet Search Engines
Several presentations focused on search engines that are incorporated into enterprise systems. There are striking differences between enterprise searching and general Web searching. Andrew Littlefield, chief strategist of enterprise solutions for Inktomi, described some of them. On the Internet, almost all of the content is in HTML format, but on intranets, there's much greater diversity. Figure 1, taken from studies performed by Inktomi, shows that only 37 percent of the content on intranets is in HTML.

Connections for general Web searching are optimized for dial-up (28.8 and 56K), but most corporate intranets now enjoy higher-speed connections. Search engine developers usually assume that the slower dial-up connections require mainly textual interfaces. However, on corporate networksa simple two-line text summary of a site may not be the most effective way to present information, so higher bandwidths can be leveraged to take advantage of graphical interfaces. Google's Matt Cutts said that his service's search tool for corporate intranets was introduced 2 years ago. He echoed many of the points made by Littlefield, emphasizing that for the corporate world, one must search many other types of files besides Web pages, such as e-mail, catalogs, and Microsoft Office documents. It's important to keep the user interface familiar to searchers. The lessons learned from searching the Web can be applied to intranet search engines.

Question-Answering Systems
One often hears speakers at information industry conferences bemoaning the fact that search engines only present users with a list of Web sites, when what they really want are answers to their questions. A session titled "The TREC Question-Answering Track" focused on question-answering systems and featured two industry leaders who are actively working in this area. The first was Donna Harman from NIST, who reported on the results from the latest Text REtrieval Conference (TREC). (These "conferences" are really competitions among search engine research groups using advanced technologies and the search engines and retrieval systems they have built. Each year, a standard database of nearly 1 million news articles is presented, and competitors are given tasks to solve using that database. TREC began in 1992 and reports on high-quality, leading-edge technology. For more information, see http://trec.nist.gov.)

Each year's TREC focuses on one or more themes, and the past three events addressed question-and-answer systems. In the 2001 experiment, participants were asked to do the following:

1) Retrieve a 50-byte snippet of text containing the answer to a list of questions. The questions seeked simple facts, and the answers were generally named entities or short phrases.

2) Assemble a list of events that answered a question. The questions were taken from logs of actual searches. TREC analysts found and verified the correct answers.

3) Track objects through a series of questions and answers. A test set of 500 questions was assembled, and answers were verified. The set included 49 questions with no known answer in the database.

NIST assessors judged the correctness of the participants' answers. To more closely approximate the real world, the evaluations took into account that assessors' opinions differ. Many systems in the test used a lexicon. Most did fairly well, with 40 to 70 percent of their results correct. Detecting the "no answer" questions was very difficult for all the systems; only five runs returned accuracies greater than 25 percent. This year's experiment has provided a rich collection of issues and data for further research, such as questions relating to definitions, cause and effect, narratives as answers, and contextual answers.

Liz Liddy of Syracuse University, another leader in NLP and other information retrieval technologies, followed Harman. Her presentation, "Why Settle For a List When You Want an Answer?" reviewed users' information needs and provided an example of how a search engine would use NLP to parse their queries. Users have many types of information requirements, but today's search engines return only lists of URLs. Question-answering is different from document retrieval, which is based on matching queries with terms from an inverted file.It requires very precise matching of entities and relationships, and one type of answer does not fit all queries.

Liddy presented detailed examples of hypothetical questions and of how a system could use NLP to achieve human-like understanding of the text (both explicit and implicit meanings). She envisions two-stage information access systems in the future in which the first stage would retrieve an initial set of resources that have a high potential for answering the query. A following stage would perform an in-depth analysis of that set, then present the answer to the user. Her research group has developed an "L-2-L" (language-to-logic) system for parsing and processing queries and arriving at an answer. For simple fact/data queries, it's necessary to understand what users ask about (the query dimensions), how they ask (query grammar), and how queries can best be used to retrieve statistical answers and map the query dimensions into metadata. Other questions, such as those asked by college students, are far more complex because they tend to require more "how" and "why" responses than simple facts.

Corporate Intranet Searching
The final session of the meeting focused on the search features that are desirable in corporate intranets and included the learning experiences of two pharmaceutical companies in deploying their systems. David Hawking of Australia-based CSIRO led off with 10 "rules" for intranet developers (see the meeting Web site for details).

Horst Baumgarten of Roche Diagnostics presented the first case study. He said his company has found the following about intranet searching:

• Over half of the information resides in structured databases.

• The data are usefully characterized into directories, product catalogs, etc.

• Extraneous or irrelevant information is largely eliminated because only content that has been "registered" is allowed on the intranet. (Registration is accepted by the content owners because it allows them to provide publicity to reach the entire user base and employ common site tools such as navigation bars, editorial systems, etc.)

• The database contains Web pages, Microsoft Office files, Adobe PDF documents, and other types of content. Searches can be done in context, and all content can be retrieved in a single search.

• Efficient intranet searches are possible when the content is up-to-date, users have an understanding of searching, and their needs are known in some detail so that the system can be designed intelligently.

The meeting concluded with a case study by Neil Margolis of Wyeth-Ayerst Research. He said that searching is not the same as finding, and in the future we should concentrate on finding. His observations largely mirror the well-known fact that finding and using information are a major part of many knowledge workers' jobs. Requirements for the ideal search system include repositories for multiple types of information, concept-based searching, search and retrieval from unstructured text, relevance ranking, security, and ease of use.

Margolis said that it's difficult to impress most intranet users because everyone has had some experience in searching the Web. People expect a search engine to find useful information. It bothers them to see the huge number of results they tend to receive, and it also bothers them when what they consider to be the best hit does not appear at the top of the list. If users don't find what they want quickly, they tend to give up.

Conclusion
For anyone interested in search engines, this annual meeting is a major event on the conference calendar and should not be missed. The quality of the presentations is very high, as is the content. The diverse interests of the attendees make for an enjoyable mix of conversation and networking opportunities. The next Search Engine Meeting will be held April 78, 2003 in Boston.

Donald T. Hawkins is editor in chief of Information Science Abstracts and Fulltext Sources Online, both published by Information Today, Inc. His e-mail address is dthawkins@infotoday.com.

Table of Contents

Previous Issues

Subscribe Now!

ITI Home