The Search Engine Meeting, sponsored by Infonortics, Ltd. of Tetbury,
U.K., grew out of a memorable Association of Information and Dissemination
Centers (ASIDIC) event held in Albuquerque, New Mexico, in 1995. Recognizing
a significant trend in the information industry, Infonortics began to host
annual meetings on Internet search engines. The first two were held in
the U.K., after which they moved to Boston for the next 4 years. Ev Brenner,
former manager of the American Petroleum Institute's APIPAT and APILIT
online databases, has been program chair for all of the meetings.
The early Search Engine Meetings concentrated on the technologies of
advanced systems such as ConQuest, DR-LINK, and CLARIT. Even though the
later gatherings continue to have a significant technological and research
emphasis, they have also expanded to include the business aspects of search
engines. The breadth of the event can be seen by the following subject
areas that have been discussed in previous years:
• What end-user and professional searchers require and expect
• The trend to human categorization
and editing
• Text summarization
• Cross-language searching
• Speech recognition
• Visualization
• Clustering
• Text mining
Because of their eclectic mix of technical and business topics, the
Search Engine Meetings tend to attract an interesting and diverse group
of attendees. This year's event,the seventh in the series, was held April
1516 in San Francisco, and it followed the tradition of previous meetings.
In fact, according to the organizers, the diversity of the audience was
greater than ever. The theme was "Agony and Ecstasy," referring to the
agony users often face when trying to find information in the flood that's
available on the Web. The ecstasy is experienced not only when finding
the right information, but also in the research and development of new
search technologies. There were many outstanding presentations this year.
Because of space limitations, not all of them are summarized here. The
program and links to most of the presentations are available at http://www.infonortics.com/searchengines/sh02/02prog.html.
Keynote
Factiva CEO Clare Hart gave the keynote speech. In her opinion, the
people whocome to the meeting are representative of the search engine industry:
approximately half are technologists, and the other half are high-level
corporate executives.
The next challenge in searching will be finding information in context
using knowledge management initiatives. Hart stressed that the complexity
and technology of searching must be hidden from end-users by eliminating
the search box found on so many of today's engines. Advanced technologies
likenatural language processing (NLP) must be used, as they will lead to
gains in productivity, an increased use of taxonomies, and the ability
to find information in context. Users can't be expected to change their
behavior. Systems must be adapted instead.
Hart listed some illuminating statistics taken from a study ("How Much
Information") that was conducted by the University of California (UC)Berkeley's
School of Information Management and Systems (SIMS) (http://www.sims.berkeley.edu/research/projects/how-much-info):
• The volume of information that's being generated is enormous and growing.
Most of it is in internal documents on users' PCs. Newspapers account for
25 terabytes of information per year and magazines account for 10 terabytes,
but office documents account for 195 terabytes!
• Twenty percent of the world's data resides in relational databases.
• The first 12 exabytes (about 12 quintillion bytes) of information
were generated by humans over thousands of years, but the next 12 exabytes
will be generated in only 2 1/2 years.
• It's estimated that 610 billion e-mail messages, representing 11 terabytes
of information, are sent every year.
Global information companies face other challenges that lower their
productivity and increase their frustration. According to Hart, they're
not able to leverage their information assets fully because of end-user
search illiteracy, the lack of information in context, multilingual content,
and other factors. Real productivity gains will occur when searching comes
naturally to end-users and when they don't have to consciously think about
how to do it. To overcome these barriers, systems must be developed that
will help users by shielding them from technology details, working in the
background, and delivering information efficiently from a variety of sources.
Hart identified content management and better use of taxonomies as key
technologies that will help us reach those goals.
Future of Search Engines
Following the keynote, three presentations looked at the future of
search engines. We need to move beyond today's search technology, which
consists of merely poking at a huge amount of information and getting something
back. Instead, we require knowledge technology that uses ontologies and
semantic networks to retrieve information in context. This way, people
can be connected and sound decisions can be made.
A major goal of information retrieval is to make sense of the data by
using its "aboutness" qualities. One key technology that helps people do
this is visualization. Another is multitasking. This is helpful because
humans typically do not rely on only one source for the information they
need as they're often working on more than one project at a time. The next
generation of search systems must take this into account, make use of the
information retrieved in a first iteration, and refine subsequent retrievals.
According to Amanda Spink, an associate professor at Penn State University
who has done research on information multitasking, search engine designers
must help users by coordinating their multitasking efforts—which allows
them to display and use their search histories—and by designing for longer
and more complex searches. (Veteran online searchers will recognize many
of these capabilities as those they routinely use in commercial online
searching systems. Some more sophisticated end-users are now beginning
to ask for them.) Metasearch engines, which provide a single interface
to multiple systems, also have significant potential in this area, even
though they may have some disadvantages. (See http://www.searchtools.com
for a review of metasearch engines and their desirable features.)
NLP and Categorization
The second session focused on NLP and categorization techniques. Susan
Feldman, research vice president at IDC, reviewed the reasons advanced
technologies are needed in search engine development. Because of the vagaries
of the English language and different ways to say the same thing, she said
that searching is essentially a language game. Words can have several meanings
or synonyms, asking questions is an art, and search engines are confusing.
Initial search engines were crude, but emerging technologies—taxonomies,
categorization, and linguistic tools—are being increasingly used in their
construction. NLP techniques are also important because they parse the
user's query and decide which termsto present to the search engine. Disambiguation
(deciding which of several word meanings is the desired one) is important
but difficult to do algorithmically.
Feldman said that search results can be improved by using categorization,
text structure, and heuristics to extract relevant concepts from the query.
Many engines depend on lists of rules to decide which terms to use. Often,
the user or a subject expert must construct these rules. Machine-learning
technologies that automatically develop rule bases are beginning to appear.
In the near future, we can expect to see better search tools because
linguistic technologies will be embedded in systems, searches will be executed
across multiple knowledge bases, and text mining and pattern detection
will improve queries, Feldman said. In the longer term, searching will
be just another feature of a portal or gateway, and additional machine
learning and agent technologies will be incorporated.
Marti Hearst, a professor at SIMS, gave an excellent presentation that
was well-illustrated by examples of "faceted metadata." She pointed out
that although today's Web search engines do well at getting users to a
starting point in their research, they tend to retrieve overwhelming numbers
of poorly organized results. Many Web sites contain links to useful information,
which can be utilized when the direction to the desired information is
clear. However, it's often difficult to detect which link should be followed.
Full-text searching does not work well because it usually produces a disorganized
mass of results.
Hearst said that the solution to many search problems is to seamlessly
integrate the searching process into the overall information architecture
and use hierarchical metadata to allow flexible navigation, organize the
results, and provide the capability to expand and refine the search. The
challenge of this approach is how to present large amounts of information
to users without overwhelming or confusing them.
Hearst's research focuses on the use of metadata to help folks navigate
through the search process. Metadata is more flexible than the simple retrieval
of a list of links, but it's less complex than doing a full search. It
helps users see where they have been and where they should go. Faceted
metadata is the use of categories to organize the metadata. For example,
if one were looking for information about a topic that occurred in a specific
geographic region on a certain date, the facets would be the topic, region,
and date. In her research, Hearst is studying the following questions:
• How many facets are allowable?
• Should they be mixed and matched?
• How much information can the
user assimilate?
• How should the hierarchies be
displayed (tabbed, progressively
revealed, etc.)?
• How should free-text results be
integrated into the search?
Some systems have attempted to utilize faceted metadata, but they don't
do it well. Hearst showed an example ofYahoo!'s awkward use of facets.
One must often drill down through many levels before arriving at the desired
data. For instance, to find information about UCBerkeley, one must
use the following navigation path:
U.S. States > California > Cities > Berkeley > Education > College and
University > Public > UC Berkeley
This example illustrates a major problem with metadata systems: They
are pre-defined and not tailored to tasks as they evolve. In contrast,
the Epicurious Web site (http://www.epicurious.com)
uses faceted metadata effectively. Epicurious creates combinations of metadata
dynamically to display the same information in different ways. It shows
the user how many sites a search will retrieve, makes it easy to back up
in the search, and supports several types of queries. In a usability study,
people found those features helpful and liked the style of the metadata
search.
Image searching is another area in which faceted metadata can be utilized
advantageously. For example, architects often have large collections of
images that they use for reference and inspiration in designing new structures.
Often, their image collections are stored more or less randomly, making
retrieval difficult. Hearst and her team took a collection of about 40,000
images from the UCBerkeley Architecture Slide Library and produced
a search interface that included detailed metadata about them. In a usability
study of 19 architects, the system was received very positively, and the
subjects expressed a strong desire to continue using it. When asked to
choose between a matrix (faceted) approach and the tree structure used
by most of today's Web search engines, the participants overwhelmingly
preferred the matrix approach. They found it easier to develop their search
strategy, keep track of where they were, and understand hierarchical relationships.
Hearst's research shows that faceted metadata can support many types
of searching. It allows the user to switch from one search tactic to another
during the search, and it makes expanding and refining searches easy. Although
there are still questions to be answered, Hearst's work is fascinating
and gives us a glimpse of how the search process may be improved in the
future.
Laurent Proulx, chief technology officerat Nstein Technologies, Inc.,
discussed the current state of Web search engines. At the enterprise level,
most of the content is unstructured and resides in many different repositories
(primarily as e-mail messages or internal documents on PCs). Most users
regard searching as a box in which to type words, rarely supply more than
two or three words in a search, and look at only the first screen of results.
Proulx said that today's search engines are designed simply to match
words and cannot interpret meaning. As a result, they deliver high recall
but very low precision. They don't interact with users in any meaningful
way, so the results frequently fail to meet expectations. Their interfaces
generally use a hierarchical tree structure, but to enhance the search
process a new interface is needed. Concepts could be determined and terms
could be disambiguated by employing linguistic-based extraction techniques.
The information could be organizedand retrieved by utilizing taxonomies,
which help define an information framework for users. Categorization—determining
the "aboutness" of an item—can define equivalent terms and can be done
by humans, computers, or a combination of both. Computer-aided categorization
provides editors with suggested terms and helps them define categorization
rules. Fully automated systems that use metadata and authority files are
currently being developed, and many of them show promise.
Content Filtering
The first day of the meeting concluded with a panel of speakers who
discussed the issues surrounding the filtering of search results. Chaired
by David Evans, CEO of Clairvoyance Corp. (formerly CLARITECH), it included
speakers from RuleSpace, a corporation that develops filtering software,
and FirstGov, the U.S. government's portal to over 51 million U.S. and
state Web sites.
Evans introduced the panel by identifying areas in which content may
cause concerns for users: security (viruses), copyright, offensive material
(pornography), spam, publishers' guidelines and policies, competitive intelligence,
corporate policy, and official information (national security). The Children's
Internet Protection Act (CIPA), which was enacted in 2000 and mandates
the filtering of public Internet access terminals in libraries, has raised
a storm of controversy. The government claims that filtering software has
greatly improved and that sites blocked in error can now be easily unblocked.
According to an ALA lawyer, studies have shown that because of language
vagaries and ambiguities, up to 15 percent of Web sites are blocked incorrectly.
Evans went on to show various methods of algorithmically identifying
content. From a sample of text, words are extracted, sentences are parsed
and tagged, and names and other proper terms are identified. The results
are anywhere between 40- and 85-percent accurate. Advanced technologies
such asNLP and filtering based on categorization will improve these results.
Many systems require the development of rule bases and "training sets"
of documents. Evans concluded that filtering is a challenging problem—much
more than it appears on the surface—and that caution is needed when making
claims of accuracy.
Daniel Lulich, chief technology officer at RuleSpace, presented a fascinating
view of the behind-the-scenes technical issues involved in content filtering
and how they must be balanced against users' requirements. RuleSpace's
filtering software is based on machine learning and has won many awards
for technology excellence. Several technologies that have been applied
for filtering cannot handle the wide range of human language capabilities
because their categories are too broad. They are also unable to deal with
images and other nontextual data.
Lulich said that the nature of the Web does not lend itself to content
policing because sites change so rapidly, content owners can be difficult
to find, many pornographers are experts at defeating the systems, and filtering
may inhibit people from finding useful, legitimate information. Many filtering
vendors use control lists that routinely overblock good sites, and they
are insensitive to privacy issues. Because of proprietary reasons, they
refuse to release their control lists or details on how their systems work.
RuleSpace averages approximately 1 billion hits per day on servers that
have its technologyinstalled. The service is opt-in; users can turn it
off at any time. However, since the company began deploying its technology,
the number of parents who opt for content filtering has doubled. On the
average, RuleSpace receives 120 requests a day to unblock sites. Twenty
to 30 of the requests are for sites that were blocked in error; the rest
violate customer policies. Every day, without fail, someone asks RuleSpace
to unblock Playboy.com—and sometimes for very inventive reasons.
According to Lulich, filtering policies have evolved in the last 2 years.
In the past, most ISPs did not filter content and they were proud of it.
Now they tend to offer filtering as an added-value service. Users expect
100-percent accuracy in the technology when it comes to pornography, but
they're not so concerned about blocking alcohol- and tobacco-related sites.
Filtering is difficult because some sites are really gateways that generate
dynamic content. Therefore, rules must be developed and sites must be rated
dynamically. RuleSpace sweeps the entire Web every 60 days and rates over
38 million sites, 8 percent of which are filtered. Of those 3.2 million
sites, 81 percent are pornographic.
New Technologies
The second day of the meeting began with the always-popular "New Technologies"
session. Steve Arnold, a frequent speaker on this topic at information
industry conferences, led off. When the Web originated, everything was
simple. There were only a few sites and only a few indexers were needed.
Now there are over a billion Web sites, nothing is simple anymore, and
flexibility is needed, he said. Cheap processors, smart agents, and rule
bases ("scripts") that are fine-tuned by humans are common threads in successful
search engines.
Arnold identified "computational search"and "mereology" as promising
technologies that will advance search and transform it. Computational search
enables the building of nimble and dynamic applications. Mereology, based
on research dating to 1918, uses preliminary answers to queries to discover
information relationships and then generates well-formed abstracts that
may contain sufficient information to answer the query.
According to Arnold, three companies to watch are Nutech Solutions,
Inc. (http://www.nutech.com),
which is using mereology technology to develop a search engine; Pertimm
(http://www.pertimm.com), which
applies semantic iteration to discover new phrases and automatically update
indexes to the content; and Applied Semantics (http://www.appliedsemantics.com),
which relies on dynamic categorization and content summarization to create
metadata and transform unstructured data into categorized and organized
information. Pertimm is being usedby the Questel search service. The new
Oingo search engine is utilizing Applied Semantics' solutions. Technologies
such as these have a strong potential to advance search to new heights.
Chahab Nastar, co-founder and CEO of LTU Technologies (http://www.ltutech.com),
described his company's image search technology. He pointed out that several
of today's search engines do a good job of retrieving images, but not by
searching the images themselves. Instead, they rely on textual descriptions
that accompany the image, such as the title or metadata tags. LTU's technology
analyzes the pixels of the images, which then permits construction of their
"DNA" representation.
Intranet Search Engines
Several presentations focused on search engines that are incorporated
into enterprise systems. There are striking differences between enterprise
searching and general Web searching. Andrew Littlefield, chief strategist
of enterprise solutions for Inktomi, described some of them. On the Internet,
almost all of the content is in HTML format, but on intranets, there's
much greater diversity. Figure 1, taken from studies performed by Inktomi,
shows that only 37 percent of the content on intranets is in HTML.
Connections for general Web searching are optimized for dial-up (28.8
and 56K), but most corporate intranets now enjoy higher-speed connections.
Search engine developers usually assume that the slower dial-up connections
require mainly textual interfaces. However, on corporate networksa simple
two-line text summary of a site may not be the most effective way to present
information, so higher bandwidths can be leveraged to take advantage of
graphical interfaces. Google's Matt Cutts said that his service's search
tool for corporate intranets was introduced 2 years ago. He echoed many
of the points made by Littlefield, emphasizing that for the corporate world,
one must search many other types of files besides Web pages, such as e-mail,
catalogs, and Microsoft Office documents. It's important to keep the user
interface familiar to searchers. The lessons learned from searching the
Web can be applied to intranet search engines.
Question-Answering Systems
One often hears speakers at information industry conferences bemoaning
the fact that search engines only present users with a list of Web sites,
when what they really want are answers to their questions. A session titled
"The TREC Question-Answering Track" focused on question-answering systems
and featured two industry leaders who are actively working in this area.
The first was Donna Harman from NIST, who reported on the results from
the latest Text REtrieval Conference (TREC). (These "conferences" are really
competitions among search engine research groups using advanced technologies
and the search engines and retrieval systems they have built. Each year,
a standard database of nearly 1 million news articles is presented, and
competitors are given tasks to solve using that database. TREC began in
1992 and reports on high-quality, leading-edge technology. For more information,
see http://trec.nist.gov.)
Each year's TREC focuses on one or more themes, and the past three events
addressed question-and-answer systems. In the 2001 experiment, participants
were asked to do the following:
1) Retrieve a 50-byte snippet of text containing the answer to a list
of questions. The questions seeked simple facts, and the answers were generally
named entities or short phrases.
2) Assemble a list of events that answered a question. The questions
were taken from logs of actual searches. TREC analysts found and verified
the correct answers.
3) Track objects through a series of questions and answers. A test set
of 500 questions was assembled, and answers were verified. The set included
49 questions with no known answer in the database.
NIST assessors judged the correctness of the participants' answers.
To more closely approximate the real world, the evaluations took into account
that assessors' opinions differ. Many systems in the test used a lexicon.
Most did fairly well, with 40 to 70 percent of their results correct. Detecting
the "no answer" questions was very difficult for all the systems; only
five runs returned accuracies greater than 25 percent. This year's experiment
has provided a rich collection of issues and data for further research,
such as questions relating to definitions, cause and effect, narratives
as answers, and contextual answers.
Liz Liddy of Syracuse University, another leader in NLP and other information
retrieval technologies, followed Harman. Her presentation, "Why Settle
For a List When You Want an Answer?" reviewed users' information needs
and provided an example of how a search engine would use NLP to parse their
queries. Users have many types of information requirements, but today's
search engines return only lists of URLs. Question-answering is different
from document retrieval, which is based on matching queries with terms
from an inverted file.It requires very precise matching of entities and
relationships, and one type of answer does not fit all queries.
Liddy presented detailed examples of hypothetical questions and of how
a system could use NLP to achieve human-like understanding of the text
(both explicit and implicit meanings). She envisions two-stage information
access systems in the future in which the first stage would retrieve an
initial set of resources that have a high potential for answering the query.
A following stage would perform an in-depth analysis of that set, then
present the answer to the user. Her research group has developed an "L-2-L"
(language-to-logic) system for parsing and processing queries and arriving
at an answer. For simple fact/data queries, it's necessary to understand
what users ask about (the query dimensions), how they ask (query grammar),
and how queries can best be used to retrieve statistical answers and map
the query dimensions into metadata. Other questions, such as those asked
by college students, are far more complex because they tend to require
more "how" and "why" responses than simple facts.
Corporate Intranet Searching
The final session of the meeting focused on the search features that
are desirable in corporate intranets and included the learning experiences
of two pharmaceutical companies in deploying their systems. David Hawking
of Australia-based CSIRO led off with 10 "rules" for intranet developers
(see the meeting Web site for details).
Horst Baumgarten of Roche Diagnostics presented the first case study.
He said his company has found the following about intranet searching:
• Over half of the information resides in structured databases.
• The data are usefully characterized into directories, product catalogs,
etc.
• Extraneous or irrelevant information is largely eliminated because
only content that has been "registered" is allowed on the intranet. (Registration
is accepted by the content owners because it allows them to provide publicity
to reach the entire user base and employ common site tools such as navigation
bars, editorial systems, etc.)
• The database contains Web pages, Microsoft Office files, Adobe PDF
documents, and other types of content. Searches can be done in context,
and all content can be retrieved in a single search.
• Efficient intranet searches are possible when the content is up-to-date,
users have an understanding of searching, and their needs are known in
some detail so that the system can be designed intelligently.
The meeting concluded with a case study by Neil Margolis of Wyeth-Ayerst
Research. He said that searching is not the same as finding, and in the
future we should concentrate on finding. His observations largely mirror
the well-known fact that finding and using information are a major part
of many knowledge workers' jobs. Requirements for the ideal search system
include repositories for multiple types of information, concept-based searching,
search and retrieval from unstructured text, relevance ranking, security,
and ease of use.
Margolis said that it's difficult to impress most intranet users because
everyone has had some experience in searching the Web. People expect a
search engine to find useful information. It bothers them to see the huge
number of results they tend to receive, and it also bothers them when what
they consider to be the best hit does not appear at the top of the list.
If users don't find what they want quickly, they tend to give up.
Conclusion
For anyone interested in search engines, this annual meeting is a major
event on the conference calendar and should not be missed. The quality
of the presentations is very high, as is the content. The diverse interests
of the attendees make for an enjoyable mix of conversation and networking
opportunities. The next Search Engine Meeting will be held April 78,
2003 in Boston.
Donald T. Hawkins is editor in chief of Information Science Abstracts
and Fulltext Sources Online, both published by Information Today,
Inc. His e-mail address is dthawkins@infotoday.com. |