CONFERENCE CIRCUIT
The Ninth Annual Search Engine Meeting
By Nancy Garman
In Europe for the first time
following a long run in Boston and one diversion to San Francisco, the Search
Engine Meeting, held April 1920 in The Hague, Netherlands, had a distinctly
international flavor. The global mix of delegates reflected the event's European
location at the expense, however, of the usual large American delegate block.
Conference organizer Harry Collier said that only a few Americans overcame
the twin obstacles of distance and high exchange rates to attend the event.
As a consequence, total attendance was down to around 100. Next year's meeting
is scheduled for April 11 to 12 in Boston. If Collier can retain the international
delegates and bring back the Americans, the 10th annual meeting could rebound
successfully.
The small attendee count encouraged participation as researchers and developers
exchanged ideas, presented solutions, and discussed current research during
sessions and social events. The Search Engine Meeting is a unique conference
where the leading edges of search engine research and development, categorization,
indexing, natural language, and computer science converge. At this event, academics
and researchers talk the same language, and the theoretical foreshadows the
operational.
Opening Keynote
In her opening keynote "Quantity Versus Quality," Karen Spärck Jones
of Cambridge University asked, "Has 50 years of research about search resulted
in anything more than the ability to find tens of thousands of references about
Britney Spears?" She contrasted the holy grail of search researchthe
world of information at our fingertips assisted by intelligent systemsto
the reality of Google's 55 million searches in 2003. "What can information
and language-processing research do for Web search engines?" she asked.
Jones then reviewed her research into computer support for human indexing.
She said that current research efforts are attacking the challenges of the
hidden Web and digital libraries. She believes that if you exploit the quantity
correctly by using machine intelligence, you get quality.
Jones said that Web search engines were developed by computer scientists
independent of the research that she and others conducted from the 1950s to
the 1970s. However, over the past 10 years this meeting and the growth of the
Web have brought those worlds together. The result is a win-win for researchers,
developers, implementers, and end users as well as a partial answer to Jones'
call for better connections and more interaction between researchers and search
engine developers.
The reality of search and the intersection of intelligent search systems
with human invention were aptly illustrated on the second day of the conference
between Jones and Martin Belam from the U.K.'s BBCi Search. This dynamic, plus
the juxtaposition of intelligent indexing and auto-categorization research
with the reality of Web searching, were major themes of this meeting. Despite
their historically deep involvement in controlled vocabularies and indexing,
librarians and information professionals are peripheral to this realm of search
research.
Research Meets Reality
In his session "Human Intervention in the Search Process," Belam described
BBCi's Best Links, a program in which a team constantly reviews and adjusts
search queries so that results match customers' expectations. BBCi monitors
the top search terms, checks the results, and puts a directory on top of the
spidering to vary terms for context and adjust for misspellings, thus increasing
recall and precision. For instance, when the space shuttle Columbia was
in the news, the increase in the number of searches for "Colombia" did not
indicate a spike in interest about the South American country, but rather a
common misspelling. BBCi adjusted its directory during that period so that
searches on "Colombia" returned hits about the space shuttle disaster.
During the break following Belam's presentation, he and Jones discussed the
why's and why nots of machine indexing and building directories based on human
results monitoring. Jones offered assistance from the research community in
automating the Best Links project. Research intersected with the real world
as Jones scratched boxes and terms on the back of her conference notes, while
Belam allowed that some of BBCi's monitoring might be automated. It's this
level of personal networking that makes the Search Engine Meeting a special
place for search engine developers, practitioners, and researchers.
Research on Search
On the first morning, Liz Liddy from Syracuse University and Donna Harman
from TREC delivered the researchers' perspective on search. Liddy's current
research focuses on some of the elusive aspects of textual retrieval, finding
not just the topic but determining the opinion and attitudes behind the text.
She showed some intriguing examples of affect-mining that can add value to
retrieval. One was CiteSeer (http://www.neci.nj.nec.com/homepages/lawrence/citeseer.html),
which groups together the context of citations to a given article. This allows
researchers to easily see what's being said and why the article was cited.
Harman reported on TREC's ongoing research projects sponsored by NIST, DARPA,
and ARDA.
Delivering the search engine developer's point of view, Prabhakar Raghavan
from Verity said that given the vast amount of unstructured data in today's
business organizations, there's an imperative to develop tools to create or
extract and then exploit structure. He discussed the challenges of XML querying
and suggested that research needs to advance toward XML, text retrieval, and
information integration.
Offering a different perspective, Endeca's Peter Bell said that faceted navigation
(or guided navigation, as his company calls it) is a multidimensional browse
capability that can be more efficient than taxonomies. Suggesting that there's
some implicit structure in most types of unstructured business documents, Bell
said that facets allow multiple sources of mixed content to coexist. He claims
that Endeca's "search plus browse" approach results in new insights into less-structured
heterogeneous content.
The Multilingual Web
Conference co-chair David Evans kicked off a panel session on CLIR (Cross-Language
Information Retrieval) by asking how big the problem is, whether commercial
CLIR can work on the Web, and whether we can use the Web itself to improve
CLIR. He cited an April 12 Newsweek article about search engine translations
in which the author suggested that the song "The Girl from Ipanema" might not
have been a hit if songwriter Norman Gimbel had to depend on the Google machine
translator to translate the song's original Portuguese lyrics.
The addition of 10 members to the European Union and the meeting's location
in The Hague made multilingual search and retrieval an issue of more than just
academic interest. Clearly, there's work to be done. Panelists Evans, Gregory
Grefenstette from CEA, Joop van Gent and Piek Vossen from Irion, and Wessel
Kraaij from TNO addressed various aspects of this challenge.
Grefenstette said that English speakers are now a minority on the Web (35.8
percent). He also discussed the results of his study to find out if the Web
is still dominated by English-language content. This study used predictors
to determine the frequency of English, Finnish, French, and German usage and
found that English remains the predominent language. However, Grefenstette
predicted that as broadband access grows, the amount of Web text in different
languages will begin to equal the amount in various online language populations.
van Gent discussed the problems of selling CLIR products. He said that for
most organizations, information retrieval, not language, is the first challenge.
In addition, European governmental restrictions limit multilingual Web initiatives.
Vossen delved into the nitty-gritty of what it takes to develop a cross-lingual
retrieval system and CLIR semantics on the Web. He also discussed which strategies
might be applicable in different circumstances.
van Gent then covered Irion's commercial answers for CLIR, which work best
in structured environments. They involve training an automatic classification
system with a multilingual data set and stimulating users to add more words
or phrases by adding a dialogue model to the classification system.
Kraaij attacked the language issue from a different angle and discussed mining
the Web for multilingual information, both by finding and dealing with multiple
translations. He concluded that transitive translation is a viable approach
to CLIR.
Search in the Enterprise
Late on the first afternoon, Sue Feldman from IDC foreshadowed the next morning's
sessions in her presentation on enterprise search. As she described the information
infrastructure she's seeing within organizations, she forecast the emergence
of a new infrastructure or middleware layer that contains modules to acquire,
manage, analyze, and create access to all kinds of information. She discussed
the factors that are driving this trend and the next generation of search.
On the horizon, Feldman sees linguistic capabilities embedded in other applications;
rules and inference engines; interactive visualizations of information spaces,
results, and relationships; and unified access to data plus content.
Search Gets Real
Steve Arnold began the second day of the conference with his presentation "Social
Software and New Search," an outline of the search engine landscape. He talked
about the "big four" in site search: Verity, Hummingbird, Convera, and FAST.
Arnold said that no one size fits all, and he discussed newcomers such as Arikus,
Delphes, and Odyssey ISYS. He believes that Lextek and dtSearch are companies
worth watching.
As Arnold discussed the development platforms, he contrasted the old Sun/
Microsoft approach with the Google-influenced perspective: TCP/IP for everything.
Picking up on the social networking theme, Arnold said that Google's new
mail function moved the company squarely into the realm of social interaction,
an area in which Yahoo! has already become a key player.
Later on the second morning, Kasper Vad, IT manager at InfoMedia Huset in
Denmark, and Martin Belam addressed search engines on a practical level. This
was much to the relief of the corporate IT managers in the audience who found
the previous day's sessions interesting but academic.
Vad walked the audience through his selection and deployment of an in-house
search platform. He talked about how he managed the process, dealt with expectations,
and automatically converted 99.5 percent of his existing data. He reported
that auto-categorization allowed him to process eight times as many documentsa
huge increase in productivity, although at an equally huge cost. Vad said that
he had not involved librarians in the search engine selection process. This
affirmed Karen Spärck Jones' earlier observation that IR research and
search engine development are the domains of computer scientists, not librarians,
who carry the perceived baggage of traditional restraints.
Image Search
Ethan Munson (University of Wisconsin), Alan Smeaton (Dublin City University),
and Sebastian Gilles (LTU Technologies) wrapped up the last afternoon with
discussions about image searching, access-to-video archives, and visual-content-analysis
technology that put a visual face on search-and-retrieval development.
In his studies, Munson confirmed that "image file name," "page title," and "page
file name" are the most important fields for accurately retrieving images on
the Web. The conference's underlying collaborative-research message was evoked
when an audience member asked Munson if his data set would be retained and
made available to researchers. Several attendees observed that the information
would be valuable to others working in the field.
Smeaton's experiences proved streaming media's viability as he described
how he uses the Físchlár News Stories site to keep up with newscasts
while traveling. He showed how search could identify and personalize the retrieval
of video images.
Partnership Opportunities
Several high-level product presentations were interspersed with the conference
research papers. These were really technical reports from vendors that are
conducting important research and development in the arena of search and retrieval
and categorization.
One standout was Nigel Hamilton, CEO and founder of Turbo10.com, who put
his new metasearch engine through its paces for the audience and then pointed
out several unsolved issues. He invited potential partners to offer their development
skills and help resolve these problems. Ask Jeeves, Convera, and Basis Technology
also delivered important briefings on their product-development efforts.
Back to Boston in 2005
Google's IPO doesn't mean that search engines have matured and search research
has slowed to a standstill. It's quite the opposite: The scale of the Web offers
new challenges and research topics. Ongoing work on text analytics, retrieval
issues, redundancy, reputation engines, faceted navigation, machine translation,
and much more suggest that there's no shortage of topics for next year's Search
Engine Meeting.
Nancy
Garman is Information Today, Inc.'s director of conference
program planning. Her e-mail address is ngarman@infotoday.com.
|