Robot
software can help us in many ways. It can also be stubbornly literal —
and even foolish at times. Think of Commander Data on Star Trek: The
Next Generation, blathering unnecessary minutiae endlessly when asked
a simple historical or scientific question.
Whether it's your
favorite search engine on the public Web or an intranet search tool that
your organization supports, it's almost a certainty that you rely on a
robot to find content every day. Whether Internet-wide or intranet-specific,
a search engine, or spider, relies on robotic elements to help you find
content. A robotic crawler traverses Web pages, looking for content and
for links to more content; a robotic indexer compiles a database of words
and URLs; a robotic search component fields user queries, trying to match
a user's search with the most relevant pages.
A human search
administrator starts the process, telling the spider — our robotic helper
— to start at a particular spot, and where to go — or where not to go —
in its crawling and indexing. From then on, the robotic components of your
favorite search tool proceed with very little human guidance. By and large,
end users see hit lists whose content has been determined mainly by choices
in algorithms made by the software engineers back at the factory.
A robotic spider
can be incredibly powerful and accurate — or it can deliver a hit list
full of irrelevant minutiae. Is it possible to offer the very best hits,
specifically for those searches performed the most frequently?
Most organizations
which take time to do the research learn that a relatively small number
of searches are repeated quite frequently by a large number of users. For
those most common searches, maybe we can deliver better results by keeping
a human editor in the equation. A human editor can augment the results
the robotic spider churns out — by finding the very best pages that would
most benefit those users and inserting those items in the hit list before
the robot offers its best guesses.
A Modest
Proposal:
The Accidental
Thesaurus
-
For intranet,
online product catalog, newspaper, campus sites.
-
Build
a thesaurus based on what people look for.
-
Don't
even try to be comprehensive.
-
Use your
search logs to find what people look for - and how they actually search.
-
Fuzzy
matching of user searches against thesaurus, à la Ask Jeeves.
|
Case in point.
Soon after AT&T established consumer sales on its Web site, prospective
customers began searching for "long distance" and receiving large hit lists.
Most of the listed results turned out to be press releases, issued over
a period of months, each announcing yet another new wrinkle on a long-distance
service plan.
But sometimes customers
don't want to see a long hit list. In this case, most potential or actual
customers wanted a very short hit list, perhaps one only including
items like this:
-
Current long-distance
rates
-
How to sign up for
a long distance discount plan
-
How to get a long
distance calling card
-
How to use the card
when traveling
-
How to pay an AT&T
Long Distance bill
Customers decidedly
did not want to see every PR flack's spin as to how the company's
new plan was just the ticket for gaining market share from Sprint and MCI.
Customers came to AT&T's site wanting to do business with the company.
By letting a search engine report results from a corpus entirely useless
to the customer, the site failed its mission. (For the relatively minuscule
number of investors or financial reporters visiting att.com, the company
could have offered a localized press release index buried in the appropriate
"About AT&T" container.)
So spider software
and the search engines behind them can indeed be foolish at times. Eventually,
AT&T came up with a solution: "AT&T Keywords" (see ATT Search Engine
at right). Today, the search box at att.com says: Enter Search Term or
AT&T Keyword.
How do these AT&T
Keywords work? Their FAQ explains:
An AT&T Keyword
is a word or short phrase that you can type into the "Search" box and that
will take you directly to an AT&T Web page. Some examples include:
How do AT&T
Keywords work?
Just like Web site
search terms, you type AT&T Keywords into the "Search" box located
on every AT&T Web page. If an exact match to your keyword is found,
your browser will automatically display the associated Web page; otherwise,
you will see a list of related AT&T pages. If the word you typed matches
two or more AT&T Keywords, the results will return a list of matching
keywords (with short descriptions of each).
Here, AT&T
has interposed an editor's judgment into the equation. A human being analyzes
what customers most likely seek when they come to the AT&T site and
manually inserts key phrases and matching URLs into a database. When a
customer enters a search, the system probes both the manually constructed
database as well as the robotically built index. The majority of users
see what they want at the very top of the hit list.
AT&T isn't
the only Web presence to adopt this editorially chosen keywords approach.
Consider ESPN. At their site you can type "Vitale" and go directly to the
Web site of their voluble basketball commentator. Or travel across the
pond to the BBC's site and do a search for Changing Rooms (the precursor
to the popular American TV show Trading Spaces). You'll see exactly
what you're looking for as a "Best Link" at the top of the hit list (See
the BBC homepage below).
Given the wide
breadth of BBC content, a robotic search engine might or might not locate
the
correct
page given the common English words "changing" and "rooms." The BBC wisely
has an editor intervene in the search. Why depend on the kindness of strange
robots? Why not make the choice editorially?
Perhaps the most
famous example of the editorially chosen keywords concept is "AOL Keywords"
as defined for America Online and its customers. AOL Keywords serve multiple
purposes. The keywords are a shorthand way for customers to get to popularly
sought content. Need to report someone who has violated AOL's Terms of
Service? Go to Keyword TOS. AOL Keywords provided a navigational "handle"
for millions of customers long before they understood the concept of the
URL. (Today, a single search box on the AOL service accommodates both AOL
Keywords as well as URLs.)
Of course, AOL
doesn't just choose its Keywords as a navigational aid or shortcut. People
who don't use AOL are mystified by the redundancy when an ad on television
says, "Come to sears.com. AOL Keyword: Sears." For AOL, that redundancy
is revenue; companies pay to be listed in AOL Keywords.
My own interest
in the "disconnect" between what users tend to type and what search engines
tend to find goes back to the days of Gopher and Veronica. In 1992, I helped
organize a national workshop on the hot tool of the day, the Internet Gopher.
Light bulbs went off when Nancy John of the University of Illinois at Chicago
provided some "remedial library school for computer people" and showed
how analysis of search logs could illuminate what content customers really
seek. Hmm, search logs as feedback into decisions on how to present information
... interesting concept.
The Accidental Thesaurus
Library science
has long understood ways to map multiple choices of terms into a single
concept. The classic example is a bibliographic database that treats Mark
Twain and Samuel Clemens as the same author. A thesaurus makes it possible.
If we were to hire
a professional thesaurus builder to solve a vocabulary problem, the consultant
would probably do a rigorous analysis of the terminology used in the literature
of the discipline — manufacturing, physics, management, whatever. We would
contract with our information scientist to build a comprehensive thesaurus
covering the language of the field.
But the examples
we've seen so far are considerably less rigorous than that. No doubt AT&T
examined its search logs to inform its choices of keywords to ensure that
the searches done by 99 percent of its customers yield pay dirt. AOL chooses
its internal Keywords for convenience — and its partner keywords for profit.
For practical, everyday solutions, we do not need the rigor of an academic
discipline examining the literature.
In a presentation
for the Access '98 conference in Calgary, I termed this user-driven, explicitly
non-comprehensive approach "the accidental thesaurus" (see A Modest Proposal
below).
In the remainder
of this article we'll explore other projects to build a smarter spider
via the accidental thesaurus approach: my own efforts at Michigan State
University and those of information scientists at Bristol-Myers Squibb.
The MSU Keywords Project
When AltaVista
burst onto the scene in 1995, it quickly caught my attention as a new high
water mark in Web search technology. Soon after its global Web search went
online, I was on the phone with Digital Equipment Corporation suggesting
that it market the product for intranet applications. In 1996, Michigan
State University beta tested the product and became the first institution
of higher education to license it as a campus-wide spider.
Since that time,
our msu.edu Web presence has grown dramatically, and, over time, this growth
in content coverage has made it increasingly difficult to find popular
campus service points on the Web. The classic example: a simple search
for
human resources
This turns out
to be a wonderful challenge for a spider. Both terms in the phrase are
common English words. We have numerous personal pages on campus whose owners
want to become human resources professionals, and we have an academic program
in the area as well. But most users just want the HR department. The university
even has more than one unit under that name. In this context, an AltaVista
search became increasingly futile: The hit list was too polluted to be
useful, especially at the top.
Over time, as campus
search administrator, I began to receive more and more complaints from
users about the difficulty of finding commonly sought content. At the same
time, complaints rose from campus content providers reacting to what they
heard from irate customers.
We began analyzing
user search logs to see what content users sought the most. The analysis
confirmed that popular campus service points such as "human resources"
were among the most common. (Note that this was aggregated log analysis;
MSU has strict rules against monitoring the searches of any individual
user.)
Ironically, AltaVista
and its rivals had already developed technology that could solve our problem,
ironically to accommodate advertising. Before AT&T Keywords dawned
on AT&T's own site, AltaVista would deliver you an AT&T banner
ad and hit list link if you typed "long distance" as a search phrase. In
the absence of a similar built-in feature in AltaVista's intranet product,
I decided we needed to roll our own.
The Evolution of MSU Keywords
Working with a
student programmer, I began fleshing out a design for MSU Keywords. We
decided to use Active Server Pages to connect a Web user with a database
back-end. Initial prototype work was done with Microsoft Access; the production
system uses MS-SQL as the database (see above).
We began the effort
with some clear design goals. As the software became more functional and
the database grew, those goals evolved somewhat. Here are the basic specifications
of MSU Keywords:
-
An editor looks for
popular starting points, primarily by examining search logs, and enters
keywords (or phrases) and the URL for the best Web page satisfying that
search. (At times this may require considerable research.) An administrative
interface allows multiple authorized editors to contribute.
-
A user searching at
the university home page or at search.msu.edu enters a search word or phrase.
Both MSU Keywords and MSU AltaVista results appear — in that order. Each
item in the MSU Keywords hit list corresponds to one URL; no duplicate
URLs appear. If there are no matching MSU Keywords, we don't bother the
user with that fact; we just show the AltaVista results.
-
Aliases support variant
uses of terminology and even misspellings. If a user types "libary," the
hit list includes all appropriate Library pages.
-
We also created an
A-Z index of sites, driven from the same MSU Keywords database. Keywords
can be marked public, in which case they appear in the A-Z, or hidden —
we do not wish to show "libary" in the official directory!
-
Most sites listed
in MSU Keywords are official university content. However, in order to include
sites such as the student daily newspaper and student organizations, we
identify official sites with a logo in the hit list.
-
Because two server
boxes are involved in every search transaction, we designed the process
to be efficient and unobtrusive. The Keywords database is implemented in
SQL using stored procedures for fast performance, even on modest server
hardware. End-users don't know they are searching MSU Keywords.
Development took a
number of months. The principal programmer, Mathew Shuster, jokes that
the resulting code and database, having evolved as we learned what functionality
was really required, doesn't represent the acme of ASP code or database
structure. He is too modest: The code is efficient; the database is compact;
the results are quite powerful (see diagram at top right).
As the service
became functional I began building a corpus of keywords and matching URLs
for the most commonly sought content. For the most part the effort was
literally driven by what users searched for the most in the existing AltaVista
service: I examined logs to find the most common searches, found the best
Web page to match, and entered the keyword (or phrase) and URL into the
database. Most entries came from user input, but not all: in some cases
I added Keywords for sites in the campus telephone book or items I encountered
in brochures or newspaper articles (see image on right).
We quietly launched
MSU Keywords early in 2002. We did nothing to highlight the new functionality
with end-users, though we did send a letter to all departments urging that
content providers submit keywords and corresponding URLs for entry into
the database.
Recent features
have augmented functionality for both users and content providers:
-
We added expiration
dates, so that an event-related Keyword automatically retires on a specific
date, and a "birth date," so that Keywords can be staged for automatic
deployment.
-
We built a Search
Logger, also implemented as a database. All queries are logged by date
so that we can examine changing trends. (We provide the public with abbreviated
reports on the most popular searches; see search.msu.edu/info.)
-
Some search terms
defy mapping to a small list of Web pages. We built an MSU Pathfinder service
to provide annotated "pathfinders" (following the example of helpful library
practice). For examples, visit search.msu.edu and type "address," "history,"
or "adviser."
-
Inevitably, content
providers remove or rename pages. We added a daily link checker to detect
newly broken links. (Alas, content providers feel free to completely reorganize
their sites without informing the search administrator!)
-
We implemented a mechanism
to delegate ownership of a given URL to its content owner, so that new
keywords could be assigned at a moment's notice. For instance, the editors
of the university's daily news feed can assign MSU Keywords that correspond
to each day's articles.
Continuing to Learn from Logs
We continue to
learn from log analysis. In the age of Google, users now usually search
in a way more likely to yield good results: a key word or two with
no extraneous words. The top 30 searches over a recent period of several
weeks are illustrated in Table 1.
Some of these terms
are peculiar to the institution; "twig" appears not because we have a Department
of Forestry, but because that is the moniker of a Web gateway to an e-mail
system named "pilot." "Blackboard" is a commercial course management system.
Its consistent appearance at the top of the search charts shows its popularity
— and perhaps the difficulty in finding the service without doing a search.
Note that these
top 30 search phrases represent some 15 percent of the 200,000 searches
performed in this period. Thus, if MSU Keywords offers the best Web pages
for those 30 searches, a very small database can supply 15 percent of our
users with exactly the right content. When you consider that numerous near-match
phrases correspond to these same searches, a small editorial investment
in MSU Keywords can yield even more benefit to a larger percentage of users.
(Although we believe this data is representative, we do not capture all
searches in our logs; Google's index of msu.edu is a popular alternative
to our service.)
Conversely, the
law of diminishing returns applies as well. The question becomes: How far
down the list of unique search phrases should our human editor venture?
High Payoff, Diminishing Returns,
and the Laws of Pareto and Zipf
Just how steep
is the curve of diminishing returns? If we ask how many unique searches
it takes to account for percentiles of total searches performed, an interesting
pattern emerges (see Table 2 on page 74).
The data confirms
our supposition: There is high payoff for putting a small number of unique
phrases — perhaps several hundred out of over 50,000 — in our thesaurus,
after which returns diminish rapidly. We can match 50 percent of users'
searches by manually matching fewer than 1,000 unique search phrases —
a manageable amount of editorial effort. But if we want to achieve 90 percent
coverage, we must include over 30,000 phrases in our thesaurus. Obviously
that would entail a huge amount of effort, far beyond a reasonable allocation
of staff resources in an environment as volatile as the Web.
The diminishing
of returns becomes even more obvious if we look at the distribution graphically
(see "Returns Distribution" graph on page 75).
As it happens,
scientific literature supports our observations. People often refer to
the "80-20 rule." We can thank the Italian economist Vilfredo Pareto who,
in 1906, observed that 20 percent of the populace controlled 80 percent
of the wealth. Over the years numerous corollaries to the rule have been
put forth: 20 percent of your workers provide 80 percent of your output;
20 percent of staff issue 80 percent of employee complaints, etc.
Others have proposed
analogous rules. In 1934, Samuel C. Bradford theorized that a small number
of scholarly journals contribute the vast bulk of scientific output in
any given discipline. Librarians (and journal publishers) still contemplate
Bradford's Law of Scattering. George Kingsley Zipf analyzed how frequently
each of 29,899 unique words appeared in James Joyce's Ulysses. Zipf
was independently wealthy and hired a team of workers to perform these
summations. (Fortunately at MSU we use MS-SQL and efficient stored procedures
to do our analysis.) Zipf found that a small number of reused words account
for a huge percentage of the total. His book, Human Behavior and the
Principle of Least Effort, published in 1949, analyzed the notion of
outputs disproportionate to inputs. Zipf's distribution has since been
applied to other areas, such as the populations of cities within a nation.
The distribution
curves of Pareto, Zipf, and Bradford are remarkably similar, even though
describing very different things. The distribution of search phrases at
MSU follows suit: Almost eerily, our curve mirrors theirs. The core concept
is incontrovertible: You can achieve a huge payoff with a small investment
of effort, after which you may start wasting your time.
If we look at the
tail end of the search logs — search phrases that appear only a handful
of times — we continue to observe perfectly reasonable searches, such as:
calculate GPA
student change
of address
guest policy
It is tempting
to try to put all "reasonable" queries into the MSU Keywords database,
either as unique entries or as aliases to existing ones. Carried to its
logical extreme, this would mean that we would analyze every search and
hand-enter the best matching URL into the database. But that way lies doom;
this would in effect substitute a human for a search engine. If a given
search is performed only a few times, far better to let the search engine
do its job as best it can, instead concentrating manual efforts on hundreds
or thousands of searches. (And far better to use the best search engine
for relevancy, arguably Google.) Advice to the accidental thesaurus builder:
Avoid scanning raw search logs; only look at the top of the charts.
MSU Keywords: Outcomes
We have learned
a great deal from building MSU Keywords — with some outcomes not intuitive.
Feedback from users and content providers indicates far greater happiness
with the search experience. Most users and content providers don't know
the role MSU Keywords plays in vastly improved search relevancy.
I continue to be
astonished by the extent to which users seek not obscure "leaf" pages but
major starting points by using the search engine. I shouldn't be surprised:
They've been trained by Google. In the first moments after planes hit the
World Trade Center, 6,000 people a minute typed "CNN" into the Google search
box. People count on search engines to find home pages that ought to be
otherwise highly visible.
Our search logs
are remarkably stable over time, with some understandable exceptions. When
the last school year ended, searches for course information declined, while
searches for online grades went up dramatically.
Thus far, our A-Z
browsing view (see page 75) sees little use. In part this is because the
MSU home page does not link to the A-Z. It's also likely that most users,
now more satisfied with the search experience, see no reason to go there.
Originally, we
planned for MSU Keywords to behave like AT&T, AOL, and ESPN Keywords:
If MSU Keywords found an exact match, the user would be immediately redirected
to the matching site. For instance, if a user sought the home page for
the Wharton Center for the Performing Arts, the keyword "Wharton," as an
exact match, would drive the user straight to the Wharton site, without
showing a hit list at all. Although this functionality is in place, it
hasn't yet been put in production as a default; MSU searchers always see
a hit list, even with an exact match.
As for content
providers, in practice it is somewhat difficult to convey what the new
service is all about and how to exploit it. Some content providers make
it hard for MSU Keywords (or any finding aid) to work well. For instance:
-
Some content providers
use techniques such as frames that array widely varying content under a
single, invariant URL. This makes it impossible to forge "deep links" that
match keywords for specific content. The utility of the service falls off
dramatically if the user lands on a starting point that isn't specific
and has to drill down to find relevant content.
-
Some content providers
propose keywords with a suggested Web page that includes no corresponding
content — as if "Kentucky Fried Chicken" were the key phrase and McDonald's
the suggested Web site. Apparently they believe that if their unit is involved
with a topic area, MSU Keywords should point to their home page, even if
that page doesn't mention the topic! Adding such keywords would be a disservice
to users.
-
In some cases, content
that users seek simply isn't online yet; we can't force service areas to
document themselves on the Web.
-
Although some content
providers respond to our entreaties for suggestions, sending us exhaustive
lists of keywords, most do not. In general MSU Keywords continues to be
driven by what users type into search boxes.
As of this writing
we are considering a transition from AltaVista to Google as the default
campus search engine. Nonetheless, we intend to continue to deliver MSU
Keywords before the Google hit list. Even the mighty Google, with its popularity-based
link analysis, doesn't always deliver the best hit first. Always bear this
in mind: Some robots are less foolish than others, but no robot is as
wise as a human editor. As a test, visit your favorite university on
the Web and search for "map." Most people doing that search want a campus
map. In many cases the spider will offer high on the hit list the library's
map of Mesopotamia before the university's map of its own campus. Thanks
to MSU Keywords, we deliver the campus map as the first item on the hit
list.
My only regret
is that we didn't build this valuable tool years earlier.
Student programmer
Nathan Burnett developed our first search log analyzer. Journeyman student
programmer Mathew Shuster designed and developed the MSU Keywords database
and the MSU Search Logger. A graduate student, Qin Yu, also worked on the
project. Anne Hunt designed the graphics. Dennis Boone and Edward Glowacki
are past MSU AltaVista administrators.
The Accidental Thesaurus at
Bristol-Myers Squibb
Bristol-Myers Squibb
(BMS)
is a major pharmaceutical and healthcare products company with 40,000 employees
working at its New York headquarters and campuses around the world. Information
scientists at the company have developed an intranet application based
on the accidental thesaurus concept. Their efforts at Bristol-Myers Squibb
bear a remarkable resemblance to ours at Michigan State University; the
main concepts and goals were almost identical, although the user base and
implementation differ substantially.
Mike Rogers, associate
director for Information Architecture at the company, says he and his colleagues
knew they had a problem with helping company users find commonly sought
content on a huge and diverse intranet. After hearing a presentation by
Vivian Bliss about her approaches to the same problem for Microsoft's intranet,
Rogers saw an opportunity. After additional consultations with Microsoft,
Rogers set out to add an editorially chosen keywords component to his search
services.
Lydia Bauer, a
senior Information Scientist at Bristol-Myers Squibb, says that in analyzing
search logs at the company, they noticed that the top 100 search terms
were sought quite commonly. Searches related to the company's lines of
research were common, as were searches related to human resources. Like
many organizations, the company had an intranet that evolved without a
governing body, covering everything from scholarly texts on pharmaceuticals
to information about the annual picnic. Improving the search experience
would help tie things together. (Another approach, installing a company-wide
portal product, is also underway.)
Bristol-Myers Squibb
actually has two search engines, one based on Verity, and another a corporate
Web search engine. Rogers and Bauer sought to add a "Best Bets" service
(see above), that would integrate with the other two engines. They consulted
with in-house developers and decided upon Cold Fusion as the middleware
to connect to their keywords database.
It took several
months to build the database back-end and integrate it with the company
search tools. Today, when Bristol-Myers Squibb workers search the company
intranet, if a matching Best Bet exists, it displays prior to the beginning
of the hit lists from the other two engines — a "federated" search in Bauer's
words.
Bauer, who is currently
finishing work on an MLS degree, expresses surprise that she has found
very little discussion of this approach in library or information science
literature. "I've found lots of discussion of using search log analysis
in augmenting a thesaurus, but not much on building a Top 100 or Best Bets
service." Perhaps this is due to history: Only in the era of the Web do
we have robotic spiders whose logs can inform the creation of an accidental
thesaurus. Bauer also wonders how much editorial effort to invest in the
service. She originally intended to stop after the Top 100. However, in
surveying users, she finds impressive positive feedback — 87 percent found
the service good or great — with the only negative being, "We want more."
Currently, the BMS database has some 1,500 terms relating to some 450 sites
— interestingly, about the same scale as MSU's.
As with MSU Keywords,
Bauer bases her editorial decisions primarily on what users seek the most.
She reads the company's daily news reports and adds keywords for events,
new product information, etc., as reported each day. The majority of entries
in Best Bets are internal company resources — this is, after all, an intranet
service — but she has added links to Mapblast, Yahoo!, etc., on the assumption
that workers seek these resources for a reason.
Bauer sometimes
encounters situations, as I have, in which content that ought to
be online — we know this because the logs show that people look for it
— simply isn't on the intranet. In those cases she tries to contact the
relevant department and suggest it publish a new page.
MSU Keywords and
BMS' Best Bets overlap considerably, but there are differences. Each of
us has some features the other desires. Bauer gets a daily report of the
most-sought keywords not in Best Bets — very useful feedback for
the editor. Her service does not yet incorporate birth and expiration dates,
nor does it have an A-Z option. Overall, though, it's uncanny how similar
the efforts are.
One might expect
that, over time, as users learn the effectiveness of searching, they will
migrate towards full system searching. In studying logs and talking with
users, Bauer has learned that currently about the same number of people
browse as search.
Best Bets has been
a big win for users of the Bristol-Myers Squibb intranet, just as Bliss'
work has been at Microsoft. (We'll examine Microsoft's use of Best Bets
in a future article.) But Rogers and Bauer aren't resting on their laurels;
they are investigating adding a natural language facility to their search
services.
I Never
Metadata That Solved a Problem
|
Someone
from the Dublin Core school of thought may have this argument: "Wait a
minute — you could solve this problem if only you had good metadata." (The
"Dublin Core" is a set of core metadata elements defined at a famous meeting
in Dublin, Ohio, the home of OCLC.) The likely claim: If content providers
thoroughly describe their own content as they publish it, then your robotic
spider can harvest that data as it crawls, providing for superior searching.I
have watched for almost a decade as the Web community has struggled with
metadata issues while in practice precious little is done in the real world.
There is not even a consistent way to determine the last time a given Web
page was modified. At my university, we struggle in a constant uphill battle
to convince content providers not to do things that hide their sites from
spiders altogether, much less to provide good metadata as they publish.
|
Until
authoring tools and Web publishing environments rigorously enforce metadata
standards, we won't have good metadata for robots to chew on. I offer the
East Lansing Maxim as a response to Dublin Core aficionados: "Everybody
talks about metadata, but no one does anything about it." That's not literally
true, of course; there are some promising applications of Dublin Core.
But as Mike Rogers observes, "It's very difficult to enforce metadata standards
across a large enterprise."
Even if Web publishers
provided good metadata with their content, that only solves the problem
of having good labels to describe the content. It doesn't help decide which
pages belong in the best bets service and which ones do not. Again, take
the example of maps. A university may have thousands of pages involving
maps online, from the holdings of the Map Library to databases in geography
to personal pages with Mapquest links. The vast majority of site visitors
want
a campus map. Content-provider-chosen metadata can't make essential editorial
decisions.
|
The Accidental Thesaurus and
Your Organization
Does the Accidental
Thesaurus make sense for your organization? I claim that any organization
that has a substantial Web presence with a significant user base should
incorporate an accidental thesaurus into its search service. Examine your
search logs. If a large number of users seeks the same content areas using
the same terms, test to see if your robotic spider delivers the goods at
the top of the hit list. If it doesn't, time for an editor to step in.
Let's be bold here,
and propose a theorem:
For any given Web
presence, whether intranet or global, the top 500 unique search phrases
entered by users represent at least 40 percent of the total searches performed.
The examples of
Michigan State University and Bristol-Myers Squibb alone cannot prove this
theorem, but given our data as well as Pareto and Zipf, I'm confident enough
of its validity to challenge people to disprove it. Given a tool like MSU
Keywords or Bristol-Myers Squibb's Best Bets, you can vastly improve your
users' search experience with a modicum of editorial effort.
You need not build
your own software; some search products support editorially chosen "best
bets" in the native product. For instance, Inktomi's intranet search product
offers a "Quick Links"feature that fills the bill; go to www.nortelnetworks.com
and search for "vpn" to see how it works. Some knowledge management products
that cost six or seven figures let editors deliver even more sophisticated
result sets, essentially offering robot-assisted presentation of frequently
sought content, organized by category. Whether you use a million dollar
tool or roll your own Best Bets service, by augmenting the spider you can
help your users find the most popular content more easily.
Ultimately, my
argument is simple: If you help a lot of people find content that they
frequently seek, you improve the overall efficiency of the organization.
In an IDC White Paper, "The High Cost of Not Finding Information," Susan
Feldman and Chris Sherman note that lack of information results in poor
decisions, duplicated effort, lost sales, and lost productivity. They estimate
that for an organization with 1,000 knowledge workers, the cost of information
not found exceeds $2.5 million per year. Even your management should understand
those reasons for investing effort in building an accidental thesaurus.
|