Only a few years
ago, the phrase "Web search" did not exist. Then the term began to move
rapidly into the awareness of information professionals, about as fast
as a Japanese bullet train. Today, much, though not all, of the work we
do revolves in one way or another around the Web.
With so much to
keep on top of, precious time becomes even more precious. A couple of years
ago I wrote an article trying to figure out a way to make the day 26 or
27 hours long. Unfortunately, that idea never reached the implementation
stage, though it remains an idea worth considering. Even within the narrow
bonds of 24/7/365, we must all still try to keep up to date about what
is happening with Web search engines. The fact that they seem to change
on a weekly, if not daily basis, is no excuse.
We as professionals
do not use every search engine or Web directory daily, nevertheless, we
have to know how each works and what data each does and does not
contain. I fully understand that this is easier said than done but today,
information access is a topic that everyone is aware of and talking about.
Pick up any newspaper. Turn on the television. Everyday more and more articles
and reports discuss searching the Web. Many of these articles and reports
are written for and by non-information professionals. We have to stay ahead
of our clients and patrons if we hope to help them. Excite or AllTheWeb
may not be your search engines of choice, but I bet they are for someone
you know. Our colleagues, co-workers, and friends come to us as the "search
experts" and we must do our best to help. Our knowledge and understanding
in this area are great ways to make our profession look good and to make
our already valuable jobs even more valuable.
With this said,
the following reviews the latest goings on in the search world and tries
to provide some suggestions and tools to make you more knowledgeable and
save you some time.
Price's
Priceless Tips
The Web search world
changes on what sometimes seems like an hourly basis. What follows are
a few selected tips and resources for some of the most well-known of engines.
This is just the tip of the iceberg. Resources like Search Engine Showdown
and Search Engine Watch are essential for learning and keeping up with
how these tools work and change over time.
|
Ten Things
to Know About Google
1. The database
that Google licenses to Yahoo! [http://google.yahoo.com]
is not the same size: it's smaller than the Google.com database. It does
not contain links to cached versions of pages. This database is also used
to supply "fall-through" content (material not in Yahoo's own database).
It is often found listed as "Web page" content.
2. Google utilizes
the Open Directory Project database as its Web Directory [http://directory.google.com].
3. You can search
stop words by placing a + in front of the word (ex. "+To +Be +Or Not +To
+Be").
4. At the present
time the Google database is refreshed about once every month.
5. You can limit
your search to only .pdf files by using the syntax filetype:pdf.
6. Google is the
only major search engine to crawl Adobe Acrobat .pdf files.
7. If you are a
frequent Google searcher, save time by using the Google Toolbar [http://toolbar.google.com]
and Google Buttons [http://www.google.com/
options/buttons.html].
8. A Boolean "OR"
is available with Google. For it to function, capitalize the OR.
9. Google only
crawls and makes searchable the first 110 k of a page. Long documents may
have
substantial content invisible to Google.
10. Entering a
U.S. street address into the query box will return a link to a map of that
address location. Typing in a person or business name, city, and state
will also run the query to the Google phone directory. Several other combinations
are available that will also query the phone directory service, including
typing in the area code and number to run a reverse search [http://www.google.com/
help/features.html#wp].
Ten Things to
Know About AllTheWeb
1. AllTheWeb licenses
its database to Lycos. The identical database is searched and makes up
some of the content on a Lycos results page.
2. Unlike Google
and AltaVista, this search engine does not have a limit on the amount of
content crawled on a Web page.
3. AllTheWeb indexes
every word. Words traditionally considered as "stop words" are searchable.
4. AllTheWeb does
not permit the use of Boolean operators.
5. If plus and/or
minus signs are not used, AllTheWeb implies a plus sign in front of each
term or phrase. This results in an implied "anding" of terms.
6. AllTheWeb is
now promising a complete refresh of its database every 9-12 days.
7. AllTheWeb permits
syntax to be used direct from the "basic" search page to limit a query.
See http://www.alltheweb.com/
help/basic.html#special.
8. A query to the
AllTheWeb text database simultaneously runs the search in the AllTheWeb
Image, Video, MP3, and FTP databases. If it finds anything, these results
are linked on the right side of the results page.
9. AllTheWeb offers
a search engine dedicated to Mobile Web content [http://mobile.alltheweb.com].
10. Fast Search
and Transfer (FAST), the company behind AllTheWeb, has deployed its software
to power the Scirus science search engine from Elsevier.
Ten Things to
Know About AltaVista
1. AltaVista is
the only major search engine that allows a searcher to use the proximity
operator, NEAR (in simple search) near (advanced search). Using this operator
finds terms within 10 words of each other in either direction.
2. AltaVista indexes
only the first 100 k of text on a page.
3. An asterisk
(*) can be used in a phrase to represent an entire word. (Ex. "One small
step for man, one giant * leap for mankind")
4. AltaVista News
http://news.altavista.com]
is "powered" by Moreover. This continuous feed of material can be searched
using AltaVista syntax.
5. The use of the
"sort by" box on the AltaVista Advanced interface allows you to give certain
words or phrases a higher relevancy weighting.
6. Caveat: If you
use Advanced Search, make sure to place some term or terms in the Sort-By
box; otherwise, results return in completely random order.
7. AltaVista's
directory comes from Looksmart.
8. AltaVista's
advanced search does not allow for the use of + and — signs.
9. If you search
AltaVista in the "simple" mode entering multiple terms without syntax,
it will result in an "implied" OR. In the advanced mode, multiple terms
are considered a phrase.
10. AltaVista software
powers the Health Resources and Services (U.S. government) search engine.
This means that all AltaVista syntax can be utilized there. This site also
illustrates AltaVista capability of indexing full-text .pdf documents on
the site-specific and intranet level [http://search.hrsa.gov].
Ten Things to
Know About MSN Search
1. MSN (Microsoft
Search Network) Search is "powered" by an Inktomi database. Remember that
Inktomi licenses its database to many search sites. Each site gets a different
"flavor" of the total database.
2. The MSN Advanced
Search interface offers numerous limiting options via fill-in boxes and
pull-down menus [http://search.msn.com/advanced.asp].
3. The Advanced
Search interface permits limiting to pages at a certain depth in the site.
For example, limiting to pages Depth 3 will limit the search to only pages
no more than three directories deep from an entire site [e.g., http://www.testsearch.com/
Directory1/Directory2/Directory3/].
4. MSN Search allows
use of the asterisk (*) as a truncation symbol.
5. According to
the most current Search Engine Showdown rankings, MSN Search has the largest
database of any Inktomi partner.
6. The directory
portion of MSN search is powered by the Looksmart database.
7. On the Advanced
Search interface, checking the "Acrobat" box will retrieve pages with links
to pages that contain .pdf files. It does not search content "inside" these
files.
8. Greg Notess
points out that the same syntax available to limit Hotbot will also work
with MSN Search [http://hotbot.lycos.com/
help/tips/search_features.asp].
9. Danny Sullivan
notes that MSN also employs human editors to "hand-pick" key sites in the
Web Directory and Featured Link sections of the site. Although most of
the time the "Featured Links" represent major MSN advertisers, editors
can add other content.
10. Selecting and
search under the MSN "News Search" tab returns results predominantly from
MSNBC.
Ten Things to
Know About Northern Light
1. Make sure to
study the Northern Light "Power" search page. It provides many limiting
options without the knowledge of any syntax [http://nlresearch.northernlight.com/
power_research.html].
2. Instead of entering
http://www.northernlight.com,
use http://www.nlresearch.com
to go straight to the Northern Light Research site. This site aimed at
the enterprise market (but available to any searcher) contains access to
several databases not available from the main URL. Most of these resources
are fee-based. They include EIU Search and market research content from
FIND/SVP and MarkIntel.
3. Northern Light
provides FREE full-text access to a database of continuously updating news
content from 56 newswires. Material stays in this database, available for
free access, for 2 weeks. Then the content moves to the Northern Light
Special Collection database.
4. Northern Light's
Special Editions are subject specific portals that combine material from
the "open Web" and NL's proprietary databases. Topics of Special Alerts
include XML, managed care, and electronic commerce.
5. The Northern
Light Special Collection currently contains content (fee-based, pay-per-document)
from over 7,100 sources. A catalog of these publications is available at
http://nlresearch.northernlight.com/
docs/specoll_help_catlook.html.
6. Northern Light
allows the use of Boolean operators and + and - signs.
7. Multiple truncation
symbols can be used in a query. Northern Light has two truncation symbols.
The asterisk (*) for multiple letters and the percent symbol (%) for single
or absent letters, e.g., medieval/mediaeval.
8. In addition
to the limiting capabilities of the "Power" search page, NL has several
terms available for field searching. These include text:, text:, and pub:.
(This last prefix allows searching in a specific Special Collection publication
title.) You can find a complete list at
http://nlresearch.northernlight.com/
docs/search_help_quickref.html.
9. Northern Light's
free "Alerts" feature is one resource you must know about. This feature
allows you to set up search strategies in ANY/ALL of the NL databases and
have those strategies searched up to three times daily. If any new material
hits on the strategy, results will be delivered to you via e-mail. I use
this tool to bring me a customized feed of news via the NL News Search
database. Remember, the full-text content is free to access for 2 weeks.
10. Northern Lights
"Geo Search" provides an opportunity to search the Web with keywords and
U.S. and Canadian address information. Results also get the benefit of
NL's organization with its "custom folders."
|
Ontologies,
Controlled Vocabularies, XML, and Web Search Engines |
I am very excited
to see that controlled vocabularies and the building of ontologies have
come into vogue.
Some of this "hipness"
has been caused by the promise and excitement surrounding XML (eXtensible
Markup Language). However, I am not sure if the coming of XML will help
the general-purpose search engine, though it should clearly help specialized,
focused, and Invisible Web engines become much more useful resources.
Why the hesitation?
The general-purpose
engines, as we know and love them today, hypothetically index each page,
massive amounts of data coming from just about anyone who wants to produce
Web content and put it on a publicly accessible server.
The problem for
implementation of a controlled vocabulary with this material is really
one of creation. Who would create it? Who would maintain it? Who would
do the cataloging? Would entire sites be cataloged at the page level or
only a specific page (the top page)? Who would manage such a project? Where
would the money come from?
Controlled vocabularies
and XML show a great deal of promise for certain types of search engines
because these types of engines can much more easily create and enforce
a set of agreed upon standards. Many issues would need resolution before
we could apply controlled vocabularies to make searching the massive amount
of material on the open Web more effective.
|
The
Future:
New Tools on
the Way |
When you learn
about new search tools and share that knowledge with others, you not only
improve your own searching, but you help to make a better future for all
searchers.
Here are some new
search products that show a lot of promise, a few more potential "quick
hits." With the vulnerability of the Internet industry of late, let's hope
these products survive. Even if the actual companies do not survive, the
technology is still worth knowing about. Have fun!!!
Three New General Purpose
Search Engines
Competition
for Google?
A New Image Search Tool
Real-Time Search
Patented technology
to search resources updated in real-time.
Natural Language Search
Technology
This product
is getting a lot of attention.
Now let's see
if you've learned your lessons. How long will it take before you've tried
all these new promising sites out? The test clock starts...now!
|
This
Article Contains Inaccuracies:
Essential Reading |
In the time it
takes this article to move from the author to the editor to the publisher
to the printer to you, undoubtedly something mentioned in this article
will have changed. Some feature will have appeared, another vanished. The
working searcher must simply make a policy of staying on top of those changes.
Those of you who
need to keep current on the Web search world should monitor the following
sites as often as possible. All these sites are free and most contain free
e-mail newsletter and updates.
SearchDay
http://www.searchenginewatch.com/
searchday/
Written by Chris
Sherman. Daily updates.
Search Engine Watch
http://www.searchenginewatch.com
A resource rich
site that offers a free monthly newsletter.
Search Engine Showdown
http://www.searchengineshowdown.com
Librarian Greg
Notess's site. Updated on a regular basis. Greg also manages the Search-L
list.
ResearchBuzz
http://www.researchbuzz.com
Written and compiled
by Tara Calishain. Daily updates.
TVC (The Virtual Chase) Alert
http://www.thevirtualchase.com
Written and compiled
by Genie Tyburski. Daily updates.
The Virtual Acquisition Shelf
and News Desk
http://resourceshelf.blogspot.com
Compiled by Gary
Price. Daily updates.
Free Pint
http://www.freepint.com
Fortnightly newsletter
edited by Will Hann. Also offers Web discussion boards.
News Breaks from Info Today
https://www.infotoday.com/newsbreaks/
General information
industry coverage of breaking news, that often features news of the Web
search world.
|
Scope Notes
Before we begin,
we need to get a definition straight — a definition that I think many of
us have thought about. What does "Web search" mean to the information professional?
In the early days of the Web, it meant exactly how it sounds — material
found on the open Web.
However, as we
move forward, the term "Web search" has taken on new meanings. Does a Web
search involve tools like Google or AltaVista to reach "open access" material?
Does it mean using the Web as a vehicle to log on to proprietary databases
such as Factiva or Dialog? Not too long ago, logging into proprietary services
required individual connections to each one. Today, any Web browser with
an Internet connection can reach those services. Perhaps it means both.
This lack of common understanding can confuse some and trying to solve
the issue is outside of the scope of this article.
This article will
primarily focus on the "traditional" Web search, i.e., search engines that
assist in locating open Web content. The approach I have taken is to try
to answer the questions I seem to get, in one form or another, at every
conference, every workshop, and in every day's stack of e-mail messages.
The Never-Ending Amount to
Learn, No Sign of Slowing Down
The single most
difficult issue for the Web searcher to face is the sheer volume and speed
of change on both the Web and the search engines that try to cope with
it. The sense of doom most searchers feel in struggling to keep pace occurs
not because of any lack of intelligence, nor any lack of interest in the
subject — far from it. Most often the cause is the reality of having only
24 hours in a day and the fact that life exists away from the computer.
I monitor what's
going on in the Web search world on a daily basis and it's almost routine
for something new to arrive or for something established to change each
day. For example, at the time of writing this article, AllTheWeb had just
undergone major changes, Google released an image search tool, and WISEnut,
a new general search tool, had come on the scene. When you couple the dynamic
nature of Web searching (both individual pages and entire resources coming
and going) and the need to stay up-to-date with traditional electronic
tools (which undergo plenty of changes as well), print resources, and other
issues of the day (can you say "copyright" or spell "Tasini"!),
there is so much to do and so little time to do it.
A lack of knowledge
and understanding about how a particular search tool works, e.g., a new
way to narrow your search, or ignorance of a more useful tool, e.g., a
new search engine going online, can waste time and produce poor results.
What Should
the Searcher Do?
I realize this
is easier said than done, but Web searchers MUST devote at least 1-2 hours
a week to stay current. This informal "continuing education" is crucial.
Often, the knowledge you gain from these sessions will pay off handsomely
with time saved and better query results in the future. The best way to
learn to learn how a search engine works is by using it. Conducting preemptive
research on a favorite topic makes it easy to spot differences both in
terms of content and the way results are presented and at the same time
to gather new resources for your own bookmarks or intranet sites. For a
list of suggested sources to keep you current, see the "Essential Reading"
sidebar.
Is an "Open Web" Search Engine
Always the Place to Begin? What Type of Information Can I Count on Finding
There?
Lately, I have
spent a great deal of time thinking about this issue. As someone who often
gives presentations about Web searching I have tried to provide session
attendees with lists of what you can and can't find "on the Web" using
a general-purpose Web search tool. Even in the most general sense, my attempts
must fail. A few minutes after beginning, I inevitably realize that one
can't boil down a dynamic universe of data like the Web into bullet points.
Knowing, or better, understanding, where to start in this world of information
resources is perhaps the most important information to know and share.
There is no simple way of doing this. It takes time and commitment. I start
learning about new resources by asking the most basic questions: What is
this database or search engine? What would kinds of questions would it
help me answer?
Often the open
Web may not be the place to begin. While it's nice to get quality material
free, how long did it take to get it? Would standing up and walking to
a bookshelf produce a useful answer in a much shorter period of time? Would
a commercial full-text search service scan the decade-long archives of
50 or 100 newspapers in a matter of minutes? At issue are the time and
money it takes to reach your answer.
Even if you choose
the open Web as your target, would a specialized or targeted search engine
more easily find your answer, rather than one of the all-encompassing engines?
Regardless, understanding how each search engine works and the many ways
an engine allows you to limit and control searches will make general-purpose
engines more productive and waste less of your time.
We need to do this
"learning" much the same way we have always "learned" traditional databases
and print resources. Think about how much focus information vendors like
Factiva and Dialog place on training. Unfortunately, Web engine companies
do not offer this kind of training, but the learning process remains crucial.
For me, the best part about being an information professional is the knowledge
of where to find an answer. This is knowledge that non-professionals desire
and makes our already important jobs even more valuable, especially with
so many new databases and new online resources becoming available.
What Should
the Searcher Do?
Consider the
open-Web more of a directory to answers and less of an all-knowing answer
machine. Sometimes, this directory WILL become an authoritative reference
book and provide you with a timely and authoritative answer. Other times
it will assist by providing you with background knowledge that can make
using a fee-based service or a print collection more productive. Don't
forget — shifting from one format to another can be a two-way street. What
you learn from a print or commercial online source can produce an effective
search strategy for the open Web. A Web search engine may also provide
you with specific names of people to contact. Remember, the telephone and
e-mail will always be very important reference resources.
The Quality of Information:
The Biggest Challenge to Web Searching
For this Web searcher,
information quality constitutes the greatest challenge faced as both a
searcher and a teacher. We live in an age when anyone can become a publisher.
All they need is a Web connection, server space, and something to say and/or
share. Once the content goes onto a server and once a crawler finds it,
the Web search engines will make it available to everyone. Within minutes
or days, anyone with Web access can find that information. Amazing! And
frightening!
Once they have
found it, the major challenge to searchers is evaluating content. They
must judge its quality, and often very quickly, using the criteria that
information professionals have always used to evaluate information. How
does one do this? Well, this is the topic of other articles, books, and
dissertations. The most important point is to take a step back, if only
for a second, to ask yourself where this information is coming from and
why it is being placed online. Since anyone can become a publisher with
the Web as a publishing medium, the reputation and background of the site
creator, their qualifications, etc., are crucial. I would strongly recommend
taking a look at the resources our colleague Genie Tyburski makes available
on her site for judging quality [http://www.virtualchase.com/
quality/index.html].
Evaluating information
quality, something that our profession has always done, offers another
in-road for sharing our skills with the public. Many who search the Web
take whatever they find to be accurate, current, and worthwhile. As information
professionals, we must protect them, often from themselves.
One more thing.
In my opinion, the challenges that information quality pose for the Web
searcher prove how important it is for our profession to include Web resources
as part of our collection development. We must try to make the Web a more
effective tool for researchers. The Web is a living organism and, unlike
an annual reference book, can change at a moment's notice. In an already
busy workday, finding time to search out Web resources in an organized
manner can be difficult. But all of us need to have an idea of what is
available and where to turn before we actually need the resource to answer
a query. Just knowing a top-level site exists that may contain the answer
will not suffice. We learn our print collections, let's learn our
Web collections and bookmarks.
Easier said than
done? Of course. Still, it remains a goal we should strive to attain.
The Domination of Google
Everyone, including
me, loves Google. How could you not like it? In most cases, it delivers
highly relevant results (though this does not always mean authoritative)
in a short amount of time. When you add in features like Google Cache (a
powerful way to find pages that might have just gone AWOL), you have a
search engine that works and works well.
Google is simple
to use at a basic search level, but still returns good results. This is
why non-professional searchers love it so much. The clean, single box home
page is simple for non-sophisticated searchers to understand. It doesn't
even allow you to directly use all three Boolean operators to return results,
yet it works! Wow! More advanced searchers will be interested to know that
Google uses AND as a default between search terms, permits the use of OR
(it must be in all caps), and can remove a word of phrase if you use a
minus (-) sign.
What I like most
about Google is its quest to improve on what it already has. Google always
seems to be introducing something new and innovative. In February 2001,
it started tracking portable document format (.pdf) material. The general
public may not put a high demand on some of this content, but PDF documents
offer information professionals masses of authoritative content from respected
sources. At the time of writing, Google was still the only general search
engine to make PDF files searchable on a large scale.
What Should
the Searcher Do?
The advanced
searcher must get to know and make use of Google at a more than "put the
words in the box" level. It's very easy. Begin by looking at the Google
Advanced Search page [http://www.google.com/
advanced_search.html], and at the same time learn the syntax that
will allow you to limit your searches directly without having to use this
page. To learn more about Google, especially on how it compares to other
search engines, go to Greg Notess's Search Engine Showdown site [http://www.searchengineshowdown.com].
Here's hoping
that Google continues to improve and add new useful features. Here is also
hoping that Google continues to properly separate advertising content from
result sets. Yet with all of Google's wonderful abilities, good searchers
know that the must never make any single Web search engine the only tool
used. No single engine makes "everything" searchable.
Understanding the Limitations
of General Web Search Tools
No single Web
search tool is the end-all/be-all. In fact, most have limitations that
need careful consideration if you plan to use them regularly or teach others
to use them. What do I mean by limitations? Here are a just a few of many
possible examples:
-
Search spiders or
crawlers (the software that brings back material to a database so you can
search it) do not crawl the Web in real time. A page made available on
the Web on Thursday could wait weeks before a crawler reaches it. The major
search services are improving turnaround on recrawling and adding pages,
but in general, expect to wait many days before a keyword search will return
a recent page.
-
If a site or page
is not linked to or submitted by someone (Webmaster, page author,
etc.), it will not be accessible from a search engine. Engines primarily
use these two methods of finding out about new sites and pages.
-
Simply because one,
1,000, or even more pages from a site are available does not mean that
the
engine makes every page of an entire site searchable.
What Should
the Searcher Do?
Understand from
the outset that these limitations exist and can effect your search results.
Rely on more than one search engine. Make use of specialty search tools
that often go "deeper" into a site to collect more content. Take advantage
of "Invisible Web" resources. Use Web directories like the Librarians'
Index to the Internet to "mine" specific sites. When you find something
of value, bookmark it.
Using Invisible/Hidden Web
Resources
Over the last
couple of years, the phrase "the Invisible Web" has come into use; others
call it the hidden or deep Web. However, for the most part all the terms
are synonymous. Searchers need to know about the material in this section
of the open Web. In many cases the material comes from well-known, authoritative
sources, is available at low or no cost, but is not accessible using
a Web search engine.
Resources you interact
with, sites where you fill in a set of variables and then have a "custom"
page returned to you are examples of an Invisible Web page. So is a site
that contains data that you can use for free, but only after you register.
Why don't the search engines access this material? The search spider software
seeking out material to bring back to the database finds nothing to retrieve
in these examples. In the case of the custom page, the material is not
accessible until the user calls for it and the system creates the page
on the fly. In the other example, search spiders from general-purpose Web
search engines do not fill out registration forms. So once the spider hits
a page that requires registration, the spider stops and moves on. None
of the material below that registration interface is searchable from general
engines. One other factor can block search engine access — the "no-robot"
tag. Webmasters can check off that they don't want to be spidered and most
of the good, responsible crawlers will respect that request whether for
all or any portion of the content on a Web site. Sometimes, Webmasters
— perhaps concerned about possible excessive usage — may block the spiders
without fully considering how this decision can eliminate substantial audience
for the material they have taken the time and trouble and expense of loading.
Prime examples
of Invisible Web databases include American FactFinder from the U.S. Census,
most Web-accessible library catalogs, and many of the databases available
via GPO Access.
What Should
the Searcher Do?
Know what is
available before you need it. Of course, this takes time and practice.
We do much the same when becoming aware of the databases from LexisNexis
or Dialog. What makes this even a larger challenge is that there are thousands
of these databases available and, unlike Dialog, no common search syntax.
Use compilations of Invisible Web databases such as the one Chris Sherman
and I have created to support our book [http://www.invisible-web.net].
Conduct Invisible Web collection development. Develop and learn your own
collection. Using the "open Web" to attempt to find something with the
boss breathing down your back is both difficult and inefficient.
One Further
Thought
A great deal of
research and time is devoted to making the information inside these Invisible
Web databases more easily accessible from general-purpose Web search tools
and other resources. The challenge is that many of these Invisible Web
databases offer "custom" interfaces and database tools specifically to
enable interaction with the data. Although the ability to crawl all of
this data is coming and, in some cases, available now, without the proper
limiting tools to harness this information, we could face even worse problems.
We might make already massive uncontrolled databases the size of Google's,
Excite's, or AltaVista's even larger, without the proper mechanisms to
get the data out in a precise manner. In librarian speak, this translates
into increasing recall, lowering precision.
Specialized, Focused, and
Site-Specific Search Tools: Important and Necessary
I often get a
bit unsettled when people and companies refer to the Invisible Web. What
many understand as the Invisible Web encompasses content actually visible
to general-purpose engines like Google and AltaVista. What many label as
Invisible, deep, or hidden Web content actually refers to basic HTML material,
easy for the general search engines to index and make accessible. Many
of the databases that are often reported as Invisible Web are actually
just beyond the reach of general Web search engine policies and procedures.
More aggressive and focused or targeted Web crawlers may go where the general
search engines have balked. For example, specialized search engines were
the first to start handling .pdf formatted files.
To penetrate these
resources, users should learn to turn to specialized or focused search
engines, important and effective tools at getting to the best answer possible
on the open Web. Well-known specialized Web search engines include Psychcrawler,
PoliticalInformation.Com, and Inomics.Com, each of which focuses on a specific
subject (psychology, political science, and economics, respectively). Site-specific
engines refer to the search engines that many sites make available to cover
their own material.
The general search
tools can, and often do, crawl material that you can also
find using a specialized, focused, and site-specific search engine. However,
in some cases, the general search engines may not cover this material as
well as the specialized ones. For example, the engines may not crawl the
key sites in a timely manner or at a deep enough level. Bottom line: Coverage
of this material by general search engines like Excite or AllTheWeb may
be spottier than the specialized search tools.
Here are just a
few of the reasons why this problem occurs:
-
Time Lag. Unless
paid for, spiders visit pages unannounced. Material changed or added between
the dates when the spider last crawled the content — as much as a month,
a quarter, or longer — remains, for all practical purposes, invisible.
News material is a good illustration. A normal page from the CNN site is
technically searchable from any general-purpose engine. However, for some
period, it will not be searchable through a general search engine.
-
Depth of Crawl.
Simply because a search engine makes one, 10, or 100,000 pages of a site
accessible does not mean that it has crawled the entire site. Some engines
only take a certain amount of material and then move on.
-
Each Search Engine
Database Is Unique. As the work of Greg Notess makes clear, each search
engine database differs. What Google knows about, Excite may not have in
its database. What AltaVista can find, AllTheWeb/Fast may not make accessible.
-
Dead-End Pages.
If a basic HTML page sits on your server and is not linked from
any other page that a search tool already knows about and you don't
submit it, then it will, most likely, not be discovered and crawled. A
site-specific engine can crawl every page sitting on an entire server and
make the page searchable.
Why would you want
to use one of these search engines? Several reasons. Smaller, more targeted
databases make for greater precision though lower recall. Think about the
world with only one massive Dialog database. Just as you select
the correct database for the specific task, it works the same with specialized
search engines.
Additionally, these
resources often offer human interaction, with a knowledgeable editor telling
the crawler where to go, how often to return, and how deep to crawl. I
think this job of human database editor will become more and more important
in the future. What a great new career for information professionals!
Finally, some of
these specialized engines, the BBC News engine for example, [http://newssearch.bbc.co.uk/
ksenglish/query.htm], provide extra functionality, such as constant,
even daily, updating and limiting options for search strategies.
What Should
the Searcher Do?
Check out and
use the good sources identifying and collecting specialized and focused
databases. I like Profusion [http://www.profusion.com],
labeled here as "Invisible Web" and the always reliable and always wonderful
Librarians' Index to the Internet [http://www.lii.org],
which covers a large amount of specialized and Invisible Web databases.
Once you have found good tools in your areas of interest, use them and
learn their features in depth.
Using Search Tools on Specific
Sites and Possible Intranet Solutions
This is a simple
idea that I think is often overlooked by searchers. We all know that information
professionals should take full advantage of the special searching features,
such as limiting, and other resources Web search tools offer. However,
the fact that many general-purpose engines (AltaVista, Google, Ultraseek/Inktomi)
are also licensed and available to search specific sites often goes unnoticed
and unused. It shouldn't.
The power searcher
should identify when a specific "site-search" tool is actually the same
software as that of a general-purpose engine. Then we should make use of
the syntax, limiting functions, etc., still available as if the
engine was being used to search the entire Web.
Here are a few
examples to illustrate my point:
The Google engine
and the syntax it offers is used by many sites, including FindLaw LawCrawler
[http://www.lawcrawler.com],
the Energy Information Administration [http://www.eia.doe.gov/],
and IDG.net [http://www.idgnet.com].
Lycos provides the search technology available at USAToday.Com [http://www.usatoday.com].
AltaVista services Macworld.Com [http://www.macworld.com]
and Western Michigan University [http://www.wmich.edu].
UltraSeek technology (now part of Inktomi) is used by CNN [http://www.cnn.com],
and the University of Toronto [http://www.utoronto.ca].
Simply placing
an interface to a well-known proprietary search product on the end user's
desktop will not get them searching well. With so much attention placed
on the power of search tools like Google, AltaVista, and Hotbot, these
products have become synonymous with searching for the general public.
Perhaps the time has come for proprietary information vendors to begin
adapting and using these widely known search software into their products.
This could allow, to some small degree, search trainers to not only share
intricacies in becoming a more effective Web searcher, but could also allow
these same techniques to be applied to in-house proprietary databases.
The lack of standards is a major issue that needs addressing.
The fact that many
Web search tools are also available for licensing as an intranet or extranet
engine makes a great deal of sense. Greater standardization of search tools
can reduce the confusion and frustration felt by end users — not to mention,
their trainers.
What Should
the Searcher Do?
Learn more about
the various search engines and their use as possible intranet search solutions.
Start by visiting Avi Rappaport's very useful site [http://www.searchtools.com].
Not only will this resource teach you about the hundreds of different search
tools available, the knowledge this site offers will also make you a better
searcher.
More Content Coming: The Ability
to Search Audio and Video Material
When it comes
to non-text formats, we already have and shortly will have even more to
ensure that we can provide our users with the best possible answered. The
ability to search video (e.g., newscasts) and audio (e.g., radio programs)
continues to expand. Material that we would have to wait weeks for in the
past, assuming it ever became available, is now available shortly after
the words are spoken. This material can serve many types of users, including
those in international relations and competitive intelligence. Of course,
archives of this material are also available. In many cases these keyword
databases are created using either voice recognition technology or by capturing
the text from closed captions associated with the broadcast.
Work also continues
on search tools that provide access to video and audio material using a
non-text mechanism to access the material. For example, you could search
for a specific color or type of background. An article in Technology
Review provides a good orientation to the topic [http://www.techreview.com/
magazine/jul01/upstream.asp]. Much of this research will also be
available for still-image search tools. Currently, such tools, including
those from Google, Fast, and AltaVista, use the text surrounding the image,
i.e., image captions, and additional factors to determine what a still
image is about.
What Should
the Searcher Do?
Become aware
and familiar with some of the major players in this space.
Virage [http://www.virage.com]
is a leader in the video search arena. In fact, you can keyword search
many of the reports from The NewsHour with Jim Lehrer using Virage technology
at [http://www.pbs.org/
newshour/video/index.html]. Other companies of interest include
TVEyes [http://www.tveyes.com],
ShadowTV [http://www.shadowtv.com],
and WordWave [http://www.wordwave.com].
Finally, take a test drive of SpeechBot [http://www.speechbot.com],
a keyword search engine demo from Compaq, that uses speech-recognition
technology to create a real-time transcript.
As for image
searches, try these two resources. Webseek allows you to search or browse
for criteria in the image [http://www.ctr.columbia.edu/webseek/].
Visoo uses software that looks for words embedded "inside the image" [http://www.visoo.com].
The Commercialization of Search
Results
This issue has
received a great deal of well-deserved attention lately. It seems to me
that the wants and needs of the searcher/researcher and the many people
from various groups (the engines themselves, the search optimization community,
the advertising community) have different ideas about what the bottom line
is when it comes to Web searches. Don't misunderstand me — the engines
are profit-making-businesses, or try to be, so making money is goal number
one. I understand this fact. However, those of us who use the "open Web"
as a research tool want timely and authoritative answers without advertising
or undo influence getting in the way of the best possible answer available.
Can the wants and
needs of the two groups co-exist? Absolutely, but it will take knowledge
and continuing education for both information professionals and end users
to continue to use general-purpose Web search tools as effective resources.
The bottom line here is knowledge of the issues for all parties. Using
the Web effectively without general-purpose search engines would be difficult,
time consuming, and in many cases impossible. This is particularly true
for the professional researcher.
Pay-per-placement,
pay-per-click allows a person or company to buy a keyword or keywords and
have their results at the top of the results list when that word or words
are searched. GoTo.Com is just one of many examples of this type of search
engine. The extra challenge with GoTo and others is that in addition to
searching at GoTo.Com they also sell their database to other engines for
them to brand as their own. For example, GoTo.Com "powers" NBCi and Go.Com
(formerly Infoseek). So, if a user tells you that NBCi is his or her engine
of choice, in actuality they are searching GoTo.Com material. Various "flavors"
of this type of branding exist in the Web search world. To get an idea
of how many of these engines are online check http://www.payperclicksearchengines.com.
Paid-inclusion
programs available from many of the leading engines have programs in place
that will allow a person or company to pay a fee and make sure that their
site is crawled and included in that particular database. Additionally,
this fee will also make sure that the site is recrawled on a regular basis,
sometimes every week or so. This can mean that searchers may assume a currency
of results based on retrieval from the paid-inclusion sites that does not
occur with non-paying sites.
Search optimization
consultants reverse-engineer search engines and relevancy-ranking algorithms
and then use this knowledge to get a client's Web pages higher in a search
result list.
Danny Sullivan,
the editor of Search Engine Watch [http://www.searchenginewatch.com]
covers this and most other parts of the search world on a regular basis
and at great depth. Also, to learn more about search engine optimization
take a look at Rank Write Roundtable [http://www.rankwrite.com].
By the way, keeping current with the search engine optimization discussion
can often provide searchers with deep background about how the engines
work. Again, this makes for a better searcher.
What Should
the Searcher Do?
Understand the
differences among search engines, become familiar with the terminology,
and share this knowledge with others.
In the case
of more "traditional" engines, be aware of how commercial material is labeled
and where it is placed. For example, AltaVista offers "partner listings"
at the top and bottom of a results list. Excite uses the term "sponsored
link." Hotbot places "products and services" at the top of the results
list.
At the time
of writing, Google does not offer a paid inclusion program. However, Google
will allow the purchase of keyword(s) and a link to a corresponding URL
to appear away from the ranked results list, labeled as a sponsored link
inside a colored box.
Meta-Search Tools: Problems
and Challenges
I have never been
a fan of meta-search engines. These tools simultaneously send your search
request to many engines. Why don't I like them? Several reasons. One, meta-search
engines often do not allow you to use the engine in a more than basic mode,
leading to high amounts of recall but very poor precision. Equally important,
especially in the last couple of years, is the fact that many of the most
well-known meta-engines send a query to many entirely "pay for placement"
engines. A May 2001 Danny Sullivan report [http://searchenginewatch.com/
sereport/01/05-metasearch.html] provides a clear view of this issue.
For example, the popular Dogpile meta-search engine sends a query to 15
engines, six of them being entirely pay for placement. I think most researchers
using the Web would be disappointed by the results they receive and the
time they have wasted.
What Should
the Searcher Do?
First, inform
other searchers, especially end users who think they are "getting it all"
by using a meta-search engine. Information professionals should take advantage
of the "power" or "advanced" mode most general engines offer, such as limiting
to a specific domain or word in the URL.
One More
Thing
In the spirit of
something for everyone, phone the neighbors and wake the children; I will
mention one meta-engine that I do like and use: Hello Vivisimo! [http://www.vivisimo.com].
So why do I like it? A few reasons.
-
It does not send your
query to any 100 percent pay-for-placement engines.
-
It does a reasonable
job of allowing you to use some advanced syntax.
-
The "advanced interface"
allows for several customization features.
-
It has some duplicate
removal capabilities.
-
Vivisimo effectively
clusters results into hierarchical sets of categories on the fly.
-
Users have the option
of previewing a page directly from a result list.
-
Vivisimo searches
several news databases and other search sites (e.g., Medline, USPTO, FirstGov.Gov)
and still take advantage of its clustering process. This can be particularly
useful for basic searchers who only enter a few keywords and do not search
with limits. Using Vivisimo they can at least take advantage of the categories,
hopefully assisting them in accessing the answer they want quickly.
Where Have All the Pages
Gone?
Searching for
older material is a challenge, often an impossible one. The issue as is
old as Web searching and occurs not only in the Web search world, but in
many other areas of digital data. Currently, when most Web pages are removed
from a site, they are gone for good unless you can personally contact the
Webmaster who can send you a copy. Luckily many people are thinking and
working on solving this problem. One example is the work done by OCLC and
RLG (Research Libraries Group) to develop standards and methods for archiving
older material. The National Archives and other government agencies are
doing similar work. NARA's Clinton Presidential Materials Archive [http://www.clinton.nara.gov/index.html]
is an early effort to store Web resources from a presidential administration.
Alexa Research
[http://www.alexa.com]
offers one of the earliest and most unique archiving efforts, the Alexa
Archive of the Web. Brewster Kahle's project makes snapshots of the Web,
archiving everything in sight. Alexa Research carries over 18 terabytes
of data covering some 5 million Web sites and some 1.9 billion pages. If
the site has preserved an archived copy of a page, it appears in blue and
you can click to view it. If the site records a page, but has no archive
for it, the page link appears with the tag "Page not in Archive" and a
greyed-out link. One subset of the Alexa archiving covers some 87 million
pages of material from the Election 2000 Presidential campaign [http://archive.alexa.com/].
What Should
the Searcher Do?
Long term? Become
aware of the research and projects going on in this area. Offer comments
and suggestions on how to make this material more accessible and searchable.
A great archive of quality content without the proper mechanism to access
it is not great.
Short term?
Take advantage of the Google cache feature — another "Google only" resource.
Each time the Google crawler comes around to crawl a Web page, it makes
a copy (unless told not to by the Web site owners) and places it on the
Google server. Therefore, if you search for a page using Google and then
click and find the page has been removed, return to the search results
page and look for the link, next to the URL, that says, "cached." Caveat:
The cache is a dynamic entity. A page does not stay in the Google cache
in perpetuity. It is only available from the cache until the next time
the crawler visits the page and identifies that it has gone. For more about
the Google cache, go to http://www.google.com/
help/features.html#cached.
Of course, another
option is to either print-out or save a copy of a page. This can both be
time consuming and a waste of paper or hard drive space. I use the SaveThis
[http://www.savethis.com]
service that allows you to copy any Web page, save it on the server, and
access it from any Web browser. This free resource is well worth a look.
I Still Can't Find...
General, invisible,
and specialized search tools still leave plenty of material not available.
So many types of resources to explain, so many places to search! Your boss
says that last night he was at home "searching" the Web for an article
from Newsweek. He or she went to AltaVista, Google, and Yahoo! and
came up empty.
"These search engines
don't contain 'everything,'" you tell your boss. However, often searching
other databases, you can access and purchase articles you need. You explain
that resources like Northern Light's Special Collection, Electric Library,
or using dowjones.com (a free site) to access and purchase an individual
article from Factiva's Publication Library are all possibilities. You go
on to tell him or her that your library also makes numerous databases available
to them through subscription licenses, databases they can access from home.
The boss says.
"Wow, I had no idea all of this material was available." On a roll, you
also suggest that the boss check with the local public library, which you
happen to know also offers access to many fee-based services. "Your tax
dollars at work," you say.
Finally, you tell
your boss, much of where you search is determined but what you need. In
some cases what you need can be found — for free — using Google or Excite,
but if you don't find it, you should know where to turn next. In some cases,
starting with Google or Excite might not be the best idea. There is still
plenty of content not digitized that may require a trip to a library with
a print or microfilm collection containing the document they need.
What should
the searcher do?
You tell them.
Gary
Price's e-mail address is gprice@invisible-web.net.
|