Web Search Engines FAQS: Questions, Answers, and Issues

Vol. 9 No. 9 — October 2001

• FEATURE •
Web Search Engines FAQS:
Questions, Answers, and Issues
by Gary Price • Gary Price Research and Internet Consulting

Table of Contents

Previous Issues

Subscribe Now!

ITI Home

View Search Engines FAQ Chart

Only a few years ago, the phrase "Web search" did not exist. Then the term began to move rapidly into the awareness of information professionals, about as fast as a Japanese bullet train. Today, much, though not all, of the work we do revolves in one way or another around the Web.

With so much to keep on top of, precious time becomes even more precious. A couple of years ago I wrote an article trying to figure out a way to make the day 26 or 27 hours long. Unfortunately, that idea never reached the implementation stage, though it remains an idea worth considering. Even within the narrow bonds of 24/7/365, we must all still try to keep up to date about what is happening with Web search engines. The fact that they seem to change on a weekly, if not daily basis, is no excuse.

We as professionals do not use every search engine or Web directory daily, nevertheless, we have to know how each works and what data each does and does not contain. I fully understand that this is easier said than done but today, information access is a topic that everyone is aware of and talking about. Pick up any newspaper. Turn on the television. Everyday more and more articles and reports discuss searching the Web. Many of these articles and reports are written for and by non-information professionals. We have to stay ahead of our clients and patrons if we hope to help them. Excite or AllTheWeb may not be your search engines of choice, but I bet they are for someone you know. Our colleagues, co-workers, and friends come to us as the "search experts" and we must do our best to help. Our knowledge and understanding in this area are great ways to make our profession look good and to make our already valuable jobs even more valuable.

With this said, the following reviews the latest goings on in the search world and tries to provide some suggestions and tools to make you more knowledgeable and save you some time.

Price's Priceless Tips The Web search world changes on what sometimes seems like an hourly basis. What follows are a few selected tips and resources for some of the most well-known of engines. This is just the tip of the iceberg. Resources like Search Engine Showdown and Search Engine Watch are essential for learning and keeping up with how these tools work and change over time.

Ten Things to Know About Google
1. The database that Google licenses to Yahoo! [http://google.yahoo.com] is not the same size: it's smaller than the Google.com database. It does not contain links to cached versions of pages. This database is also used to supply "fall-through" content (material not in Yahoo's own database). It is often found listed as "Web page" content.
2. Google utilizes the Open Directory Project database as its Web Directory [http://directory.google.com].
3. You can search stop words by placing a + in front of the word (ex. "+To +Be +Or Not +To +Be").
4. At the present time the Google database is refreshed about once every month.
5. You can limit your search to only .pdf files by using the syntax filetype:pdf.
6. Google is the only major search engine to crawl Adobe Acrobat .pdf files.
7. If you are a frequent Google searcher, save time by using the Google Toolbar [http://toolbar.google.com] and Google Buttons [http://www.google.com/ options/buttons.html].
8. A Boolean "OR" is available with Google. For it to function, capitalize the OR.
9. Google only crawls and makes searchable the first 110 k of a page. Long documents may have substantial content invisible to Google.
10. Entering a U.S. street address into the query box will return a link to a map of that address location. Typing in a person or business name, city, and state will also run the query to the Google phone directory. Several other combinations are available that will also query the phone directory service, including typing in the area code and number to run a reverse search [http://www.google.com/ help/features.html#wp].

Ten Things to Know About AllTheWeb
1. AllTheWeb licenses its database to Lycos. The identical database is searched and makes up some of the content on a Lycos results page.
2. Unlike Google and AltaVista, this search engine does not have a limit on the amount of content crawled on a Web page.
3. AllTheWeb indexes every word. Words traditionally considered as "stop words" are searchable.
4. AllTheWeb does not permit the use of Boolean operators.
5. If plus and/or minus signs are not used, AllTheWeb implies a plus sign in front of each term or phrase. This results in an implied "anding" of terms.
6. AllTheWeb is now promising a complete refresh of its database every 9-12 days.
7. AllTheWeb permits syntax to be used direct from the "basic" search page to limit a query. See http://www.alltheweb.com/ help/basic.html#special.
8. A query to the AllTheWeb text database simultaneously runs the search in the AllTheWeb Image, Video, MP3, and FTP databases. If it finds anything, these results are linked on the right side of the results page.
9. AllTheWeb offers a search engine dedicated to Mobile Web content [http://mobile.alltheweb.com].
10. Fast Search and Transfer (FAST), the company behind AllTheWeb, has deployed its software to power the Scirus science search engine from Elsevier.

Ten Things to Know About AltaVista
1. AltaVista is the only major search engine that allows a searcher to use the proximity operator, NEAR (in simple search) near (advanced search). Using this operator finds terms within 10 words of each other in either direction.
2. AltaVista indexes only the first 100 k of text on a page.
3. An asterisk (*) can be used in a phrase to represent an entire word. (Ex. "One small step for man, one giant * leap for mankind")
4. AltaVista News http://news.altavista.com] is "powered" by Moreover. This continuous feed of material can be searched using AltaVista syntax.
5. The use of the "sort by" box on the AltaVista Advanced interface allows you to give certain words or phrases a higher relevancy weighting.
6. Caveat: If you use Advanced Search, make sure to place some term or terms in the Sort-By box; otherwise, results return in completely random order.
7. AltaVista's directory comes from Looksmart.
8. AltaVista's advanced search does not allow for the use of + and — signs.
9. If you search AltaVista in the "simple" mode entering multiple terms without syntax, it will result in an "implied" OR. In the advanced mode, multiple terms are considered a phrase.
10. AltaVista software powers the Health Resources and Services (U.S. government) search engine. This means that all AltaVista syntax can be utilized there. This site also illustrates AltaVista capability of indexing full-text .pdf documents on the site-specific and intranet level [http://search.hrsa.gov].

Ten Things to Know About MSN Search
1. MSN (Microsoft Search Network) Search is "powered" by an Inktomi database. Remember that Inktomi licenses its database to many search sites. Each site gets a different "flavor" of the total database.
2. The MSN Advanced Search interface offers numerous limiting options via fill-in boxes and pull-down menus [http://search.msn.com/advanced.asp].
3. The Advanced Search interface permits limiting to pages at a certain depth in the site. For example, limiting to pages Depth 3 will limit the search to only pages no more than three directories deep from an entire site [e.g., http://www.testsearch.com/ Directory1/Directory2/Directory3/].
4. MSN Search allows use of the asterisk (*) as a truncation symbol.
5. According to the most current Search Engine Showdown rankings, MSN Search has the largest database of any Inktomi partner.
6. The directory portion of MSN search is powered by the Looksmart database.
7. On the Advanced Search interface, checking the "Acrobat" box will retrieve pages with links to pages that contain .pdf files. It does not search content "inside" these files.
8. Greg Notess points out that the same syntax available to limit Hotbot will also work with MSN Search [http://hotbot.lycos.com/ help/tips/search_features.asp].
9. Danny Sullivan notes that MSN also employs human editors to "hand-pick" key sites in the Web Directory and Featured Link sections of the site. Although most of the time the "Featured Links" represent major MSN advertisers, editors can add other content.
10. Selecting and search under the MSN "News Search" tab returns results predominantly from MSNBC.

Ten Things to Know About Northern Light
1. Make sure to study the Northern Light "Power" search page. It provides many limiting options without the knowledge of any syntax [http://nlresearch.northernlight.com/ power_research.html].
2. Instead of entering http://www.northernlight.com, use http://www.nlresearch.com to go straight to the Northern Light Research site. This site aimed at the enterprise market (but available to any searcher) contains access to several databases not available from the main URL. Most of these resources are fee-based. They include EIU Search and market research content from FIND/SVP and MarkIntel.
3. Northern Light provides FREE full-text access to a database of continuously updating news content from 56 newswires. Material stays in this database, available for free access, for 2 weeks. Then the content moves to the Northern Light Special Collection database.
4. Northern Light's Special Editions are subject specific portals that combine material from the "open Web" and NL's proprietary databases. Topics of Special Alerts include XML, managed care, and electronic commerce.
5. The Northern Light Special Collection currently contains content (fee-based, pay-per-document) from over 7,100 sources. A catalog of these publications is available at http://nlresearch.northernlight.com/ docs/specoll_help_catlook.html.
6. Northern Light allows the use of Boolean operators and + and - signs.
7. Multiple truncation symbols can be used in a query. Northern Light has two truncation symbols. The asterisk (*) for multiple letters and the percent symbol (%) for single or absent letters, e.g., medieval/mediaeval.
8. In addition to the limiting capabilities of the "Power" search page, NL has several terms available for field searching. These include text:, text:, and pub:. (This last prefix allows searching in a specific Special Collection publication title.) You can find a complete list at
http://nlresearch.northernlight.com/ docs/search_help_quickref.html.
9. Northern Light's free "Alerts" feature is one resource you must know about. This feature allows you to set up search strategies in ANY/ALL of the NL databases and have those strategies searched up to three times daily. If any new material hits on the strategy, results will be delivered to you via e-mail. I use this tool to bring me a customized feed of news via the NL News Search database. Remember, the full-text content is free to access for 2 weeks.
10. Northern Lights "Geo Search" provides an opportunity to search the Web with keywords and U.S. and Canadian address information. Results also get the benefit of NL's organization with its "custom folders."

Ontologies, Controlled Vocabularies, XML, and Web Search Engines

I am very excited to see that controlled vocabularies and the building of ontologies have come into vogue.
Some of this "hipness" has been caused by the promise and excitement surrounding XML (eXtensible Markup Language). However, I am not sure if the coming of XML will help the general-purpose search engine, though it should clearly help specialized, focused, and Invisible Web engines become much more useful resources.
Why the hesitation?
The general-purpose engines, as we know and love them today, hypothetically index each page, massive amounts of data coming from just about anyone who wants to produce Web content and put it on a publicly accessible server.
The problem for implementation of a controlled vocabulary with this material is really one of creation. Who would create it? Who would maintain it? Who would do the cataloging? Would entire sites be cataloged at the page level or only a specific page (the top page)? Who would manage such a project? Where would the money come from?
Controlled vocabularies and XML show a great deal of promise for certain types of search engines because these types of engines can much more easily create and enforce a set of agreed upon standards. Many issues would need resolution before we could apply controlled vocabularies to make searching the massive amount of material on the open Web more effective.

The Future:
New Tools on the Way

When you learn about new search tools and share that knowledge with others, you not only improve your own searching, but you help to make a better future for all searchers.
Here are some new search products that show a lot of promise, a few more potential "quick hits." With the vulnerability of the Internet industry of late, let's hope these products survive. Even if the actual companies do not survive, the technology is still worth knowing about. Have fun!!!
Three New General Purpose Search Engines
Competition for Google?

WISEnut

[http://www.wisenut.com]
Teoma

[http://www.teoma.com]
GuideBeam

[http://www.guidebeam.com]

A New Image Search Tool

picsearch

[http://www.picsearch.com]

Real-Time Search
Patented technology to search resources updated in real-time.

http://www.netcurrents.com

Natural Language Search Technology
This product is getting a lot of attention.

http://www.iphrase.com

Now let's see if you've learned your lessons. How long will it take before you've tried all these new promising sites out? The test clock starts...now!

This Article Contains Inaccuracies:
Essential Reading

In the time it takes this article to move from the author to the editor to the publisher to the printer to you, undoubtedly something mentioned in this article will have changed. Some feature will have appeared, another vanished. The working searcher must simply make a policy of staying on top of those changes.
Those of you who need to keep current on the Web search world should monitor the following sites as often as possible. All these sites are free and most contain free e-mail newsletter and updates.
SearchDay
http://www.searchenginewatch.com/ searchday/
Written by Chris Sherman. Daily updates.
Search Engine Watch
http://www.searchenginewatch.com
A resource rich site that offers a free monthly newsletter.
Search Engine Showdown
http://www.searchengineshowdown.com
Librarian Greg Notess's site. Updated on a regular basis. Greg also manages the Search-L list.
ResearchBuzz
http://www.researchbuzz.com
Written and compiled by Tara Calishain. Daily updates.
TVC (The Virtual Chase) Alert
http://www.thevirtualchase.com
Written and compiled by Genie Tyburski. Daily updates.
The Virtual Acquisition Shelf and News Desk
http://resourceshelf.blogspot.com
Compiled by Gary Price. Daily updates.
Free Pint
http://www.freepint.com
Fortnightly newsletter edited by Will Hann. Also offers Web discussion boards.
News Breaks from Info Today
https://www.infotoday.com/newsbreaks/
General information industry coverage of breaking news, that often features news of the Web search world.

Scope Notes
Before we begin, we need to get a definition straight — a definition that I think many of us have thought about. What does "Web search" mean to the information professional? In the early days of the Web, it meant exactly how it sounds — material found on the open Web.

However, as we move forward, the term "Web search" has taken on new meanings. Does a Web search involve tools like Google or AltaVista to reach "open access" material? Does it mean using the Web as a vehicle to log on to proprietary databases such as Factiva or Dialog? Not too long ago, logging into proprietary services required individual connections to each one. Today, any Web browser with an Internet connection can reach those services. Perhaps it means both. This lack of common understanding can confuse some and trying to solve the issue is outside of the scope of this article.

This article will primarily focus on the "traditional" Web search, i.e., search engines that assist in locating open Web content. The approach I have taken is to try to answer the questions I seem to get, in one form or another, at every conference, every workshop, and in every day's stack of e-mail messages.

The Never-Ending Amount to Learn, No Sign of Slowing Down
The single most difficult issue for the Web searcher to face is the sheer volume and speed of change on both the Web and the search engines that try to cope with it. The sense of doom most searchers feel in struggling to keep pace occurs not because of any lack of intelligence, nor any lack of interest in the subject — far from it. Most often the cause is the reality of having only 24 hours in a day and the fact that life exists away from the computer.

I monitor what's going on in the Web search world on a daily basis and it's almost routine for something new to arrive or for something established to change each day. For example, at the time of writing this article, AllTheWeb had just undergone major changes, Google released an image search tool, and WISEnut, a new general search tool, had come on the scene. When you couple the dynamic nature of Web searching (both individual pages and entire resources coming and going) and the need to stay up-to-date with traditional electronic tools (which undergo plenty of changes as well), print resources, and other issues of the day (can you say "copyright" or spell "Tasini"!), there is so much to do and so little time to do it.

A lack of knowledge and understanding about how a particular search tool works, e.g., a new way to narrow your search, or ignorance of a more useful tool, e.g., a new search engine going online, can waste time and produce poor results.

What Should the Searcher Do?

I realize this is easier said than done, but Web searchers MUST devote at least 1-2 hours a week to stay current. This informal "continuing education" is crucial. Often, the knowledge you gain from these sessions will pay off handsomely with time saved and better query results in the future. The best way to learn to learn how a search engine works is by using it. Conducting preemptive research on a favorite topic makes it easy to spot differences both in terms of content and the way results are presented and at the same time to gather new resources for your own bookmarks or intranet sites. For a list of suggested sources to keep you current, see the "Essential Reading" sidebar.

Is an "Open Web" Search Engine Always the Place to Begin? What Type of Information Can I Count on Finding There?
Lately, I have spent a great deal of time thinking about this issue. As someone who often gives presentations about Web searching I have tried to provide session attendees with lists of what you can and can't find "on the Web" using a general-purpose Web search tool. Even in the most general sense, my attempts must fail. A few minutes after beginning, I inevitably realize that one can't boil down a dynamic universe of data like the Web into bullet points. Knowing, or better, understanding, where to start in this world of information resources is perhaps the most important information to know and share. There is no simple way of doing this. It takes time and commitment. I start learning about new resources by asking the most basic questions: What is this database or search engine? What would kinds of questions would it help me answer?

Often the open Web may not be the place to begin. While it's nice to get quality material free, how long did it take to get it? Would standing up and walking to a bookshelf produce a useful answer in a much shorter period of time? Would a commercial full-text search service scan the decade-long archives of 50 or 100 newspapers in a matter of minutes? At issue are the time and money it takes to reach your answer.

Even if you choose the open Web as your target, would a specialized or targeted search engine more easily find your answer, rather than one of the all-encompassing engines? Regardless, understanding how each search engine works and the many ways an engine allows you to limit and control searches will make general-purpose engines more productive and waste less of your time.

We need to do this "learning" much the same way we have always "learned" traditional databases and print resources. Think about how much focus information vendors like Factiva and Dialog place on training. Unfortunately, Web engine companies do not offer this kind of training, but the learning process remains crucial. For me, the best part about being an information professional is the knowledge of where to find an answer. This is knowledge that non-professionals desire and makes our already important jobs even more valuable, especially with so many new databases and new online resources becoming available.

What Should the Searcher Do?

Consider the open-Web more of a directory to answers and less of an all-knowing answer machine. Sometimes, this directory WILL become an authoritative reference book and provide you with a timely and authoritative answer. Other times it will assist by providing you with background knowledge that can make using a fee-based service or a print collection more productive. Don't forget — shifting from one format to another can be a two-way street. What you learn from a print or commercial online source can produce an effective search strategy for the open Web. A Web search engine may also provide you with specific names of people to contact. Remember, the telephone and e-mail will always be very important reference resources.

The Quality of Information: The Biggest Challenge to Web Searching
For this Web searcher, information quality constitutes the greatest challenge faced as both a searcher and a teacher. We live in an age when anyone can become a publisher. All they need is a Web connection, server space, and something to say and/or share. Once the content goes onto a server and once a crawler finds it, the Web search engines will make it available to everyone. Within minutes or days, anyone with Web access can find that information. Amazing! And frightening!

Once they have found it, the major challenge to searchers is evaluating content. They must judge its quality, and often very quickly, using the criteria that information professionals have always used to evaluate information. How does one do this? Well, this is the topic of other articles, books, and dissertations. The most important point is to take a step back, if only for a second, to ask yourself where this information is coming from and why it is being placed online. Since anyone can become a publisher with the Web as a publishing medium, the reputation and background of the site creator, their qualifications, etc., are crucial. I would strongly recommend taking a look at the resources our colleague Genie Tyburski makes available on her site for judging quality [http://www.virtualchase.com/ quality/index.html].

Evaluating information quality, something that our profession has always done, offers another in-road for sharing our skills with the public. Many who search the Web take whatever they find to be accurate, current, and worthwhile. As information professionals, we must protect them, often from themselves.

One more thing. In my opinion, the challenges that information quality pose for the Web searcher prove how important it is for our profession to include Web resources as part of our collection development. We must try to make the Web a more effective tool for researchers. The Web is a living organism and, unlike an annual reference book, can change at a moment's notice. In an already busy workday, finding time to search out Web resources in an organized manner can be difficult. But all of us need to have an idea of what is available and where to turn before we actually need the resource to answer a query. Just knowing a top-level site exists that may contain the answer will not suffice. We learn our print collections, let's learn our Web collections and bookmarks.

Easier said than done? Of course. Still, it remains a goal we should strive to attain.

The Domination of Google
Everyone, including me, loves Google. How could you not like it? In most cases, it delivers highly relevant results (though this does not always mean authoritative) in a short amount of time. When you add in features like Google Cache (a powerful way to find pages that might have just gone AWOL), you have a search engine that works and works well.

Google is simple to use at a basic search level, but still returns good results. This is why non-professional searchers love it so much. The clean, single box home page is simple for non-sophisticated searchers to understand. It doesn't even allow you to directly use all three Boolean operators to return results, yet it works! Wow! More advanced searchers will be interested to know that Google uses AND as a default between search terms, permits the use of OR (it must be in all caps), and can remove a word of phrase if you use a minus (-) sign.

What I like most about Google is its quest to improve on what it already has. Google always seems to be introducing something new and innovative. In February 2001, it started tracking portable document format (.pdf) material. The general public may not put a high demand on some of this content, but PDF documents offer information professionals masses of authoritative content from respected sources. At the time of writing, Google was still the only general search engine to make PDF files searchable on a large scale.

What Should the Searcher Do?

The advanced searcher must get to know and make use of Google at a more than "put the words in the box" level. It's very easy. Begin by looking at the Google Advanced Search page [http://www.google.com/ advanced_search.html], and at the same time learn the syntax that will allow you to limit your searches directly without having to use this page. To learn more about Google, especially on how it compares to other search engines, go to Greg Notess's Search Engine Showdown site [http://www.searchengineshowdown.com].

Here's hoping that Google continues to improve and add new useful features. Here is also hoping that Google continues to properly separate advertising content from result sets. Yet with all of Google's wonderful abilities, good searchers know that the must never make any single Web search engine the only tool used. No single engine makes "everything" searchable.

Understanding the Limitations of General Web Search Tools
No single Web search tool is the end-all/be-all. In fact, most have limitations that need careful consideration if you plan to use them regularly or teach others to use them. What do I mean by limitations? Here are a just a few of many possible examples:

Search spiders or crawlers (the software that brings back material to a database so you can search it) do not crawl the Web in real time. A page made available on the Web on Thursday could wait weeks before a crawler reaches it. The major search services are improving turnaround on recrawling and adding pages, but in general, expect to wait many days before a keyword search will return a recent page.
If a site or page is not linked to or submitted by someone (Webmaster, page author, etc.), it will not be accessible from a search engine. Engines primarily use these two methods of finding out about new sites and pages.
Simply because one, 1,000, or even more pages from a site are available does not mean that the engine makes every page of an entire site searchable.

What Should the Searcher Do?

Understand from the outset that these limitations exist and can effect your search results. Rely on more than one search engine. Make use of specialty search tools that often go "deeper" into a site to collect more content. Take advantage of "Invisible Web" resources. Use Web directories like the Librarians' Index to the Internet to "mine" specific sites. When you find something of value, bookmark it.

Using Invisible/Hidden Web Resources
Over the last couple of years, the phrase "the Invisible Web" has come into use; others call it the hidden or deep Web. However, for the most part all the terms are synonymous. Searchers need to know about the material in this section of the open Web. In many cases the material comes from well-known, authoritative sources, is available at low or no cost, but is not accessible using a Web search engine.

Resources you interact with, sites where you fill in a set of variables and then have a "custom" page returned to you are examples of an Invisible Web page. So is a site that contains data that you can use for free, but only after you register. Why don't the search engines access this material? The search spider software seeking out material to bring back to the database finds nothing to retrieve in these examples. In the case of the custom page, the material is not accessible until the user calls for it and the system creates the page on the fly. In the other example, search spiders from general-purpose Web search engines do not fill out registration forms. So once the spider hits a page that requires registration, the spider stops and moves on. None of the material below that registration interface is searchable from general engines. One other factor can block search engine access — the "no-robot" tag. Webmasters can check off that they don't want to be spidered and most of the good, responsible crawlers will respect that request whether for all or any portion of the content on a Web site. Sometimes, Webmasters — perhaps concerned about possible excessive usage — may block the spiders without fully considering how this decision can eliminate substantial audience for the material they have taken the time and trouble and expense of loading.

Prime examples of Invisible Web databases include American FactFinder from the U.S. Census, most Web-accessible library catalogs, and many of the databases available via GPO Access.

What Should the Searcher Do?

Know what is available before you need it. Of course, this takes time and practice. We do much the same when becoming aware of the databases from LexisNexis or Dialog. What makes this even a larger challenge is that there are thousands of these databases available and, unlike Dialog, no common search syntax. Use compilations of Invisible Web databases such as the one Chris Sherman and I have created to support our book [http://www.invisible-web.net]. Conduct Invisible Web collection development. Develop and learn your own collection. Using the "open Web" to attempt to find something with the boss breathing down your back is both difficult and inefficient.

One Further Thought
A great deal of research and time is devoted to making the information inside these Invisible Web databases more easily accessible from general-purpose Web search tools and other resources. The challenge is that many of these Invisible Web databases offer "custom" interfaces and database tools specifically to enable interaction with the data. Although the ability to crawl all of this data is coming and, in some cases, available now, without the proper limiting tools to harness this information, we could face even worse problems. We might make already massive uncontrolled databases the size of Google's, Excite's, or AltaVista's even larger, without the proper mechanisms to get the data out in a precise manner. In librarian speak, this translates into increasing recall, lowering precision.

Specialized, Focused, and Site-Specific Search Tools: Important and Necessary
I often get a bit unsettled when people and companies refer to the Invisible Web. What many understand as the Invisible Web encompasses content actually visible to general-purpose engines like Google and AltaVista. What many label as Invisible, deep, or hidden Web content actually refers to basic HTML material, easy for the general search engines to index and make accessible. Many of the databases that are often reported as Invisible Web are actually just beyond the reach of general Web search engine policies and procedures. More aggressive and focused or targeted Web crawlers may go where the general search engines have balked. For example, specialized search engines were the first to start handling .pdf formatted files.

To penetrate these resources, users should learn to turn to specialized or focused search engines, important and effective tools at getting to the best answer possible on the open Web. Well-known specialized Web search engines include Psychcrawler, PoliticalInformation.Com, and Inomics.Com, each of which focuses on a specific subject (psychology, political science, and economics, respectively). Site-specific engines refer to the search engines that many sites make available to cover their own material.

The general search tools can, and often do, crawl material that you can also find using a specialized, focused, and site-specific search engine. However, in some cases, the general search engines may not cover this material as well as the specialized ones. For example, the engines may not crawl the key sites in a timely manner or at a deep enough level. Bottom line: Coverage of this material by general search engines like Excite or AllTheWeb may be spottier than the specialized search tools.

Here are just a few of the reasons why this problem occurs:

Time Lag. Unless paid for, spiders visit pages unannounced. Material changed or added between the dates when the spider last crawled the content — as much as a month, a quarter, or longer — remains, for all practical purposes, invisible. News material is a good illustration. A normal page from the CNN site is technically searchable from any general-purpose engine. However, for some period, it will not be searchable through a general search engine.
Depth of Crawl. Simply because a search engine makes one, 10, or 100,000 pages of a site accessible does not mean that it has crawled the entire site. Some engines only take a certain amount of material and then move on.
Each Search Engine Database Is Unique. As the work of Greg Notess makes clear, each search engine database differs. What Google knows about, Excite may not have in its database. What AltaVista can find, AllTheWeb/Fast may not make accessible.
Dead-End Pages. If a basic HTML page sits on your server and is not linked from any other page that a search tool already knows about and you don't submit it, then it will, most likely, not be discovered and crawled. A site-specific engine can crawl every page sitting on an entire server and make the page searchable.

Why would you want to use one of these search engines? Several reasons. Smaller, more targeted databases make for greater precision though lower recall. Think about the world with only one massive Dialog database. Just as you select the correct database for the specific task, it works the same with specialized search engines.

Additionally, these resources often offer human interaction, with a knowledgeable editor telling the crawler where to go, how often to return, and how deep to crawl. I think this job of human database editor will become more and more important in the future. What a great new career for information professionals!

Finally, some of these specialized engines, the BBC News engine for example, [http://newssearch.bbc.co.uk/ ksenglish/query.htm], provide extra functionality, such as constant, even daily, updating and limiting options for search strategies.

What Should the Searcher Do?

Check out and use the good sources identifying and collecting specialized and focused databases. I like Profusion [http://www.profusion.com], labeled here as "Invisible Web" and the always reliable and always wonderful Librarians' Index to the Internet [http://www.lii.org], which covers a large amount of specialized and Invisible Web databases. Once you have found good tools in your areas of interest, use them and learn their features in depth.

Using Search Tools on Specific Sites and Possible Intranet Solutions
This is a simple idea that I think is often overlooked by searchers. We all know that information professionals should take full advantage of the special searching features, such as limiting, and other resources Web search tools offer. However, the fact that many general-purpose engines (AltaVista, Google, Ultraseek/Inktomi) are also licensed and available to search specific sites often goes unnoticed and unused. It shouldn't.

The power searcher should identify when a specific "site-search" tool is actually the same software as that of a general-purpose engine. Then we should make use of the syntax, limiting functions, etc., still available as if the engine was being used to search the entire Web.

Here are a few examples to illustrate my point:

The Google engine and the syntax it offers is used by many sites, including FindLaw LawCrawler [http://www.lawcrawler.com], the Energy Information Administration [http://www.eia.doe.gov/], and IDG.net [http://www.idgnet.com]. Lycos provides the search technology available at USAToday.Com [http://www.usatoday.com]. AltaVista services Macworld.Com [http://www.macworld.com] and Western Michigan University [http://www.wmich.edu]. UltraSeek technology (now part of Inktomi) is used by CNN [http://www.cnn.com], and the University of Toronto [http://www.utoronto.ca].

Simply placing an interface to a well-known proprietary search product on the end user's desktop will not get them searching well. With so much attention placed on the power of search tools like Google, AltaVista, and Hotbot, these products have become synonymous with searching for the general public. Perhaps the time has come for proprietary information vendors to begin adapting and using these widely known search software into their products. This could allow, to some small degree, search trainers to not only share intricacies in becoming a more effective Web searcher, but could also allow these same techniques to be applied to in-house proprietary databases. The lack of standards is a major issue that needs addressing.

The fact that many Web search tools are also available for licensing as an intranet or extranet engine makes a great deal of sense. Greater standardization of search tools can reduce the confusion and frustration felt by end users — not to mention, their trainers.

What Should the Searcher Do?

Learn more about the various search engines and their use as possible intranet search solutions. Start by visiting Avi Rappaport's very useful site [http://www.searchtools.com]. Not only will this resource teach you about the hundreds of different search tools available, the knowledge this site offers will also make you a better searcher.

More Content Coming: The Ability to Search Audio and Video Material
When it comes to non-text formats, we already have and shortly will have even more to ensure that we can provide our users with the best possible answered. The ability to search video (e.g., newscasts) and audio (e.g., radio programs) continues to expand. Material that we would have to wait weeks for in the past, assuming it ever became available, is now available shortly after the words are spoken. This material can serve many types of users, including those in international relations and competitive intelligence. Of course, archives of this material are also available. In many cases these keyword databases are created using either voice recognition technology or by capturing the text from closed captions associated with the broadcast.

Work also continues on search tools that provide access to video and audio material using a non-text mechanism to access the material. For example, you could search for a specific color or type of background. An article in Technology Review provides a good orientation to the topic [http://www.techreview.com/ magazine/jul01/upstream.asp]. Much of this research will also be available for still-image search tools. Currently, such tools, including those from Google, Fast, and AltaVista, use the text surrounding the image, i.e., image captions, and additional factors to determine what a still image is about.

What Should the Searcher Do?

Become aware and familiar with some of the major players in this space.

Virage [http://www.virage.com] is a leader in the video search arena. In fact, you can keyword search many of the reports from The NewsHour with Jim Lehrer using Virage technology at [http://www.pbs.org/ newshour/video/index.html]. Other companies of interest include TVEyes [http://www.tveyes.com], ShadowTV [http://www.shadowtv.com], and WordWave [http://www.wordwave.com]. Finally, take a test drive of SpeechBot [http://www.speechbot.com], a keyword search engine demo from Compaq, that uses speech-recognition technology to create a real-time transcript.

As for image searches, try these two resources. Webseek allows you to search or browse for criteria in the image [http://www.ctr.columbia.edu/webseek/]. Visoo uses software that looks for words embedded "inside the image" [http://www.visoo.com].

The Commercialization of Search Results
This issue has received a great deal of well-deserved attention lately. It seems to me that the wants and needs of the searcher/researcher and the many people from various groups (the engines themselves, the search optimization community, the advertising community) have different ideas about what the bottom line is when it comes to Web searches. Don't misunderstand me — the engines are profit-making-businesses, or try to be, so making money is goal number one. I understand this fact. However, those of us who use the "open Web" as a research tool want timely and authoritative answers without advertising or undo influence getting in the way of the best possible answer available.

Can the wants and needs of the two groups co-exist? Absolutely, but it will take knowledge and continuing education for both information professionals and end users to continue to use general-purpose Web search tools as effective resources. The bottom line here is knowledge of the issues for all parties. Using the Web effectively without general-purpose search engines would be difficult, time consuming, and in many cases impossible. This is particularly true for the professional researcher.

Pay-per-placement, pay-per-click allows a person or company to buy a keyword or keywords and have their results at the top of the results list when that word or words are searched. GoTo.Com is just one of many examples of this type of search engine. The extra challenge with GoTo and others is that in addition to searching at GoTo.Com they also sell their database to other engines for them to brand as their own. For example, GoTo.Com "powers" NBCi and Go.Com (formerly Infoseek). So, if a user tells you that NBCi is his or her engine of choice, in actuality they are searching GoTo.Com material. Various "flavors" of this type of branding exist in the Web search world. To get an idea of how many of these engines are online check http://www.payperclicksearchengines.com.

Paid-inclusion programs available from many of the leading engines have programs in place that will allow a person or company to pay a fee and make sure that their site is crawled and included in that particular database. Additionally, this fee will also make sure that the site is recrawled on a regular basis, sometimes every week or so. This can mean that searchers may assume a currency of results based on retrieval from the paid-inclusion sites that does not occur with non-paying sites.

Search optimization consultants reverse-engineer search engines and relevancy-ranking algorithms and then use this knowledge to get a client's Web pages higher in a search result list.

Danny Sullivan, the editor of Search Engine Watch [http://www.searchenginewatch.com] covers this and most other parts of the search world on a regular basis and at great depth. Also, to learn more about search engine optimization take a look at Rank Write Roundtable [http://www.rankwrite.com]. By the way, keeping current with the search engine optimization discussion can often provide searchers with deep background about how the engines work. Again, this makes for a better searcher.

What Should the Searcher Do?

Understand the differences among search engines, become familiar with the terminology, and share this knowledge with others.

In the case of more "traditional" engines, be aware of how commercial material is labeled and where it is placed. For example, AltaVista offers "partner listings" at the top and bottom of a results list. Excite uses the term "sponsored link." Hotbot places "products and services" at the top of the results list.

At the time of writing, Google does not offer a paid inclusion program. However, Google will allow the purchase of keyword(s) and a link to a corresponding URL to appear away from the ranked results list, labeled as a sponsored link inside a colored box.

Meta-Search Tools: Problems and Challenges
I have never been a fan of meta-search engines. These tools simultaneously send your search request to many engines. Why don't I like them? Several reasons. One, meta-search engines often do not allow you to use the engine in a more than basic mode, leading to high amounts of recall but very poor precision. Equally important, especially in the last couple of years, is the fact that many of the most well-known meta-engines send a query to many entirely "pay for placement" engines. A May 2001 Danny Sullivan report [http://searchenginewatch.com/ sereport/01/05-metasearch.html] provides a clear view of this issue. For example, the popular Dogpile meta-search engine sends a query to 15 engines, six of them being entirely pay for placement. I think most researchers using the Web would be disappointed by the results they receive and the time they have wasted.

What Should the Searcher Do?

First, inform other searchers, especially end users who think they are "getting it all" by using a meta-search engine. Information professionals should take advantage of the "power" or "advanced" mode most general engines offer, such as limiting to a specific domain or word in the URL.

One More Thing

In the spirit of something for everyone, phone the neighbors and wake the children; I will mention one meta-engine that I do like and use: Hello Vivisimo! [http://www.vivisimo.com]. So why do I like it? A few reasons.

It does not send your query to any 100 percent pay-for-placement engines.
It does a reasonable job of allowing you to use some advanced syntax.
The "advanced interface" allows for several customization features.
It has some duplicate removal capabilities.
Vivisimo effectively clusters results into hierarchical sets of categories on the fly.
Users have the option of previewing a page directly from a result list.
Vivisimo searches several news databases and other search sites (e.g., Medline, USPTO, FirstGov.Gov) and still take advantage of its clustering process. This can be particularly useful for basic searchers who only enter a few keywords and do not search with limits. Using Vivisimo they can at least take advantage of the categories, hopefully assisting them in accessing the answer they want quickly.

Where Have All the Pages Gone?
Searching for older material is a challenge, often an impossible one. The issue as is old as Web searching and occurs not only in the Web search world, but in many other areas of digital data. Currently, when most Web pages are removed from a site, they are gone for good unless you can personally contact the Webmaster who can send you a copy. Luckily many people are thinking and working on solving this problem. One example is the work done by OCLC and RLG (Research Libraries Group) to develop standards and methods for archiving older material. The National Archives and other government agencies are doing similar work. NARA's Clinton Presidential Materials Archive [http://www.clinton.nara.gov/index.html] is an early effort to store Web resources from a presidential administration.

Alexa Research [http://www.alexa.com] offers one of the earliest and most unique archiving efforts, the Alexa Archive of the Web. Brewster Kahle's project makes snapshots of the Web, archiving everything in sight. Alexa Research carries over 18 terabytes of data covering some 5 million Web sites and some 1.9 billion pages. If the site has preserved an archived copy of a page, it appears in blue and you can click to view it. If the site records a page, but has no archive for it, the page link appears with the tag "Page not in Archive" and a greyed-out link. One subset of the Alexa archiving covers some 87 million pages of material from the Election 2000 Presidential campaign [http://archive.alexa.com/].

What Should the Searcher Do?

Long term? Become aware of the research and projects going on in this area. Offer comments and suggestions on how to make this material more accessible and searchable. A great archive of quality content without the proper mechanism to access it is not great.

Short term? Take advantage of the Google cache feature — another "Google only" resource. Each time the Google crawler comes around to crawl a Web page, it makes a copy (unless told not to by the Web site owners) and places it on the Google server. Therefore, if you search for a page using Google and then click and find the page has been removed, return to the search results page and look for the link, next to the URL, that says, "cached." Caveat: The cache is a dynamic entity. A page does not stay in the Google cache in perpetuity. It is only available from the cache until the next time the crawler visits the page and identifies that it has gone. For more about the Google cache, go to http://www.google.com/ help/features.html#cached.

Of course, another option is to either print-out or save a copy of a page. This can both be time consuming and a waste of paper or hard drive space. I use the SaveThis [http://www.savethis.com] service that allows you to copy any Web page, save it on the server, and access it from any Web browser. This free resource is well worth a look.

I Still Can't Find...
General, invisible, and specialized search tools still leave plenty of material not available. So many types of resources to explain, so many places to search! Your boss says that last night he was at home "searching" the Web for an article from Newsweek. He or she went to AltaVista, Google, and Yahoo! and came up empty.

"These search engines don't contain 'everything,'" you tell your boss. However, often searching other databases, you can access and purchase articles you need. You explain that resources like Northern Light's Special Collection, Electric Library, or using dowjones.com (a free site) to access and purchase an individual article from Factiva's Publication Library are all possibilities. You go on to tell him or her that your library also makes numerous databases available to them through subscription licenses, databases they can access from home.

The boss says. "Wow, I had no idea all of this material was available." On a roll, you also suggest that the boss check with the local public library, which you happen to know also offers access to many fee-based services. "Your tax dollars at work," you say.

Finally, you tell your boss, much of where you search is determined but what you need. In some cases what you need can be found — for free — using Google or Excite, but if you don't find it, you should know where to turn next. In some cases, starting with Google or Excite might not be the best idea. There is still plenty of content not digitized that may require a trip to a library with a print or microfilm collection containing the document they need.

What should the searcher do?

You tell them.

Gary Price's e-mail address is gprice@invisible-web.net.

Table of Contents

Previous Issues

Subscribe Now!

ITI Home