On The Net
Unusual Power Web Searching Commands
By Greg R. Notess
Reference Librarian, Montana
State University
Web searching has
changed the nature of online. What online searcher, even a decade ago, would
have thought that hundreds of millions of online searches would be sent each
day? Who is an online searcher these days? It can be almost anyone: parents,
children, plumbers, waiters. The new style of online searching, at least as
defined by the majority of Web searchers, is to put a few query words into
Google, Yahoo!, or MSN. With the quantity of information resources on the Web
these days and the ease of finding certain kinds of popular information, this
simple technique is often effective.
If you are a searcher who prefers this new style of online searching, and
you are always happy with the results, read no further. This column takes a
look at some of the more esoteric and unusual of the advanced search commands
from the search engines. These are tools that you may never use or may only
try once in a year of searching. But for those who like to dig deeper, who
seek the hard-to-find information nuggets, using these commands can help you
mine the Internet for information in ways that others will never consider.
FEATURE CHURN
Some of these more unusual features come and go at a rather rapid rate. And
with Yahoo!'s impending acquisition of Overture, which now owns both AlltheWeb
and AltaVista (remember Yahoo! already owns Inktomi), some of these commands
could disappear as well. Overture has already said that it plans to merge the
underlying database from AlltheWeb and AltaVista, relying on the AlltheWeb
technology for crawling and indexing the Web and on AltaVista for the search
interface. How that may change those search engines remains to be seen.
The merger may cause the disappearance of some search options by the time
or soon after this is published. So use the advanced features while you can.
Search engines certainly look at their usage logs. The percentage of searches
that use any of these commands is quite small. Using them once in awhile can
help let the search engines know that these command options are still used
and valued.
With the ever-changing nature of so many Internet tools, it should be no
surprise that some features disappear. Take for example Google's cache, a very
useful feature when encountering a dead link or changed content on a page.
But long ago, Google removed from the top of the page the date it retrieved
the cached page or the date stamp reported at that time. So, the advanced user
now must look for internal clues to those dates.
Alternatively, the little-known Comet Web Search could be used to view those
dates. It used Google's database, and, unlike several other Google partners,
it included the Google cache pages. But unlike Google, it displayed both of
those dates on the top of the cache page. Alas, that feature vanished earlier
this year when Comet Web Search switched from the Google database to the FAST
database, which does not have any cached pages. While the caching dates are
displayed at Gigablast, that is only for the Gigablast cache and not for Google's.
ALLTHEWEB SITE SEARCHING
AlltheWeb has used its own, sometimes strange, syntax for field searches.
The url.tld: and url.domain: field searches were both unusual and hard to remember.
With the introduction of site: in September 2002, AlltheWeb moved more into
the mainstream of search engine syntax and offered a field label of site: that
restricts results to only the designated site. Google, Gigablast, Teoma, and
the Inktomi search engines also use site: while it is host: at AltaVista. So
to search AlltheWeb for Hubble on the NASA site, the syntax would be
hubble site:nasa.gov
The advanced search forms give the same search option without the need to
use the site: prefix, but for searchers who know the syntax, it is usually
simpler just to type it in directly.
AlltheWeb goes even further than just a basic site: field search capability.
It also has two special operators that can be used with the site: prefixthe
carat (^) and the asterisk (). The ^ is an anchor, while the fulfills
its usual function of being a truncation symbol. AlltheWeb still does not have
truncation for query words, but within the site: specification, it does.
Here is how it works. The anchor ^ and the truncation symbols can
appear at either end of the portion of the host address used with the site:
command. The anchor ^ used in the beginning means that nothing else can come
before it, and used at the end, it means that nothing else should follow. On
the flip side, the truncation sign at the beginning means that additional
pieces can come before it, and used at the end, it means that additional pieces
can follow. In both cases, the symbols apply not at the character level but
for pieces of a URL that make up the character strings separated by periods.
So the five "pieces" of www.subdir.dir.host.com are www, subdir, dir, host,
and com.
Domain names often are duplicative. While nasa.gov covers most of NASA, it
is used by more than one Web site. The Ames Research Center can be at arc.nasa.gov,
while the Jet Propulsion Lab can be jpl.nasa.gov. To search all of nasa.gov
subsidiary sites, the site:nasa.gov command works, because the default is to
have the end anchored but not the beginning, as the more fully written-out
version site:nasa.gov^ would work.
The anchor operator helps if you only want to search nasa.gov sites and not
sites like jpl.nasa.gov. If it's the latter, use site:^nasa.gov. Since the
default is to end with an anchor, there is no need to write it as site:^nasa.gov^.
Just remember that it will also exclude www.nasa.gov.
Then there are the other country domains that can occur on top of a standard
U.S. top-level domain. For example, total.com and total.com.au are two completely
different sites.
To search both as well as other country code top-level domains, use site:total.com.
WILD CARD WORD WITHIN
A PHRASE
Standard truncation can be a very useful search tool to retrieve variants
on the stem of a search term without ORing the variants themselves. Currently
only available on AltaVista, with the truncation symbol (sometimes called wild
card) of the asterisk , it is disappointing that truncation is not available
at Google, AlltheWeb, and other search engines.
But there is a bit of an exception. Both Google and AltaVista do offer something
that I call the "Wild Card Word Within a Phrase." Both engines even use the
same syntax. This only works within a phrase search (multiple words in an exact
sequence designated in the query with quotation marks [""]). The is
an operator that represents any single word in that exact position. Like in
AlltheWeb's special site: operators above, the is not character-based
truncation but represents a whole word.
Searching to find a quotation is an easy example. Trying to find "a little
neglect may breed mischief" when you are not sure of the second to last word?
Just search "a little neglect may mischief". If even fewer words are
known, use multiple asterisks as in "a little mischief".
There are several ways in which this feature is useful beyond quotation searching.
It can be used to find more sophisticated cases of plagiarism or intellectual
property theft. For advertisers, it helps to check on the variations of a new
advertising slogan that might already be in use. With rather unique phrases,
it can be used to replace regular truncation for a word. Have a word within
a phrase where you want plural, singular, and misspellings? Just replace the
whole word with the . If there are some other phrases that have a whole
different word in that place, use the - in front of that alternative for a
NOT function.
For example, searching for matches on pinot grigio grape (which is also known
as pinot gris) as "pinot grape" will also find pinot noir grape and
others. In that case, search again with
"pinot grape" -"pinot noir grape" -"pinot blanc grape"
Just bear in mind that Google limits queries to 10 search words, so use AltaVista
if you would have more than that. Fortunately, Google does not count the as
a search term, so the 10-term limit refers only to real words in the query.
PROXIMITY SEARCHING
All the main search engines have phrase searching, which uses exact proximity.
But when you want to be able to search for terms that are close to one another,
but not exactly next to each other, you have few choices. AltaVista is the
only major search engine with a documented proximity operator, NEAR, which
retrieves hits when terms are within 10 words of each other. I wrote briefly
before about AltaVista's undocumented proximity operators in my January 2002 "Internet
Search Engine Update" in ONLINE [www.infotoday.com/online/jan02/SearchEngineUpdate.htm].
Beyond NEAR, AltaVista lets you specify exactly how close or far apart the
terms should be. This only works in the advanced search in the Boolean box.
There, to specify proximity other than 10, use the "WITHIN" operator followed
by a space and the number. Alternatively, the symbol for "WITHIN" is a double
tilde: ~~. So, to search for "embossing" within five words of "tape," use
"embossing" within 5 "tape"
or
"embossing" ~~ 5 "tape"
PROXIMITY ON GOOGLE
Of course, the problem with using an advanced technique like proximity to
target very specific information is that its effectiveness depends on the scope
of the underlying database. With Google's database so much larger than AltaVista's,
what we need is proximity searching (and truncation for that matter) at Google.
While Google does not have any officially supported proximity searching,
you can use the wild card word within a phrase function with some creative
permutations to mimic a certain level of proximity searching. Fortunately this
is made even easier by the Google API Proximity Search (GAPS) tool [www.stagernation.com/cgi-bin/gaps.cgi].
GAPS provides proximity searching at Google for up to three words distance.
Using Google's API interface, GAPS basically runs multiple iterations of the
wild card word in a phrase to accomplish the proximity search. In other words,
searching for a within two words of b runs four searches:
"a b"
"a b"
"b a"
"b a"
Then the results are combined with an OR relationship. For three-word proximity,
GAPS runs six searches. Since the four-word proximity requires 24 and five-word,
120, you can see why the GAPS site has the three word limit. Fortunately, for
tech-savvy searchers who'd like to try larger proximity searches, the source
code for the GAPS approach can be downloaded, installed, and run on your own
site. However, due to the limitation of 1,000 searches per day through Google's
API interface, only proximity of up to six terms could be done with this approach.
MEDIA TYPE OR PAGE CONTENT
Of the many advanced features I have explored over the years, at first glance
several seem to have no obvious uses. Embedded content was one of these. Available
for many years at Inktomi-powered search engines like HotBot and MSN Search,
and more recently offered at AlltheWeb, the page content limit searches just
for Web pages that contain specific kinds of files such as video, audio, JavaScript,
Shockwave, or RealAudio.
For all three of these search engines, these limits are readily available
on the advanced search page. HotBot and MSN limit results to pages that either
have a link to such files or have them embedded on the page itself. AlltheWeb
just limits to embedded content, but then it has a separate limit for searching
just specific file types.
Why use such a limit? I saw no obvious applications at first. But then I
was looking for examples of library tutorials that used Flash or Shockwave.
Just searching for library while adding that appropriate page content limit
effectively targeted the kinds of results I needed. The scripting limits are
quite useful when you're searching for examples of certain kinds of scripts.
And the audio limit proved quite helpful when searching for the song of a specific
bird. Adding the audio limit to the common or taxonomic name of the bird brought
up far more examples than a single Audubon recording could offer.
FUTURE POTENTIAL
Unfortunately, all of these power-searching commands and functions are subject
to the general inconsistencies that are so common among the search engines.
Remember that the search engines are designed first and foremost for the general
searcher, the one who enters a few words and hopes to get one moderately good
answer, or at least an interesting site that will distract from the original
information request.
So the search engine engineers can let useful advanced commands lapse (like
in the summer of 2003 when Google's intitle: and inurl: were not working properly
for several months). But if really popular search terms start returning really
poor results in the top 10 positions, the engines will respond much more quickly.
The good news is that advanced features are still being offered and introduced.
In August, Google added a new synonym operator of a tilde (~), which can be
used right before a search term, with no space, to tell Google to look for
synonyms of that word. AlltheWeb, and then Google, introduced calculator and
conversion functions. Gigablast added indexing for non-HTML file types.
The features and capabilities of the search engines go through regular churn.
By learning what is available and which offer special capabilities, we can
be prepared to use these unusual advanced features to retrieve the information
we need. And then we just hope that the features remain available for next
time.
Greg
R. Notess (greg@notess.com; www.notess.com)
is a reference librarian at Montana State University and founder of SearchEngineShowdown.com.
Comments? Email the editor at marydee@infotoday.com.
|