On The Net 
                        Unusual Power Web Searching Commands                         
                        By Greg R. Notess 
                        Reference Librarian, Montana
                        State University  
                         Web searching has
  changed the nature of online. What online searcher, even a decade ago, would
  have thought that hundreds of millions of online searches would be sent each
  day? Who is an online searcher these days? It can be almost anyone: parents,
  children, plumbers, waiters. The new style of online searching, at least as
  defined by the majority of Web searchers, is to put a few query words into
  Google, Yahoo!, or MSN. With the quantity of information resources on the Web
  these days and the ease of finding certain kinds of popular information, this
  simple technique is often effective.
 If you are a searcher who prefers this new style of online searching, and
  you are always happy with the results, read no further. This column takes a
  look at some of the more esoteric and unusual of the advanced search commands
  from the search engines. These are tools that you may never use or may only
  try once in a year of searching. But for those who like to dig deeper, who
  seek the hard-to-find information nuggets, using these commands can help you
  mine the Internet for information in ways that others will never consider.
  FEATURE CHURN
  Some of these more unusual features come and go at a rather rapid rate. And
  with Yahoo!'s impending acquisition of Overture, which now owns both AlltheWeb
  and AltaVista (remember Yahoo! already owns Inktomi), some of these commands
  could disappear as well. Overture has already said that it plans to merge the
  underlying database from AlltheWeb and AltaVista, relying on the AlltheWeb
  technology for crawling and indexing the Web and on AltaVista for the search
  interface. How that may change those search engines remains to be seen.
  The merger may cause the disappearance of some search options by the time
  or soon after this is published. So use the advanced features while you can.
  Search engines certainly look at their usage logs. The percentage of searches
  that use any of these commands is quite small. Using them once in awhile can
  help let the search engines know that these command options are still used
  and valued.
  With the ever-changing nature of so many Internet tools, it should be no
  surprise that some features disappear. Take for example Google's cache, a very
  useful feature when encountering a dead link or changed content on a page.
  But long ago, Google removed from the top of the page the date it retrieved
  the cached page or the date stamp reported at that time. So, the advanced user
  now must look for internal clues to those dates.
  Alternatively, the little-known Comet Web Search could be used to view those
  dates. It used Google's database, and, unlike several other Google partners,
  it included the Google cache pages. But unlike Google, it displayed both of
  those dates on the top of the cache page. Alas, that feature vanished earlier
  this year when Comet Web Search switched from the Google database to the FAST
  database, which does not have any cached pages. While the caching dates are
  displayed at Gigablast, that is only for the Gigablast cache and not for Google's.
  ALLTHEWEB SITE SEARCHING 
  AlltheWeb has used its own, sometimes strange, syntax for field searches.
  The url.tld: and url.domain: field searches were both unusual and hard to remember.
  With the introduction of site: in September 2002, AlltheWeb moved more into
  the mainstream of search engine syntax and offered a field label of site: that
  restricts results to only the designated site. Google, Gigablast, Teoma, and
  the Inktomi search engines also use site: while it is host: at AltaVista. So
  to search AlltheWeb for Hubble on the NASA site, the syntax would be
  hubble site:nasa.gov
  The advanced search forms give the same search option without the need to
  use the site: prefix, but for searchers who know the syntax, it is usually
  simpler just to type it in directly.
  AlltheWeb goes even further than just a basic site: field search capability.
  It also has two special operators that can be used with the site: prefixthe
  carat (^) and the asterisk (). The ^ is an anchor, while the  fulfills
  its usual function of being a truncation symbol. AlltheWeb still does not have
  truncation for query words, but within the site: specification, it does.
  Here is how it works. The anchor ^ and the truncation  symbols can
  appear at either end of the portion of the host address used with the site:
  command. The anchor ^ used in the beginning means that nothing else can come
  before it, and used at the end, it means that nothing else should follow. On
  the flip side, the truncation  sign at the beginning means that additional
  pieces can come before it, and used at the end, it means that additional pieces
  can follow. In both cases, the symbols apply not at the character level but
  for pieces of a URL that make up the character strings separated by periods.
  So the five "pieces" of www.subdir.dir.host.com are www, subdir, dir, host,
  and com.
  Domain names often are duplicative. While nasa.gov covers most of NASA, it
  is used by more than one Web site. The Ames Research Center can be at arc.nasa.gov,
  while the Jet Propulsion Lab can be jpl.nasa.gov. To search all of nasa.gov
  subsidiary sites, the site:nasa.gov command works, because the default is to
  have the end anchored but not the beginning, as the more fully written-out
  version site:nasa.gov^ would work.
  The anchor operator helps if you only want to search nasa.gov sites and not
  sites like jpl.nasa.gov. If it's the latter, use site:^nasa.gov. Since the
  default is to end with an anchor, there is no need to write it as site:^nasa.gov^.
  Just remember that it will also exclude www.nasa.gov.
  Then there are the other country domains that can occur on top of a standard
  U.S. top-level domain. For example, total.com and total.com.au are two completely
  different sites.
  To search both as well as other country code top-level domains, use site:total.com.
  WILD CARD WORD WITHIN 
  A PHRASE
  Standard truncation can be a very useful search tool to retrieve variants
  on the stem of a search term without ORing the variants themselves. Currently
  only available on AltaVista, with the truncation symbol (sometimes called wild
  card) of the asterisk , it is disappointing that truncation is not available
  at Google, AlltheWeb, and other search engines.
  But there is a bit of an exception. Both Google and AltaVista do offer something
  that I call the "Wild Card Word Within a Phrase." Both engines even use the
  same syntax. This only works within a phrase search (multiple words in an exact
  sequence designated in the query with quotation marks [""]). The  is
  an operator that represents any single word in that exact position. Like in
  AlltheWeb's special site: operators above, the  is not character-based
  truncation but represents a whole word.
  Searching to find a quotation is an easy example. Trying to find "a little
  neglect may breed mischief" when you are not sure of the second to last word?
  Just search "a little neglect may  mischief". If even fewer words are
  known, use multiple asterisks as in "a little    mischief".
  There are several ways in which this feature is useful beyond quotation searching.
  It can be used to find more sophisticated cases of plagiarism or intellectual
  property theft. For advertisers, it helps to check on the variations of a new
  advertising slogan that might already be in use. With rather unique phrases,
  it can be used to replace regular truncation for a word. Have a word within
  a phrase where you want plural, singular, and misspellings? Just replace the
  whole word with the . If there are some other phrases that have a whole
  different word in that place, use the - in front of that alternative for a
  NOT function.
  For example, searching for matches on pinot grigio grape (which is also known
  as pinot gris) as "pinot  grape" will also find pinot noir grape and
  others. In that case, search again with
  "pinot  grape" -"pinot noir grape" -"pinot blanc grape"
  Just bear in mind that Google limits queries to 10 search words, so use AltaVista
  if you would have more than that. Fortunately, Google does not count the  as
  a search term, so the 10-term limit refers only to real words in the query.
  PROXIMITY SEARCHING
  All the main search engines have phrase searching, which uses exact proximity.
  But when you want to be able to search for terms that are close to one another,
  but not exactly next to each other, you have few choices. AltaVista is the
  only major search engine with a documented proximity operator, NEAR, which
  retrieves hits when terms are within 10 words of each other. I wrote briefly
  before about AltaVista's undocumented proximity operators in my January 2002 "Internet
  Search Engine Update" in ONLINE [www.infotoday.com/online/jan02/SearchEngineUpdate.htm].
  Beyond NEAR, AltaVista lets you specify exactly how close or far apart the
  terms should be. This only works in the advanced search in the Boolean box.
  There, to specify proximity other than 10, use the "WITHIN" operator followed
  by a space and the number. Alternatively, the symbol for "WITHIN" is a double
  tilde: ~~. So, to search for "embossing" within five words of "tape," use
  "embossing" within 5 "tape"
  or
  "embossing" ~~ 5 "tape"
  PROXIMITY ON GOOGLE
  Of course, the problem with using an advanced technique like proximity to
  target very specific information is that its effectiveness depends on the scope
  of the underlying database. With Google's database so much larger than AltaVista's,
  what we need is proximity searching (and truncation for that matter) at Google.
  While Google does not have any officially supported proximity searching,
  you can use the wild card word within a phrase function with some creative
  permutations to mimic a certain level of proximity searching. Fortunately this
  is made even easier by the Google API Proximity Search (GAPS) tool [www.stagernation.com/cgi-bin/gaps.cgi].
  GAPS provides proximity searching at Google for up to three words distance.
  Using Google's API interface, GAPS basically runs multiple iterations of the
  wild card word in a phrase to accomplish the proximity search. In other words,
  searching for a within two words of b runs four searches:
  "a b"
  "a  b"
  "b a"
  "b  a"
  Then the results are combined with an OR relationship. For three-word proximity,
  GAPS runs six searches. Since the four-word proximity requires 24 and five-word,
  120, you can see why the GAPS site has the three word limit. Fortunately, for
  tech-savvy searchers who'd like to try larger proximity searches, the source
  code for the GAPS approach can be downloaded, installed, and run on your own
  site. However, due to the limitation of 1,000 searches per day through Google's
  API interface, only proximity of up to six terms could be done with this approach.
  MEDIA TYPE OR PAGE CONTENT
  Of the many advanced features I have explored over the years, at first glance
  several seem to have no obvious uses. Embedded content was one of these. Available
  for many years at Inktomi-powered search engines like HotBot and MSN Search,
  and more recently offered at AlltheWeb, the page content limit searches just
  for Web pages that contain specific kinds of files such as video, audio, JavaScript,
  Shockwave, or RealAudio.
  For all three of these search engines, these limits are readily available
  on the advanced search page. HotBot and MSN limit results to pages that either
  have a link to such files or have them embedded on the page itself. AlltheWeb
  just limits to embedded content, but then it has a separate limit for searching
  just specific file types.
  Why use such a limit? I saw no obvious applications at first. But then I
  was looking for examples of library tutorials that used Flash or Shockwave.
  Just searching for library while adding that appropriate page content limit
  effectively targeted the kinds of results I needed. The scripting limits are
  quite useful when you're searching for examples of certain kinds of scripts.
  And the audio limit proved quite helpful when searching for the song of a specific
  bird. Adding the audio limit to the common or taxonomic name of the bird brought
  up far more examples than a single Audubon recording could offer.
  FUTURE POTENTIAL
  Unfortunately, all of these power-searching commands and functions are subject
  to the general inconsistencies that are so common among the search engines.
  Remember that the search engines are designed first and foremost for the general
  searcher, the one who enters a few words and hopes to get one moderately good
  answer, or at least an interesting site that will distract from the original
  information request.
  So the search engine engineers can let useful advanced commands lapse (like
  in the summer of 2003 when Google's intitle: and inurl: were not working properly
  for several months). But if really popular search terms start returning really
  poor results in the top 10 positions, the engines will respond much more quickly.
  The good news is that advanced features are still being offered and introduced.
  In August, Google added a new synonym operator of a tilde (~), which can be
  used right before a search term, with no space, to tell Google to look for
  synonyms of that word. AlltheWeb, and then Google, introduced calculator and
  conversion functions. Gigablast added indexing for non-HTML file types.
  The features and capabilities of the search engines go through regular churn.
  By learning what is available and which offer special capabilities, we can
  be prepared to use these unusual advanced features to retrieve the information
  we need. And then we just hope that the features remain available for next
  time.
  
 Greg
R. Notess (greg@notess.com; www.notess.com)
is a reference librarian at Montana State University and founder of SearchEngineShowdown.com.  
Comments? Email the editor at marydee@infotoday.com.  
   
  |