Information Today, Inc. Corporate Site KMWorld CRM Media Streaming Media Faulkner Speech Technology DBTA/Unisphere
PRIVACY/COOKIES POLICY
Other ITI Websites
American Library Directory Boardwalk Empire Database Trends and Applications DestinationCRM Faulkner Information Services Fulltext Sources Online InfoToday Europe KMWorld Literary Market Place Plexus Publishing Smart Customer Service Speech Technology Streaming Media Streaming Media Europe Streaming Media Producer Unisphere Research



Magazines > Online > Nov/Dec 2003
Back Index Forward
 




SUBSCRIBE NOW!
Online Magazine
Vol. 27 No. 6 — Nov/Dec 2003
On The Net
Unusual Power Web Searching Commands
By Greg R. Notess
Reference Librarian, Montana State University

Web searching has changed the nature of online. What online searcher, even a decade ago, would have thought that hundreds of millions of online searches would be sent each day? Who is an online searcher these days? It can be almost anyone: parents, children, plumbers, waiters. The new style of online searching, at least as defined by the majority of Web searchers, is to put a few query words into Google, Yahoo!, or MSN. With the quantity of information resources on the Web these days and the ease of finding certain kinds of popular information, this simple technique is often effective.

If you are a searcher who prefers this new style of online searching, and you are always happy with the results, read no further. This column takes a look at some of the more esoteric and unusual of the advanced search commands from the search engines. These are tools that you may never use or may only try once in a year of searching. But for those who like to dig deeper, who seek the hard-to-find information nuggets, using these commands can help you mine the Internet for information in ways that others will never consider.

FEATURE CHURN

Some of these more unusual features come and go at a rather rapid rate. And with Yahoo!'s impending acquisition of Overture, which now owns both AlltheWeb and AltaVista (remember Yahoo! already owns Inktomi), some of these commands could disappear as well. Overture has already said that it plans to merge the underlying database from AlltheWeb and AltaVista, relying on the AlltheWeb technology for crawling and indexing the Web and on AltaVista for the search interface. How that may change those search engines remains to be seen.

The merger may cause the disappearance of some search options by the time or soon after this is published. So use the advanced features while you can. Search engines certainly look at their usage logs. The percentage of searches that use any of these commands is quite small. Using them once in awhile can help let the search engines know that these command options are still used and valued.

With the ever-changing nature of so many Internet tools, it should be no surprise that some features disappear. Take for example Google's cache, a very useful feature when encountering a dead link or changed content on a page. But long ago, Google removed from the top of the page the date it retrieved the cached page or the date stamp reported at that time. So, the advanced user now must look for internal clues to those dates.

Alternatively, the little-known Comet Web Search could be used to view those dates. It used Google's database, and, unlike several other Google partners, it included the Google cache pages. But unlike Google, it displayed both of those dates on the top of the cache page. Alas, that feature vanished earlier this year when Comet Web Search switched from the Google database to the FAST database, which does not have any cached pages. While the caching dates are displayed at Gigablast, that is only for the Gigablast cache and not for Google's.

ALLTHEWEB SITE SEARCHING

AlltheWeb has used its own, sometimes strange, syntax for field searches. The url.tld: and url.domain: field searches were both unusual and hard to remember. With the introduction of site: in September 2002, AlltheWeb moved more into the mainstream of search engine syntax and offered a field label of site: that restricts results to only the designated site. Google, Gigablast, Teoma, and the Inktomi search engines also use site: while it is host: at AltaVista. So to search AlltheWeb for Hubble on the NASA site, the syntax would be

hubble site:nasa.gov

The advanced search forms give the same search option without the need to use the site: prefix, but for searchers who know the syntax, it is usually simpler just to type it in directly.

AlltheWeb goes even further than just a basic site: field search capability. It also has two special operators that can be used with the site: prefix—the carat (^) and the asterisk (•). The ^ is an anchor, while the • fulfills its usual function of being a truncation symbol. AlltheWeb still does not have truncation for query words, but within the site: specification, it does.

Here is how it works. The anchor ^ and the truncation • symbols can appear at either end of the portion of the host address used with the site: command. The anchor ^ used in the beginning means that nothing else can come before it, and used at the end, it means that nothing else should follow. On the flip side, the truncation • sign at the beginning means that additional pieces can come before it, and used at the end, it means that additional pieces can follow. In both cases, the symbols apply not at the character level but for pieces of a URL that make up the character strings separated by periods. So the five "pieces" of www.subdir.dir.host.com are www, subdir, dir, host, and com.

Domain names often are duplicative. While nasa.gov covers most of NASA, it is used by more than one Web site. The Ames Research Center can be at arc.nasa.gov, while the Jet Propulsion Lab can be jpl.nasa.gov. To search all of nasa.gov subsidiary sites, the site:nasa.gov command works, because the default is to have the end anchored but not the beginning, as the more fully written-out version site:•nasa.gov^ would work.

The anchor operator helps if you only want to search nasa.gov sites and not sites like jpl.nasa.gov. If it's the latter, use site:^nasa.gov. Since the default is to end with an anchor, there is no need to write it as site:^nasa.gov^. Just remember that it will also exclude www.nasa.gov.

Then there are the other country domains that can occur on top of a standard U.S. top-level domain. For example, total.com and total.com.au are two completely different sites.

To search both as well as other country code top-level domains, use site:total.com•.

WILD CARD WORD WITHIN A PHRASE

Standard truncation can be a very useful search tool to retrieve variants on the stem of a search term without ORing the variants themselves. Currently only available on AltaVista, with the truncation symbol (sometimes called wild card) of the asterisk •, it is disappointing that truncation is not available at Google, AlltheWeb, and other search engines.

But there is a bit of an exception. Both Google and AltaVista do offer something that I call the "Wild Card Word Within a Phrase." Both engines even use the same syntax. This only works within a phrase search (multiple words in an exact sequence designated in the query with quotation marks [""]). The • is an operator that represents any single word in that exact position. Like in AlltheWeb's special site: operators above, the • is not character-based truncation but represents a whole word.

Searching to find a quotation is an easy example. Trying to find "a little neglect may breed mischief" when you are not sure of the second to last word? Just search "a little neglect may • mischief". If even fewer words are known, use multiple asterisks as in "a little • • • mischief".

There are several ways in which this feature is useful beyond quotation searching. It can be used to find more sophisticated cases of plagiarism or intellectual property theft. For advertisers, it helps to check on the variations of a new advertising slogan that might already be in use. With rather unique phrases, it can be used to replace regular truncation for a word. Have a word within a phrase where you want plural, singular, and misspellings? Just replace the whole word with the •. If there are some other phrases that have a whole different word in that place, use the - in front of that alternative for a NOT function.

For example, searching for matches on pinot grigio grape (which is also known as pinot gris) as "pinot • grape" will also find pinot noir grape and others. In that case, search again with

"pinot • grape" -"pinot noir grape" -"pinot blanc grape"

Just bear in mind that Google limits queries to 10 search words, so use AltaVista if you would have more than that. Fortunately, Google does not count the • as a search term, so the 10-term limit refers only to real words in the query.

PROXIMITY SEARCHING

All the main search engines have phrase searching, which uses exact proximity. But when you want to be able to search for terms that are close to one another, but not exactly next to each other, you have few choices. AltaVista is the only major search engine with a documented proximity operator, NEAR, which retrieves hits when terms are within 10 words of each other. I wrote briefly before about AltaVista's undocumented proximity operators in my January 2002 "Internet Search Engine Update" in ONLINE [www.infotoday.com/online/jan02/SearchEngineUpdate.htm].

Beyond NEAR, AltaVista lets you specify exactly how close or far apart the terms should be. This only works in the advanced search in the Boolean box. There, to specify proximity other than 10, use the "WITHIN" operator followed by a space and the number. Alternatively, the symbol for "WITHIN" is a double tilde: ~~. So, to search for "embossing" within five words of "tape," use

"embossing" within 5 "tape"

or

"embossing" ~~ 5 "tape"

PROXIMITY ON GOOGLE

Of course, the problem with using an advanced technique like proximity to target very specific information is that its effectiveness depends on the scope of the underlying database. With Google's database so much larger than AltaVista's, what we need is proximity searching (and truncation for that matter) at Google.

While Google does not have any officially supported proximity searching, you can use the wild card word within a phrase function with some creative permutations to mimic a certain level of proximity searching. Fortunately this is made even easier by the Google API Proximity Search (GAPS) tool [www.stagernation.com/cgi-bin/gaps.cgi].

GAPS provides proximity searching at Google for up to three words distance. Using Google's API interface, GAPS basically runs multiple iterations of the wild card word in a phrase to accomplish the proximity search. In other words, searching for a within two words of b runs four searches:

"a b"

"a • b"

"b a"

"b • a"

Then the results are combined with an OR relationship. For three-word proximity, GAPS runs six searches. Since the four-word proximity requires 24 and five-word, 120, you can see why the GAPS site has the three word limit. Fortunately, for tech-savvy searchers who'd like to try larger proximity searches, the source code for the GAPS approach can be downloaded, installed, and run on your own site. However, due to the limitation of 1,000 searches per day through Google's API interface, only proximity of up to six terms could be done with this approach.

MEDIA TYPE OR PAGE CONTENT

Of the many advanced features I have explored over the years, at first glance several seem to have no obvious uses. Embedded content was one of these. Available for many years at Inktomi-powered search engines like HotBot and MSN Search, and more recently offered at AlltheWeb, the page content limit searches just for Web pages that contain specific kinds of files such as video, audio, JavaScript, Shockwave, or RealAudio.

For all three of these search engines, these limits are readily available on the advanced search page. HotBot and MSN limit results to pages that either have a link to such files or have them embedded on the page itself. AlltheWeb just limits to embedded content, but then it has a separate limit for searching just specific file types.

Why use such a limit? I saw no obvious applications at first. But then I was looking for examples of library tutorials that used Flash or Shockwave. Just searching for library while adding that appropriate page content limit effectively targeted the kinds of results I needed. The scripting limits are quite useful when you're searching for examples of certain kinds of scripts. And the audio limit proved quite helpful when searching for the song of a specific bird. Adding the audio limit to the common or taxonomic name of the bird brought up far more examples than a single Audubon recording could offer.

FUTURE POTENTIAL

Unfortunately, all of these power-searching commands and functions are subject to the general inconsistencies that are so common among the search engines. Remember that the search engines are designed first and foremost for the general searcher, the one who enters a few words and hopes to get one moderately good answer, or at least an interesting site that will distract from the original information request.

So the search engine engineers can let useful advanced commands lapse (like in the summer of 2003 when Google's intitle: and inurl: were not working properly for several months). But if really popular search terms start returning really poor results in the top 10 positions, the engines will respond much more quickly.

The good news is that advanced features are still being offered and introduced. In August, Google added a new synonym operator of a tilde (~), which can be used right before a search term, with no space, to tell Google to look for synonyms of that word. AlltheWeb, and then Google, introduced calculator and conversion functions. Gigablast added indexing for non-HTML file types.

The features and capabilities of the search engines go through regular churn. By learning what is available and which offer special capabilities, we can be prepared to use these unusual advanced features to retrieve the information we need. And then we just hope that the features remain available for next time.


Greg NotessGreg R. Notess (greg@notess.com; www.notess.com) is a reference librarian at Montana State University and founder of SearchEngineShowdown.com

Comments? Email the editor at marydee@infotoday.com

 


       Back to top