On The Net
Fiddling with File Types
By Greg R. Notess
Reference Librarian Montana
State University
It's obvious that information is stored and communicated in many
ways, which has ramifications for its retrieval. In the computer age, each
software program seems to have one or more distinct file types in which it
saves individual documents. While the Web has pushed a single format, HTML,
one of the Net's great strengths has been its ability to make many other file
formats available as well. Images in GIF or JPG formats are just one example.
For the information seeker, the textual file formats are usually the most
desirable to find. Initially, Web search engines only indexed the text within
an HTML page and ignored text within PDF or word-processing files. Fortunately
for the information professional, this finally started to change in late January
2001 when Google began to index text within PDF documents. In November of that
year, Google expanded to include even more file types, such as PostScript and
Microsoft Word.
Before long, all the major search engines except Teoma had expanded their
indexing to include at least PDF files. Some index PowerPoint, spreadsheet,
and Flash files. Search engines have added special command line syntax for
limiting to, or excluding, specific file formats and have integrated these
documents into the search results listings.
Since these files were not necessarily created specifically for the Web,
for search engines, or for searchers, there are some unique issues to consider
when trying to find and view them. Plus, the nature of the files makes for
some unique search problems, and the commands and scope vary between search
engines.
FILE TYPE PRIMER
Almost any kind of file type can be made available on the Web. There are
hundreds of file extensions available and a somewhat smaller number of file
types. Take a look at a list such as the one at www.webopedia.com/quick_ref/fileextensions.asp to get a sense of the wide variety of files.
The file name extension is typically used to identify the file type. So in
a file named report.abc, the abc part is the extension and identifies the type
of file. Common extensions include .doc for Microsoft Word documents, .pdf
for Adobe Acrobat PDF, and .ps for PostScript. The default setting in recent
versions of the Windows operating system will hide these file extensions from
view, but on Web pages and in search engine results, these extensions are typically
viewable.
THE CONTROVERSY
Surprisingly, I found some controversy regarding the indexing of other file
types. Librarians and other information-oriented folks appreciate the information-rich
content found within such files. Certainly PDFs and word-processing file formats
tend to be longer than the standard Web page, besides being a popular way to
post technical reports, periodicals, annual reports, press releases, and other
important information content.
Yet information professionals do not make up the majority of Internet users.
Indeed, in browsing through online discussion forums, I discovered that many
in the Webmaster and e-commerce communities, especially the search engine marketers,
are downright hostile towards PDFs and other files. They would prefer to have
them either excluded from the likes of Google or at least ranked quite low.
Personally, my preference is to have these other file types ranked higher.
Fortunately, using file type limiters, it is easy enough to pull up all the
files in a certain format, as long as you know the search engine's syntax.
Why even bother with the extra file types as separate searches? Sometimes
such documents can contain very interesting information unavailable elsewhere.
Spreadsheets are great sources for statistics. Limit to PowerPoint files to
find recent conference presentations, especially on research that may not yet
have been published elsewhere. Looking for samples of online tutorials for
a topic? Add a Flash limit to the keywords for the topic to see what Flash
tutorials the search engine can find. Mary Ellen Bates in her April 2003 Tip
of the Month
[http://batesinfo.com/tip.html#April2003]
mentioned several such uses with
some great examples.
SEARCHABLE FILE TYPES
Of the hundreds of different file types available, only a few are commonly
found on the Web. Of these, which are being indexed and are thus searchable,
and by which search engines? Google and AlltheWeb have the most extensive coverage
of file types, although PDFs are by far the most common and informationally
significant of the other file types. AltaVista only indexes PDFs at this point,
while the Inktomi database used at MSN Search and HotBot has PDF, PowerPoint,
Word, and Excel capabilities.
Google provides access to files in many formats, including at least all of
those on the following list and probably some others as well. The common file
extensions for each type are listed in parentheses.
Adobe Portable Document Format (pdf)
Adobe PostScript (ps)
Corel WordPerfect (wpd, wp5, wp6, wp7)
Lotus 1-2-3 (wk1, wk2, wk3, wk4, wk5, wki, wks, wku)
Lotus WordPro (lwp)
MacWrite (mw)
Microsoft Excel (xls)
Microsoft PowerPoint (ppt)
Microsoft Word (doc)
Microsoft Works (wks, wps, wdb)
Microsoft Write (wri)
Rich Text Format (rtf)
Text (ans, txt)
AlltheWeb has a fair number as well, although it has only officially announced
PDF, Microsoft Word, and Macromedia Flash. AlltheWeb includes some files from
each of the following types and, like Google, probably covers even more that
I have not discovered. The common extensions are listed for file types not
included in Google's list.
Adobe PDF
Adobe PostScript
Corel WordPerfect
Lotus 1-2-3
Macromedia Flash (swf)
Microsoft Excel
Microsoft PowerPoint
Microsoft Word
Rich Text Format
Star Office (sdw, sdc, sdd)
Text
SEARCH SYNTAX
How do you find such files? Because the search engines index these additional
file types, nothing extra is needed to find themsometimes. These files
will simply show up in regular search results. For example, searching for SC00-2348
(a Florida court docket number) will bring up PDF files in the top few hits
at Google, AlltheWeb, MSN, and AltaVista.
For the power searcher, the search engines do offer special features to look
specifically for documents in a certain file format or to exclude such documents.
The advanced search screens at Google, AlltheWeb, AltaVista, and MSN Search
all have file type options to select. Yet the advanced search pages may only
show a few of the file formats available. For Google and AlltheWeb in particular,
the command line searching is much more powerful.
For AltaVista, with only a PDF limit, the advanced page works well. AltaVista
accepts a filetype:pdf command, but it will not work in the Boolean searching
box. For MSN Search, the advanced search page is the only option for file type
limiting since it does not yet have a command line option. HotBot, which does
include some other file types, does not have any capability to limit file types,
even on its advanced search page. HotBot's Page Content limit is similar, but
it looks for links to or embedded file types rather than searching for the
files separately.
The command line syntax is only of use at present at Google and AlltheWeb,
but of course both use somewhat different syntax.
THE GOOGLE VERSION
Google's Advanced Search page only offers some of the most popular file type
limits. These are under the label of "File Formats" and give six choices: PDF,
PostScript, Word, Excel, PowerPoint, and Rich Text Format. The advanced page
does give the option to either limit results to a particular format or to exclude
all of a particular format, but multiple formats cannot be combined.
The command line version (which can be used in the regular Google search
box) uses the syntax of filetype: followed by the extension. This cannot be
used alone and has to be combined with another search term. To search for an
Excel spreadsheet that includes cognizant, use
cognizant filetype:xls
To search for Lotus 1-2-3 spreadsheets mentioning health, try a search like
health filetype:wks OR filetype:wku OR filetype: wk5 ORfiletype:wk4 OR fil
type:wk3
Google expands on its cached copy of Web pages to provide HTML versions of
many of its separately indexed additional file type documents. Look for the "View
as HTML" link in the search results list. This will show an HTML version of
the file, which is especially useful for a quick look at the content or when
you do not have the necessary viewer for that file type.
THE ALLTHEWEB ALTERNATIVE
The AlltheWeb Advanced Search page also uses the label of "File Formats" for
the file type limit and has only three choices: Adobe PDF, Macromedia Flash,
and Microsoft Word. Yet there are many more file types indexed by AlltheWeb.
These are not officially released, and sometimes they behave rather strangely.
The command syntax is similar at first to Google, using the filetype: prefix,
but then rather than using the file extensions, AlltheWeb uses their MIME type
designation.
filetype:pdf
filetype:flash
filetype:msword
filetype:rtf
filetype:powerpoint
filetype:excel
filetype:postscript
filetype:wordperfect
filetype:staroffice
filetype:lotus123
filetype:text
filetype:xml
The advantage to this approach is that a single search for filetype:staroffice
can find files with several StarOffice extensions, including sdw, sdc, and
sdd, without stringing a long OR statement together. The disadvantage is that
the syntax is different from Google and harder to remember, especially for
the frequent Google user.
Nor does AlltheWeb have the View as HTML option any more than it has a cached
copy of Web pages. Still, the availability of the Flash and StarOffice files
provides access to content not on Google. AlltheWeb still finds some files
in PDF and other common file types that Google does not.
THE GIGABLAST ALTERNATIVE
One lesser-known search engine, Gigablast, also indexes several file typesWord,
Excel, PowerPoint, Postscriptand plain-text documents.
Gigablast is significantly smaller at this point than the major search engines,
with 200+ million indexed documents compared to the billions in the major search
engines, but since Gigablast is planning on being able to expand up to 5 billion,
it is well worth watching.
No option for file type limits is available (at this point) on its advanced
search page, so searchers have to use command syntax. Instead of file
type:, Gigablast uses type: followed by the extension. So it is more like Google
than AlltheWeb. The extensions are the same as Google except for the plain
text, which is "text" rather than just "txt."
type:pdf
type:doc
type:xls
type:ppt
type:ps
type:text
Although Gigablast is small now, it at least contains a cached copy of Web
pages and of the other file types as well. Rather than using Google's View
as HTML, Gigablast just labels them "[cached]" like the Web pages. Again they
are text versions of the files, but a great way to quickly view the information
content.
FILE ACCESS
While all it takes to put a file on the Web is to load the file on a Web
server and create a link to it, that does not necessarily mean that the rest
of us will be able to view the file. Take an Acrobat PDF as an example. To
view a PDF, a searcher must have an Acrobat viewer in addition to the Web browser.
While there is a free viewer available, it has to be installed and working
to view the content within the PDF.
For other file types, there may or may not be a free viewer available. Microsoft
Office users should have no problem viewing Word, Excel, or PowerPoint files,
but StarOffice, or even Microsoft Works files, may not be directly viewable.
If the file does not load easily, look at the file extension for a clue as
to the file type. Google and AlltheWeb will make some guess as to the file
type, but knowing that it is a Microsoft Works or PostScript file will not
help if you do not have a program to view such files. In addition, people can
use unusual file extensions. If a Word document has a .pdf at the end of its
file name, Acrobat will try to open it.
Sometimes one of these unusual file types will not load properly, will automatically
prompt to save to disk, or will display on the screen as gibberish. This could
be due to the remote Web server not being configured to recognize the file
as the correct MIME type. This is where Google's View as HTML feature and Gigablast's
cached copies can be so useful to view some the content, if not the formatting.
For other search engines, try saving the filing by right clicking the mouse
and choosing to "Save target as. . . ."
SEARCH CONSIDERATIONS
To index the content of all these non-HTML files, the search engines have
to find a way to transform the file into one with indexable text. They have
to filter something that looks like
%â??Ó
157 0 obj
<<
/Linearized 1
and strip out the codes to find the remaining text. That filtering process
can lead to some strange interpretations of the text within the document. In
PDF files particularly, initial letters may be separated from the rest of the
word. Try a search on nalyze filetype:pdf to find all kinds of hits due to
an extra space. For any key- words, especially those that might start a sentence,
try leaving off the first letter.
Many other strange things happen to these files when converted to an indexable
format, especially for more graphically oriented files like Flash or PowerPoint.
While these can be information-rich files, the filtered translation means that
they may only be found with some creative guessing of words or word fragments
found within the documents.
The non-HTML files can be great sources of information and are now an important
part of Web searching. They may appear on any search, even in the top 10 results.
Knowing how to limit to specific file types, exclude others, combine several,
and to view them is not a search skill needed every day, but it is one more
technique that for certain searches can help the professional find information
that no one else can.
Greg
R. Notess (greg@notess.com; www.notess.com)
is a reference librarian at Montana State University and founder of SearchEngineShowdown.com.
Comments? Email the editor at marydee@infotoday.com.
|