On The Net
Dating the Web: The Confusion of Chronology
By Greg R. Notess
Reference Librarian Montana
State University
For those of us grounded in the print world of publishing,
the date of publication helps identify and distinguish
different editions, specific periodical issues, and even
re-printings. The date of print publication rarely matches
the exact date of composition but does have the distinct
advantage of being basically unchangeable. Once published
and a publication date has been included on each item,
the only way authors can update or otherwise make a change
is to issue a new edition or a correction. Otherwise,
they would have to track down every single copy and make
the change.
The Web, of course, is pretty much the opposite. Site
owners can change any page any time they wish. Unscrupulous
but talented hacks sometimes can even change other people's
pages. And anyone who has ever tried to cite a Web page
knows that many have no publication date information listed
at all.
Yet for all its newness as a publication medium, the Web
is aging. We have had public Web sites up for more than
a decade now, though few pages remain in their original
format. As the Web ages, it becomes increasingly important
to try and understand the origination date of certain
Web content. For intellectual property cases and the historical
record, among other reasons, it can be important to know
when a Web page was actually written or first posted.
Exploring the Internet dating scene for the information
professional means understanding the dimensions, deficiencies,
and differences of the various dates associated with Web
pages.
DATING DIMENSIONS
With the ease of posting a Web page, which is then
publicly available, and the subsequent ease of changing
that page, issues of date information have several dimensions.
There is the original content creation date and possibly
an editing or updated date. The surrounding text and
graphics may come from an entirely different day and
time while the page design may have occurred at yet
a different point.
The date and time when the file containing this conglomeration
of parts was last changed are reported in the date stamp.
Any time a file on a computer is changed, a new date
and time stamp, based on the computer's internal clock,
is recorded.
Take for example an article written in 1998 that may
have been uploaded to a Web page. The links in it may
have been updated in 2000, while the page was redesigned
with new surrounding logo graphics in 2002. Then the
whole site was redesigned using a new content management
system in 2004, resulting in the date stamp being updated
to report the current year's date. Yet the bulk of the
content of the article is still 6 years old, and the
links have not been updated in 4 years.
DATE DEFICIENCIES
The previous example shows the problems with dating
Web content. Most articles published on the Web by news
media and periodical publishers have fairly obvious
creation dates posted along with the article. Many Gannett
papers include an "originally published" date
and label at the bottom of each article. URLs also include
the year, month, and day of the original publication.
However, other news publications often have no date
listed in the article or in the URL. Still others put
the current day's date on the top of every page, even
when the articles were obviously published earlier.
Alternatively, some list a "posted on" date.
This may or may not be the same date as the date of
the article's newspaper publication.
Beyond articles, plenty of other Web pages include some
kind of a date. Far too often, it is only a small copyright
notice at the bottom of the page. Typically the current
year or a range of years such as 1995-2004 is listed.
The problem with this date statement is that, on many
sites, it may just be part of a standard footer on every
page. Check other pages on the same site to verify the
use of a standard footer. If every page has the same
copyright statement at the bottom, then it is likely
just a site-wide copyright statement.
Many other pages list no date information at all. In
this case, checking for a date stamp on the page may
be helpful. In Internet Explorer (IE), click "File"
on the drop-down menu and then click "Properties."
In Netscape, Mozilla, or Firefox, use "View/Page
Info" or the keyboard shortcut of CTRL+ I. Just
remember that, if accurate, this is only the date the
page was last changed. The actual writing and posting
of the content may have occurred much earlier.
DATING DIFFICULTIES
Unfortunately, determining the date of a page can be
even more difficult. The date stamp is not reported
on many pages. For sites which use SSI, ASP, PHP, or
other server-side scripting languages (or use some content
management systems), the date stamp on all the pages
will always be the current day and time.
Even for those pages that do have a date stamp, various
versions and installations of Internet Explorer may
not display the correct date stamp. One solution is
to use a "Show Date" bookmarklet. Simply add
a new "Favorite" in IE with a name such as
"Show Date" and instead of a URL enter javascript:alert
(document.lastModified). This will help if the regular
"File/Properties" approach does not work properly.
As an alternative, just check the page in Netscape,
Mozilla, or Firefox and use the "Page Info"
display.
Tricks like the bookmarklet do not help for pages that
either do not show a date stamp or, more commonly, just
give the current date. Except for pages that are obviously
updated on a daily basis, never trust a date stamp that
matches the current date. Instead, look for other ways
to establish a publication date.
DATE SEARCHING
A variety of search options can help identify the
date of some Web pages. Checking in with the search
engines brings up one more date to add to the confusion.
When a search engine sends out its spider to index the
Web page content, it adds the new indexing date. With
so many sites not reporting an actual date stamp, the
last indexing date may be the only date information
that a search engine knows.
A quick search at Gigablast clearly identifies the sites
with a date stamp and those without. Those with a date
stamp have a modified date as well as an indexed date
listed. Those with only the indexed date gave no usable
date stamp to Gigablast's spider. This great date reporting
along with cached copies of pages at Gigablast make
it a good tool in helping to pin down the date of a
page.
The Internet Archive's Wayback Machine [www.archive.org]
is a better place to look. As long as the page is archived
there and is not older than late 1996, looking through
the various versions can help detect the difference
between content change and design changes. While the
early years of the archive are less complete than recent
ones, just seeing when the page first appeared in the
archive can give a good hint as to its creation date.
However, bear in mind that the page's content could
just have been previously published on a page with a
different URL or on a different Web site and could therefore
be older. Look for clues to this in the earliest pages
in the archive. Also check on the site's main URL to
see how far back it is archived and whether it points
to the same content in a different location.
Yahoo! and Google have date limits that can be used
to try to hone in on pages from the last few months.
To find a page from an older time period, try the AltaVista
or AlltheWeb advanced search. Even though they are basically
the Yahoo! database, they have more precise date limits
than Yahoo!. However, given the multiplicity of dates
and their general inaccuracy, do not depend too much
on the accuracy of the search engine date limit.
For more recent changes to content, the search engines'
cached copies can offer some help. With Gigablast, both
the search engine results page and the cached copy include
the last indexed date and the date stamp. Just remember
that the date stamp, listed as the modified date, was
the date stamp at the time that the page was last indexed.
It may have been updated since then. Compare the current
page to the cached version to verify.
With Google's recent addition of its indexing date to
the cached copy of pages, it is much easier to identify
the date of the cached copy. Unfortunately, it only
has one cached copy available per date. To see the date,
click on the "cached" link and then look in
the header for the "as retrieved on" date.
Note that the time is given as Greenwich Mean Time (GMT)
so be sure to convert to the appropriate local time.
DATING DIFFERENCES
Time zones are an important consideration in the global
network. With activity and posting on the Web coming
from all around the world at all times of day and night,
when the actual time of posting is of concern, be careful
in assuming the time zone.
Even Google uses two different time zone standards.
The recent addition of the GMT indexing time in the
cache contrasts with the green date that sometimes appears
on the Google results list. Often known as the "fresh"
date (since that was the old label and the green date
only appears on recently re-indexed pages), it is not
based on GMT. For the times listed in the cache as GMT,
Google seems to be using "fresh" dates based
on U.S. Eastern Standard Time. But with so many search
firms looking into local search, it is also possible
that the displayed time could be tied to the user's
local settings (chosen or guessed) rather than the date
and time at the creator's location.
Location on the Web is also difficult. With Web servers
often located in another part of the world than the
author—a German living in Australia could be posting
new Web pages on his server located in U.S.—which
date and time is displayed? If it is a date stamp based
on the system clock, it would probably be the U.S. time
zone. If it is an author-supplied date that is visible
on the text of the page, then it is more likely based
on the author's time zone.
One nice feature of most Weblog implementations is that
the postings include a posting date and time, depending
on the blog's configuration. The time zone difference
issue still applies, especially for multi-author blogs
where the writers can be very geographically dispersed.
Even for the single author blog, the date displayed
is still at the discretion of the blogger. Some blog
software lets the writer control the posting date, making
it easy to add earlier postings and even future ones.
Editing older posts is also easy. And while a necessary
function for correcting typographical errors, bad links,
and misstatements, editing does pose a problem for those
needing to see the original text of the posting.
DATING WRAP-UP
The whole Internet dating scene is quite complex,
whether talking about the online relationship matching
or the chronological side addressed here. The danger
for the unwary information seeker is in falling into
the clutches of a misleadingly dated Web page and drawing
false conclusions about the page's content based on
that one date. When accuracy of chronology is essential,
remember to insist on more than one date verification.
Greg
R. Notess (greg@notess.com; www.notess.com)
is a reference librarian at Montana State University and founder of SearchEngineShowdown.com.
Comments? Email the editor at marydee@infotoday.com.
|