Information Today, Inc. Corporate Site KMWorld CRM Media Streaming Media Faulkner Speech Technology DBTA/Unisphere
PRIVACY/COOKIES POLICY
Other ITI Websites
American Library Directory Boardwalk Empire Database Trends and Applications DestinationCRM Faulkner Information Services Fulltext Sources Online InfoToday Europe KMWorld Literary Market Place Plexus Publishing Smart Customer Service Speech Technology Streaming Media Streaming Media Europe Streaming Media Producer Unisphere Research



Magazines > Online > Nov/Dec 2004
Back Index Forward
 




SUBSCRIBE NOW!
Online Magazine
Vol. 28 No. 6 — Nov/Dec 2004
On The Net
Dating the Web: The Confusion of Chronology
By Greg R. Notess
Reference Librarian Montana State University

For those of us grounded in the print world of publishing, the date of publication helps identify and distinguish different editions, specific periodical issues, and even re-printings. The date of print publication rarely matches the exact date of composition but does have the distinct advantage of being basically unchangeable. Once published and a publication date has been included on each item, the only way authors can update or otherwise make a change is to issue a new edition or a correction. Otherwise, they would have to track down every single copy and make the change.

The Web, of course, is pretty much the opposite. Site owners can change any page any time they wish. Unscrupulous but talented hacks sometimes can even change other people's pages. And anyone who has ever tried to cite a Web page knows that many have no publication date information listed at all.

Yet for all its newness as a publication medium, the Web is aging. We have had public Web sites up for more than a decade now, though few pages remain in their original format. As the Web ages, it becomes increasingly important to try and understand the origination date of certain Web content. For intellectual property cases and the historical record, among other reasons, it can be important to know when a Web page was actually written or first posted. Exploring the Internet dating scene for the information professional means understanding the dimensions, deficiencies, and differences of the various dates associated with Web pages.

DATING DIMENSIONS

With the ease of posting a Web page, which is then publicly available, and the subsequent ease of changing that page, issues of date information have several dimensions. There is the original content creation date and possibly an editing or updated date. The surrounding text and graphics may come from an entirely different day and time while the page design may have occurred at yet a different point.

The date and time when the file containing this conglomeration of parts was last changed are reported in the date stamp. Any time a file on a computer is changed, a new date and time stamp, based on the computer's internal clock, is recorded.

Take for example an article written in 1998 that may have been uploaded to a Web page. The links in it may have been updated in 2000, while the page was redesigned with new surrounding logo graphics in 2002. Then the whole site was redesigned using a new content management system in 2004, resulting in the date stamp being updated to report the current year's date. Yet the bulk of the content of the article is still 6 years old, and the links have not been updated in 4 years.

DATE DEFICIENCIES

The previous example shows the problems with dating Web content. Most articles published on the Web by news media and periodical publishers have fairly obvious creation dates posted along with the article. Many Gannett papers include an "originally published" date and label at the bottom of each article. URLs also include the year, month, and day of the original publication.

However, other news publications often have no date listed in the article or in the URL. Still others put the current day's date on the top of every page, even when the articles were obviously published earlier. Alternatively, some list a "posted on" date. This may or may not be the same date as the date of the article's newspaper publication.

Beyond articles, plenty of other Web pages include some kind of a date. Far too often, it is only a small copyright notice at the bottom of the page. Typically the current year or a range of years such as 1995-2004 is listed. The problem with this date statement is that, on many sites, it may just be part of a standard footer on every page. Check other pages on the same site to verify the use of a standard footer. If every page has the same copyright statement at the bottom, then it is likely just a site-wide copyright statement.

Many other pages list no date information at all. In this case, checking for a date stamp on the page may be helpful. In Internet Explorer (IE), click "File" on the drop-down menu and then click "Properties." In Netscape, Mozilla, or Firefox, use "View/Page Info" or the keyboard shortcut of CTRL+ I. Just remember that, if accurate, this is only the date the page was last changed. The actual writing and posting of the content may have occurred much earlier.

DATING DIFFICULTIES

Unfortunately, determining the date of a page can be even more difficult. The date stamp is not reported on many pages. For sites which use SSI, ASP, PHP, or other server-side scripting languages (or use some content management systems), the date stamp on all the pages will always be the current day and time.

Even for those pages that do have a date stamp, various versions and installations of Internet Explorer may not display the correct date stamp. One solution is to use a "Show Date" bookmarklet. Simply add a new "Favorite" in IE with a name such as "Show Date" and instead of a URL enter javascript:alert (document.lastModified). This will help if the regular "File/Properties" approach does not work properly. As an alternative, just check the page in Netscape, Mozilla, or Firefox and use the "Page Info" display.

Tricks like the bookmarklet do not help for pages that either do not show a date stamp or, more commonly, just give the current date. Except for pages that are obviously updated on a daily basis, never trust a date stamp that matches the current date. Instead, look for other ways to establish a publication date.

DATE SEARCHING

A variety of search options can help identify the date of some Web pages. Checking in with the search engines brings up one more date to add to the confusion. When a search engine sends out its spider to index the Web page content, it adds the new indexing date. With so many sites not reporting an actual date stamp, the last indexing date may be the only date information that a search engine knows.

A quick search at Gigablast clearly identifies the sites with a date stamp and those without. Those with a date stamp have a modified date as well as an indexed date listed. Those with only the indexed date gave no usable date stamp to Gigablast's spider. This great date reporting along with cached copies of pages at Gigablast make it a good tool in helping to pin down the date of a page.

The Internet Archive's Wayback Machine [www.archive.org] is a better place to look. As long as the page is archived there and is not older than late 1996, looking through the various versions can help detect the difference between content change and design changes. While the early years of the archive are less complete than recent ones, just seeing when the page first appeared in the archive can give a good hint as to its creation date.

However, bear in mind that the page's content could just have been previously published on a page with a different URL or on a different Web site and could therefore be older. Look for clues to this in the earliest pages in the archive. Also check on the site's main URL to see how far back it is archived and whether it points to the same content in a different location.

Yahoo! and Google have date limits that can be used to try to hone in on pages from the last few months. To find a page from an older time period, try the AltaVista or AlltheWeb advanced search. Even though they are basically the Yahoo! database, they have more precise date limits than Yahoo!. However, given the multiplicity of dates and their general inaccuracy, do not depend too much on the accuracy of the search engine date limit.

For more recent changes to content, the search engines' cached copies can offer some help. With Gigablast, both the search engine results page and the cached copy include the last indexed date and the date stamp. Just remember that the date stamp, listed as the modified date, was the date stamp at the time that the page was last indexed. It may have been updated since then. Compare the current page to the cached version to verify.

With Google's recent addition of its indexing date to the cached copy of pages, it is much easier to identify the date of the cached copy. Unfortunately, it only has one cached copy available per date. To see the date, click on the "cached" link and then look in the header for the "as retrieved on" date. Note that the time is given as Greenwich Mean Time (GMT) so be sure to convert to the appropriate local time.

DATING DIFFERENCES

Time zones are an important consideration in the global network. With activity and posting on the Web coming from all around the world at all times of day and night, when the actual time of posting is of concern, be careful in assuming the time zone.

Even Google uses two different time zone standards. The recent addition of the GMT indexing time in the cache contrasts with the green date that sometimes appears on the Google results list. Often known as the "fresh" date (since that was the old label and the green date only appears on recently re-indexed pages), it is not based on GMT. For the times listed in the cache as GMT, Google seems to be using "fresh" dates based on U.S. Eastern Standard Time. But with so many search firms looking into local search, it is also possible that the displayed time could be tied to the user's local settings (chosen or guessed) rather than the date and time at the creator's location.

Location on the Web is also difficult. With Web servers often located in another part of the world than the author—a German living in Australia could be posting new Web pages on his server located in U.S.—which date and time is displayed? If it is a date stamp based on the system clock, it would probably be the U.S. time zone. If it is an author-supplied date that is visible on the text of the page, then it is more likely based on the author's time zone.

One nice feature of most Weblog implementations is that the postings include a posting date and time, depending on the blog's configuration. The time zone difference issue still applies, especially for multi-author blogs where the writers can be very geographically dispersed. Even for the single author blog, the date displayed is still at the discretion of the blogger. Some blog software lets the writer control the posting date, making it easy to add earlier postings and even future ones. Editing older posts is also easy. And while a necessary function for correcting typographical errors, bad links, and misstatements, editing does pose a problem for those needing to see the original text of the posting.

DATING WRAP-UP

The whole Internet dating scene is quite complex, whether talking about the online relationship matching or the chronological side addressed here. The danger for the unwary information seeker is in falling into the clutches of a misleadingly dated Web page and drawing false conclusions about the page's content based on that one date. When accuracy of chronology is essential, remember to insist on more than one date verification.


Greg NotessGreg R. Notess (greg@notess.com; www.notess.com) is a reference librarian at Montana State University and founder of SearchEngineShowdown.com

Comments? Email the editor at marydee@infotoday.com

 


       Back to top