ONLINE, January 2000
Copyright © 2000 Information Today, Inc.
After all the work, time, and money that's invested in building and maintaining the library Web site, you and your staff will most likely want to know who, if anyone, is using it. Additionally, what features and resources do visitors use most often? Are the people accessing the site the same people who come into the library? How do people find the Web site? Do they use a search engine?
These are usage questions, and librarians already have experience in gathering usage data. For example, librarians count the number of questions asked at a reference desk as a way of measuring its use. Like the reference desk, the library Web site represents a service point. The Web site service point, however, is electronic, and it requires new methods of measuring usage.
Understanding the basics of Web server technology and the data servers record is a good start in developing usage measurement techniques. After that, you can explore the software that exists to help you make sense of Web site statistics, and find the right software for your system.
remotehost rfc931 authuser [date] "request" status bytes
Broken out, each component of a common log file has its own meaning.
A typical log entry might look something like:
gateway.iso.com - - [10/MAY/1999:00:10:30 -000] "GET /class.html HTTP/1.1" 200 10000
In this example, the remote host is gateway.iso.com. The next two fields, rfc931 and authuser, are blank (represented by dashes). The request was made on May 10, 1999 at 10 minutes after midnight. The file requested was class.html. The error code 200 (status OK) was returned, and the file requested was 10,000 bytes in size.
The common log file format may be the standard, but variations of log files exist. Additional information may be stored in referrer and agent logs.
08/02/99, 12:02:35, http://ink.yahoo.com/bin/query?p="sample+log+file"&b=21&hc=0&hs=0, 999.999.999.99, jaz.med.yale.edu
In this example, the referring page was a search engine, ink.yahoo.com, and the search used to find the requested page was "sample log file." (Many Web designers and marketers are interested in the search words that lead users to their sites.) Note that the IP address of the computer making the request, 999.999.999.99, is also recorded here.
07/09/99, 13:59:24, , 999.999.99.99, scooby.northernlight.com, crawler@northernlight.com, Gulliver/1.2
In addition to the standard information about the date, time, and IP address, crawler@northernlight.com tells you that this hit came from a crawler. A hit from a Web browser would reveal the browser name and version, such as Mozilla/4.0. This probably means the visitor's browser was Netscape version 4.0 (Mozilla was the code name for Netscape and is still used for a browser compliant with the open-source Netscape code.) Browser information, however, is not always considered reliable.
Common log files, referrer logs, and agent logs are sometimes combined into one log. Whatever format your Web server uses, the first thing you will need to do is determine what type of log file is being generated. The person responsible for the server should be able to tell you what format is used. In addition, there may be options in the log file that determine what data is recorded, and you may be able to use these options to increase or decrease the data collected, depending on your needs.
To sum up, some of the things you can learn from your Web server's log files are:
There are also concerns about how caching affects log files. Caching occurs when you visit a Web page and your browser stores that page in memory. The next time you request the same URL, your browser will search its memory for the URL. If it is cached, it will pull the page from memory, and the server will never receive the request. You are using the site, but that will not be recorded in the server's log files. ISPs also utilize caching, which exacerbates the problem.
A good rule to remember is that the log file measures requests for specific files on a server, not exact usage. The number of requests does not translate into number of unique visitors, and the numbers may not reflect all usage because of caching. Measuring usage requires extrapolating from what the log file tells us and entails some level of error. To gain more exact knowledge about Web site usage, other means of investigation, such as questionnaires or cookies, must be used.
Also remember that dynamic addressing masks some individual users because they are not associated with a unique IP address. Anyone, however, who connects directly to the Internet will have a unique, unchanging IP address. Even though a name is not recorded, access to an individual's IP address can reveal their actions on a Web site.
There are currently no laws covering how to handle the information contained in a log file, but because log files can contain information about individual IP addresses, they should be considered confidential, much as circulation records are confidential. Any data the library makes public from its log files should mask individual IP addresses. Data can always be presented at the level of usage by large groups (such as users from a particular country or in-house versus outside users).
However your library decides to analyze log files, the library Web site should carry a complete statement of what data is collected, who can see the data, and how that data is used.
Before you consider analysis software, make sure you understand what you really want to know about your library's Web-site usage. There is free and commercial software available for log analysis-each has advantages and disadvantages. In general, commercial software offers more features, enhanced graphics, and some level of customer support. If your needs are not too complex, a simpler-and less expensive-alternative may suit you as well as, or better than, the most full-featured analysis packages.
The following are not reviews of software. They are quick snapshots of some of the features of free and commercial software to acquaint you with what is available and the price range. No single software package is right for everyone. Performance of individual software will be affected by the types of log files your server produces, so you need to test your own system using your own log files to evaluate what works best in your environment. As you examine software options, keep one key point in mind. Log analysis software can aid in gathering, distilling, and displaying information from log files, but no matter how sophisticated the software, it cannot add to or improve on what is already available in the log file. The contents of the log file are the ultimate limiting factor in what log analysis software can do for you.
Analog is a very popular, freely available log analysis software developed by Stephen Turner. Analog offers a standard report, which can be configured to the specifications of the user, and offers a General Summary of requests to a Web server.
An important feature is the Request Report. This report displays the most requested Web pages on the Web site, from most to least. The Request Report lists the number of requests, the last date when the file was requested, and the file name.
In addition to the General Summary and the Request Report, Analog will display monthly, daily, and hourly summaries. This can help to identify the busiest month, day of the week, and hour of the day. Also, Analog can show the most common domain names of computers where requests for the server's pages originated. This can tell you, for example, that 35% of requests came from academic sites in the U.S. Analog makes no attempt to try to identify the number of visitors to a site.
Analog is widely used. It runs on a variety of platforms and can recognize many log files. It does not offer advanced graphics capabilities. The two titles below are also free and can be downloaded from the Internet.
TITLE: wwwstat
URL: http://www.ics.uci.edu/pub/Websoft/wwwstat/
PRODUCER: Roy Fielding
PRICE: Free
CUSTOMER SUPPORT: No
LOG FILE FORMAT: Common log file format
PLATFORM: UNIX
TITLE: http-Analyze 2.01
URL: http://www.netstore.de/Supply/http-analyze/default.htm
PRODUCER: RENT-A-GURU
PRICE: Free for educational or individual use
Customer Support: No
LOG FILE FORMAT: Common log file format, some extended log file formats
PLATFORM: UNIX
WebTrends is a powerful software package that attempts to simplify the process of log analysis. Log profiles and reports are created and edited in menu-driven systems, with wizards and online help available to ease the process. WebTrends lets you manage multiple log files across several servers.
Generating a customized report is done easily through the Report Wizard. In the report creation module, you may elect to generate tables and graphics from General Statistics, Resources Accessed, Visitors & Demographics, Activity Statistics, Technical Statistics, Referrers & Keywords, and Browsers & Platforms. Including a table or graph is as easy as checking a box in the wizard process. Graphs can be further customized as pie charts, or bar or line graphs. Reports can be generated as HTML, Microsoft Word, or Microsoft Excel documents.
Some of the WebTrends reports are similar to what is offered in free software. For example, WebTrends will generate a report of the most requested pages on the Web site. Notice that a graph is included and that file addresses are also identified by titles.
WebTrends has more reports available than the free software. In the area of Resources Accessed alone, WebTrends generates tables and graphs for entry pages, exit pages, paths through the site, downloaded files, and forms. The other report sections are also full of enhanced capabilities. Referrers & Keywords presents the top search engines, sending hits to your site and the search terms that found your site.
WebTrends reports can be filtered to exclude or include particular data. For example, you can choose to exclude requests generated by library employees by filtering those IP addresses out of the report. Other filters can present data for only one page, for a particular day of the week or hour of day, or for a particular referrer page. This feature is helpful in controlling the amount of data presented and aids in more finely targeting your reports to a particular subject.
WebTrends uses mathematical algorithms to try to distinguish the number of visitors to your site. There are difficulties in determining unique users from log files, and this information may not be credible. WebTrends itself states that the only way to determine a unique visitor to the site is to use authentication (i.e., logons and passwords.) For sites that do require authentication, WebTrends offers the ability to link user profile information in databases to visitor activity on the site.
WebTrends offers many easy-to-use features. In some ways, it's a bridge between low-cost or free utilities and very high-end software packages, which can cost from $7,000-$10,000. Some of the advanced capabilities in WebTrends might be more than your library requires. The following two software packages present many of the same capabilities as WebTrends, such as predefined and customizable reports, data filtering, graphics, and friendly user interfaces.
TITLE: Netintellect V.4.0
URL: http://www.Webmanage.com/
PRODUCER: WebManage
PRICE: $295
CUSTOMER SUPPORT: Support by phone, email, and online, as well as an online tutorial and documentation
TRIAL: Free 15-day trial
LOG FILES: Recognizes 45 log file formats
PLATFORMS: Windows 95/98/NT
TITLE: FastStats
URL: http://www.mach5.com/fast/
PRICE: $99.95
25-day free trial
CUSTOMER SUPPORT: Free technical support via email, no phone support available
PLATFORM: Windows 95/98/NT
Server logs were designed to measure traffic and demand loads on a computer server, and they work well for this purpose. When server log files are used to try to measure how people use a site, they don't work quite as well. They can, however, give you useful information about the relative usage of pages on your Web site, other sites that refer visitors to your site, and how search engines help people find your site, among other important data.
Although log analysis isn't perfect, few measures of usage are. For example, when we count people who come through the doors of our library, we don't know if they are there to read books or magazines, or just use the bathroom. When we circulate a book, we don't know why it was selected or even if it is read. Server log file analysis can be viewed in the same light, as a flawed but necessary measure of usage. The important thing is to educate yourself about the abilities and limitations of log file analysis so that you can make educated use of the data it produces.
2. Stehle, Tim. "Getting Real About Usage Statistics." Available at http://www.wprc.com/wpl/stats.html.
3. Stout, Rick. "Web Site Stats: Tracking Hits and Analyzing Traffic." Osborne/McGraw-Hill, Berkeley, CA, 1997.
4. "Web Site Analysis Tools." PC Magazine Online, March 10, 1998. Available at http://www.zdnet.com/pcmag/features/Webanalysis2/default.htm.
Kathleen Bauer (kathleen.bauer@yale.edu) is an Informatics Librarian at the Yale School of Medicine Library.
Comments? Email letters to the Editor at editor@infotoday.com.
[infotoday.com] | [ONLINE] | [Current Issue] | [Subscriptions] | [Top] |
Copyright © 2000, Information Today, Inc. All rights reserved.
Comments