FEATURE
Using Web Server Logs to Track Users
Through the Electronic Forest
by Karen A. Coombs
As the electronic services librarian at SUNY Cortland,
I am responsible for the electronic presence of the
library and for providing cohesion between its various
Web-based systems and electronic information. As part
of my responsibilities, I work with the library's Webmaster
and other librarians to create a unified Web-based
presence for the library. Along with our main site,
we have Web sites for our ILL system, proxy server,
and Web-based survey host. When the university wanted
to know more about who was using our Web pages and
how they were doing it, I was asked to collect some
statistics. After using three methods to assess our
site, I chose to concentrate on using Web server logs,
and in following their paths discovered a wealth of
useful data.
Welcome to the Virtual Electronic Forest
Currently, the library Web site at SUNY Cortland
has approximately 270 pages spread across two different
servers. These pages include database-driven Web pages,
XML, and Web-based forms. The site offers a variety
of Web-based services including online interlibrary
loan, electronic resources, research guides, tutorials,
and e-mail reference. Since my arrival at SUNY Cortland
almost 4 years ago, the significance of the Web site
as a library service has grown considerably. The number
of users visiting the site seems to be increasing,
and the library has been providing more Web-based services
and resources. In contrast, the last 4 years have seen
a steady decline in the number of users visiting the
physical library. In my opinion, SUNY Cortland is seeing
a shift in user behavior toward the virtual library;
this is a trend being noticed by libraries throughout
the country.
Although our library's Web site is not a "physical" space
per se, patrons utilize it in ways that are comparable
to the physical library. Since the library currently
collects and analyzes a variety of statistics on user
behavior within the physical library, I've been asked
to do the same for the Web site. Some questions that
I am interested in answering are:
What resources, services, and pages
are patrons utilizing?
Do people find the Web site navigable
and useful?
How do users learn about the library's
Web site, and why are they motivated to visit?
Answering these questions has become part of our
plan for continuous assessment and improvement. In
particular, I have been asked to assess the effectiveness
of the library's Web site. While working at SUNY Cortland,
I've used three different methods to do this. My first
foray into the assessment arena was at the start of
my second year here. I had just finished redesigning
the Web site and wanted to get feedback from our users.
I decided to use a Web-based survey to do this. The
survey provided me with some very subjective data about
how patrons felt about the site. However, I received
a limited number of responses, which didn't tell me
much about what portions of the site were being used.
Second, I turned to our Web server logs to track
which parts of the library's Web site were being used.
At the time, I was receiving a report from the campus
Webmaster that listed our most frequently used pages.
This information was helpful in assessing the effectiveness
of different Web site pages, but it didn't answer all
of my questions. One question of particular interest
at that time was how many visitors use Netscape 4.7
when accessing our site. In spring 2001, we were implementing
a new Web catalog that did not work well with Netscape
4.7; we needed to know how many of our users this would
adversely affect. To accomplish this goal, we sent
several months' worth of Web server logs to a consultant,
who ran them through a log analysis tool. The results
were that less than 4 percent of our users accessed
the site via Netscape 4.7. After this project was completed,
I contracted the consultant to analyze our server logs
on a monthly basis. This arrangement continued from
the spring of 2001 until the library successfully implemented
its own Web log analysis software in August 2004.
During my third assessment, the Webmaster and I studied
our site using task-based usability testing. This is
where users are observed as they perform particular
tasks (or answer particular questions). They're encouraged
to "think out loud." It is often difficult to do this
type of testing without disrupting the user. In addition,
the observer often only gets snippets of behavior.
The task-based usability testing we conducted was very
successful and allowed us to make significant changes
to the library site. However, the process of building
and conducting the testing made me realize that we
needed to more efficiently and consistently be collecting
and analyzing the data from our Web server logs. These
logs could provide the foundation for our analysis
because they provide the most continuous and complete
data about our library site. Additionally, by analyzing
these log files, the Webmaster and I would be able
to focus on specific areas of the site to improve.
Having come to this conclusion, I began to investigate
how to collect and analyze our logs.
Seeking Electronic Tracks
Our first step was getting the server to generate
logs. Almost all Web servers have the capability to
do this; the function just needs to be turned on. The
crucial piece for us was making sure the data we wanted
to analyze was in the log files. After doing some research,
I learned that most log analysis tools want very similar
information; you can configure the server to collect
specific information in your log files.
Additionally, many servers let you control whether
the log files are collected daily, weekly, monthly,
and so on. Based on my reading, I set up our server
to collect log files on a daily basis. Each daily log
file contains several lines of information. Each line
represents a user's interaction with the library's
Web site. A line in the log file typically contains
information like the date and time, the IP address
of the user, the page being accessed, and the browser
and operating system being used by the person visiting
the site.
Recognizing Users' Electronic Footprints
Simply collecting Web server logs wasn't enough.
In order to follow our virtual visitors' footprints,
we needed a Web log analysis tool. While it is possible
to look for patterns in the server log files yourself,
analysis tools make this task much easier. These tools
take the raw data in the log files and look for patterns
such as which pages are most or least visited. The
end result is an aggregation of data transformed into
a useful set of reports about what is going on in your
virtual forest.
Today, there are a variety of log analysis tools
available. These tools have many similarities but can
range in features and price. Since we had been receiving
log analysis reports from a consultant, I understood
the types of information these tools provide about
a Web site. Also, I knew that at the very least the
library would need the most basic analysis data, like
number of hits and visitors, referring sites, referring
search engines, and search key phrases. I found an
array of Web analysis tools by searching Google. Since
cost was a factor, I specifically searched for open
source tools and compared these with commercial solutions.
In addition, I spoke with several other Web managers
and the consultant who had been providing us with log
analysis data. This gave me insight into the reports
provided, cost, technical requirements, difficulty
of installation, configuration, maintenance, and available
features.
My research revealed that almost all log analysis
tools provide the basic set of data I was hunting for.
There is a great range in prices of log analysis packages.
Some (like WebTrends) cost several hundred dollars,
while others (such as Analog) are free. The pricing
models also differ from one tool to anothersome
packages are priced by the number of servers being
analyzed, others by the volume of traffic (number of
page views) analyzed. Certain tools allow the Webmaster
to access and run them remotely. Different tools produce
reports in HTML, PDF, or other formats.
WebTrends is probably the best-known log analysis
tool for businesses. However, it costs close to $500
for a small-business license. SurfStats and Sawmill
are other commercial packages that cost about $100
for a single-server license. In addition, I found three
open source log analysis tools that seemed worth investigating:
AWStats, Analog, and Webalizer.
Choosing a Tool for Tracking and Analysis
When it came to choosing the tracking tool to analyze
our Web logs, price was an important factor. I had
wanted to perform this analysis for some time; however,
funding it had never been an organizational priority.
So, software was never acquired, and our log files
had never been consistently analyzed. Many Web managers
recommended WebTrends as a solution. However, while
WebTrends provides extremely in-depth information,
I could not justify the cost for the types of data
the library managers were interested in. The problem
was not only one of initial cost but of upgrades. Other
Web managers told me that WebTrends and many other
log analysis tools are upgraded frequently. This would
mean a software investment every other year (if I chose
to skip a version or two). While SurfStats and Sawmill
provided a lower-cost alternative, the upgrade cost
was still a factor. In addition, these products were
licensed per Web site, meaning we would need to purchase
four licenses to cover our four sites.
As a result, my search for a log analysis tool turned
to open source solutions. Currently, there are at least
a dozen available. In selecting a tool for Memorial
Library, I looked at Analog, Webalizer, and AWStats.
Analog is a C-based analysis tool that
can be run from a Web page or the command line. It
is probably the most popular of these three open
source Web analysis products. It has comparable features
to AWStats but provides no information about visits.
Additionally, it does not archive the data gathered
from the server logs in a format that can be analyzed
by another product such as a database.
Webalizer is a C-based analysis tool
that has to be run from the command line. While it
provides similar basic data as the other two tools,
it doesn't provide information about users' operating
systems nor the search engines they may have used
to find the site.
AWStats is a Perl-based log analysis
tool that can be run from a Web page or the command
line. It has more reports about "visits" than the
other two tools, including tracking where users entered
and exited the site.
Based on this comparison, I decided to use AWStats
because of its versatility and ability to be extended.
Setting Up Tracking Gear
Next, I needed to implement AWStats on our Web servers.
The first step in this process was to download the
program and the accompanying documentation from the
Web (http://awstats.sourceforge.net).
After reading the documentation, I became a little
concerned that I might lose data during installation.
Therefore, I contacted the consultant I work with and
asked him if he could help me get the software installed
and properly configured. Together we decided the next
step was to configure the server logs to match the
format preferred by AWStats. This meant configuring
the server to collect the following data within the
log files:
date
time
c-ip (client IP address)
c-username (client server)
s-ip (IP address of the Web server)
cs-method (method- GET or POST)
cs-uri-stem (the path to the file accessed)
cs-uri-query
sc-status (the status sent back by
the Web serveri.e., 404-file not found or 200-ok)
sc-bytes (bytes sent)
cs-bytes (bytes received)
time-taken
cs-version (protocol version used)
cs (User-Agent)(operating system
and Web browser patron used to access the site)
cs (Cookie)
cs (Referer)(page the user came
from when accessing the current page)
Next we decided to test AWStats on a sample set of
log files from my server without actually installing
the software on our Web server. When this test was
successful, the consultant installed AWStats on a test
server. During this process, we could not make the
Web-based interface for updating the statistics work.
However, I was not sure I needed "real-time" data,
so we journeyed on. The final step in the process was
installing and configuring AWStats on our Web server.
The three-step installation process for AWStats was
relatively simple: 1) Install Perl if necessary (because
AWStats is a Perl-based program); 2) install AWStats;
3) configure AWStats. This required altering several
values in the configuration file that controls AWStats.
Once this was done, we were ready to start analyzing
log files.
Because AWStats is run from the command line, the
consultant and I developed a couple of batch files
that made interactions with the program easier. One
batch file allows us to control the date range for
which the log files are being run. A second batch file
allows the analysis statistics to be automatically
updated. I have chosen to check statistics on a monthly
basis. In order to do this, a batch file is used in
conjunction with the Windows scheduled tasks; it is
set to run at 11:59 p.m. every day. When the batch
file runs, it updates the statistics and reports for
the current month and places the reports in a folder
for that month. This allows SUNY Cortland to have up-to-date
statistics with no human intervention. Additionally,
the statistics can be updated via a batch file if necessary.
All of the Web log analysis reports are automatically
made available via the library's intranet; this allows
me, the Webmaster, and the director to access different
pieces of information about the Web site's usage when
necessary.
Realizing the Benefits of Doing Our Own Tracking
Having our own Web log analysis statistics has had
distinct benefits. First and foremost, we now have
statistics about our Web site available on demand.
As a result, I am able to more easily answer questions
about visitors to our Web site. Since we are no longer
reliant on an external source for our data, we are
able to gather the data we want in the format we want.
Moreover, we can change the data we are gathering as
needed. Another advantage is the fact that we have
a more complete record of how our Web site is being
used. Prior to implementing AWStats, we were only running
log files on our main Web site. Now we are able to
run usage statistics for all of our sites. This provides
us with greater insight into the overall patterns of
user behavior across the library's electronic forest.
As a result of analyzing our server logs, we have
learned several interesting things about our users'
behavior. There are a few pages in our site that people
utilize more than others. The most surprising of these
is a page that lists the library's periodical holdings.
The heavy use of this page has emphasized the importance
of creating complete holdings for our journals in the
Web catalog. Additionally, users prefer the alphabetical
listing of the library's database to a list of full-text
databases or a list of databases by subject. Data collected
from the server logs also revealed that most users
access our site while on campus. This is interesting,
considering a significant number of students live off
campus. Another important discovery is that most users
come to the library's site directly rather than through
an external source like a search engine or a link on
another site.
The information I've obtained from analyzing the
server logs has taught us many intriguing things about
our users and has created as many questions as it has
answered. Nonetheless, the data has been invaluable
in making decisions about Web-based services. We have
found many practical applications of Web server log
data, including designing future usability studies.
All of these endeavors have helped us to improve the
overall quality of our Web site. However, none of this
would have been possible without AWStats. This demonstrates
that there are low-cost solutions that can yield big
results for small and medium-sized libraries.
Further Reading
Bailey, Dorothy (2000). "Web Server Log Analysis" (http://slis-two.lis.fsu.edu/~log).
Fichter, Darlene (2003). "Server Logs: Making Sense
of the Cyber Tracks," ONLINE 27 (5) 4755.
Haigh, Susan and Megarity, Janette (1998). "Measuring
Web Site Usage: Log File Analysis," Network Notes 57 (http://www.collectionscanada.ca/9/1/p1-256-e.html).
Kerner, Sean Michael (2003). "Handle Log Analysis
with AWStats," Builder.com (http://builder.com.com/5100-6371-5054860.html).
Open Directory Project: http://dmoz.org/Computers/Software/Internet/
Site_Management/Log_Analysis/Freeware_and_Open_Source.
Rubin, Jeffrey (2004). "Log Analysis Pays Off," Network
Computing 15 (18) 7679.
Karen A. Coombs is the electronic services librarian
at SUNY Cortland in N.Y. She holds an M.L.S. and M.S.
in information management from Syracuse University in
N.Y. In addition to developing and maintaining the library's
Web applications (SFX, ILLiad, and OPAC), she is responsible
for implementing and maintaining the library's electronic
resources. Coombs is the author of the Library Web Chic
Weblog (http://www.librarywebchic.net) and
has published articles in Computers in Libraries and Journal
of Academic Librarianship. Her e-mail address is coombsk@cortland.edu. |