When the hits just keep on coming
Posted: 07.12.99WebTrends is tops among three Web server log analysis tools, but make sure you understand the assumptions these programs make before you rely on the data they return.
Identifying the top 40 page hits on your Web site doesn't have to be much harder than finding Top 40 hit songs on the radio. All you need is the right Web site log analyzer. Log analyzer programs look at the logs produced by Web servers and present the results in a variety of reports.
The products we tested — WebTrends Log Analyzer 4.51, Marketwave's Hit List Commerce Suite 4.0 and WebManage Technologies' NetIntellect 4.0 — don't look at Web servers in real time, but they do provide detailed site usage statistics culled from log files. All three are powerful and easy to use, but the current standout is WebTrends Log Analyzer, which is easy to install and configure, and provides the most informative reports. Marketwave's Hit List is even easier to configure than Log Analyzer and has a great deal of power. However, Hit List is more difficult to use, and its standard reports are not quite as useful as Log Analyzer's. Released late last month, WebManage's NetIntellect 4.0 is a good entry-level product with an improved interface and more reporting options than Version 3.0, which we also looked at. It's less expensive than WebTrends Log Analyzer and Marketwave's Hit List, but it's less sophisticated as well.
A need for speed?
Though vendors place a lot of emphasis on how quickly a log analyzer can pore through data, the speed issue is not really what it appears to be. All three products can crunch multiple megabytes per minute under ideal conditions. In the real world, however, initial log-read speed is often a factor of your disk drive's speed, not of the program, because much of the work involves reading huge amounts of data off the drive.
Don't place too much stock in the speed figures distributed by vendors. The figures are meaningless when you consider that vendor benchmarks are often run using logs with preresolved domain names, and that the vendor benchmarks often don't require the software to perform reverse Domain Name System (DNS) lookups on IP addresses. Under typical conditions, it takes time for a Web server to resolve domain names.
In our tests, we enabled reverse DNS lookups because reports without domain names are of little value. As we expected, this made all the products considerably slower. In one instance, we found that speed dropped from an average of 34M byte/minute to around 1M byte/minute. Network conditions and DNS server availability can account for such significant delays. To speed things, all the vendors in our tests attempt to cache DNS lookups and allow you to control the number of simultaneous lookups; however, all the lookups take a long time. And you'll find dramatically different results on the same log data if you flush your product's DNS cache and retest.
Just as troublesome as the speed issue is the question of how big a log file a product can handle. Log file size is dictated more by the amount of RAM or disk space available than anything else. If you are using a memory-based log-file analysis program to crunch a gigabyte-level log, it may not work well — or at all — depending on the quality of your operating system. Having RAM equal to the size of the log can get ridiculous, so vendors use a database to store the log data. This serves two purposes: It allows you to re-query the data, and it allows you to create indexes and summaries without keeping the entire log file in memory at once.
The downside of using a database is that loading the database with data can be a time-consuming procedure. Straight memory-based log-file analysis tools will generally beat the database tools on the first pass, but these memory-based tools require a complete reread of the file to run a new query. Log Analyzer provides both options, while Hit List and NetIntellect use a database.
For any decent amount of logged data, using a database is the only way to go and should help alleviate any major size issues. To lower disk-space requirements and reduce query time, most products dump certain portions of the log data from the database. Despite such steps, log files can get big — if you want to save and be able to run queries for a few years' worth of Web data, be prepared to start your own Web-specific data warehouse.
Precision hitting
To simulate a typical end user approaching log analysis, we monitored traffic on a public Web site for a month. Our log files ranged from a small 1,000-line fragment to a 100M-byte log file. For each product, we ran all the reports that comprised the vendor's definition of a complete analysis. All programs crunched the data fairly quickly.
When we compared the programs' findings, we discovered something unexpected — inconsistencies in how the products record and report Web usage data. In fact, nearly every report we generated — including usage by time of day, types of browsers used, number of visitors, number of visits and top requesting sites — yielded varied results. For example, WebManage's NetIntellect claimed the site received 3,861 requests for a sample daily log file. Marketwave's Hit List claimed 600 for the same data. WebTrends Log Analyzer didn't summarize requests, but listed 3,620 hits for the site and 529 page views, or impressions.
The same data produces such different results because the tools make different assumptions during log analysis. First, there's the issue of what constitutes a unique visitor. Without cookies, log files don't record people or machines, they record IP addresses. If a user is browsing from behind a firewall, the program records the IP address of the proxy server. By assuming that one IP address equals one user, these programs risk underreporting visitors if more than one person accesses a site through the same proxy server. Fortunately, if a site uses cookies and the product can track visitors this way, it can avoid this ambiguity.
There's also the question of how the programs define a visit or session. Because HTTP is a stateless protocol, we really can't tell when a user is done looking at a site, which makes it difficult to determine visitation lengths. How much inactivity should constitute the end of a visit? Five minutes? Ten minutes? Thirty minutes? The default setting varies from vendor to vendor.
The vendors' default definitions are not necessarily right or wrong, but it's important that you understand how to read and modify a product's default settings before relying on its statistical reports. For example, we closely examined a particular portion of the test log and found that Log Analyzer reported 36 file requests for that time period, while NetIntellect 3.0 showed only 34. (NetIntellect 4.0 was not available until after we conducted this round of tests; the rest of our evaluation is based on NetIntellect 4.0.) Getting the same information from Hit List in a tabular format was a chore because the product delivers a graph style by default. Once we figured out how to customize reports, we found that Hit List reported only 19 requests for the same time period.
With a little digging, we realized that Hit List was excluding images from the internal database it uses for report generation. Because Hit List was excluding images, its count was much lower than Log Analyzer's figure, which we verified by direct inspection of the logs. While the decision to drop image requests helps reduce the size of the stored database, users need to be aware of this option and be willing to sacrifice some useful information. Once we removed this option, Hit List came up with 36 requests.
We had to dig deeper to find out why NetIntellect reported a slightly different answer. The only clue came from a log file that showed two HTTP methods of type HEAD. (A HEAD request is a request for only the head of Web document, which includes information such as the document title and date of last revisions.) By default, NetIntellect drops HEAD requests, we discovered. According to WebManage, you can tune this feature as necessary.
It's difficult to say which product generates the most accurate results. From the end user's point of view, Log Analyzer provides the most information, but the information provided is occasionally imprecise. In particular, the Most Active Cities and North American States & Provinces reports determine geographic usage trends by correlating resolved domain names with data from the InterNIC WHOIS database. Under this dangerous assumption, every America Online, UUNET or PSINet user will be reported as coming from Virginia because the domain records list Virginia as the ISPs' location. Hit List and NetIntellect are far more conservative in the assumptions made by their default settings.
Standardizing log definitions would improve consistency among products. BPA International (BPAI) is one organization helping to set some standards. However, though many tool vendors claim to support BPAI standards, products may not return the same results without careful tuning — one more reason users need to investigate the assumptions their products make and the terminology used in the reports. Still, as long as you always use the same product and settings when you analyze logs, you'll be able to determine trends and reach reasonable conclusions.
WebTrends sets the bar
WebTrends Log Analyzer is a powerful program with relatively minimal needs; you can run it from any Windows 95, 98 or NT system with 32M bytes of RAM and 20M bytes of disk space.
We had no trouble installing Log Analyzer, and its configuration was the most intuitive of the products we tested. WebTrends makes it easy for anyone with log-file access to generate a report.
Log Analyzer's built-in log-file reports range from basic executive summaries of site-usage patterns to advanced reports that include banner ad tracking and search-engine phrase monitoring. Some of its reports could be somewhat misleading, however, such as the geographic trends data.
You can save customized reports in a variety of formats, including HTML, Word, Excel and comma-delimited or ASCII text. Log Analyzer runs reports immediately, or you can use a built-in scheduler that automatically retrieves logs and saves results to disk, uploads them via File Transfer Protocol (FTP) to a remote system, or mails a report to interested parties.
You can program Log Analyzer to constantly refresh the results of logs read from a shared drive. This makes near real-time monitoring possible.
The speed of the product was adequate for the log files we tested. It took Log Analyzer seconds to crunch a few megabytes of logs with DNS resolution off. It was just as fast with preresolved entries, whereby the server resolves names as it goes along. Once DNS resolution was enabled, the product slowed significantly. Fortunately, processing speed is partially improved by the inclusion of FastTrends, a caching database that stores the results of processed logs.
Advanced features include support for clusters, log analysis for proxy server logs, filters to sort out multiple domains served from a single machine, Open Database Connectivity support for access to log files stored in databases, and remote reporting from a Web browser. The latest release of the product also provides the ability to analyze streaming media data from products such as RealNetworks' G2.
If you're looking for more than log analysis, WebTrends' $1,499 Enterprise Suite bundles numerous facilities for Web site content maintenance, including link checking, site quality analysis, alerting and monitoring, and proxy file analysis, with its Log Analyzer features.
Customization distinguishes Hit List
Marketwave's Hit List Commerce Suite 4.0, the midrange offering in the Hit List family, compares in scale to WebTrends Log Analyzer. The product is easy to install and runs on Windows 95, 98 and NT. Marketwave prudently suggests RAM and disk space based upon the size of files you plan to analyze. Your requirements may range from a mere 24M bytes of RAM to 256M bytes or more.
Hit List's interface is more difficult to use than WebTrends'. The problem was most noticeable when we tried to look at logs from more than one server. Because Hit List sets up a common configuration for reports when it is installed, we were required to create a custom configuration for any other logs we wanted to process.
To store log data, Hit List relies on an internal database. You can generate reports quickly once the data is loaded, but the first reading of a log file can be relatively time-consuming. We found that it's easy to accidentally mix log files from two separate Web sites in the same database because of the interface issues we mentioned.
Hit List's predefined reports are a close second to those of Log Analyzer in terms of variety and the quality of output. As with Log Analyzer, you can save reports in numerous formats, including Word, HTML, ASCII and comma-separated values. Hit List runs reports by a scheduler and will automatically mail or save them; but it offers fewer scheduling options than Log Analyzer, and the interface is not as nice.
Hit List offers more powerful report-customization capabilities than the other products tested.
The types of possible queries are limited only by your ability to define how you want to view requests related to day, time, IP address, page load and a variety of other criteria. You can also relate data in a log to data found in outside databases, which could be useful to match IP or cookie information to entries in a sales system. (Log Analyzer supports similar associations, but they are not quite as sophisticated.)
Hit List is missing Log Analyzer's ability to easily browse remote sites via FTP for log files. Instead, it assumes you know how to form an appropriate FTP URL, such as ftp://user:password@www.xyz.com/logs/ access. The program also lacks the ability to automatically post results to other sites.
For sites with log files of less than 50M bytes per day, the next step down in the Hit List family is the $395 Hit List Pro. For a lower price, you'll sacrifice Hit List Commerce Suite's data-linking feature, Marketwave's proxy server plug-in and Tetranet's Linkbot software, all of which are bundled with Commerce Suite. In addition, Commerce Suite, which is designed to handle log files up to 100M bytes per day, is faster than Hit List Pro.
NetIntellect gets the job done for less
The lowest-priced entry, WebManage's $295 NetIntellect 4.0, provides basic log analysis on par with the other two products but lacks some of their sophistication. NetIntellect requires Windows 95, 98 or NT and a minimum of 32M bytes of RAM. Installation is straightforward.
You can define numerous log files to analyze with NetIntellect, which uses a database to store data and pays the same performance price on initial load as Hit List. Importing data was straightforward. In Version 4.0, we're pleased to see WebManage cleaned up a buggy FTP access interface that hurt Version 3.0. However, NetIntellect 4.0 still provides an odd approach to retrieving logs from remote sites, forcing users to manually download the log and select a local copy. We also encountered strange bugs that eventually forced us to resort to manual FTP of log files.
NetIntellect 4.0 includes a scheduler, which allows you to automate the retrieval of files and running of reports. Fortunately, the scheduler interface is much improved over the previous version.
There's a greater number of reports and more opportunity to customize them in NetIntellect 4.0 than in Version 3.0, but the new product still lacks the ability to link to external data sources. The filtering facilities provided in NetIntellect 4.0 are much improved over the previous product, and it's easy to set report filters to exclude by browsers, domains or a variety of other criteria. However, NetIntellect 4.0 still saves files only in HTML and Word formats.
While most of the features in NetIntellect 4.0 are improvements over the last version, one feature we missed from 3.0 is dynamic summary mode, which allowed us to click through data sets. The new version adopts the more common report style that WebTrends and others have popularized.
For basic log-file analysis, NetIntellect is a good deal and provides the features that most entry-level users would want.
Originally published on Network World, Published: July 12, 1999.