Thunderstone chases Google's search watermark
Posted: 10.31.05With the continuing explosion of unstructured Web-based content in the enterprise, a quality search engine is no longer a luxury, but a necessity. Encouraged by reader feedback after our recent Google Search Appliance Clear Choice Test, we tested a similar product, the Thunderstone Search Appliance.
Overall, the Thunderstone Software appliance is a capable, flexible and fast search platform, though at times it is hampered by its lack of polish in the areas of administration and security.
Immediately upon installation, it is clear the Thunderstone appliance does not hide its implementation details well. Packaged in a custom blue case is a fairly stock RedHat Linux box equipped with open source Webmin interface for addressing system tasks.
To configure the search functionality, we had to use the supplied, very rudimentary Web-based interface, which simply does not do justice to the power of the search provided. While some users may be initially attracted to what appears to be a simple form-based interface, we found the forms cluttered and confusing, containing little or no field grouping, and rife with little annoyances, most notably one-line-high scrolling text areas that don't allow you to see a field's contents at once. During testing, we also found pages occasionally not displaying the requested information.
However, once you get beyond the interface issues, you will see that the system allows for detailed customization of indexing and search results. When building a search index with the Thunderstone appliance, you first indicate the starting URL(s) and the particular file types to include or exclude during the site walk.
If you take the time to explore the complete walk settings, you will find many features that may help you handle the special cases you might encounter during a site walk. For example, it is possible to configure the system to remove the contents of certain types of tags or even remove commonly found text in page navigation, headers and footers.
However, you may find indexing sites with form-based logons very difficult to do, requiring lots of trial and error if you want to do more than basic Web authentication.
We were happy to find the Thunderstone crawler (the Texis software the company has offered for years) was able to traverse our test sites fairly easily because it can be configured to execute JavaScript content, including external .js files, or examine strings within JavaScript for URLs to traverse. While in practice this helped the program move around sites, there were situations where the crawler made mistakes with JavaScript content and noted many pages in error. For example, on one test site that used Google AdSense, the crawler pulled out data that was not a URL to crawl. However, these problems were forgivable given that many crawlers cannot even index sites that rely too much on JavaScript for navigation.
After building a search index, you can edit the results, doing things such as removing bad entries or defining matches for particular queries called Best Bets. You can define as many hard-wired matches as you like for particular URL-keyword combinations. Unfortunately, taking advantage of this would be incredibly time-consuming given the awkward interface. There appears to be no direct way to manage keyword match-ups en masse outside a suspiciously dangerous work-around suggested on the tech support site to export all settings to an XML file, make changes and reimport them.
Keeping the search index up-to-date on the Thunderstone appliance is easy if you schedule reindexing processes. We preferred the appliance's ability to perform trigger-based crawling, which lets the appliance watch a URL and rewalk a site when the contents of that resource changes.
To complete the configuration, you will want to configure the search-results page and integrate a search box into your site. For most sites it will be a matter of defining a header and footer that matches the visuals of the site to wrap the search results and choosing one of the eight defined result styles. Of course, you also could fetch results in XML and use Extensible Stylesheet Language Transformations to transform your results in an arbitrary manner.
The value of the search results was generally good. For searches for known unique keywords and phrases we had nearly the same results for Thunderstone as what we saw on a Google Mini being used as a control. However, in searches for less-unique keywords the results were not always as useful as the Google search results. For example, some PDF files were ranked higher because of the create path of the file being used as a title. In other queries, landmark pages such as home pages were often found farther down in result lists. Yet when tuning the ranking features using an advanced option on the result page on the Thunderstone search, results often more closely matched the Google appliance.
One aspect of Thunderstone's search results we really liked was when few results were returned, the "Did You Mean...?" option suggested multiple choices rather than just one. It also showed the number of hits for those queries so you could see where rich results sets were.
Administration-wise, the Thunderstone could stand some improvement. The current version does not support SNMP integration. In terms of reporting, with the Web interface you have access to some basic system logs, as well as some rudimentary query reports. Query logging could be much richer and should be searchable. Besides richer usage data, we would prefer it to be saved in a Web common log format so that it could be easily processed by log analysis tools. Furthermore, it really does not seem a good idea to purge the query logs on rewalk as it erases valuable historical data.
There also were aspects of administration that were quite nice, such as a software updating system and some integrated methods to send technical support and system configuration information to the appliance.
The documentation for the product is terse and could be improved with screen captures and more examples with explanations. The help system is not very well done, forcing you to jump up and down long form screens.
Finally, we were disappointed with the security. The Web console does not force SSL access by default and new users are created with full administrator privileges, a dangerous prospect. Fortunately, you can lock down access control by user or group, but in a convoluted manner compared with other multiuser systems we've tested. The system also allows very weak passwords, and it does not provide any user auditing that we could find. Thunderstone indicated that a future revision will include user auditing.
Thunderstone's Enterprise Search Appliance is a fast and flexible offering with abundant areas for fine-tuning search. It also lacks polish and can be awkward to use. Yet compared with Google Search Appliance, it is quite inexpensive, and network professionals willing to put in some time may be rewarded with a powerful search facility at an affordable price.
How we did it
We tested Thunderstone Search Appliance both within an intranet environment and on public Web sites, and indexed numerous production sites as large as 20,000 documents. We customized search pages to test integration with existing site styles. Load verification was done using inexpensive load-generation tools (www.loadtestingtool.com) given the required load to prove the devices operated as specified did not warrant more sophisticated tools.