Analysis of Redundancy Requirements
for a Web Server Site

A highly visible organization suffered multiple embarrassing web server failures over a period of several months. Networking Unlimited, Inc. was called in to review all aspects of the operation from validation of their selection of NT versus UNIX servers, to analysis of firewall crashes, to suggestions on how best to work around ISP failures.


The organization's web site was used to publish time critical announcements to the organization's customers and other interested parties. The web site was run from the organization's data center, protected by a firewall and router using a T1 link to a major ISP. As the web site grew in popularity and importance, however, it started to fail. One month the server crashed under the load, another month, their ISP went down for several hours, two months later, the firewall crashed.

Networking Unlimited, Inc. was called in by the IS Director to take an overall look at the situation and provide recommendations for avoiding future problems.

Technical Approach

The web site was a typical small web server running Windows NT on quality, dual processor servers with a backend Windows NT based database server providing much of the content (using Cold Fusion). Cold standby redundancy was provided by a duplicate set of servers and hardware. The web server connected to the Internet through a firewall and a router, with a dedicated T1 link provided by an ISP. As is typical of many systems which have grown up quickly without taking the time to rethink the design, the architecture of the site was a mixture of high quality and unnecessary weakness.

The NT servers had already been worked on by a Windows consulting firm, and the cause of the NT crash was well understood. However, while the Windows consulting firm knew how to tune NT servers, their understanding of networking was deficient, and a number of misleading conclusions in their final report were pointed out by Networking Unlimited, Inc. to the client. In particular, the stress testing performed by them, while better than that previously done by the client, still did not stress many aspects of the server and network. This is unfortunate, because if they had tested the number of simultaneous connections in addition to testing the speed of serving content to a few dozen simulated users, they should have been able to detect the inability of the firewall as configured to handle a very large number of partially open TCP connection attempts and prevented the firewall crash which later occurred when the firewall ran out of memory and died during a peak traffic period.

The ISP crash was a random event and had nothing to do with the site or its extremely cyclical traffic patterns. However, it did provide a graphic example of the need to provide redundancy at all levels of the operation in subsequent discussions with management.

The most important finding of the Networking Unlimited, Inc. investigation was not the causes of the various crashes, or even how to prevent them, but rather was the first consistent overall look at how the site was designed and how the various factions and departments responsible for its operation could work together more to enhance the reliability and quality of the product.

Bottom Line Results

In addition to informal discussions with operating personnel and various managers responsible for different pieces of the web site, an extended written report was provided which covered topics ranging from why the different crashes had occurred to an analysis of the effectiveness of various approaches to providing redundant operations of the site, ranging from business as usual (doing nothing to provide automatic recovery from failures) to establishing a parallel site at a distant location and letting the Domain Name servers handle failure recovery.

Armed with a better understanding of what they already had, and a clear picture of the alternatives available to them, MIS management was in a position to make a cost effective decision of how to upgrade their web site rather than simply reacting to each problem as it occurred.

