Web Servers, Earthquakes, and the Slashdot Effect


Abstract

Web servers serving earthquake information are subject to tremendous surges in traffic due to earthquakes felt by large numbers of people. Coping with them is a challenge for system administrators. This article details how the U.S. Geological Survey Pasadena Field Office has attempted to deal with this phenomenon.

Note: A slightly revised version of this article was included in the 1999 Southern California Seismic Network Bulletin, which was published in the July/August, 2000 issue of Seismological Research Letters.


Introduction

There is a popular web site called Slashdot which bills itself as, "News for Nerds. Stuff that Matters." Every day it serves up a list of links to articles of interest around the Internet. Its readership is substantial, numbering in the hundreds of thousands, and sites that are linked from it almost always experience a surge of web server traffic after being mentioned. This is known in the jargon as the "Slashdot Effect", and has reduced many web servers to smoking piles of charred silicon.

The Slashdot Effect was chronicled in a paper by Stephen Adler. In it, he discussed the Effect as seen by the servers at bnl.gov after some papers he wrote were linked from Slashdot. The essential feature of the Effect is a sudden, large increase in web server traffic after a link to the site has been posted in a Slashdot article. The traffic peaks on the day of the initial Slashdot posting, and falls off in subsequent days.

Earthquakes and the Slashdot Effect

Here at the USGS Pasadena Office, our web server experiences a very similar effect after any earthquake that is felt by large numbers of people in the Los Angeles area. This can be seen in this graph:


The large spike is the surge of traffic experienced after the October 16, 1999 Hector Mine Earthquake. This M7.1 event was felt over all of Southern California, and as far away as Phoenix and Las Vegas. Damage was minimal, due to its remote location in the desert. The lack of damage meant that it was an ideal event for generating web traffic. The earthquake occurred at 02:46 PDT, and by 03:00 the server was swamped. A more detailed look a that day's logs gives us the following graph:


Note that the spike in traffic starts essentially immediately after the event, and then tails off until dawn. There is then a second, larger spike starting around 06:00, when people in Los Angeles who didn't get out of bed at 3:00 woke up. The gap between 09:00 and 12:00 is an artifact of the data. The server was processing so many requests that the log file filled the disk. Traffic was being serviced during this time, but it was not logged until space was cleared on the disk.

The following graph gives a more detailed look at the period immediately following the earthquake. Note that the hit rate begins increasing immediately, and increased by almost three orders of magnitude within 10 minutes. The peak hit rate of 67/sec was reached at about 15 minutes after the event:


During the peak traffic periods, the server was essentially unresponsive. The average data rate during peak periods was about 4Mb/sec. A rough estimate based on the relationship between the hit rate and data rate indicates that the peak data rate was probably around 7Mb/s. The server was at that time on a 10Mb/sec link to the 100Mb Caltech network, so its Ethernet was essentially saturated. In addition, the server's CPU was 100% busy. Since the server was saturated, it is likely that the actual number of requests received was higher than that recorded in the logs. The logs only record requests that were serviced, so we really do not know how many requests fell on the floor that morning.

Increasing Web Server Capacity

In the aftermath of the Hector Mine event, we began looking for ways to increase our web server capacity. After some research, three options presented themselves:

Option One:

The first option came from a suggestion that we move some of our higher-traffic pages to a commercial web hosting company. A web hosting company would have many servers, capable of handing very large traffic loads. The downside of this is due to the nature of the information we are serving. The most popular pages requested after an earthquake are: These are all necessarily dynamic pages. The Community Internet Intensity Map is updated at five-minute intervals after an event. The other two maps are updated after every significant earthquake. During a major sequence, this can translate into updates every 2-5 minutes. This would make it very difficult to propagate updated versions of the maps to an off-site provider in real-time. Because of this difficulty, this option was shelved.

Option Two:

The second option was to set up several servers on the local LAN, each with a full copy of the web pages, and a Coyote Point Equalizer as a front-end to distribute requests to the servers. In this scenario, it would still be necessary to propagate updated files to the various servers, but this would not be as big a problem as in the first option, as the servers would all be on the local USGS LAN. The major downside of this option is the $4,000 cost of the base-model Equalizer product, plus the cost of setting up additional web servers.

Option Three:

The third option was suggested by an article from Web Techniques titled "Load Balancing Your Web Site: Practical Approaches for Distributing HTTP Traffic". In this article, the author discusses several approaches to distributing load to a web site, finishing with a discussion of using Apache as a reverse-proxy server. The idea is to use a stripped-down Apache server to intercept incoming http requests and then to dole them out to a farm of back-end servers which actually serve up the data. By not having to do any disk I/O, the front-end server can be made to run very fast. This was the option that we finally decided to try, since it seemed to be the easiest and most cost-effective to implement.

The Design Goal

At the time, a design goal was set of increasing our web server capacity by an order of magnitude. This was decided because the current server was able to half-fill its 10Mb/sec Ethernet connection after Hector Mine. By the nature of Ethernet, 40% bandwidth utilization is about the point where performance begins to degrade, so it is about the highest performance level than can be practically achieved. By switching to a 100Mb/sec Fast Ethernet connection, our network saturation point could be raised up to about 40Mb/sec. The Caltech network is 100Mb from our building to the Gigabit campus backbone. If our improved server could fill the 100Mb link, it would be doing about as much as it could without requiring major revisions in the USGS connection to the Caltech network.

A Fourth Option Presents Itself

While searching the Internet for practical information and experiences of setting up an Apache reverse-proxy server, we encountered scores of people who advised that, while this option works well, a caching reverse-proxy known as Squid was a superior choice. Squid is primarily intended to be a proxy server for ISPs and other networks to use as a way of speeding up access to popular web sites for users on their local networks. It works by intercepting outbound http requests and then caching the files that come back from the remote servers. It can then serve repeated requests for the same pages locally, thus providing faster service. Reading the Squid documentation revealed that it has what is called an http-accelerator mode. This is essentially a reverse-proxy mode, with caching. In this, a Squid is set up to receive all incoming requests for web service. It forwards the requests to one or more back-end servers, and it caches the data returned so that it can use it to service future requests. It caches data both in memory and on disk, and it can serve requests much faster than a conventional http server.

From Pie-in-the-Sky to Proof-of-Concept

As a test, we set up a Squid server. It is an AMD K6-2/400 PC, with 384MB of RAM, and a 9GB fast-wide SCSI disk. It is running FreeBSD version 3.3, and Squid version 2.2-Stable4.
Sidebar: Performance tuning the FreeBSD kernel
A minimal amount of performance tuning was done on the Squid server machine. The FreeBSD kernel was reconfigured with:
maxusers=512
options NMBCLUSTERS=32768
options NO_MEMORY_HOLE
options NO_F00F_HACK
options MAXDSIZ=(256*1024*1024)
options DFLDSIZ=(256*1024*1024)
makeoptions COPTFLAGS="-O2 -pipe"
The following commands were added to /etc/rc.local:
/sbin/sysctl -w kern.maxfiles=16384
/sbin/sysctl -w kern.maxfilesperproc=16384
The kernel was also stripped down to remove unnecessary device drivers.
A brief explanation of the kernel configuration can be found here.
Nota Bene: It was necessary to recompile Squid after building a new kernel so that the executable would know about the new kernel configuration.
This was set up to be a front-end for the USGS Pasadena web server. Testing was done using a set of seven Sun workstations running the Apache Benchmark program. Each workstation was instructed to request a set of six files between 100 and 10,000 times. In this manner, we were able to subject the Squid server to 350,000-400,000 hits per hour for six hours in order to simulate a sustained load about twice as large as the Hector Mine peak, and 10 times the average load experienced on October 16th. The server performed well. At a hit rate of about 110/sec we experienced some slowdowns. This turned out to be due to the kernel running out of network sockets. A new kernel was built with a higher value of NMBCLUSTERS to fix this. After recompiling Squid, more tests were run. A maximum hit rate of 367/sec was achieved in this round of testing. The data rate reported by the cache manager was about 64Mb/sec, which is beyond the 40Mb/sec we had hoped for. This was possible due to setting the FastEthernet interface to run full-duplex, which raises the practical saturation limit to around 80-90Mb/sec. The Squid server was observed to be CPU-bound at this level of traffic, indicating that we had reached the limit of its ability to handle packets. Still, the server was responsive at all times during this test. Since then, we have replaced the NIC in the server with one that is more efficient in handling packets, so the CPU load from network traffic has been reduced. This should enable us to handle higher network traffic. Near-real-time graphs of the network traffic on our server can be found at http://bort.gps.caltech.edu/mrtg.

The Squid server went live on November 23rd, 1999. On November 30th, there were two earthquakes, magnitude 3.2 and 3.1, about 15 minutes apart, centered under West L.A. The first event was at 10:27 PST, and the web server experienced a traffic spike within ten minutes. During the day, the server normally experiences a hit rate of about 0.4/sec. During this traffic spike, it jumped to 19.8/sec. The Squid server handled this traffic easily. At the same time, other servers in the Seismo Lab were slowed to a crawl. The spike in traffic was brief, but even at peak activity, our server was responsive at all times.

On December 7th, there was a M3.9 earthquake centered about 50km (30 miles) southeast of Los Angeles, near densely populated areas of Orange County and Riverside. This event occurred at 13:58 PST, and the hit rate on the server was increasing within two minutes. The following graph shows the hit rate reported by the internal cache manager in the Squid server:


Note that the hit rate went from 0.5/sec to 15/sec within two minutes of the earthquake, and peaked at just under 32/sec at about 15 minutes. The peak rate was about half that recorded after Hector Mine. This was followed by a falling-off which continued for the rest of the day. The hit rate did not return to the normal level for several hours.

During this time, the Squid server was responsive. The Squid process is not particularly CPU-intensive, as the following graph shows:


Note that the implied relation between hit rate and CPU usage in the above graph indicates that CPU saturation should occur around 300 hits/sec, which was the behavior observed in testing.

Another graph of interest is the daily graph for activity on the back-end web server. Even though the Squid was servicing the majority of requests, requests for recently modified documents and CGI scripts had to be passed through. Thus, its activity shows a spike around the time of the events:


Note that the peak activity is between 14:00 and 15:00 when the server processed 7,682 hits. A look at the comparable report for the Squid server shows that during this same time, it handled 58,449 hits, which is almost a factor of 10 difference. This correlates well with the Squid server's internal statistics, which indicate that it has about a 90% hit rate for cached items. This indicates that the load on the main server has been reduced considerably.

On March 6, 2000, there was another 4.0 event in Orange County. After this event, the Squid server experienced a peak hit rate of 82/sec, and it handled it well. This indicates that we are better able to handle the large traffic loads generated by felt earthquakes, and that the Squid server is performing well.

Conclusion

It would appear from the combination of testing and the real experience of the November 30th, December 7th, and March 6th events that the configuration of the Squid server as a front-end for our regular office server has performed well. With a bit of luck, we will be much better able to handle the traffic generated by the next big earthquake in Los Angeles.


Stan Schwarz
Honeywell Technical Services
Southern California Seismic Network Contract
Pasadena, California
Back to Main
This page last updated 08/10/2000.