Web servers serving earthquake information are subject to tremendous surges in traffic due to earthquakes felt by large numbers of people. Coping with them is a challenge for system administrators. This article details how the U.S. Geological Survey Pasadena Field Office has attempted to deal with this phenomenon.
Note: A slightly revised version of this article was included in the 1999 Southern California Seismic Network Bulletin, which was published in the July/August, 2000 issue of Seismological Research Letters.
There is a popular web site called Slashdot which bills itself as, "News for Nerds. Stuff that Matters." Every day it serves up a list of links to articles of interest around the Internet. Its readership is substantial, numbering in the hundreds of thousands, and sites that are linked from it almost always experience a surge of web server traffic after being mentioned. This is known in the jargon as the "Slashdot Effect", and has reduced many web servers to smoking piles of charred silicon.
The Slashdot Effect was chronicled in a paper by Stephen Adler. In it, he discussed the Effect as seen by the servers at bnl.gov after some papers he wrote were linked from Slashdot. The essential feature of the Effect is a sudden, large increase in web server traffic after a link to the site has been posted in a Slashdot article. The traffic peaks on the day of the initial Slashdot posting, and falls off in subsequent days.
Here at the USGS Pasadena Office, our
web server experiences a very similar effect after any earthquake that
is felt by large numbers of people in the Los Angeles area. This can be
seen in this graph:
The large spike is the surge of traffic experienced after the October 16,
1999 Hector Mine Earthquake. This M7.1 event was felt over
all of Southern California, and as far away as Phoenix and Las Vegas.
Damage was minimal, due to
its remote location in the desert. The lack of damage meant that
it was an ideal event for generating web traffic.
The earthquake occurred at 02:46 PDT, and
by 03:00 the server was swamped. A more detailed look a that day's logs
gives us the following graph:
Note that the spike in traffic starts essentially immediately after the
event, and then tails off until dawn. There is then a second, larger spike starting around
06:00, when people in Los Angeles who didn't get out of bed at 3:00
woke up. The gap between 09:00 and 12:00 is an artifact of the data.
The server was processing so many requests that the log file filled
the disk. Traffic was being serviced during this time, but it was
not logged until space was cleared on the disk.
The following graph gives a more detailed look at the period
immediately following the earthquake.
Note that the hit rate begins increasing immediately, and
increased by almost three orders of magnitude within 10
minutes. The peak hit rate of 67/sec was reached at about 15
minutes after the event:
During the peak traffic periods, the server was essentially unresponsive. The average data rate during peak periods was about 4Mb/sec. A rough estimate based on the relationship between the hit rate and data rate indicates that the peak data rate was probably around 7Mb/s. The server was at that time on a 10Mb/sec link to the 100Mb Caltech network, so its Ethernet was essentially saturated. In addition, the server's CPU was 100% busy. Since the server was saturated, it is likely that the actual number of requests received was higher than that recorded in the logs. The logs only record requests that were serviced, so we really do not know how many requests fell on the floor that morning.
In the aftermath of the Hector Mine event, we began looking for ways to increase our web server capacity. After some research, three options presented themselves:
While searching the Internet for practical information and experiences of setting up an Apache reverse-proxy server, we encountered scores of people who advised that, while this option works well, a caching reverse-proxy known as Squid was a superior choice. Squid is primarily intended to be a proxy server for ISPs and other networks to use as a way of speeding up access to popular web sites for users on their local networks. It works by intercepting outbound http requests and then caching the files that come back from the remote servers. It can then serve repeated requests for the same pages locally, thus providing faster service. Reading the Squid documentation revealed that it has what is called an http-accelerator mode. This is essentially a reverse-proxy mode, with caching. In this, a Squid is set up to receive all incoming requests for web service. It forwards the requests to one or more back-end servers, and it caches the data returned so that it can use it to service future requests. It caches data both in memory and on disk, and it can serve requests much faster than a conventional http server.
As a test, we set up a Squid server. It is an AMD K6-2/400 PC, with 384MB of RAM, and a 9GB fast-wide SCSI disk. It is running FreeBSD version 3.3, and Squid version 2.2-Stable4.
| Sidebar: Performance tuning the FreeBSD kernel |
| A minimal amount of performance tuning was done on the Squid server machine. The FreeBSD kernel was reconfigured with: |
| maxusers=512 options NMBCLUSTERS=32768 options NO_MEMORY_HOLE options NO_F00F_HACK options MAXDSIZ=(256*1024*1024) options DFLDSIZ=(256*1024*1024) makeoptions COPTFLAGS="-O2 -pipe" |
| The following commands were added to /etc/rc.local: |
| The kernel was also stripped down to remove unnecessary device drivers. |
| A brief explanation of the kernel configuration can be found here. |
| Nota Bene: It was necessary to recompile Squid after building a new kernel so that the executable would know about the new kernel configuration. |
The Squid server went live on November 23rd, 1999. On November 30th, there were two earthquakes, magnitude 3.2 and 3.1, about 15 minutes apart, centered under West L.A. The first event was at 10:27 PST, and the web server experienced a traffic spike within ten minutes. During the day, the server normally experiences a hit rate of about 0.4/sec. During this traffic spike, it jumped to 19.8/sec. The Squid server handled this traffic easily. At the same time, other servers in the Seismo Lab were slowed to a crawl. The spike in traffic was brief, but even at peak activity, our server was responsive at all times.
On December 7th, there was a M3.9 earthquake centered about
50km (30 miles) southeast of Los Angeles, near densely
populated areas of Orange County and Riverside.
This event occurred at 13:58 PST, and the hit rate on the
server was increasing within two minutes.
The following graph
shows the hit rate reported by the internal cache manager
in the Squid server:
Note that the hit rate went from 0.5/sec to 15/sec within two minutes of the earthquake, and peaked at just under 32/sec at about 15 minutes. The peak rate was about half that recorded after Hector Mine. This was followed by a falling-off which continued for the rest of the day. The hit rate did not return to the normal level for several hours.
During this time, the Squid server was responsive. The Squid
process is not particularly CPU-intensive, as the following
graph shows:
Note that the implied relation between hit rate and CPU usage in the above graph indicates that CPU saturation should occur around 300 hits/sec, which was the behavior observed in testing.
Another graph of interest is the daily graph for activity on the
back-end web server. Even though the Squid was servicing the majority
of requests, requests for recently modified documents and CGI scripts had to
be passed through. Thus, its activity shows a spike around the
time of the events:
Note that the peak activity is between 14:00 and 15:00 when the server processed 7,682 hits. A look at the comparable report for the Squid server shows that during this same time, it handled 58,449 hits, which is almost a factor of 10 difference. This correlates well with the Squid server's internal statistics, which indicate that it has about a 90% hit rate for cached items. This indicates that the load on the main server has been reduced considerably.
On March 6, 2000, there was another 4.0 event in Orange County. After this event, the Squid server experienced a peak hit rate of 82/sec, and it handled it well. This indicates that we are better able to handle the large traffic loads generated by felt earthquakes, and that the Squid server is performing well.