These problems are suspiciously similar to problems we've been having
with home.mcom.com in recent weeks. Each time a failure similar to the
one you describe happens, we've been able to isolate a backbone or
trunk line somewhere which has gone down and is causing some amount of
the Internet to get suddenly cut off from our server for indefinite
amounts of time. When these failures happen, we now use traceroute
with a set of common hosts to locate the line which is down.
Our analysis of this problem which is confirmed by gathered evidence
and correspondence with SGI (we run IRIX web servers) is that the
incoming connection queue is being filled. Normally, TCP kernels
maintain a queue of connections which are in the process of being
negotiated. An entry in this queue is used when a browser initiates a
connection to the server machine, and is occupied until the connection
is fully negotiated and has been accepted by the server software.
When a trunk line goes down, many of these queue slots can be occupied
by connections which are in the process of negotiation. Because the
line between the server and that machine has been severed, those queue
slots will be occupied either until the line comes back up, or two
minutes elapses.
If the server is accepting between 20 and 30 new connections per
second as home.mcom.com often does, it does not take long to fill this
queue. We had our queue size set to 128, which was sufficient for most
of December, but recent failures along with our increased traffic have
been enough to exhaust even this size.
To change your queue size, you need to change the kernel's maximum
connection request size. Under BSD, this parameter is called
SOMAXCONN. Since you're using Solaris, you can use the ndd command to
set your maximum queue size higher. The parameter you want to
experiment with is tcp_conn_req_max.
Finally, you have to change your server software to use a larger queue
size. The NCSA httpd, if I remember correctly, uses a queue size of
5. Search for a call to listen() and experiment with the value.
If anyone else has more data on this problem, we'd love to hear about
it. I hope this helps anyone who is having similar problems with their
servers. I heard a rumor that these problems are being caused somehow
by routing, Sprint and MCI. Anybody know more?
--Rob