Troubleshooting erratic network performance

Sat Jun 6 00:36:04 UTC 2009

So it had to happen sometime. Friday, 4:30pm, developers go live with a 
new release. Server performs just fine until someone upgrades something 
and damn, I'm stuck in the server room on a gorgeous Friday night.

Hardware:
1) Sun Xfire 4600, 16 core opteron /w 32gb ram. A beast.
2) Quad Intel 82546EB Gigabit ports
3) Old and new ethernet cables
4) 4 other identical machines that haven't showed any signs of trouble.

OS:
1) 2.6.29-gentoo-r5 kernel /w e1000 ethernet driver loaded as a module.
2) 2.6.25-gentoo-r7 kernel /w e1000 ethernet driver loaded as a module.

Symptoms:
1) At first it looked like apache was having a problem with mod_rails. 
Disabled that, nope, still erratic timeouts on web pages. Disabled 
mod_proxy_ajp, still odd slowdowns. MaxClients are fine etc. Server had 
60 days of uptime, during which time it performed admirably. The problem 
seems limited to apache, but then again, apart from a bit of mail, there 
isn't much network throughput apart from it, so it's kind of hard to 
diagnose without installing a whole other webserver.

2) ab run from another host on gigabit link behind the same switch comes 
in at a paltry 1.93 pages/second. Usually 50-100 is to be expected 
depending on the vhost being served.

2) vmstat looks absolutely normal, like it has for the last 60 days. No 
blocking processes, cpu almost 100% idle, i/o is negligible.

3) smartctl shows all 4 sas drives in the raid5 array are healthy, as 
does /proc/mdstat. No disk problems that I can see.

4) Outgoing network throughput is as expected, full kernel from 
kernel.org in seconds.

5) netstat shows about 300 connections at any given time, no large 
fluctuations at all.

6) The only odd looking graph (using munin) is the fork rate. I think 
the spike I see of 150 forks/second is apache starting up after 
rebooting though. Probably a red herring.

7) tcpdump shows dropped packets sometimes when the problem is 
occurring. Only sometimes. iptables looks fine, and ifconfig doesn't 
show any dropped packets. I'm not too familiar with tcpdump, but I made 
sure to grab some packets so I can pore over them and try to glean 
something. *Any tips as to what to look for would be appreciated.*

Haven't looked at sysctl settings at all, things have been fine up until 
now, and the other admin rebooted so I'd expect that any odd problem 
somewhere in the bowels of the system's tcp stack would have gone away.

Any thoughts or suggestions of where to look next? I consider myself a 
capable admin for basic stuff like setting up apache, databases etc., 
but this problem has me completely baffled. I'd swap the disks to 
another chassis, but some may recall a problem I had with the same 
servers a month ago where switching ip addresses and mac addresses takes 
6 hours of so for the switch to realize what's happened, so that's 
probably a no go.

Thanks! Jamon
--
The Toronto Linux Users Group.      Meetings: http://gtalug.org/
TLUG requests: Linux topics, No HTML, wrap text below 80 columns
How to UNSUBSCRIBE: http://gtalug.org/wiki/Mailing_lists