Recently had an issue where servers were randomly seeming to hang, or slow down significantly. I isolated the problem to activities that had a mandate to log something to syslog before completion, such as logging in via ssh. So now we know we’ve got a syslog issue. In this particular case, all the hosts were using a pair of centralized syslog servers which were running rsyslogd.
The issue was affecting hosts set to log via TCP and UDP. On the sending side, running netstat showed a very large Send-Q value with a destination of the rsyslog servers. On the syslog servers, they both showed large Recv-Q values for all the incoming traffic. Additionally, netstat -su showed a lot of errors and receive buffer problems.
It turns out the issue was the UDP hosts and the fact that the rsyslog daemon on the syslog servers was set to log entries to a path that included the variable %HOSTNAME%. This was causing the servers to have to do a hostname lookup from the IP on EVERY incoming syslog packet from every UDP-based server; on a large network, we’re talking thousands of lookups per second, and that had started to overwhelm the DNS server they were pointed at. Switching to a local DNS cache resolved the issue. You could of course also do it via /etc/hosts for even higher efficiency but then you’d have to maintain it by adding and updating hosts any time any syslog source is added or changed.