Investigating `rsyslog` issues

rsyslog is “the rocket-fast system for log processing”. It’s an application that runs on all of our VMs that receives log messages, performs some processing, and forwards the messages to our logging system (currently logit).

We have occasionally suspected rsyslog (or other parts of our logging pipeline) of causing reliability issues. This documentation is intended to help operators diagnose issues with rsyslog more quickly.

Suspected memory usage issues

On 2018-07-31 the production router virtual machines exhausted their available memory and swap. Initial investigation suggested that the rsyslogd process on these machines was using far more than the expected amount of memory.

We investigated rsyslog’s behaviour in situations where it can’t ship logs to its destination in #159559834, but weren’t able to reproduce the issue. There’s extensive documentation of our investigation on that ticket.

Investigating memory usage issues

If there are suspected issues with rsyslog’s memory usage in future there are a few things you should look into:

In Kibana, search for @source.component: "rsyslogd-pstats" - this should show if rsyslog has full queues / is dropping messages. (See #159559834 for more details)

Use the bosh-cli to ssh onto a router, then:

# May show rsyslog-pstats messages that didn't make it to Kibana
tail -20000 /var/log/syslog | grep rsyslog-pstats

# How much CPU / memory are the rsyslog processes using?
ps uax | awk '/rsyslo[g]/ || NR==1'

# How much overall disk usage is there?
df -h | grep ^/

# How much overall memory usage is there?
free -o -m

It may also be useful to capture the traffic that rsyslog is sending to logit to investigate possible packet loss etc.

sudo tcpdump host "$(grep -o '[^@]*logit.io' /etc/rsyslog.d/35-syslog-release-forwarding-rules.conf)" -G 120 -W 1 -i eth0 -w logit-traffic.pcap

Investigating rsyslog issues

Suspected memory usage issues

Investigating memory usage issues

Investigating `rsyslog` issues