Skip to content

Add cgroup observer to prevent infinite fanotify growth#443

Merged
bobrik merged 1 commit intocloudflare:masterfrom
bobrik:ivan/cgroup-observer
Jun 28, 2024
Merged

Add cgroup observer to prevent infinite fanotify growth#443
bobrik merged 1 commit intocloudflare:masterfrom
bobrik:ivan/cgroup-observer

Conversation

@bobrik
Copy link
Copy Markdown
Contributor

@bobrik bobrik commented Jun 28, 2024

With fanotify we see every cgroup appear. Even if there are no metrics associated with it and the cgroup quickly goes away, we keep the mapping in memory. This mostly works fine, until you have an infinite churn of cgroups, then memory starts growing indefinitely.

This commit adds a garbage collecting cache that would remove old entries that were not requested in a long time.

@bobrik
Copy link
Copy Markdown
Contributor Author

bobrik commented Jun 28, 2024

Heap profile from a problematic instance:

image

Looking with a debugger at another instance:

$ sudo dlv attach $(pidof prometheus-ebpf-exporter)
(dlv) b github.com/cloudflare/ebpf_exporter/v2/cgroup.(*fanotifyMonitor).Resolve
Breakpoint 1 set at 0x851b6e for github.com/cloudflare/ebpf_exporter/v2/cgroup.(*fanotifyMonitor).Resolve() /cfsetup_build/prometheus-ebpf-exporter/tmp/build/cloudflare-ebpf_exporter-9a207bb/cgroup/fanotify.go:185
(dlv) c
(dlv) bt
 0  0x0000000000851b6e in github.com/cloudflare/ebpf_exporter/v2/cgroup.(*fanotifyMonitor).Resolve
    at /cfsetup_build/prometheus-ebpf-exporter/tmp/build/cloudflare-ebpf_exporter-9a207bb/cgroup/fanotify.go:185
 1  0x000000000085c899 in github.com/cloudflare/ebpf_exporter/v2/cgroup.(*Monitor).Resolve
    at /cfsetup_build/prometheus-ebpf-exporter/tmp/build/cloudflare-ebpf_exporter-9a207bb/cgroup/monitor.go:35
 2  0x000000000085c899 in github.com/cloudflare/ebpf_exporter/v2/decoder.(*CGroup).Decode
    at /cfsetup_build/prometheus-ebpf-exporter/tmp/build/cloudflare-ebpf_exporter-9a207bb/decoder/cgroup.go:33
 3  0x000000000085d1e7 in github.com/cloudflare/ebpf_exporter/v2/decoder.(*Set).decode
    at /cfsetup_build/prometheus-ebpf-exporter/tmp/build/cloudflare-ebpf_exporter-9a207bb/decoder/decoder.go:75
 4  0x000000000085ddde in github.com/cloudflare/ebpf_exporter/v2/decoder.(*Set).decodeLabels
    at /cfsetup_build/prometheus-ebpf-exporter/tmp/build/cloudflare-ebpf_exporter-9a207bb/decoder/decoder.go:155
 5  0x000000000085d5ca in github.com/cloudflare/ebpf_exporter/v2/decoder.(*Set).DecodeLabelsForMetrics
    at /cfsetup_build/prometheus-ebpf-exporter/tmp/build/cloudflare-ebpf_exporter-9a207bb/decoder/decoder.go:108
 6  0x0000000000aa18ca in github.com/cloudflare/ebpf_exporter/v2/exporter.(*Exporter).mapValues
    at /cfsetup_build/prometheus-ebpf-exporter/tmp/build/cloudflare-ebpf_exporter-9a207bb/exporter/exporter.go:581
 7  0x0000000000aa043c in github.com/cloudflare/ebpf_exporter/v2/exporter.(*Exporter).collectCounters
    at /cfsetup_build/prometheus-ebpf-exporter/tmp/build/cloudflare-ebpf_exporter-9a207bb/exporter/exporter.go:467
 8  0x0000000000a9fb3e in github.com/cloudflare/ebpf_exporter/v2/exporter.(*Exporter).Collect
    at /cfsetup_build/prometheus-ebpf-exporter/tmp/build/cloudflare-ebpf_exporter-9a207bb/exporter/exporter.go:455
 9  0x0000000000921c65 in github.com/prometheus/client_golang/prometheus.(*Registry).Gather.func1
    at /home/builder/go/pkg/mod/github.com/prometheus/client_golang@v1.19.1/prometheus/registry.go:455
10  0x000000000047a781 in runtime.goexit
    at /usr/local/go/src/runtime/asm_amd64.s:1695
(dlv) print len(m.mapping)
2276851

There are definitely fewer than 2 million cgroups on this instance:

$ find /sys/fs/cgroup -type d | wc -l
665

@bobrik bobrik force-pushed the ivan/cgroup-observer branch from 18d9357 to d34cec7 Compare June 28, 2024 21:43
With fanotify we see every cgroup appear. Even if there are no metrics
associated with it and the cgroup quickly goes away, we keep the mapping
in memory. This mostly works fine, until you have an infinite churn of
cgroups, then memory starts growing indefinitely.

This commit adds a garbage collecting cache that would remove old entries
that were not requested in a long time.
@bobrik bobrik force-pushed the ivan/cgroup-observer branch from d34cec7 to 9e05515 Compare June 28, 2024 22:25
@bobrik bobrik merged commit f5dff38 into cloudflare:master Jun 28, 2024
@bobrik bobrik deleted the ivan/cgroup-observer branch June 28, 2024 22:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant