Is workingSetBytes of 0 really an indication of a terminated process? #1330

SergeyKanzhelev · 2023-09-15T20:41:14Z

What happened:

We observe sometimes metrics stop flowing for a Pod when one of the containers start reporting 0 as a working_set_bytes.

The container in question is doing nothing but sleep for hours and then check on some files to do some work. We do not have direct access to the repro.

The check that discards Pod metrics in this case is here:

metrics-server/pkg/scraper/client/resource/decode.go

Lines 195 to 198 in ffbdb6f

    
           if containerMetric.CumulativeCpuUsed == 0 || containerMetric.MemoryUsage == 0 { 
        
           	klog.V(1).InfoS("Failed getting complete container metric", "containerName", containerName, "containerMetric", containerMetric) 
        
           	return nil 
        
           } else {

This code was introduced by #759 to filter out situations when the Container is terminated already. I would question the reasoning for throwing the entire Pod stats away when container is terminated. How would it work in case of Jobs with two containers that are not restarteable and one of them already terminated.

I also found similar logic and this comment in heapster repo: kubernetes-retired/heapster#1708 (comment) that says (as expected) that CPU alone is a bad indicator of bad container. Both memory and cpu needs to be checked, while this PR: #759 ignores container if either is zero.

What you expected to happen:

0 workingSetBytes should not result in missing Pod measurements.

Anything else we need to know?:

One questions I still cannot confirm and repro with confidence is whether 0 workingSetBytes is a legitimate situation. I tried to create a small container and a memory pressure, but wasn't able to drive workingSetBytes = usage - total_inactive_memory to 0. It went quite low, but not zero. If we have a repro for this, than the behavior in this repo is definitely incorrect.

If this is impossible, than he logic may be legit and there is a bug somewhere else. I am opening this issue for the discussion in case community wisdom can help understand this better.

/sig node

Environment:

Kubernetes distribution (GKE, EKS, Kubeadm, the hard way, etc.):
Container Network Setup (flannel, calico, etc.):
Kubernetes version (use kubectl version):
Metrics Server manifest

spoiler for Metrics Server manifest:

Kubelet config:

spoiler for Kubelet config:

Metrics server logs:

spoiler for Metrics Server logs:

Status of Metrics API:

spolier for Status of Metrics API:

kubectl describe apiservice v1beta1.metrics.k8s.io

/kind bug

The text was updated successfully, but these errors were encountered:

SergeyKanzhelev · 2023-09-15T20:59:19Z

This is how working set bytes are calculated: https://github.com/google/cadvisor/blob/fbd519ba03978d54cb54ea7ed8ab9d6e3dd64590/container/libcontainer/handler.go#L831-L844

dashpole · 2023-09-21T16:55:40Z

/assign @dgrisonnet
/triage accepted

Thanks @SergeyKanzhelev

kwiesmueller · 2023-09-29T22:49:11Z

I'd add the question if metrics-server should be the one deciding when a pod is terminated?
I would have assumed it determines that based on kubelet container status, not based on metrics.
Not sure if that's just not possible or why it wasn't chosen back then.

SergeyKanzhelev · 2023-09-29T23:55:31Z

I think it is because the experience is based on parsing prometheus metrics. I'm not sure if pod status is available.

We had a short discussion with @mrunalp. We still need to understand exactly how this happens. Once we know, and we confirm that this is expected, we may need to expose additional metrics to help understand that that 0 means. But again, we will need to fully understand this first.

pacoxu · 2023-10-04T11:46:14Z

/cc @serathius

serathius · 2023-10-04T12:02:43Z

Metrics server treats 0 in WSS as undefined behavior and skips reporting pod to avoid autoscaling to making a incorrect decision. Better to not make a decision then making a bad one.

raywainman · 2023-10-11T19:50:51Z

@serathius I think in principle that makes sense, however in practice we have seen a case where a single container in a pod with workingSetBytes = 0 (this container is not part of the main application and really is not relevant to the overall pod performance) is preventing the entire pod from being reported and therefore prevents autoscaling.

I'm curious if any progress has been made on the investigation here or if we have a theory as to how this can happen?

Maybe a middle ground here is to just drop the single bad container's metrics?

serathius · 2023-10-12T13:29:09Z

It's just a status quo, if someone can bring a nicely documented case that would show that kernel can report workingSetBytes=0 and that's a correct state we can switch it. We could also consider leaving the decision up to user.

On the other hand, what will HPA do if there are 10 pods under it and one of them has issue. Can hpa make a decision based on remaining 9 pods?

k8s-triage-robot · 2024-10-11T14:31:23Z

This issue has not been updated in over 1 year, and should be re-triaged.

You can:

Confirm that this issue is still relevant with /triage accepted (org members only)
Close this issue with /close

For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/

/remove-triage accepted

raywainman · 2024-10-11T15:24:07Z

FWIW this was solved by migrating to cgroup v2 and we haven't seen the issue since.

I don't think there is anything else to do here.

/close

k8s-ci-robot · 2024-10-11T15:24:12Z

@raywainman: Closing this issue.

In response to this:

FWIW this was solved by migrating to cgroup v2 and we haven't seen the issue since.

I don't think there is anything else to do here.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

chuyee · 2024-11-13T09:33:17Z

I see the problem happened with cgroup v2 today:

On the node

[root@ip-10-236-137-241 ~]# grep cgroup /proc/filesystems
nodev	cgroup
nodev	cgroup2

[root@ip-10-236-137-241 ~]# crictl -r unix:///run/containerd/containerd.sock statsp --id c878d0e300b0d -o json |jq '.stats[].linux.containers[] | select(.attributes.metadata.name=="logtail-exporter") |.memory'
{
  "timestamp": "1731489715104494015",
  "workingSetBytes": {
    "value": "0"
  },
  "availableBytes": {
    "value": "128000000"
  },
  "usageBytes": {
    "value": "3276800"
  },
  "rssBytes": {
    "value": "2703360"
  },
  "pageFaults": {
    "value": "19701"
  },
  "majorPageFaults": {
    "value": "0"
  }
}

kubectl to the pod

$ kc exec -it wms-pms-prod-fx-5f9887f59-9tnrx -n wms-pms -c logtail-exporter -- /bin/bash
bash-5.2# cat /sys/fs/cgroup/memory/memory.usage_in_bytes
4313088
bash-5.2# cat /sys/fs/cgroup/memory/memory.stat
cache 4866048
rss 3244032
rss_huge 0
shmem 0
mapped_file 0
dirty 0
writeback 0
swap 0
pgpgin 54945
pgpgout 54038
pgfault 19668
pgmajfault 0
inactive_anon 3244032
active_anon 0
inactive_file 4866048
active_file 0
unevictable 0
hierarchical_memory_limit 128000000
hierarchical_memsw_limit 128000000
total_cache 4866048
total_rss 3244032
total_rss_huge 0
total_shmem 0
total_mapped_file 0
total_dirty 0
total_writeback 0
total_swap 0
total_pgpgin 54945
total_pgpgout 54038
total_pgfault 19668
total_pgmajfault 0
total_inactive_anon 3244032
total_active_anon 0
total_inactive_file 4866048
total_active_file 0
total_unevictable 0

According to https://github.com/google/cadvisor/blob/master/container/libcontainer/handler.go#L842-L851,
Memory.WorkingSet = memory.usage_in_bytes - total_inactive_file = 4313088 - 4866048 < 0. So WorkingSet is set to 0 in containerd (as confirmed by crictl statsp above). This caused metrics-server thinks the container is gone and set pod metrics to 0. The consequence is: kubectl top pod failed to list the pod, and HPA cannot read metrics for autoscaling. The latter had a very bad impact.

Could someone shed light on how to fix this issue?

raywainman · 2024-11-13T14:52:18Z

Interesting, we noticed the issue with logrotate which I guess has similar functionality to logtail-exporter?

Did you notice anything else happening in the node itself, was it under significant memory pressure?

chuyee · 2024-11-14T01:24:01Z

The node usage is packed with cpu.

  Resource                   Requests            Limits
  --------                   --------            ------
  cpu                        87222m (91%)        34 (35%)
  memory                     171928534528 (44%)  186099040k (47%)

But usage is not high.

CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%
26141m       27%    91410Mi         24%

I also have node with logrotate has similar problem. Both are very easy to reproduce.

@raywainman do you want to reopen this issue or should I create a new one?

raywainman · 2024-11-14T14:12:24Z

Re-opening this issue seems totally reasonable. Do you mind sharing how to reproduce? That was something we really struggled with on our end.

chuyee · 2024-11-14T23:14:04Z

@dgrisonnet please reopen this issue

chuyee · 2024-11-14T23:30:43Z

Correction. I'm using cgroup v1. @raywainman do you see this problem only on v1 or v2 as well?

tmpfs on /sys/fs/cgroup type tmpfs (ro,nosuid,nodev,noexec,mode=755)
cgroup on /sys/fs/cgroup/systemd type cgroup (rw,nosuid,nodev,noexec,relatime,xattr,release_agent=/usr/lib/systemd/systemd-cgroups-agent,name=systemd)
cgroup on /sys/fs/cgroup/net_cls,net_prio type cgroup (rw,nosuid,nodev,noexec,relatime,net_cls,net_prio)
cgroup on /sys/fs/cgroup/blkio type cgroup (rw,nosuid,nodev,noexec,relatime,blkio)
cgroup on /sys/fs/cgroup/cpuset type cgroup (rw,nosuid,nodev,noexec,relatime,cpuset)
cgroup on /sys/fs/cgroup/cpu,cpuacct type cgroup (rw,nosuid,nodev,noexec,relatime,cpu,cpuacct)
cgroup on /sys/fs/cgroup/memory type cgroup (rw,nosuid,nodev,noexec,relatime,memory)
cgroup on /sys/fs/cgroup/freezer type cgroup (rw,nosuid,nodev,noexec,relatime,freezer)
cgroup on /sys/fs/cgroup/perf_event type cgroup (rw,nosuid,nodev,noexec,relatime,perf_event)
cgroup on /sys/fs/cgroup/devices type cgroup (rw,nosuid,nodev,noexec,relatime,devices)
cgroup on /sys/fs/cgroup/pids type cgroup (rw,nosuid,nodev,noexec,relatime,pids)
cgroup on /sys/fs/cgroup/hugetlb type cgroup (rw,nosuid,nodev,noexec,relatime,hugetlb)

raywainman · 2024-11-15T14:37:39Z

We only saw this with cgroup v1 and seemed to go away when upgrading to cgroup v2.

mvp · 2024-11-18T23:04:54Z

We also observe this issue with cgroups v1. Container in question is very light (2 MB RSS) and is mostly sleeping, but working set is sometimes reported as 0.

Please reopen this issue, @dgrisonnet!

raywainman · 2024-11-22T16:07:23Z

Any chance anyone could paste some instructions to reproduce this? Would be great to capture that here.

chuyee · 2024-11-22T23:27:52Z

See if you can reproduce it with this. It sometimes triggers 36+ hours after deployment.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: wolfi
  labels:
    app: wolfi
spec:
  replicas: 1
  selector:
    matchLabels:
      app: wolfi
  template:
    metadata:
      labels:
        app: wolfi
    spec:
      containers:
      - image: artifactory.coupang.net/ghcr-remote/chuyee/logtail-exporter:20241106103241-amd64
        imagePullPolicy: IfNotPresent
        lifecycle:
          preStop:
            exec:
              command:
              - sleep
              - "30"
        name: logtail-exporter
        resources:
          limits:
            memory: 128M
          requests:
            cpu: 200m
            memory: 128M

k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. kind/bug Categorizes issue or PR as related to a bug. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Sep 15, 2023

k8s-ci-robot assigned dgrisonnet Sep 21, 2023

k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Sep 21, 2023

k8s-ci-robot added needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. and removed triage/accepted Indicates an issue or PR is ready to be actively worked on. labels Oct 11, 2024

k8s-ci-robot closed this as completed Oct 11, 2024

chuyee mentioned this issue Nov 22, 2024

workaround cgroup zero memory working set problem #1598

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is workingSetBytes of 0 really an indication of a terminated process? #1330

Is workingSetBytes of 0 really an indication of a terminated process? #1330

SergeyKanzhelev commented Sep 15, 2023

SergeyKanzhelev commented Sep 15, 2023

dashpole commented Sep 21, 2023

kwiesmueller commented Sep 29, 2023

SergeyKanzhelev commented Sep 29, 2023

pacoxu commented Oct 4, 2023

serathius commented Oct 4, 2023

raywainman commented Oct 11, 2023

serathius commented Oct 12, 2023 •

edited

Loading

k8s-triage-robot commented Oct 11, 2024

raywainman commented Oct 11, 2024

k8s-ci-robot commented Oct 11, 2024

chuyee commented Nov 13, 2024

raywainman commented Nov 13, 2024

chuyee commented Nov 14, 2024

raywainman commented Nov 14, 2024

chuyee commented Nov 14, 2024

chuyee commented Nov 14, 2024

raywainman commented Nov 15, 2024

mvp commented Nov 18, 2024

raywainman commented Nov 22, 2024

chuyee commented Nov 22, 2024

Is workingSetBytes of 0 really an indication of a terminated process? #1330

Is workingSetBytes of 0 really an indication of a terminated process? #1330

Comments

SergeyKanzhelev commented Sep 15, 2023

SergeyKanzhelev commented Sep 15, 2023

dashpole commented Sep 21, 2023

kwiesmueller commented Sep 29, 2023

SergeyKanzhelev commented Sep 29, 2023

pacoxu commented Oct 4, 2023

serathius commented Oct 4, 2023

raywainman commented Oct 11, 2023

serathius commented Oct 12, 2023 • edited Loading

k8s-triage-robot commented Oct 11, 2024

raywainman commented Oct 11, 2024

k8s-ci-robot commented Oct 11, 2024

chuyee commented Nov 13, 2024

raywainman commented Nov 13, 2024

chuyee commented Nov 14, 2024

raywainman commented Nov 14, 2024

chuyee commented Nov 14, 2024

chuyee commented Nov 14, 2024

raywainman commented Nov 15, 2024

mvp commented Nov 18, 2024

raywainman commented Nov 22, 2024

chuyee commented Nov 22, 2024

serathius commented Oct 12, 2023 •

edited

Loading