You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We've seen quite a few incidents where a newly introduced feature unnecessarily caused high cpu usage in cadvisor (e.g., disk and network stats). The process of fixing the instances usually involve:
user/developer reports high cpu usage from kubelet
ask user to generate a cpu profile
identify the issue
fix the issue through cadvisor PR
update the dependency in kubernetes
To gate these changes earlier, we should run some tests to detect performance regression. Preferably these tests will run as part of the per-PR build. If the tests are deemed too long, a separate builder may be needed. The later will at least protect downstream projects such as kubernetes. I understand that not all regressions can be detected in the tests, but they would at least catch the trivial cases.
Some simple metrics that comes to me (without knowing too much cadvisor internals) are housekeeping time and the #goroutines. It should be easy to read the summary of these metrics from prometheus.
+1. In addition to these the metrics you suggested, we should track cpu and
memory usage. These metrics are exposed by cadvisor already.
The existing test framework might not be ideal for this purpose. We might
have to soak the binary for a few minutes to identify leaks and other
issues.
Instead of running against every PR, we can run against HEAD periodically.
On Thu, Oct 29, 2015 at 4:54 PM, Jimmi Dyson [email protected]
wrote:
[image: 👍] This is likely to highlight not just regressions but a
number of performance improvements too.
—
Reply to this email directly or view it on GitHub #945 (comment).
We've seen quite a few incidents where a newly introduced feature unnecessarily caused high cpu usage in cadvisor (e.g., disk and network stats). The process of fixing the instances usually involve:
To gate these changes earlier, we should run some tests to detect performance regression. Preferably these tests will run as part of the per-PR build. If the tests are deemed too long, a separate builder may be needed. The later will at least protect downstream projects such as kubernetes. I understand that not all regressions can be detected in the tests, but they would at least catch the trivial cases.
Some simple metrics that comes to me (without knowing too much cadvisor internals) are housekeeping time and the #goroutines. It should be easy to read the summary of these metrics from prometheus.
@vishh @dchen1107 @jimmidyson, thoughts?
The text was updated successfully, but these errors were encountered: