Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

try to fix e2e test failure #1119

Merged
merged 1 commit into from
Jan 31, 2023

Conversation

yangjunmyfm192085
Copy link
Contributor

What this PR does / why we need it:

The pull-metrics-server-test-e2e and pull-metrics-server-test-e2e-ha tests have failed frequently recently because the test case returns accurate CPU metric fails.
The most likely reason is that when the pod is just successfully deployed, the metrics obtained are inaccurate. We need to wait for one scrape cycle

https://prow.k8s.io/job-history/gs/kubernetes-jenkins/pr-logs/directory/pull-metrics-server-test-e2e
https://prow.k8s.io/job-history/gs/kubernetes-jenkins/pr-logs/directory/pull-metrics-server-test-e2e-ha

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Nov 18, 2022
@k8s-ci-robot k8s-ci-robot added the size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. label Nov 18, 2022
@yangjunmyfm192085
Copy link
Contributor Author

/retest

@yangjunmyfm192085
Copy link
Contributor Author

/test pull-metrics-server-test-e2e

@yangjunmyfm192085
Copy link
Contributor Author

/test pull-metrics-server-test-e2e-ha

@yangjunmyfm192085
Copy link
Contributor Author

/test pull-metrics-server-test-e2e

@yangjunmyfm192085
Copy link
Contributor Author

/test pull-metrics-server-test-e2e-ha

1 similar comment
@yangjunmyfm192085
Copy link
Contributor Author

/test pull-metrics-server-test-e2e-ha

@yangjunmyfm192085
Copy link
Contributor Author

/cc @serathius , this pr try to fix the e2e test failure

@serathius
Copy link
Contributor

The most likely reason is that when the pod is just successfully deployed, the metrics obtained are inaccurate. We need to wait for one scrape cycle.

Before changing test, please ensure that this is not a an issue with metrics-server.

@yangjunmyfm192085
Copy link
Contributor Author

The most likely reason is that when the pod is just successfully deployed, the metrics obtained are inaccurate. We need to wait for one scrape cycle.

Before changing test, please ensure that this is not a an issue with metrics-server.

ok, when creating a pod, let me do more tests and get data verification directly from the metrics/resource and stats/summary endpoints of kubelet

@yangjunmyfm192085
Copy link
Contributor Author

/hold
Wait for confirmation that it is not a issue with metrics-server

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 18, 2022
@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Nov 20, 2022
@yangjunmyfm192085
Copy link
Contributor Author

/test pull-metrics-server-test-e2e

2 similar comments
@yangjunmyfm192085
Copy link
Contributor Author

/test pull-metrics-server-test-e2e

@yangjunmyfm192085
Copy link
Contributor Author

/test pull-metrics-server-test-e2e

@yangjunmyfm192085
Copy link
Contributor Author

/test pull-metrics-server-test-e2e-ha

@yangjunmyfm192085
Copy link
Contributor Author

/test pull-metrics-server-test-e2e-helm

@yangjunmyfm192085
Copy link
Contributor Author

yangjunmyfm192085 commented Nov 21, 2022

/test pull-metrics-server-test-e2e
/test pull-metrics-server-test-e2e-ha

@yangjunmyfm192085
Copy link
Contributor Author

/test pull-metrics-server-test-e2e-helm

@yangjunmyfm192085
Copy link
Contributor Author

/test pull-metrics-server-test-e2e
/test pull-metrics-server-test-e2e-ha
/test pull-metrics-server-test-e2e-helm

3 similar comments
@yangjunmyfm192085
Copy link
Contributor Author

/test pull-metrics-server-test-e2e
/test pull-metrics-server-test-e2e-ha
/test pull-metrics-server-test-e2e-helm

@yangjunmyfm192085
Copy link
Contributor Author

/test pull-metrics-server-test-e2e
/test pull-metrics-server-test-e2e-ha
/test pull-metrics-server-test-e2e-helm

@yangjunmyfm192085
Copy link
Contributor Author

/test pull-metrics-server-test-e2e
/test pull-metrics-server-test-e2e-ha
/test pull-metrics-server-test-e2e-helm

@yangjunmyfm192085
Copy link
Contributor Author

@dgrisonnet any ideas?
To be sure, this is not a bug of metrics-server

@dgrisonnet
Copy link
Member

What I wanted to check in #1119 (comment) was if kubelet (cadvisor under the hood) was the culprit. To do so, we can compare the data we get from kubelet with data from either crictl stats or linux top.

Another option that might be better would be to grab CPU profiles from the cpu-consumer container, but since I don't expect this program to have pprof endpoints it might be a bit harder. We could for example get the profile by installing the parca-agent: https://www.parca.dev/docs/parca-agent and running it in CI, but that's quite a bit of work.

@yangjunmyfm192085
Copy link
Contributor Author

What I wanted to check in #1119 (comment) was if kubelet (cadvisor under the hood) was the culprit. To do so, we can compare the data we get from kubelet with data from either crictl stats or linux top.

Another option that might be better would be to grab CPU profiles from the cpu-consumer container, but since I don't expect this program to have pprof endpoints it might be a bit harder. We could for example get the profile by installing the parca-agent: https://www.parca.dev/docs/parca-agent and running it in CI, but that's quite a bit of work.

I will continue to analyze it.

@dgrisonnet
Copy link
Member

Thanks! Let me know if you need any help, this can be a little tricky

@yangjunmyfm192085
Copy link
Contributor Author

Thanks! Let me know if you need any help, this can be a little tricky

It is indeed a bit tricky. Using para, I still don't know how to export the statistics. I might need to look into it

@dgrisonnet
Copy link
Member

I am not sure if collecting raw data is supported yet. As far as I can tell from the doc, they only allow sending the profiles to a compatible server. Let me check if I can find something else we could use.

@dgrisonnet
Copy link
Member

We might be able to use kubectl-flame plugin: https://github.com/yahoo/kubectl-flame to capture the flame graph and then upload it in the artifacts.

@yangjunmyfm192085
Copy link
Contributor Author

OK, Let me do some research about it

@yangjunmyfm192085 yangjunmyfm192085 force-pushed the fixe2etest branch 2 times, most recently from e92b499 to 6041de6 Compare December 27, 2022 07:50
@yangjunmyfm192085
Copy link
Contributor Author

/cc @dgrisonnet ,I try to use kubectl-flame plugin to get cpu-consumer's process cpu statistics, but can't get it normally. The error message is similar to

<!-- NOTES: -->
<text x="600.00" y="24" >ERROR: No valid input provided to flamegraph.pl.</text>
</svg>
  , stderr: ERROR: No stack counts found

@dgrisonnet
Copy link
Member

OK, when I have some times on my hands, I'll try to have a look, but I would want us to further investigate where the issue is coming from

@k8s-ci-robot k8s-ci-robot added size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jan 27, 2023
@yangjunmyfm192085
Copy link
Contributor Author

Hi, @dgrisonnet As we discussed above, this is not a an issue with metrics-server.
Could we merge this pr first, and I will open an issue in the kubernetes/kubernetes repo to track this issue?
Because the ci e2e tests is very unstable recently, which affects the normal pr e2e tests

@yangjunmyfm192085
Copy link
Contributor Author

/assign @dgrisonnet

@yangjunmyfm192085
Copy link
Contributor Author

/cc @serathius Could we consider merging this PR?
e2e tests fail frequently recently

test/e2e_test.go Outdated
@@ -94,6 +94,7 @@ var _ = Describe("MetricsServer", func() {
if err != nil {
panic(err)
}
time.Sleep(15 * time.Second)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please comment why this sleep is needed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact, the effective key point is in

Limits: map[corev1.ResourceName]resource.Quantity{
corev1.ResourceCPU: mustQuantity("100m"),
},

I will roll back this modification

@@ -535,6 +536,9 @@ func consumeCPU(client clientset.Interface, podName string) error {
Requests: map[corev1.ResourceName]resource.Quantity{
corev1.ResourceCPU: mustQuantity("100m"),
},
Limits: map[corev1.ResourceName]resource.Quantity{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you leave a TODO to investigate and remove limit?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@yangjunmyfm192085
Copy link
Contributor Author

/cc @serathius I have updated.

@serathius
Copy link
Contributor

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 31, 2023
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: serathius, yangjunmyfm192085

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 31, 2023
@k8s-ci-robot k8s-ci-robot merged commit 51e8aa8 into kubernetes-sigs:master Jan 31, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants