Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add client side event spam filtering #47367

Merged
merged 1 commit into from
Sep 4, 2017

Conversation

derekwaynecarr
Copy link
Member

@derekwaynecarr derekwaynecarr commented Jun 12, 2017

What this PR does / why we need it:
Add client side event spam filtering to stop excessive traffic to api-server from internal cluster components.

this pr defines a per source+object event budget of 25 burst with refill of 1 every 5 minutes.

i tested this pr on the following scenarios:

Scenario 1: Node with 50 crash-looping pods

$ create 50 crash-looping pods on a single node
$ kubectl run bad --image=busybox --replicas=50 --command -- derekisbad

Before:

  • POST events with peak of 1.7 per second, long-tail: 0.2 per second
  • PATCH events with peak of 5 per second, long-tail: 5 per second

After:

  • POST events with peak of 1.7 per second, long-tail: 0.2 per second
  • PATCH events with peak of 3.6 per second, long-tail: 0.2 per second

Observation:

Scenario 2: replication controller limited by quota

$ kubectl create quota my-quota --hard=pods=1
$ kubectl run nginx --image=nginx --replicas=50

Before:

  • POST events not relevant as aggregation worked well here.
  • PATCH events with peak and long-tail of 13.6 per second

After:

  • POST events not relevant as aggregation worked well here.
  • PATCH events with peak: .35 per second, and long-tail of 0

Which issue this PR fixes
fixes #47366

Special notes for your reviewer:
this was a significant problem in a kube 1.5 cluster we are running where events were co-located in a single etcd. this cluster was normal to have larger numbers of unhealty pods as well as denial by quota.

Release note:

add support for client-side spam filtering of events

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jun 12, 2017
@derekwaynecarr derekwaynecarr changed the title Add client side event spam filtering WIP: Add client side event spam filtering Jun 12, 2017
@derekwaynecarr derekwaynecarr added the do-not-merge DEPRECATED. Indicates that a PR should not merge. Label can only be manually applied/removed. label Jun 12, 2017
@derekwaynecarr derekwaynecarr added this to the v1.7 milestone Jun 12, 2017
@derekwaynecarr
Copy link
Member Author

marking do-not-merge until i add a unit test.

while I think server-side spam filtering is important, this prevents our internal agents from abusing the apiserver. we have experienced a significant amount of abuse in a few clusters both intentional (miners creating replica sets with huge numbers), and unintentional (users pods that are in constant crashloop backoff). in either scenario, if an event keeps recurring, ttl doesn't help, and we need to reduce the frequency of traffic from our agents. this client-side spam detection does that.

/cc @eparis @sjenning @liggitt @smarterclayton

@derekwaynecarr
Copy link
Member Author

this is a production stability problem for us right now, so marking 1.7 milestone.

@k8s-github-robot k8s-github-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. release-note Denotes a PR that will be considered when it comes time to generate release notes. labels Jun 12, 2017
@smarterclayton
Copy link
Contributor

Rationale:

  1. In production clusters we see large (>100s) of event sources that persist for multi-days, with only 5 second interruption
  2. The difference between an event you've seen once, and 5 times, is significant. The difference between an event you've seen 6100 times and 6200 times is not.
  3. Uncontrolled pathological event spam is surprisingly common from infra components, so client side normalization is reasonable.
  4. Setting a maximum rate of recurrence is equivalent to pod backoff.

@@ -362,11 +471,11 @@ func NewEventCorrelator(clock clock.Clock) *EventCorrelator {

// EventCorrelate filters, aggregates, counts, and de-duplicates all incoming events
func (c *EventCorrelator) EventCorrelate(newEvent *v1.Event) (*EventCorrelateResult, error) {
aggregateEvent, ckey := c.aggregator.EventAggregate(newEvent)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How expensive is EventAggregate? I assume that's why it was lower before - maybe we should filter twice, instead of once?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

event aggregate is cheap. we had it lower before because i had not thought through spam detection well enough when it was first written. we need spam filtering after the event we plan to send to the server is aggregated so spam filtering works on aggregated event itself.

Copy link
Contributor

@sjenning sjenning left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good. just some readability recommendations.

if interval > maxInterval {
// enough time has transpired, we create a new record
record = spamRecord{}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is somewhat confusing. I'd move L138-146 inside the if block at L134 and only overwrite record if interval < maxInterval. That way we don't create a spamRecord on L123, lose it on L135, and make another new one on L145.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, if the record is outside the interval, we should remove it from the cache. Unless the eventKey is the same for the new record we are adding, then it'll just be an overwrite. Looking at getEventKey(), it looks like the key would be the same for repeated records. Just need to check.


maxSyncInterval := time.Duration(f.syncIntervalInSeconds) * time.Second
syncInterval := now.Time.Sub(record.lastSynced.Time)
if syncInterval > maxSyncInterval {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

invert the if condition and set filter = true below and you can remove L158. That way we are setting false, then true, the false again.

@derekwaynecarr derekwaynecarr changed the title WIP: Add client side event spam filtering Add client side event spam filtering Jun 13, 2017
@derekwaynecarr derekwaynecarr removed the do-not-merge DEPRECATED. Indicates that a PR should not merge. Label can only be manually applied/removed. label Jun 13, 2017
@derekwaynecarr
Copy link
Member Author

derekwaynecarr commented Jun 13, 2017

@smarterclayton @sjenning -- updated per review comment.

the values chosen for spam filtering appears to work well with the pathological replication controller scenario. we may need to further tweak the interval when i look at some of the other worst offenders we encountered. we also NEED to stop a replication controller from going from 0-500 in every sync interval concurrently. while those requests do not cause a write when denied by quota, we need to stop making calls just to be told no.

either way, for the pathological scenario:

kubectl create quota my-quota --hard=pods=1
kubectl run nginx --image=nginx --replicas=1000
...wait ~5 minutes...
kubectl get events
1m         7m          4807      nginx-3181653891         ReplicaSet                            Warning   FailedCreate              replicaset-controller   Error creating: pods "nginx-3181653891-" is forbidden: exceeded quota: my-quota, requested: pods=1, used: pods=1, limited: pods=1

previously, this created 4807 PATCH requests to the server, with client side spam filtering:

$ kubectl get --raw /metrics | grep summary_count | grep events
...
apiserver_request_latencies_summary_count{resource="events",subresource="",verb="PATCH"} 24

@eparis
Copy link
Contributor

eparis commented Jun 13, 2017

Actual data in one 300 node cluster we see:

Event Type Events/Sec
FailedSync 58.1
BackOff 50.5
Failedmount 7.5
FailedCreate 6.3
Unhealthy 3.4

But, over that same time period if I break FailedSync and BackOff down per metadata.name (instead of lumping all of them together) we see much less 'worst case'. (Notice this is per minute while above is per second)

Event Type Events/Pod/Min
FailedSync 4.8
BackOff 4.8

So, if I understand correctly, this would never hit either of the top 2 contributors to events in this cluster.

@derekwaynecarr
Copy link
Member Author

looking at more of the data that showed our event spam.

nodes are prone to induce event spam:

  • a pod that always fails its readiness probe (but not liveness probe) appeared to cause us 9k probe failure events per day. at the default 10s interval, this pr reduces the amount of PATCH event traffic for this failure event back to api server by a factor of 6.
  • a single pod reported a large number of "BackOff" events with reason "Back-off restarting failed docker container" 48k times over ~7d. the MaxContainerBackoff appears to be 300s. I think its worth moving the maxInterval to 300s from 120s to ensure .

@derekwaynecarr
Copy link
Member Author

derekwaynecarr commented Jun 13, 2017

@eparis -- i need to parse through more of the FailedSync events to see what is possible with you tomorrow so i can try and have a good reproduction. i think the Unhealthy events if tied to readiness probes (like the one i was looking at) should benefit from this PR.

for example, i am seeing ~6x reduction in traffic with

apiVersion: v1
kind: Pod
metadata:
  name: goproxy
  labels:
    app: goproxy
spec:
  containers:
  - name: goproxy
    image: gcr.io/google_containers/goproxy:0.1
    ports:
    - containerPort: 8080
    readinessProbe:
      tcpSocket:
        port: 8900
      initialDelaySeconds: 5
      periodSeconds: 10
    livenessProbe:
      tcpSocket:
        port: 8080
      initialDelaySeconds: 15
      periodSeconds: 20

@derekwaynecarr
Copy link
Member Author

in other news:

$ kubectl run bad --image=busybox --command -- derekisbad

gives you a number of terrifically horrible events reported back now with the CRI.

dissecting them more, it's hard to know what to do here when there are many similar bad pods. deleting the pods in crash loop back-off just causes more of them to come from the backing job. the FailedSync events are the trickiest.

the start (or failed start) of any pod can cause a lot of events.

Events:
  FirstSeen	LastSeen	Count	From			SubObjectPath		Type		Reason			Message
  ---------	--------	-----	----			-------------		--------	------			-------
  1m		1m		1	kubelet, 127.0.0.1				Normal		SuccessfulMountVolume	MountVolume.SetUp succeeded for volume "default-token-1h1m4" 
  1m		1m		1	default-scheduler				Normal		Scheduled		Successfully assigned bad-2463393876-v8b31 to 127.0.0.1
  1m		1m		1	kubelet, 127.0.0.1	spec.containers{bad}	Normal		Created			Created container with id b95f26009dfbbb4a9889a4274f6fef35bf6cd3ef7b2bac4040584ccc0ac51358
  1m		1m		1	kubelet, 127.0.0.1				Warning		FailedSync		Error syncing pod, skipping: failed to "StartContainer" for "bad" with rpc error: code = 2 desc = failed to start container "b95f26009dfbbb4a9889a4274f6fef35bf6cd3ef7b2bac4040584ccc0ac51358": Error response from daemon: {"message":"invalid header field value \"oci runtime error: container_linux.go:247: starting container process caused \\\"exec: \\\\\\\"derekisbad\\\\\\\": executable file not found in $PATH\\\"\\n\""}: "Start Container Failed"

  1m	1m	1	kubelet, 127.0.0.1	spec.containers{bad}	Warning	Failed		Failed to start container with id b95f26009dfbbb4a9889a4274f6fef35bf6cd3ef7b2bac4040584ccc0ac51358 with error: rpc error: code = 2 desc = failed to start container "b95f26009dfbbb4a9889a4274f6fef35bf6cd3ef7b2bac4040584ccc0ac51358": Error response from daemon: {"message":"invalid header field value \"oci runtime error: container_linux.go:247: starting container process caused \\\"exec: \\\\\\\"derekisbad\\\\\\\": executable file not found in $PATH\\\"\\n\""}
  1m	1m	1	kubelet, 127.0.0.1	spec.containers{bad}	Warning	Failed		Failed to start container with id 882d555cf0460b1850265a4f0ee66152f908898329319a66a5ac18e3b8b9f302 with error: rpc error: code = 2 desc = failed to start container "882d555cf0460b1850265a4f0ee66152f908898329319a66a5ac18e3b8b9f302": Error response from daemon: {"message":"invalid header field value \"oci runtime error: container_linux.go:247: starting container process caused \\\"exec: \\\\\\\"derekisbad\\\\\\\": executable file not found in $PATH\\\"\\n\""}
  1m	1m	1	kubelet, 127.0.0.1	spec.containers{bad}	Normal	Created		Created container with id 882d555cf0460b1850265a4f0ee66152f908898329319a66a5ac18e3b8b9f302
  1m	1m	1	kubelet, 127.0.0.1				Warning	FailedSync	Error syncing pod, skipping: failed to "StartContainer" for "bad" with rpc error: code = 2 desc = failed to start container "882d555cf0460b1850265a4f0ee66152f908898329319a66a5ac18e3b8b9f302": Error response from daemon: {"message":"invalid header field value \"oci runtime error: container_linux.go:247: starting container process caused \\\"exec: \\\\\\\"derekisbad\\\\\\\": executable file not found in $PATH\\\"\\n\""}: "Start Container Failed"

  57s	57s	1	kubelet, 127.0.0.1	spec.containers{bad}	Normal	Created		Created container with id 30c23189223a5c3ac14b47006398b5098616f5798b7a0403786b7094c154337a
  57s	57s	1	kubelet, 127.0.0.1	spec.containers{bad}	Warning	Failed		Failed to start container with id 30c23189223a5c3ac14b47006398b5098616f5798b7a0403786b7094c154337a with error: rpc error: code = 2 desc = failed to start container "30c23189223a5c3ac14b47006398b5098616f5798b7a0403786b7094c154337a": Error response from daemon: {"message":"invalid header field value \"oci runtime error: container_linux.go:247: starting container process caused \\\"exec: \\\\\\\"derekisbad\\\\\\\": executable file not found in $PATH\\\"\\n\""}
  57s	57s	1	kubelet, 127.0.0.1				Warning	FailedSync	Error syncing pod, skipping: failed to "StartContainer" for "bad" with rpc error: code = 2 desc = failed to start container "30c23189223a5c3ac14b47006398b5098616f5798b7a0403786b7094c154337a": Error response from daemon: {"message":"invalid header field value \"oci runtime error: container_linux.go:247: starting container process caused \\\"exec: \\\\\\\"derekisbad\\\\\\\": executable file not found in $PATH\\\"\\n\""}: "Start Container Failed"

  42s	42s	1	kubelet, 127.0.0.1		Warning	FailedSync	Error syncing pod, skipping: failed to "StartContainer" for "bad" with CrashLoopBackOff: "Back-off 20s restarting failed container=bad pod=bad-2463393876-v8b31_default(1d5ee9b4-4fee-11e7-a69c-c85b76cda386)"

  1m	30s	4	kubelet, 127.0.0.1	spec.containers{bad}	Normal	Pulling		pulling image "busybox"
  1m	29s	4	kubelet, 127.0.0.1	spec.containers{bad}	Normal	Pulled		Successfully pulled image "busybox"
  29s	29s	1	kubelet, 127.0.0.1	spec.containers{bad}	Normal	Created		Created container with id 20a4498e1821e27f913f9b1685d79530735dab18150f0d731a9ce4117946374a
  29s	29s	1	kubelet, 127.0.0.1	spec.containers{bad}	Warning	Failed		Failed to start container with id 20a4498e1821e27f913f9b1685d79530735dab18150f0d731a9ce4117946374a with error: rpc error: code = 2 desc = failed to start container "20a4498e1821e27f913f9b1685d79530735dab18150f0d731a9ce4117946374a": Error response from daemon: {"message":"invalid header field value \"oci runtime error: container_linux.go:247: starting container process caused \\\"exec: \\\\\\\"derekisbad\\\\\\\": executable file not found in $PATH\\\"\\n\""}
  29s	29s	1	kubelet, 127.0.0.1				Warning	FailedSync	Error syncing pod, skipping: failed to "StartContainer" for "bad" with rpc error: code = 2 desc = failed to start container "20a4498e1821e27f913f9b1685d79530735dab18150f0d731a9ce4117946374a": Error response from daemon: {"message":"invalid header field value \"oci runtime error: container_linux.go:247: starting container process caused \\\"exec: \\\\\\\"derekisbad\\\\\\\": executable file not found in $PATH\\\"\\n\""}: "Start Container Failed"

  42s	2s	4	kubelet, 127.0.0.1	spec.containers{bad}	Warning	BackOff		Back-off restarting failed container
  28s	2s	3	kubelet, 127.0.0.1				Warning	FailedSync	Error syncing pod, skipping: failed to "StartContainer" for "bad" with CrashLoopBackOff: "Back-off 40s restarting failed container=bad pod=bad-2463393876-v8b31_default(1d5ee9b4-4fee-11e7-a69c-c85b76cda386)"

a number of those events just suck. also i am not sure what user will really care about the container id versus just the container name. and man, do those oci errors stink coming back from runc...

i am going to try and see if i can clean this up more. that said, depending on what is in your pod, its normal to get a fair number of events on pod start. thinking for one way to handle this is to try and get the kubelet itself to log fewer events if the pod has a large restart count.

will wait for more sample data from @eparis to compare as that cluster was also on kube 1.5 and may vary as well.

@dchen1107
Copy link
Member

We talked about a couple of times on cleaning up those events, together with logging spam from Kubelet, but never get around with it. Introducing a client side event filter / aggregator before sending to the server is a good plan to me.

On another hand, we should re-evaluate today's events exposed by Kubelet. IMHO, we failed to export a lot of valuable events to the users, for example, sys oom killing, etc.

@yujuhong
Copy link
Contributor

a number of those events just suck. also i am not sure what user will really care about the container id versus just the container name. and man, do those oci errors stink coming back from runc...

We do report container ID in the ContainerStatus through, and it is useful for debugging in some cases. Maybe we can trim the ID to make it shorter?
I'm not sure what we can do for the oci errors other than doing some regex magic. It's ugly, though it does tell me where the error comes from (which probably is that of interest to the users).

@dchen1107
Copy link
Member

re: #47367 (comment)

We just talked about this at sig-node. Can we push containerID to logging, not in event? Event is for the user, not necessary for the debuggers, like us.

@yujuhong
Copy link
Contributor

We just talked about this at sig-node. Can we push containerID to logging, not in event? Event is for the user, not necessary for the debuggers, like us.

Then why do we even report them in the ContainerStatus? :-)

@mml
Copy link
Contributor

mml commented Aug 16, 2017

@derekwaynecarr Awesome, thanks.

@derekwaynecarr
Copy link
Member Author

@mml -- it is non-trivial to plumb a kube feature gate at this level of the stack as it seems like bad practice for client-go to reference a kubefeature gate. I do not see a compelling reason for this to not be the default behavior, and it is consistent with converstions/comments discussed here (https://docs.google.com/document/d/13BeJlrEcJhSKgsHOHWmHdJGqXrjSm_o9XxOtwcN6yNg/edit)

@liggitt
Copy link
Member

liggitt commented Aug 24, 2017

it seems like bad practice for client-go to reference a kubefeature gate.

agree

I do not see a compelling reason for this to not be the default behavior, and it is consistent with converstions/comments discussed here

also agree. this is an evolution of a feature already present and intended to prevent event overload, that was not filtering sufficiently

@smarterclayton
Copy link
Contributor

Given agreement and that this is a production issue for large clusters, I'm inclined to approve. Given a choice between write exhaustion and dropped events, I'd move to dropped events.

/approve

based on criteria and general consensus on the thread

@smarterclayton
Copy link
Contributor

However would still like to see a lgtm from the reviewer

@k8s-github-robot k8s-github-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 31, 2017
@mml
Copy link
Contributor

mml commented Aug 31, 2017

My concern with this on by default is that there is no easy workaround if it really does create a problem, and such problems will be hard to find with test plans. AFAICT, we don't even expose knobs for the burst or refill rate.

I would really like us to prioritize making this more configurable or less necessary in 1.9. Can we please file followup bugs about that now?

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 31, 2017
@k8s-github-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: derekwaynecarr, mml, smarterclayton

Associated issue: 47366

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these OWNERS Files:

You can indicate your approval by writing /approve in a comment
You can cancel your approval by writing /approve cancel in a comment

@smarterclayton
Copy link
Contributor

Agree this needs a bit more of a knob. There is an issue for per client rate limiting server side that is being worked on #50925. As to soak, we've been running in production with this for a month now on 4 large clusters and have not yet observed lost events (although we do deserve to file an AAR on this that evaluates outgoing vs limited). This reduced event traffic to a negligible concern since the vast majority of events were redundant and repeating.

@dims
Copy link
Member

dims commented Sep 4, 2017

/test all

@fejta-bot
Copy link

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to @fejta).

Review the full test history for this PR.

@dims
Copy link
Member

dims commented Sep 4, 2017

/retest

@k8s-github-robot
Copy link

/test all [submit-queue is verifying that this PR is safe to merge]

@k8s-ci-robot
Copy link
Contributor

k8s-ci-robot commented Sep 4, 2017

@derekwaynecarr: The following test failed, say /retest to rerun them all:

Test name Commit Details Rerun command
pull-kubernetes-e2e-kops-aws b62fa1d link /test pull-kubernetes-e2e-kops-aws

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@k8s-github-robot
Copy link

Automatic merge from submit-queue

@k8s-github-robot k8s-github-robot merged commit 870406b into kubernetes:master Sep 4, 2017
k8s-github-robot pushed a commit that referenced this pull request Oct 13, 2017
Automatic merge from submit-queue (batch tested with PRs 51840, 53542, 53857, 53831, 53702). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.

kubelet sync pod throws more detailed events

**What this PR does / why we need it**:
If there are errors in the kubelet sync pod iteration, it is difficult to determine the problem.

This provides more specific events for errors that occur in the syncPod iteration to help perform problem isolation.

Fixes #53900

**Special notes for your reviewer**:
It is safer to dispatch more specific events now that we have an event budget per object enforced via #47367

**Release note**:
```release-note
kubelet provides more specific events when unable to sync pod
```
k8s-github-robot pushed a commit that referenced this pull request Dec 15, 2017
Automatic merge from submit-queue (batch tested with PRs 56308, 54304, 56364, 56388, 55853). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.

Send events on certain service repair controller errors

**What this PR does / why we need it**:

This PR enables sending events when the api-server service IP and port allocator repair controllers find an error repairing a cluster ip or a port respectively.

**Which issue this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close that issue when PR gets merged)*: fixes #54303

**Special notes for your reviewer**:

In case of an error, events will be emitted [every 3 minutes](https://github.com/kubernetes/kubernetes/blob/master/pkg/master/controller.go#L93) for each failed Service. Even so, event spam protection has been merged (#47367) to mitigate the risk of excessive events.

**Release note**:

```release-note
api-server provides specific events when unable to repair a service cluster ip or node port
```
huww98 added a commit to huww98/kubernetes that referenced this pull request May 23, 2024
kubernetes#47367 introduced client side event spam filtering to reduce the API server/etcd pressure of processing repeated events.
However, it use only Source and InvolvedObject as key, so it can also drop important and unique events. Slowing down the diagnosing significantly.

kubernetes#103918 try to resolve this by adding a new SpamKeyFunc to customize the key used. But this is still not ideal, because every user needs to dig into the code of client-go to figure out why his event does not appear. If the EventBroadcaster is initialized in a library, it may be even harder to set the SpamKeyFunc.

I propose to only drop PATCH events, which is the intent of the original proposal.
This patch deprecates SpamKeyFunc, and effectively fixed it to the return value of EventAggregate().

For the users who don't set SpamKeyFunc, they should get exactly more events
then before, so this is not breaking for them. Other users are likely just adding Reason to the key. We already do this now.
huww98 added a commit to huww98/kubernetes that referenced this pull request May 26, 2024
PR kubernetes#47367 introduced client side event spam filtering to reduce the API
server/etcd pressure of processing repeated events.  However, it use only
Source and InvolvedObject as key, so it can also drop important and unique
events. Slowing down the diagnosing significantly.

PR kubernetes#103918 try to resolve this by adding a new SpamKeyFunc to customize the key
used. But this is still not ideal, because every user needs to dig into the
code of client-go to figure out why his event does not appear. If the
EventBroadcaster is initialized in a library, it may be even harder to set the
SpamKeyFunc.

I propose to only drop PATCH events by default, which is the intent of the
original proposal.  For the users who don't set SpamKeyFunc, they should get
exactly more events.
huww98 added a commit to huww98/kubernetes that referenced this pull request Jun 1, 2024
PR kubernetes#47367 introduced client side event spam filtering to reduce the API
server/etcd pressure of processing repeated events.  However, it use only
Source and InvolvedObject as key, so it can also drop important and unique
events. Slowing down the diagnosing significantly.

PR kubernetes#103918 try to resolve this by adding a new SpamKeyFunc to customize the key
used. But this is still not ideal, because every user needs to dig into the
code of client-go to figure out why his event does not appear. If the
EventBroadcaster is initialized in a library, it may be even harder to set the
SpamKeyFunc.

I propose to only drop PATCH events by default, which is the intent of the
original proposal.  For the users who don't set SpamKeyFunc, they should get
exactly more events.
huww98 added a commit to huww98/kubernetes that referenced this pull request Aug 29, 2024
PR kubernetes#47367 introduced client side event spam filtering to reduce the API
server/etcd pressure of processing repeated events.  However, it use only
Source and InvolvedObject as key, so it can also drop important and unique
events. Slowing down the diagnosing significantly.

PR kubernetes#103918 try to resolve this by adding a new SpamKeyFunc to customize the key
used. But this is still not ideal, because every user needs to dig into the
code of client-go to figure out why his event does not appear. If the
EventBroadcaster is initialized in a library, it may be even harder to set the
SpamKeyFunc.

I propose to only drop PATCH events by default, which is the intent of the
original proposal.  For the users who don't set SpamKeyFunc, they should get
exactly more events.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/node Categorizes an issue or PR as relevant to SIG Node. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

event spam causes excessive apiserver traffic and frequent snapshots