-
Notifications
You must be signed in to change notification settings - Fork 730
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Elasticsearch_cluster_health_should_not_have_been_red_during_mutation_process (revisited) #5795
Comments
I have reached out to the Elasticsearch team to figure out if we can improve something in the node shutdown API. What I need to understand better still is why our own check
|
So regarding the check for shard activity it looks like the cluster health API request timed out:
But the operator seems to have continued. I would have expected the |
In build/4899, there was exactly 10 x cloud-on-k8s/test/e2e/test/elasticsearch/steps_mutation.go Lines 254 to 257 in 3e40d5b
cloud-on-k8s/test/e2e/test/elasticsearch/steps_mutation.go Lines 233 to 238 in 9ef2aca
Let's set |
This comment was marked as resolved.
This comment was marked as resolved.
We fail the test as soon as there is a failure using cloud-on-k8s/test/e2e/test/elasticsearch/steps_mutation.go Lines 159 to 161 in 58b5a73
#7358 should fix this. |
But not the flaky test:
52 seems very high. |
We had the same failure at least 2 times yesterday and today (both on Kind). I'll create a new issue as I'm not sure the recent failures are related. |
Related to #5040 but now with node shutdown API in use:
https://devops-ci.elastic.co/job/cloud-on-k8s-e2e-tests-gke-k8s-versions/754//testReport
Rolling upgrade from 8.2.0 to 8.2.2
Timeline:
-- index [[.ds-ilm-history-5-2022.06.17-000001/VSEhLPyNTpOL8_9xw-YaVQ]]
----shard_id [.ds-ilm-history-5-2022.06.17-000001][0]
--------[.ds-ilm-history-5-2022.06.17-000001][0], node[c4wCBppvQRWsqabiqMd1hg], [P], recovery_source[new shard recovery], s[INITIALIZING], a[id=5av7tzdTRfeS67wh1Qw-Ig], unassigned_info[[reason=INDEX_CREATED], at[2022-06-17T07:18:09.536Z], delayed=false, allocation_status[no_attempt]]
--------[.ds-ilm-history-5-2022.06.17-000001][0], node[null], [R], recovery_source[peer recovery], s[UNASSIGNED], unassigned_info[[reason=REPLICA_ADDED], at[2022-06-17T07:18:09.537Z], delayed=false, allocation_status[no_attempt]] ILM history is auto-created
{"log.level":"info","@\timestamp": "2022-06-17T07:18:11.163Z","log.logger":"driver","message":"Deleting pod for rolling upgrade","service.version":"2.4.0-SNAPSHOT+903d77f6","service.type":"eck","ecs.version":"1.4.0","es_name":"test-version-upgrade-to-8x-r6p5","namespace":"e2e-fusqo-mercury","pod_name":"test-version-upgrade-to-8x-r6p5-es-masterdata-2","pod_uid":"3e2c7c5e-b829-4373-b29b-7f9eea252177"} operator deleting the first Pod after node shutdown API said it was OK to do so
----shard_id [.ds-ilm-history-5-2022.06.17-000001][0]
--------[.ds-ilm-history-5-2022.06.17-000001][0], node[null], [P], recovery_source[existing store recovery; bootstrap_history_uuid=false], s[UNASSIGNED], unassigned_info[[reason=NODE_RESTARTING], at[2022-06-17T07:18:11.986Z], delayed=true, details[node_left [c4wCBppvQRWsqabiqMd1hg]], allocation_status[no_valid_shard_copy]]
--------[.ds-ilm-history-5-2022.06.17-000001][0], node[null], [R], recovery_source[peer recovery], s[UNASSIGNED], unassigned_info[[reason=PRIMARY_FAILED], at[2022-06-17T07:18:11.986Z], delayed=false, details[primary failed while replica initializing], allocation_status[no_attempt]] cluster is RED
The way I read this sequence of events is that we started a rolling upgrade only 3 seconds after the ILM history index had been created and apparently its replica was not ready yet when we took down the first node to upgrade. However my expectation would be that the node shutdown API should not allow a node to go down if the only remaining replica is not yet initialised.
The text was updated successfully, but these errors were encountered: