Elasticsearch_cluster_health_should_not_have_been_red_during_mutation_process (revisited) #5795

pebrc · 2022-06-17T11:28:07Z

Related to #5040 but now with node shutdown API in use:

https://devops-ci.elastic.co/job/cloud-on-k8s-e2e-tests-gke-k8s-versions/754//testReport

Rolling upgrade from 8.2.0 to 8.2.2

Timeline:

{"Time":"2022-06-17T07:17:01.717730831Z","Action":"output","Package":"github.com/elastic/cloud-on-k8s/test/e2e/kb","Test":"TestVersionUpgradeToLatest8x/Creating_an_Elasticsearch_cluster_should_succeed","Output":"=== RUN TestVersionUpgradeToLatest8x/Creating_an_Elasticsearch_cluster_should_succeed\n"}
{"Time":"2022-06-17T07:18:07.786852419Z","Action":"run","Package":"github.com/elastic/cloud-on-k8s/test/e2e/kb","Test":"TestVersionUpgradeToLatest8x/Applying_the_Elasticsearch_mutation_should_succeed"} this is were the test runner starts the rolling upgrade
Jun 17, 2022 @ 07:18:09.556 cluster state updated, source [auto create [ilm-history-5][org.elasticsearch.action.admin.indices.create.AutoCreateAction$TransportAction$CreateIndexTask@4f704a38]]
-- index [[.ds-ilm-history-5-2022.06.17-000001/VSEhLPyNTpOL8_9xw-YaVQ]]
----shard_id [.ds-ilm-history-5-2022.06.17-000001][0]
--------[.ds-ilm-history-5-2022.06.17-000001][0], node[c4wCBppvQRWsqabiqMd1hg], [P], recovery_source[new shard recovery], s[INITIALIZING], a[id=5av7tzdTRfeS67wh1Qw-Ig], unassigned_info[[reason=INDEX_CREATED], at[2022-06-17T07:18:09.536Z], delayed=false, allocation_status[no_attempt]]
--------[.ds-ilm-history-5-2022.06.17-000001][0], node[null], [R], recovery_source[peer recovery], s[UNASSIGNED], unassigned_info[[reason=REPLICA_ADDED], at[2022-06-17T07:18:09.537Z], delayed=false, allocation_status[no_attempt]] ILM history is auto-created
{"log.level":"info","@\timestamp":"2022-06-17T07:18:11.163Z","log.logger":"driver","message":"Node shutdown complete, can start node deletion","service.version":"2.4.0-SNAPSHOT+903d77f6","service.type":"eck","ecs.version":"1.4.0","namespace":"e2e-fusqo-mercury","es_name":"test-version-upgrade-to-8x-r6p5","type":"RESTART","node":"test-version-upgrade-to-8x-r6p5-es-masterdata-2"}
{"log.level":"info","@\timestamp": "2022-06-17T07:18:11.163Z","log.logger":"driver","message":"Deleting pod for rolling upgrade","service.version":"2.4.0-SNAPSHOT+903d77f6","service.type":"eck","ecs.version":"1.4.0","es_name":"test-version-upgrade-to-8x-r6p5","namespace":"e2e-fusqo-mercury","pod_name":"test-version-upgrade-to-8x-r6p5-es-masterdata-2","pod_uid":"3e2c7c5e-b829-4373-b29b-7f9eea252177"} operator deleting the first Pod after node shutdown API said it was OK to do so
Jun 17, 2022 @ 07:18:12.226 cluster state updated, source [cluster_reroute(async_shard_fetch)] -- index [[.ds-ilm-history-5-2022.06.17-000001/VSEhLPyNTpOL8_9xw-YaVQ]]
----shard_id [.ds-ilm-history-5-2022.06.17-000001][0]
--------[.ds-ilm-history-5-2022.06.17-000001][0], node[null], [P], recovery_source[existing store recovery; bootstrap_history_uuid=false], s[UNASSIGNED], unassigned_info[[reason=NODE_RESTARTING], at[2022-06-17T07:18:11.986Z], delayed=true, details[node_left [c4wCBppvQRWsqabiqMd1hg]], allocation_status[no_valid_shard_copy]]
--------[.ds-ilm-history-5-2022.06.17-000001][0], node[null], [R], recovery_source[peer recovery], s[UNASSIGNED], unassigned_info[[reason=PRIMARY_FAILED], at[2022-06-17T07:18:11.986Z], delayed=false, details[primary failed while replica initializing], allocation_status[no_attempt]] cluster is RED

The way I read this sequence of events is that we started a rolling upgrade only 3 seconds after the ILM history index had been created and apparently its replica was not ready yet when we took down the first node to upgrade. However my expectation would be that the node shutdown API should not allow a node to go down if the only remaining replica is not yet initialised.

pebrc · 2022-06-17T15:21:25Z

I have reached out to the Elasticsearch team to figure out if we can improve something in the node shutdown API. What I need to understand better still is why our own check

cloud-on-k8s/pkg/controller/elasticsearch/driver/upgrade_predicates.go

Line 648 in 24ab02e

func isSafeToRoll(health client.Health) bool {

did not protect us in this scenario. All preconditions apply: it is a yellow cluster state, we have a version upgrade.

pebrc · 2022-06-17T15:59:38Z

So regarding the check for shard activity it looks like the cluster health API request timed out:

Jun 17, 2022 @ 07:18:09.543 task [cluster_health (wait_for_events [LANGUID])] timed out after [0s]

{"log.level":"debug","@timestamp":"2022-06-17T07:18:09.539Z","log.logger":"elasticsearch-client","message":"Elasticsearch HTTP request","service.version":"2.4.0-SNAPSHOT+903d77f6","service.type":"eck","ecs.versi
on":"1.4.0","method":"GET","url":"https://test-version-upgrade-to-8x-r6p5-es-internal-http.e2e-fusqo-mercury.svc:9200/_cluster/health?wait_for_events=languid&timeout=0s","namespace":"e2e-fusqo-mercury","es_name":"test-version-upgrade-to-8x-r6p5"}
# parse error here is misleading just means we got a health response with an non-200 HTTP code
{"log.level":"error","@timestamp":"2022-06-17T07:18:09.545Z","log.logger":"elasticsearch-client","message":"Cannot parse Elasticsearch error response body","service.version":"2.4.0-SNAPSHOT+903d77f6","service.type":"eck","ecs.version":"1.4.0","error":"--useless stack trace elided---"}

But the operator seems to have continued. I would have expected the if_yellow_only_restart_upgrading_nodes_with_unassigned_replicas to save us here. I wonder if the cluster was not (yet) reporting yellow health.

thbkrkr · 2023-07-13T08:12:18Z

In build/4899, there was exactly 10 x Elasticsearch cluster health check failure at ...: cluster health red, and in #7000 we set the limit to 10, so it should have worked, but I didn't see there may also be 1 x error while creating the Elasticsearch client, which is counted in continuousHealthChecks.FailureCount.

cloud-on-k8s/test/e2e/test/elasticsearch/steps_mutation.go

Lines 254 to 257 in 3e40d5b

    
           fmt.Printf("error while creating the Elasticsearch client: %s", err) 
        
           if !errors.As(err, &PotentialNetworkError) { 
        
           	// explicit apiserver error, consider as healthcheck failure 
        
           	hc.AppendErr(err)

cloud-on-k8s/test/e2e/test/elasticsearch/steps_mutation.go

Lines 233 to 238 in 9ef2aca

    
           func (hc *ContinuousHealthCheck) AppendErr(err error) { 
        
           	hc.Failures = append(hc.Failures, ContinuousHealthCheckFailure{ 
        
           		err:       err, 
        
           		timestamp: time.Now(), 
        
           	}) 
        
           	hc.FailureCount++

Let's set 15 to get a little margin but stay relatively low.

thbkrkr · 2023-12-01T14:52:14Z

We fail the test as soon as there is a failure using t.Errorf despite our tolerance mechanism. 🤦

cloud-on-k8s/test/e2e/test/elasticsearch/steps_mutation.go

Lines 159 to 161 in 58b5a73

    
           for _, f := range continuousHealthChecks.Failures { 
        
           	t.Errorf("Elasticsearch cluster health check failure at %s: %s", f.timestamp, f.err.Error()) 
        
           }

#7358 should fix this.

thbkrkr · 2023-12-04T09:35:09Z

But not the flaky test:

https://buildkite.com/elastic/cloud-on-k8s-operator-nightly/builds/381

steps_mutation.go:159: ContinuousHealthChecks failures count (52) is above the tolerance (15): 52 x [cluster health red]

52 seems very high.

barkbay · 2024-11-27T06:26:12Z

We had the same failure at least 2 times yesterday and today (both on Kind). I'll create a new issue as I'm not sure the recent failures are related.

pebrc mentioned this issue Jun 17, 2022

TestAgentVersionUpgradeToLatest8x is failing #5344

Closed

botelastic bot added the triage label Jun 17, 2022

pebrc added the >bug Something isn't working label Jun 17, 2022

botelastic bot removed the triage label Jun 17, 2022

pebrc added the discuss We need to figure this out label Jun 17, 2022

pebrc mentioned this issue Jun 20, 2022

Node shutdown should minimise unavailability by waiting for shard initialisation elastic/elasticsearch#87836

Open

thbkrkr added the >flaky_test label Apr 20, 2023

thbkrkr mentioned this issue Jun 30, 2023

TestAPMServerVersionUpgradeToLatest8x/Elasticsearch_cluster_health_should_not_have_been_red_during_mutation_process #6967

Closed

thbkrkr mentioned this issue Jul 11, 2023

Relax mutation health checks for TestVersionUpgrade e2e tests #7000

Merged

thbkrkr mentioned this issue Jul 13, 2023

Increase mutation health checks tolerance #7012

Merged

This comment was marked as resolved.

Sign in to view

thbkrkr mentioned this issue Oct 16, 2023

[e2e] Log continuous health checks failures #7230

Merged

thbkrkr mentioned this issue Oct 26, 2023

[e2e] Log test name on failure #7267

Merged

thbkrkr mentioned this issue Dec 1, 2023

Fix continuousHealthCheck mutation step #7358

Merged

thbkrkr closed this as completed in #7358 Dec 1, 2023

thbkrkr reopened this Dec 4, 2023

thbkrkr mentioned this issue Mar 12, 2024

TestMutationResizeMemoryUp fails in 8.13.0-SNAPSHOT #7618

Closed

barkbay closed this as completed Nov 27, 2024

barkbay mentioned this issue Nov 27, 2024

[E2E] Elasticsearch cluster health should not have been red during mutation process #8267

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Elasticsearch_cluster_health_should_not_have_been_red_during_mutation_process (revisited) #5795

Elasticsearch_cluster_health_should_not_have_been_red_during_mutation_process (revisited) #5795

pebrc commented Jun 17, 2022 •

edited

Loading

pebrc commented Jun 17, 2022

pebrc commented Jun 17, 2022 •

edited

Loading

thbkrkr commented Jul 13, 2023 •

edited

Loading

This comment was marked as resolved.

thbkrkr commented Dec 1, 2023

thbkrkr commented Dec 4, 2023

barkbay commented Nov 27, 2024

Elasticsearch_cluster_health_should_not_have_been_red_during_mutation_process (revisited) #5795

Elasticsearch_cluster_health_should_not_have_been_red_during_mutation_process (revisited) #5795

Comments

pebrc commented Jun 17, 2022 • edited Loading

pebrc commented Jun 17, 2022

pebrc commented Jun 17, 2022 • edited Loading

thbkrkr commented Jul 13, 2023 • edited Loading

This comment was marked as resolved.

thbkrkr commented Dec 1, 2023

thbkrkr commented Dec 4, 2023

barkbay commented Nov 27, 2024

pebrc commented Jun 17, 2022 •

edited

Loading

pebrc commented Jun 17, 2022 •

edited

Loading

thbkrkr commented Jul 13, 2023 •

edited

Loading