Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Elasticsearch_cluster_health_should_not_have_been_red_during_mutation_process (revisited) #5795

Closed
pebrc opened this issue Jun 17, 2022 · 7 comments · Fixed by #7358
Closed
Labels
>bug Something isn't working discuss We need to figure this out >flaky_test

Comments

@pebrc
Copy link
Collaborator

pebrc commented Jun 17, 2022

Related to #5040 but now with node shutdown API in use:

https://devops-ci.elastic.co/job/cloud-on-k8s-e2e-tests-gke-k8s-versions/754//testReport

Rolling upgrade from 8.2.0 to 8.2.2

Timeline:

  • {"Time":"2022-06-17T07:17:01.717730831Z","Action":"output","Package":"github.com/elastic/cloud-on-k8s/test/e2e/kb","Test":"TestVersionUpgradeToLatest8x/Creating_an_Elasticsearch_cluster_should_succeed","Output":"=== RUN TestVersionUpgradeToLatest8x/Creating_an_Elasticsearch_cluster_should_succeed\n"}
  • {"Time":"2022-06-17T07:18:07.786852419Z","Action":"run","Package":"github.com/elastic/cloud-on-k8s/test/e2e/kb","Test":"TestVersionUpgradeToLatest8x/Applying_the_Elasticsearch_mutation_should_succeed"} this is were the test runner starts the rolling upgrade
  • Jun 17, 2022 @ 07:18:09.556 cluster state updated, source [auto create [ilm-history-5][org.elasticsearch.action.admin.indices.create.AutoCreateAction$TransportAction$CreateIndexTask@4f704a38]]
    -- index [[.ds-ilm-history-5-2022.06.17-000001/VSEhLPyNTpOL8_9xw-YaVQ]]
    ----shard_id [.ds-ilm-history-5-2022.06.17-000001][0]
    --------[.ds-ilm-history-5-2022.06.17-000001][0], node[c4wCBppvQRWsqabiqMd1hg], [P], recovery_source[new shard recovery], s[INITIALIZING], a[id=5av7tzdTRfeS67wh1Qw-Ig], unassigned_info[[reason=INDEX_CREATED], at[2022-06-17T07:18:09.536Z], delayed=false, allocation_status[no_attempt]]
    --------[.ds-ilm-history-5-2022.06.17-000001][0], node[null], [R], recovery_source[peer recovery], s[UNASSIGNED], unassigned_info[[reason=REPLICA_ADDED], at[2022-06-17T07:18:09.537Z], delayed=false, allocation_status[no_attempt]] ILM history is auto-created
  • {"log.level":"info","@\timestamp":"2022-06-17T07:18:11.163Z","log.logger":"driver","message":"Node shutdown complete, can start node deletion","service.version":"2.4.0-SNAPSHOT+903d77f6","service.type":"eck","ecs.version":"1.4.0","namespace":"e2e-fusqo-mercury","es_name":"test-version-upgrade-to-8x-r6p5","type":"RESTART","node":"test-version-upgrade-to-8x-r6p5-es-masterdata-2"}
    {"log.level":"info","@\timestamp": "2022-06-17T07:18:11.163Z","log.logger":"driver","message":"Deleting pod for rolling upgrade","service.version":"2.4.0-SNAPSHOT+903d77f6","service.type":"eck","ecs.version":"1.4.0","es_name":"test-version-upgrade-to-8x-r6p5","namespace":"e2e-fusqo-mercury","pod_name":"test-version-upgrade-to-8x-r6p5-es-masterdata-2","pod_uid":"3e2c7c5e-b829-4373-b29b-7f9eea252177"} operator deleting the first Pod after node shutdown API said it was OK to do so
  • Jun 17, 2022 @ 07:18:12.226 cluster state updated, source [cluster_reroute(async_shard_fetch)] -- index [[.ds-ilm-history-5-2022.06.17-000001/VSEhLPyNTpOL8_9xw-YaVQ]]
    ----shard_id [.ds-ilm-history-5-2022.06.17-000001][0]
    --------[.ds-ilm-history-5-2022.06.17-000001][0], node[null], [P], recovery_source[existing store recovery; bootstrap_history_uuid=false], s[UNASSIGNED], unassigned_info[[reason=NODE_RESTARTING], at[2022-06-17T07:18:11.986Z], delayed=true, details[node_left [c4wCBppvQRWsqabiqMd1hg]], allocation_status[no_valid_shard_copy]]
    --------[.ds-ilm-history-5-2022.06.17-000001][0], node[null], [R], recovery_source[peer recovery], s[UNASSIGNED], unassigned_info[[reason=PRIMARY_FAILED], at[2022-06-17T07:18:11.986Z], delayed=false, details[primary failed while replica initializing], allocation_status[no_attempt]] cluster is RED

The way I read this sequence of events is that we started a rolling upgrade only 3 seconds after the ILM history index had been created and apparently its replica was not ready yet when we took down the first node to upgrade. However my expectation would be that the node shutdown API should not allow a node to go down if the only remaining replica is not yet initialised.

@botelastic botelastic bot added the triage label Jun 17, 2022
@pebrc pebrc added the >bug Something isn't working label Jun 17, 2022
@botelastic botelastic bot removed the triage label Jun 17, 2022
@pebrc pebrc added the discuss We need to figure this out label Jun 17, 2022
@pebrc
Copy link
Collaborator Author

pebrc commented Jun 17, 2022

I have reached out to the Elasticsearch team to figure out if we can improve something in the node shutdown API. What I need to understand better still is why our own check

func isSafeToRoll(health client.Health) bool {
did not protect us in this scenario. All preconditions apply: it is a yellow cluster state, we have a version upgrade.

@pebrc
Copy link
Collaborator Author

pebrc commented Jun 17, 2022

So regarding the check for shard activity it looks like the cluster health API request timed out:

Jun 17, 2022 @ 07:18:09.543 task [cluster_health (wait_for_events [LANGUID])] timed out after [0s]
{"log.level":"debug","@timestamp":"2022-06-17T07:18:09.539Z","log.logger":"elasticsearch-client","message":"Elasticsearch HTTP request","service.version":"2.4.0-SNAPSHOT+903d77f6","service.type":"eck","ecs.versi
on":"1.4.0","method":"GET","url":"https://test-version-upgrade-to-8x-r6p5-es-internal-http.e2e-fusqo-mercury.svc:9200/_cluster/health?wait_for_events=languid&timeout=0s","namespace":"e2e-fusqo-mercury","es_name":"test-version-upgrade-to-8x-r6p5"}
# parse error here is misleading just means we got a health response with an non-200 HTTP code
{"log.level":"error","@timestamp":"2022-06-17T07:18:09.545Z","log.logger":"elasticsearch-client","message":"Cannot parse Elasticsearch error response body","service.version":"2.4.0-SNAPSHOT+903d77f6","service.type":"eck","ecs.version":"1.4.0","error":"--useless stack trace elided---"}

But the operator seems to have continued. I would have expected the if_yellow_only_restart_upgrading_nodes_with_unassigned_replicas to save us here. I wonder if the cluster was not (yet) reporting yellow health.

@thbkrkr
Copy link
Contributor

thbkrkr commented Jul 13, 2023

In build/4899, there was exactly 10 x Elasticsearch cluster health check failure at ...: cluster health red, and in #7000 we set the limit to 10, so it should have worked, but I didn't see there may also be 1 x error while creating the Elasticsearch client, which is counted in continuousHealthChecks.FailureCount.

fmt.Printf("error while creating the Elasticsearch client: %s", err)
if !errors.As(err, &PotentialNetworkError) {
// explicit apiserver error, consider as healthcheck failure
hc.AppendErr(err)

func (hc *ContinuousHealthCheck) AppendErr(err error) {
hc.Failures = append(hc.Failures, ContinuousHealthCheckFailure{
err: err,
timestamp: time.Now(),
})
hc.FailureCount++

Let's set 15 to get a little margin but stay relatively low.

@thbkrkr

This comment was marked as resolved.

@thbkrkr
Copy link
Contributor

thbkrkr commented Dec 1, 2023

We fail the test as soon as there is a failure using t.Errorf despite our tolerance mechanism. 🤦

for _, f := range continuousHealthChecks.Failures {
t.Errorf("Elasticsearch cluster health check failure at %s: %s", f.timestamp, f.err.Error())
}

#7358 should fix this.

@thbkrkr
Copy link
Contributor

thbkrkr commented Dec 4, 2023

But not the flaky test:

https://buildkite.com/elastic/cloud-on-k8s-operator-nightly/builds/381

steps_mutation.go:159: ContinuousHealthChecks failures count (52) is above the tolerance (15): 52 x [cluster health red]

52 seems very high.

@barkbay
Copy link
Contributor

barkbay commented Nov 27, 2024

We had the same failure at least 2 times yesterday and today (both on Kind). I'll create a new issue as I'm not sure the recent failures are related.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug Something isn't working discuss We need to figure this out >flaky_test
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants