Skip to content

Zarf's cleanup-on-failure logic can cause unintended deletion of applications during failed upgrades #2455

@blancharda

Description

@blancharda

Is your feature request related to a problem? Please describe.

When Zarf fails to install a component it attempts to purge the helm install as shown below:

  :package: KEYCLOAK COMPONENT

  ✔  Pushed 2 images to the zarf registry
  ⠹  preparing upgrade for keycloak
 WARNING  Retrying (1/3) in 5s: unable to complete the helm chart install/upgrade: another operation
          (install/upgrade/rollback) is in progress
  ⠏  preparing upgrade for keycloak (6s)
 WARNING  Retrying (2/3) in 10s: unable to complete the helm chart install/upgrade: another
          operation (install/upgrade/rollback) is in progress
  ⠙  preparing upgrade for keycloak (16s)
 WARNING  Retrying (3/3) in 20s: unable to complete the helm chart install/upgrade: another
          operation (install/upgrade/rollback) is in progress
  ⠧  purge requested for keycloak
     ERROR:  Failed to deploy bundle: unable to deploy component "keycloak": unable to install helm chart(s):
             unable to install chart after 3 attempts

On initial install this makes some amount of sense, and there is logic that attempts to prevent removal on a failed upgrade. However, in some edge cases (as shown above), Zarf may mistake an upgrade as an initial install and delete something that should instead be rolled back.

These rough steps caused an existing (system critical) deployment to be removed from the cluster:

  1. Successfully deploy a Zarf package including Keycloak
  2. Attempt an upgrade some time later
  3. Realize the upgrade is doomed
  4. ctrl-c out of the deploy (a timeout would have resulted in a rollback, but would have taken ~45mins)
  5. Make a change and re-redeploy
  6. Laugh/cry a little as Keycloak is removed from the cluster

The deletion is almost certainly explained by the aborted deployment leaving the helm release in an incomplete state, but the existence of the behavior in the first place is very concerning for long-lived production deployments.

Describe the solution you'd like

  • Given a Zarf package
  • When when there is an error during deployment
  • Then Zarf attempts a rollback (if available), and does nothing if not

In this particular case, a rollback would not have been possible either due to the invalid helm state -- but in 9/10 situations, I would rather have a broken deployment than no deployment..

Describe alternatives you've considered

Potentially, Zarf could catch the abort signal from ctrl-c and attempt a rollback immediately on exit rather than leaving the helm release in an incomplete state. (half baked idea.. but maybe there's something there..)

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions