-
Notifications
You must be signed in to change notification settings - Fork 226
Description
Is your feature request related to a problem? Please describe.
When Zarf fails to install a component it attempts to purge the helm install as shown below:
:package: KEYCLOAK COMPONENT
✔ Pushed 2 images to the zarf registry
⠹ preparing upgrade for keycloak
WARNING Retrying (1/3) in 5s: unable to complete the helm chart install/upgrade: another operation
(install/upgrade/rollback) is in progress
⠏ preparing upgrade for keycloak (6s)
WARNING Retrying (2/3) in 10s: unable to complete the helm chart install/upgrade: another
operation (install/upgrade/rollback) is in progress
⠙ preparing upgrade for keycloak (16s)
WARNING Retrying (3/3) in 20s: unable to complete the helm chart install/upgrade: another
operation (install/upgrade/rollback) is in progress
⠧ purge requested for keycloak
ERROR: Failed to deploy bundle: unable to deploy component "keycloak": unable to install helm chart(s):
unable to install chart after 3 attempts
On initial install this makes some amount of sense, and there is logic that attempts to prevent removal on a failed upgrade. However, in some edge cases (as shown above), Zarf may mistake an upgrade as an initial install and delete something that should instead be rolled back.
These rough steps caused an existing (system critical) deployment to be removed from the cluster:
- Successfully deploy a Zarf package including Keycloak
- Attempt an upgrade some time later
- Realize the upgrade is doomed
- ctrl-c out of the deploy (a timeout would have resulted in a rollback, but would have taken ~45mins)
- Make a change and re-redeploy
- Laugh/cry a little as Keycloak is removed from the cluster
The deletion is almost certainly explained by the aborted deployment leaving the helm release in an incomplete state, but the existence of the behavior in the first place is very concerning for long-lived production deployments.
Describe the solution you'd like
- Given a Zarf package
- When when there is an error during deployment
- Then Zarf attempts a rollback (if available), and does nothing if not
In this particular case, a rollback would not have been possible either due to the invalid helm state -- but in 9/10 situations, I would rather have a broken deployment than no deployment..
Describe alternatives you've considered
Potentially, Zarf could catch the abort signal from ctrl-c and attempt a rollback immediately on exit rather than leaving the helm release in an incomplete state. (half baked idea.. but maybe there's something there..)