*: add backup file option to recover subcommand #528

xiang90 · 2017-05-18T23:34:10Z

This is still WIP. But if you want to take a early view. Do it :P. I will keep this updated with my progress, and ping you again when it is finished and tested.

diegs

Sweet!

diegs · 2017-05-19T00:19:09Z

cmd/bootkube/recover.go

@@ -46,8 +48,10 @@ func init() {
 	cmdRecover.Flags().StringVar(&recoverOpts.etcdCertificatePath, "etcd-certificate-path", "", "Path to an existing certificate that will be used for TLS-enabled communication between the apiserver and etcd. Must be used in conjunction with --etcd-ca-path and --etcd-private-key-path, and must have etcd configured to use TLS with matching secrets.")
 	cmdRecover.Flags().StringVar(&recoverOpts.etcdPrivateKeyPath, "etcd-private-key-path", "", "Path to an existing private key that will be used for TLS-enabled communication between the apiserver and etcd. Must be used in conjunction with --etcd-ca-path and --etcd-certificate-path, and must have etcd configured to use TLS with matching secrets.")
 	cmdRecover.Flags().StringVar(&recoverOpts.etcdServers, "etcd-servers", "", "List of etcd server URLs including host:port, comma separated.")
+	cmdRecover.Flags().StringVar(&recoverOpts.etcdServers, "etcd-backup-file", "", "Path to the etcd backup file.")


&recoverOpts.etcdBackupFile?

diegs · 2017-05-19T00:20:23Z

cmd/bootkube/recover.go

@@ -66,6 +70,22 @@ func runCmdRecover(cmd *cobra.Command, args []string) error {
 			return err
 		}
 		backend = recovery.NewEtcdBackend(etcdClient, recoverOpts.etcdPrefix)
+	case recoverOpts.etcdBackupFile != "":


Maybe just move this above the switch, set the etcdServer, and then let it fall througH?

thought about it. but the logging will be duplicated. it will look like

Attempting recovery using etcd backup file at ... Attempting recovery using etcd cluster at ...

but, yea, i will try to clean up code duplication here in another way.

It's not a big deal.

diegs · 2017-05-19T00:20:58Z

pkg/recovery/etcd.go

+	}{
+		// TODO: this already exists in bootkube/cmd.
+		// do not duplicate this!
+		Image:      "quay.io/coreos/etcd:v3.1.6",


I'd be ok with factoring out an images package.

diegs · 2017-05-19T00:21:52Z

cmd/bootkube/recover.go

+		if err != nil {
+			return err
+		}
+		defer recovery.CleanRecoveryEtcd(recoverOpts.podManifestPath)


nit: ignoring error returned by function.

diegs · 2017-05-19T00:22:29Z

pkg/recovery/etcd.go

+
+const assetPathRecoveryEtcd = "recovery-etcd.yaml"
+
+func StartRecoveryEtcd(p, backupPath string) error {


nit: would put "Backup" somewhere in the name to be clearer.

diegs · 2017-05-19T20:58:21Z

pkg/recovery/etcd.go

@@ -31,6 +37,16 @@ func NewEtcdBackend(client *clientv3.Client, pathPrefix string) Backend {
 	}
 }

+// NewSelfHostedEtcdBackend constructs a new etcdBackend for the given client and pathPrefix, and backup file.
+func NewSelfHostedEtcdBackend(client *clientv3.Client, pathPrefix, backupPath string) Backend {
+	return &etcdBackend{


I think this violates the SRP.

Why not create a new type, etcdBackupBackend, which inherits etcdBackend, calls etcdBackend.Read(), and then adds the bootEtcd asset to the results?

diegs · 2017-05-19T21:01:28Z

cmd/bootkube/recover.go

@@ -66,6 +70,22 @@ func runCmdRecover(cmd *cobra.Command, args []string) error {
 			return err
 		}
 		backend = recovery.NewEtcdBackend(etcdClient, recoverOpts.etcdPrefix)
+	case recoverOpts.etcdBackupFile != "":


It's not a big deal.

diegs · 2017-05-19T21:09:07Z

cmd/bootkube/recover.go

+		}()
+
+		bootkube.UserOutput("Waiting for etcd server to start...\n")
+		// TODO: better way to detect the startup...


I have a poll loop in pkg/util/etcdutil/migrate.go#createBootstrapEtcdService that we can factor out

diegs · 2017-05-19T21:12:22Z

pkg/recovery/etcd_template.go

+      --data-dir=/var/etcd/data && \
+      etcdctl \
+      --endpoints=http://localhost:32379 \
+      del /registry/ThirdPartyResourceData/etcd.coreos.com/clusters/kube-system/kube-etcd


This is slick, but we won't know if it fails. Might have to use the client instead.

there is no where in the code we can do this correctly from my understanding.

we have to delete the TPR after we recovered the boot etcd, but before API server starts to use it.

Maybe you can try to find a better way to do this?

My original understanding was that you would:
(1) Start temporary etcd from backup, creating a datadir and extracting the manifests
(2) The bootstrap etcd would talk to the datadir you created in (1)

Thus, I thought you'd be able to delete the TPR from the running etcd in (1). However, it looks like (2) also starts up directly from the backup? In that case then I agree that there's no place in code to remove the TPR, so I suppose this should work. I'd say let's go ahead and try it.

However, it looks like (2) also starts up directly from the backup?

Yes. It was not possible to reuse the same datadir since (2) needs Pod.IP, which is hard to know when we create datadir for (1). But as we start to use a well-known service IP for boot-etcd. It is possible now.

I still choose to use the backup file approach since we can save one flag for bootkube recovery. Or user needs to specify another etcd data dir which will persist after bootkube recover for the bootkube start command.

But it is up to you to decide which we should go.

Since we are recovering to a known asset-dir, it would be possible to put the etcd data dir in there (and use the service IP).

If the approach you are using appears to work right now, then I'd say let's stick with it to get something in and add a TODO/open an issue to explore the slightly cleaner approach.

xiang90 · 2017-05-24T16:35:07Z

@diegs This works now. Can you take a look? I am trying to extend bootkube-test-recovery to include self hosted case.

if you want to give this a try, you can do:

SELF_HOST_ETCD=true ./bootkube-up
wait the control plane to be up
ssh into c1 and copy /var/etcd/kube-etcd-0000/member/snap/db to local machine
recreate c1
copy db file back to c1 as /home/core/db
sudo ./bootkube recover --asset-dir=/home/core/recovered --etcd-backup-file=/home/core/db --kubeconfig=/home/core/kubeconfig
sudo /home/core/bootkube start --asset-dir=/home/core/recovered

xiang90 · 2017-05-25T00:01:21Z

@diegs I added a test script.

diegs

Looks great! Just some minor nits.

diegs · 2017-05-30T22:12:11Z

cmd/bootkube/recover.go

 	cmdRecover.Flags().StringVar(&recoverOpts.etcdPrefix, "etcd-prefix", "/registry", "Path prefix to Kubernetes cluster data in etcd.")
 	cmdRecover.Flags().StringVar(&recoverOpts.kubeConfigPath, "kubeconfig", "", "Path to kubeconfig for communicating with the cluster.")
+	cmdRecover.Flags().StringVar(&recoverOpts.podManifestPath, "pod-manifest-path", "/etc/kubernetes/manifests", "The location where the kubelet is configured to look for static pod manifests. (Only need to be set when recovering from a etcd backup file)")


Please add validation that this flag is non-empty iff etcd-backup-file is non-empty (yes, even though it has a default value).

diegs · 2017-05-30T22:22:12Z

cmd/bootkube/recover.go

+	case recoverOpts.etcdBackupFile != "":
+		bootkube.UserOutput("Attempting recovery using etcd backup file at %q...\n", recoverOpts.etcdBackupFile)
+
+		err := recovery.StartRecoveryEtcdForBackup(recoverOpts.podManifestPath, recoverOpts.etcdBackupFile)


nit: scope errs inside if statements (here and 2 places below)

xiang90 · 2017-05-31T17:45:06Z

@diegs Thanks for the review. All fixed.

diegs · 2017-05-31T17:48:05Z

@xiang90 awesome! One last thing: can you please update the README? Then I think we're good to merge.

xiang90 · 2017-05-31T17:49:50Z

@diegs

One last thing: can you please update the README? Then I think we're good to merge.

Sure. I will do it late today or tomorrow (need to attend coreos fest).

xiang90 · 2017-06-01T16:45:36Z

@diegs readme updated.

diegs

LGTM but please address my README comment before merging.

diegs · 2017-06-01T17:12:45Z

README.md

@@ -78,7 +78,13 @@ Recover from a running apiserver (i.e. if the scheduler pods are all down):
 bootkube recover --asset-dir=recovered --kubeconfig=/etc/kubernetes/kubeconfig
 ```

-For a complete recovery example please see the [hack/multi-node/bootkube-test-recovery](hack/multi-node/bootkube-test-recovery) script.
+Recover from an etcd backup when self hosted etcd is enabled:


Please update line 61 to add the new supported case, and add a section just like lines 72/78 with a command line example. Thanks!

i think the command line example is already added two line below, right? or you want something different? i will update line 61.

Whoops, sorry totally misread the diff. That SG, thanks!

xiang90 · 2017-06-01T21:57:28Z

@diegs

CI failed with:

    run.go:147: creating cluster: error calling NewMachines: <nil> ssh journalctl failed: dial tcp 104.196.152.95:22: getsockopt: connection refused

I do not think this is relevant with my PR. So I am going to merge this PR.

diegs · 2017-06-01T22:02:16Z

rktbot run tests

k8s-ci-robot requested a review from diegs May 18, 2017 23:34

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label May 18, 2017

diegs reviewed May 19, 2017

View reviewed changes

xiang90 mentioned this pull request May 19, 2017

*: move image assignment to asset #530

Merged

diegs suggested changes May 19, 2017

View reviewed changes

diegs reviewed May 19, 2017

View reviewed changes

xiang90 changed the title ~~WIP *: add backup file option to recover subcommand~~ *: add backup file option to recover subcommand May 24, 2017

xiang90 mentioned this pull request May 24, 2017

Finding a solution for etcd kubernetes/kubeadm#277

Closed

*: add backup file option to recover subcommand

c6f4556

diegs mentioned this pull request May 24, 2017

WIP: support self hosted etcd recovery #505

Closed

diegs approved these changes May 30, 2017

View reviewed changes

*: generate boot etcd manifest for self hosted etcd case

10c3768

diegs approved these changes Jun 1, 2017

View reviewed changes

README: add selfhosted reocvery doc

56bc9d5

xiang90 merged commit 775e3ea into kubernetes-retired:master Jun 2, 2017

xiang90 deleted the t branch June 2, 2017 01:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

*: add backup file option to recover subcommand #528

*: add backup file option to recover subcommand #528

xiang90 commented May 18, 2017

diegs left a comment

diegs May 19, 2017

diegs May 19, 2017

xiang90 May 19, 2017

diegs May 19, 2017

diegs May 19, 2017

diegs May 19, 2017

diegs May 19, 2017

diegs May 19, 2017

xiang90 May 19, 2017

diegs May 19, 2017

diegs May 19, 2017

diegs May 19, 2017

xiang90 May 19, 2017 •

edited

Loading

diegs May 19, 2017

xiang90 May 19, 2017

diegs May 22, 2017 •

edited

Loading

xiang90 commented May 24, 2017

xiang90 commented May 25, 2017

diegs left a comment

diegs May 30, 2017

diegs May 30, 2017

xiang90 commented May 31, 2017

diegs commented May 31, 2017

xiang90 commented May 31, 2017

xiang90 commented Jun 1, 2017

diegs left a comment

diegs Jun 1, 2017

xiang90 Jun 1, 2017

diegs Jun 1, 2017

xiang90 commented Jun 1, 2017

diegs commented Jun 1, 2017


		const assetPathRecoveryEtcd = "recovery-etcd.yaml"

		func StartRecoveryEtcd(p, backupPath string) error {

*: add backup file option to recover subcommand #528

*: add backup file option to recover subcommand #528

Conversation

xiang90 commented May 18, 2017

diegs left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

xiang90 May 19, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

diegs May 22, 2017 • edited Loading

Choose a reason for hiding this comment

xiang90 commented May 24, 2017

xiang90 commented May 25, 2017

diegs left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

xiang90 commented May 31, 2017

diegs commented May 31, 2017

xiang90 commented May 31, 2017

xiang90 commented Jun 1, 2017

diegs left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

xiang90 commented Jun 1, 2017

diegs commented Jun 1, 2017

xiang90 May 19, 2017 •

edited

Loading

diegs May 22, 2017 •

edited

Loading