VAULT-11829: Add cluster status handler #18351

ccapurso · 2022-12-13T22:25:43Z

The associated PR #18316 expands the meta proto by introducing the GetClusterStatus RPC. This PR implements it.

The GetClusterStatusResponse includes the following fields:

ClusterID
HAStatus
RaftStatus
StorageType

A few new Core methods have been added for the purpose of de-coupling via WrappedCoreMeta (previously called WrappedCoreListNamespacesMounts):

ClusterID loads and returns the atomic Core.clusterID value
HAEnabled specifies whether high-availability mode is enabled
GetRaftConfiguration exposes the Raft configuration, similar to what is provided via /sys/storage/raft/configuration
GetRaftAutopilotState exposes the Raft Autopilot state, similar to what is provided via /sys/storage/raft/autopilot/state

TODO:

go get github.com/hashicorp/vault/vault/hcp_link/proto once VAULT-11829: Add GetClusterStatus rpc to meta capability #18316 since the current version in go.mod is pointing to @vault-11829-meta-get-cluster-status
Add changelog entry

hghaf099

Looks great! Just some minor comments.

vault/hcp_link/capabilities/meta/meta.go

hghaf099 · 2023-01-04T22:08:41Z

vault/hcp_link/capabilities/meta/meta.go

@@ -189,3 +191,110 @@ func (h *hcpLinkMetaHandler) ListAuths(ctx context.Context, req *meta.ListAuthsR
 		Auths: auths,
 	}, nil
 }
+
+func (h *hcpLinkMetaHandler) GetClusterStatus(ctx context.Context, req *meta.GetClusterStatusRequest) (*meta.GetClusterStatusResponse, error) {
+	if h.wrappedCore.HAState() != consts.Active {


Wondering if we would need to guard accessing hcpLinkMetaHandler members with a lock? What do you think?

That's an interesting thought. It looks like we're mainly accessing wrappedCore in a similar fashion to how we're using WrappedCoreNodeStatus in the node status logic. I'm not sure what contention might arise but interested what potential issues you might have noticed.

I guess I am wrong with accessing wrappedCore. I was looking at the start and stop functions and see how we needed to use a lock. I understand that, and here it seems safe to access wrappedCore as is. We are just reading stuff, so not really a big concern I guess.

I think we need to think about using the HAState(). This function accesses some core members such as c.perfStandby without a guard and using it here means that we are not going through our regular request handling path during which a state lock is grabbed. I checked some usages of that function, and it seems that some sort of lock was held prior to calling that function. So, I am wondering if we should introduce a function that uses a guard? What do you think?

Oh, you're totally right! Concurrent access for anything in Core could cause issues. I think this is also true for the newly added HAEnabled, GetRaftConfiguration, and GetRaftAutopilotState, see

vault/vault/core.go

Lines 3590 to 3611 in f9b4cd7

func (c *Core) HAEnabled() bool {

return c.ha != nil && c.ha.HAEnabled()

}

func (c *Core) GetRaftConfiguration(ctx context.Context) (*raft.RaftConfigurationResponse, error) {

raftBackend := c.getRaftBackend()

if raftBackend == nil {

return nil, nil

}

return raftBackend.GetConfiguration(ctx)

}

func (c *Core) GetRaftAutopilotState(ctx context.Context) (*raft.AutopilotState, error) {

raftBackend := c.getRaftBackend()

if raftBackend == nil {

return nil, nil

}

return raftBackend.GetAutopilotServerState(ctx)

}

.

Thank you for calling that out!

A bit of an update: I did a bit of digging regarding the methods added or used for the GetClusterStatus handler to determine what our locking strategy should be. The TL;DR is that I think we're mostly fine but I'll summarize here:

HAEnabled : c.ha is only set during initialization of a new core, so this only then exposes the underlying call to c.ha.HAEnabled

HAState: As we discussed this should have a lock, I will add a function HAStateWithLock that grabs c.stateLock.RLock

GetHAPeerNodesCached: This uses c.clusterPeerClusterAddrsCache which is a *cache.Cache that has locking built in. For example, calling c.clusterPeerClusterAddrsCache.Items() uses an underlying lock built into the cache

GetRaftConfiguration: This uses c.ha mainly to cast to a RaftBackend which mentioned above should be safe. Then the underlying call to RaftBackend.GetConfiguration is guarded by the RaftBackend.l

GetRaftAutopilotState: This uses c.ha mainly to cast to a RaftBackend which mentioned above should be safe. Then the underlying call to RaftBackend.GetAutopilotServerState is guarded by RaftBackend.l

StorageType: The storage type shouldn't change

ClusterID: The cluster ID won't be changing and we call this elsewhere without locking so should be safe

vault/hcp_link/capabilities/meta/meta.go

biazmoreira · 2023-01-05T18:58:45Z

vault/hcp_link/capabilities/meta/meta.go

+		if voterCount == 0 {
+			quorumWarnings = append(quorumWarnings, "Only one server node found. Vault is not running in high availability mode.")
+		} else if voterCount%2 == 0 {
+			quorumWarnings = append(quorumWarnings, "Vault should have access to an odd number of voter nodes.")


Do we also need to add the prefix "Warning:" here?

I think that the concept of the warning prefix was from an initial implementation where a single string was built up. Now that they are all surfaced as warnings I will remove the prefix from the "Very large cluster detected" scenario.

biazmoreira · 2023-01-05T19:01:17Z

vault/hcp_link/capabilities/meta/meta.go

+
+func (h *hcpLinkMetaHandler) GetClusterStatus(ctx context.Context, req *meta.GetClusterStatusRequest) (*meta.GetClusterStatusResponse, error) {
+	if h.wrappedCore.HAState() != consts.Active {
+		return nil, fmt.Errorf("node not active")


Does this error mean the node is not active? Also, I was wondering if we could use error wrapping here so I can treat them accordingly in the service.

Great call on the error wrapping. I added that to all the cases it made sense. In terms of this particular error, yes, it means that the node is not the active node and cannot handle the request. We can use whatever error message would be the most meaningful. What do you think?

vercel · 2023-01-05T19:12:11Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name	Status	Preview	Updated
vault	🔄 Building (Inspect)		Jan 5, 2023 at 7:12PM (UTC)

hghaf099

Looks good to me!

This reverts commit 97c73c4.

* go get link proto @vault-11829-meta-get-cluster-status * add HA status * add HAEnabled method * add raft config * allocate HA nodes based on actual count * add raft autopilot status * add raft quorum warnings * add ClusterID method * add StorageType * add ClusterID * update github.com/hashicorp/vault/vault/hcp_link/proto * add changelog entry * fix raft config panic * remove "Warning" quorum message prefix * add error wrapping * add Core.HAStateWithLock method * reduce quorum warnings to single string * fix HCP_API_HOST test env var check * Revert "fix HCP_API_HOST test env var check" This reverts commit 97c73c4.

ccapurso added 7 commits December 13, 2022 17:10

go get link proto @vault-11829-meta-get-cluster-status

90ee930

add HA status

f76f0b4

add HAEnabled method

00db853

add raft config

1d59750

allocate HA nodes based on actual count

4e11d84

add raft autopilot status

0a05f98

add raft quorum warnings

5e84116

ccapurso added this to the 1.13.0-rc1 milestone Dec 16, 2022

ccapurso added 3 commits January 3, 2023 16:37

add ClusterID method

8e32d3d

add StorageType

8a65b0f

add ClusterID

11eb552

ccapurso changed the title ~~Vault 11829 cluster status handler~~ VAULT-11829: Add cluster status handler Jan 4, 2023

ccapurso requested review from biazmoreira, divyaac and hghaf099 January 4, 2023 16:29

hghaf099 reviewed Jan 4, 2023

View reviewed changes

ccapurso added 2 commits January 5, 2023 13:35

update github.com/hashicorp/vault/vault/hcp_link/proto

926b28b

add changelog entry

25a87d9

ccapurso marked this pull request as ready for review January 5, 2023 18:42

biazmoreira reviewed Jan 5, 2023

View reviewed changes

fix raft config panic

c8e012f

ccapurso added 3 commits January 5, 2023 14:19

remove "Warning" quorum message prefix

32f7291

add error wrapping

f9b4cd7

add Core.HAStateWithLock method

861ea09

ccapurso requested review from biazmoreira and hghaf099 January 6, 2023 17:15

hghaf099 approved these changes Jan 6, 2023

View reviewed changes

biazmoreira approved these changes Jan 6, 2023

View reviewed changes

ccapurso added 3 commits January 6, 2023 15:34

reduce quorum warnings to single string

733ce94

fix HCP_API_HOST test env var check

97c73c4

Revert "fix HCP_API_HOST test env var check"

3ff6f24

This reverts commit 97c73c4.

ccapurso merged commit 9247459 into main Jan 6, 2023

ccapurso deleted the vault-11829-cluster-status-handler branch January 6, 2023 22:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

VAULT-11829: Add cluster status handler #18351

VAULT-11829: Add cluster status handler #18351

ccapurso commented Dec 13, 2022 •

edited

Loading

hghaf099 left a comment

hghaf099 Jan 4, 2023

ccapurso Jan 5, 2023

hghaf099 Jan 5, 2023

hghaf099 Jan 5, 2023 •

edited

Loading

ccapurso Jan 5, 2023

ccapurso Jan 6, 2023

biazmoreira Jan 5, 2023

ccapurso Jan 5, 2023

biazmoreira Jan 5, 2023

ccapurso Jan 5, 2023

vercel bot commented Jan 5, 2023

hghaf099 left a comment

	func (c *Core) HAEnabled() bool {
	return c.ha != nil && c.ha.HAEnabled()
	}

	func (c Core) GetRaftConfiguration(ctx context.Context) (raft.RaftConfigurationResponse, error) {
	raftBackend := c.getRaftBackend()

	if raftBackend == nil {
	return nil, nil
	}

	return raftBackend.GetConfiguration(ctx)
	}

	func (c Core) GetRaftAutopilotState(ctx context.Context) (raft.AutopilotState, error) {
	raftBackend := c.getRaftBackend()
	if raftBackend == nil {
	return nil, nil
	}

	return raftBackend.GetAutopilotServerState(ctx)
	}

VAULT-11829: Add cluster status handler #18351

VAULT-11829: Add cluster status handler #18351

Conversation

ccapurso commented Dec 13, 2022 • edited Loading

hghaf099 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hghaf099 Jan 5, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vercel bot commented Jan 5, 2023

hghaf099 left a comment

Choose a reason for hiding this comment

ccapurso commented Dec 13, 2022 •

edited

Loading

hghaf099 Jan 5, 2023 •

edited

Loading