-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
VAULT-11829: Add cluster status handler #18351
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great! Just some minor comments.
@@ -189,3 +191,110 @@ func (h *hcpLinkMetaHandler) ListAuths(ctx context.Context, req *meta.ListAuthsR | |||
Auths: auths, | |||
}, nil | |||
} | |||
|
|||
func (h *hcpLinkMetaHandler) GetClusterStatus(ctx context.Context, req *meta.GetClusterStatusRequest) (*meta.GetClusterStatusResponse, error) { | |||
if h.wrappedCore.HAState() != consts.Active { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wondering if we would need to guard accessing hcpLinkMetaHandler
members with a lock? What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's an interesting thought. It looks like we're mainly accessing wrappedCore
in a similar fashion to how we're using WrappedCoreNodeStatus
in the node status logic. I'm not sure what contention might arise but interested what potential issues you might have noticed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess I am wrong with accessing wrappedCore
. I was looking at the start and stop functions and see how we needed to use a lock. I understand that, and here it seems safe to access wrappedCore
as is. We are just reading stuff, so not really a big concern I guess.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to think about using the HAState()
. This function accesses some core members such as c.perfStandby
without a guard and using it here means that we are not going through our regular request handling path during which a state lock is grabbed. I checked some usages of that function, and it seems that some sort of lock was held prior to calling that function. So, I am wondering if we should introduce a function that uses a guard? What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, you're totally right! Concurrent access for anything in Core
could cause issues. I think this is also true for the newly added HAEnabled
, GetRaftConfiguration
, and GetRaftAutopilotState
, see
Lines 3590 to 3611 in f9b4cd7
func (c *Core) HAEnabled() bool { | |
return c.ha != nil && c.ha.HAEnabled() | |
} | |
func (c *Core) GetRaftConfiguration(ctx context.Context) (*raft.RaftConfigurationResponse, error) { | |
raftBackend := c.getRaftBackend() | |
if raftBackend == nil { | |
return nil, nil | |
} | |
return raftBackend.GetConfiguration(ctx) | |
} | |
func (c *Core) GetRaftAutopilotState(ctx context.Context) (*raft.AutopilotState, error) { | |
raftBackend := c.getRaftBackend() | |
if raftBackend == nil { | |
return nil, nil | |
} | |
return raftBackend.GetAutopilotServerState(ctx) | |
} |
Thank you for calling that out!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A bit of an update: I did a bit of digging regarding the methods added or used for the GetClusterStatus
handler to determine what our locking strategy should be. The TL;DR is that I think we're mostly fine but I'll summarize here:
HAEnabled
:c.ha
is only set during initialization of a new core, so this only then exposes the underlying call toc.ha.HAEnabled
HAState
: As we discussed this should have a lock, I will add a functionHAStateWithLock
that grabsc.stateLock.RLock
GetHAPeerNodesCached
: This usesc.clusterPeerClusterAddrsCache
which is a*cache.Cache
that has locking built in. For example, callingc.clusterPeerClusterAddrsCache.Items()
uses an underlying lock built into the cacheGetRaftConfiguration
: This usesc.ha
mainly to cast to aRaftBackend
which mentioned above should be safe. Then the underlying call toRaftBackend.GetConfiguration
is guarded by theRaftBackend.l
GetRaftAutopilotState
: This usesc.ha
mainly to cast to aRaftBackend
which mentioned above should be safe. Then the underlying call toRaftBackend.GetAutopilotServerState
is guarded byRaftBackend.l
StorageType
: The storage type shouldn't changeClusterID
: The cluster ID won't be changing and we call this elsewhere without locking so should be safe
if voterCount == 0 { | ||
quorumWarnings = append(quorumWarnings, "Only one server node found. Vault is not running in high availability mode.") | ||
} else if voterCount%2 == 0 { | ||
quorumWarnings = append(quorumWarnings, "Vault should have access to an odd number of voter nodes.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we also need to add the prefix "Warning:" here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that the concept of the warning prefix was from an initial implementation where a single string was built up. Now that they are all surfaced as warnings I will remove the prefix from the "Very large cluster detected" scenario.
|
||
func (h *hcpLinkMetaHandler) GetClusterStatus(ctx context.Context, req *meta.GetClusterStatusRequest) (*meta.GetClusterStatusResponse, error) { | ||
if h.wrappedCore.HAState() != consts.Active { | ||
return nil, fmt.Errorf("node not active") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this error mean the node is not active? Also, I was wondering if we could use error wrapping here so I can treat them accordingly in the service.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great call on the error wrapping. I added that to all the cases it made sense. In terms of this particular error, yes, it means that the node is not the active node and cannot handle the request. We can use whatever error message would be the most meaningful. What do you think?
The latest updates on your projects. Learn more about Vercel for Git ↗︎
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me!
* go get link proto @vault-11829-meta-get-cluster-status * add HA status * add HAEnabled method * add raft config * allocate HA nodes based on actual count * add raft autopilot status * add raft quorum warnings * add ClusterID method * add StorageType * add ClusterID * update github.com/hashicorp/vault/vault/hcp_link/proto * add changelog entry * fix raft config panic * remove "Warning" quorum message prefix * add error wrapping * add Core.HAStateWithLock method * reduce quorum warnings to single string * fix HCP_API_HOST test env var check * Revert "fix HCP_API_HOST test env var check" This reverts commit 97c73c4.
* go get link proto @vault-11829-meta-get-cluster-status * add HA status * add HAEnabled method * add raft config * allocate HA nodes based on actual count * add raft autopilot status * add raft quorum warnings * add ClusterID method * add StorageType * add ClusterID * update github.com/hashicorp/vault/vault/hcp_link/proto * add changelog entry * fix raft config panic * remove "Warning" quorum message prefix * add error wrapping * add Core.HAStateWithLock method * reduce quorum warnings to single string * fix HCP_API_HOST test env var check * Revert "fix HCP_API_HOST test env var check" This reverts commit 97c73c4.
The associated PR #18316 expands the
meta
proto by introducing theGetClusterStatus
RPC. This PR implements it.The
GetClusterStatusResponse
includes the following fields:ClusterID
HAStatus
RaftStatus
StorageType
A few new
Core
methods have been added for the purpose of de-coupling viaWrappedCoreMeta
(previously calledWrappedCoreListNamespacesMounts
):ClusterID
loads and returns the atomicCore.clusterID
valueHAEnabled
specifies whether high-availability mode is enabledGetRaftConfiguration
exposes the Raft configuration, similar to what is provided via /sys/storage/raft/configurationGetRaftAutopilotState
exposes the Raft Autopilot state, similar to what is provided via /sys/storage/raft/autopilot/stateTODO:
go get github.com/hashicorp/vault/vault/hcp_link/proto
once VAULT-11829: Add GetClusterStatus rpc to meta capability #18316 since the current version ingo.mod
is pointing to@vault-11829-meta-get-cluster-status