Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VAULT-11829: Add cluster status handler #18351

Merged
merged 19 commits into from
Jan 6, 2023

Conversation

ccapurso
Copy link
Contributor

@ccapurso ccapurso commented Dec 13, 2022

The associated PR #18316 expands the meta proto by introducing the GetClusterStatus RPC. This PR implements it.

The GetClusterStatusResponse includes the following fields:

  • ClusterID
  • HAStatus
  • RaftStatus
  • StorageType

A few new Core methods have been added for the purpose of de-coupling via WrappedCoreMeta (previously called WrappedCoreListNamespacesMounts):

  • ClusterID loads and returns the atomic Core.clusterID value
  • HAEnabled specifies whether high-availability mode is enabled
  • GetRaftConfiguration exposes the Raft configuration, similar to what is provided via /sys/storage/raft/configuration
  • GetRaftAutopilotState exposes the Raft Autopilot state, similar to what is provided via /sys/storage/raft/autopilot/state

TODO:

@ccapurso ccapurso added this to the 1.13.0-rc1 milestone Dec 16, 2022
@ccapurso ccapurso changed the title Vault 11829 cluster status handler VAULT-11829: Add cluster status handler Jan 4, 2023
Copy link
Contributor

@hghaf099 hghaf099 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! Just some minor comments.

vault/hcp_link/capabilities/meta/meta.go Outdated Show resolved Hide resolved
vault/hcp_link/capabilities/meta/meta.go Show resolved Hide resolved
@@ -189,3 +191,110 @@ func (h *hcpLinkMetaHandler) ListAuths(ctx context.Context, req *meta.ListAuthsR
Auths: auths,
}, nil
}

func (h *hcpLinkMetaHandler) GetClusterStatus(ctx context.Context, req *meta.GetClusterStatusRequest) (*meta.GetClusterStatusResponse, error) {
if h.wrappedCore.HAState() != consts.Active {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering if we would need to guard accessing hcpLinkMetaHandler members with a lock? What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's an interesting thought. It looks like we're mainly accessing wrappedCore in a similar fashion to how we're using WrappedCoreNodeStatus in the node status logic. I'm not sure what contention might arise but interested what potential issues you might have noticed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I am wrong with accessing wrappedCore. I was looking at the start and stop functions and see how we needed to use a lock. I understand that, and here it seems safe to access wrappedCore as is. We are just reading stuff, so not really a big concern I guess.

Copy link
Contributor

@hghaf099 hghaf099 Jan 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to think about using the HAState(). This function accesses some core members such as c.perfStandby without a guard and using it here means that we are not going through our regular request handling path during which a state lock is grabbed. I checked some usages of that function, and it seems that some sort of lock was held prior to calling that function. So, I am wondering if we should introduce a function that uses a guard? What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, you're totally right! Concurrent access for anything in Core could cause issues. I think this is also true for the newly added HAEnabled, GetRaftConfiguration, and GetRaftAutopilotState, see

vault/vault/core.go

Lines 3590 to 3611 in f9b4cd7

func (c *Core) HAEnabled() bool {
return c.ha != nil && c.ha.HAEnabled()
}
func (c *Core) GetRaftConfiguration(ctx context.Context) (*raft.RaftConfigurationResponse, error) {
raftBackend := c.getRaftBackend()
if raftBackend == nil {
return nil, nil
}
return raftBackend.GetConfiguration(ctx)
}
func (c *Core) GetRaftAutopilotState(ctx context.Context) (*raft.AutopilotState, error) {
raftBackend := c.getRaftBackend()
if raftBackend == nil {
return nil, nil
}
return raftBackend.GetAutopilotServerState(ctx)
}
.

Thank you for calling that out!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A bit of an update: I did a bit of digging regarding the methods added or used for the GetClusterStatus handler to determine what our locking strategy should be. The TL;DR is that I think we're mostly fine but I'll summarize here:

  • HAEnabled : c.ha is only set during initialization of a new core, so this only then exposes the underlying call to c.ha.HAEnabled
  • HAState: As we discussed this should have a lock, I will add a function HAStateWithLock that grabs c.stateLock.RLock
  • GetHAPeerNodesCached: This uses c.clusterPeerClusterAddrsCache which is a *cache.Cache that has locking built in. For example, calling c.clusterPeerClusterAddrsCache.Items() uses an underlying lock built into the cache
  • GetRaftConfiguration: This uses c.ha mainly to cast to a RaftBackend which mentioned above should be safe. Then the underlying call to RaftBackend.GetConfiguration is guarded by the RaftBackend.l
  • GetRaftAutopilotState: This uses c.ha mainly to cast to a RaftBackend which mentioned above should be safe. Then the underlying call to RaftBackend.GetAutopilotServerState is guarded by RaftBackend.l
  • StorageType: The storage type shouldn't change
  • ClusterID: The cluster ID won't be changing and we call this elsewhere without locking so should be safe

vault/hcp_link/capabilities/meta/meta.go Show resolved Hide resolved
@ccapurso ccapurso marked this pull request as ready for review January 5, 2023 18:42
Comment on lines 258 to 261
if voterCount == 0 {
quorumWarnings = append(quorumWarnings, "Only one server node found. Vault is not running in high availability mode.")
} else if voterCount%2 == 0 {
quorumWarnings = append(quorumWarnings, "Vault should have access to an odd number of voter nodes.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we also need to add the prefix "Warning:" here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that the concept of the warning prefix was from an initial implementation where a single string was built up. Now that they are all surfaced as warnings I will remove the prefix from the "Very large cluster detected" scenario.


func (h *hcpLinkMetaHandler) GetClusterStatus(ctx context.Context, req *meta.GetClusterStatusRequest) (*meta.GetClusterStatusResponse, error) {
if h.wrappedCore.HAState() != consts.Active {
return nil, fmt.Errorf("node not active")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this error mean the node is not active? Also, I was wondering if we could use error wrapping here so I can treat them accordingly in the service.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great call on the error wrapping. I added that to all the cases it made sense. In terms of this particular error, yes, it means that the node is not the active node and cannot handle the request. We can use whatever error message would be the most meaningful. What do you think?

@vercel
Copy link

vercel bot commented Jan 5, 2023

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Updated
vault 🔄 Building (Inspect) Jan 5, 2023 at 7:12PM (UTC)

Copy link
Contributor

@hghaf099 hghaf099 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me!

@ccapurso ccapurso merged commit 9247459 into main Jan 6, 2023
@ccapurso ccapurso deleted the vault-11829-cluster-status-handler branch January 6, 2023 22:07
AnPucel pushed a commit that referenced this pull request Jan 14, 2023
* go get link proto @vault-11829-meta-get-cluster-status

* add HA status

* add HAEnabled method

* add raft config

* allocate HA nodes based on actual count

* add raft autopilot status

* add raft quorum warnings

* add ClusterID method

* add StorageType

* add ClusterID

* update github.com/hashicorp/vault/vault/hcp_link/proto

* add changelog entry

* fix raft config panic

* remove "Warning" quorum message prefix

* add error wrapping

* add Core.HAStateWithLock method

* reduce quorum warnings to single string

* fix HCP_API_HOST test env var check

* Revert "fix HCP_API_HOST test env var check"

This reverts commit 97c73c4.
AnPucel pushed a commit that referenced this pull request Feb 3, 2023
* go get link proto @vault-11829-meta-get-cluster-status

* add HA status

* add HAEnabled method

* add raft config

* allocate HA nodes based on actual count

* add raft autopilot status

* add raft quorum warnings

* add ClusterID method

* add StorageType

* add ClusterID

* update github.com/hashicorp/vault/vault/hcp_link/proto

* add changelog entry

* fix raft config panic

* remove "Warning" quorum message prefix

* add error wrapping

* add Core.HAStateWithLock method

* reduce quorum warnings to single string

* fix HCP_API_HOST test env var check

* Revert "fix HCP_API_HOST test env var check"

This reverts commit 97c73c4.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants