Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: adjust pool dynamically based on demand #3855

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

galargh
Copy link
Contributor

@galargh galargh commented Apr 20, 2024

Description

This PR extends the pool lambda with:

  • ability to adjust the pool based on the number of queued jobs
  • support for Repo level runners
  • support for multiple runner owners

Why?

Dynamic Pool

Sometimes, a runner startup fails, for example, due to a connectivity issue with GitHub servers. When using the self-hosted runners in an ephemeral mode, such failures can lead to starvation. By adding support for dynamic runner scaling based on the number of currently queued jobs to the pool adjust lambda, it becomes possible to counteract this issue.

Repo Level Runners

Up until now, the pool adjustment supported only Org-level runners. Adding support for Repo-level runners means that anyone using self-hosted runners in this manner will now be able to use the pool lambda as intended.

Multiple Runner Owners

By accepting multiple runners as input to the pool lambda, the overall number of lambdas created for the setup can be reduced. This might be desirable when someone is running Repo-level runners for many repositories.

How?

Dynamic Pool

The use of this feature can be controlled via the pool_size input of the pool lambda. Setting it to -1 will enable the dynamic pool. When the dynamic pool is enabled, the function determines the desired pool size by counting the number of currently queued jobs in the context of the runner owner.

Repo Level Runners

From now on, the pool lambda will accept runner_owner input in the form of OWNER/REPO. When the runner owner is a repository, the function will create repository-level runners instead of org-level ones. It will also check for idle runners in the repository context. It will not check for idle runners in the context of the organization.

Multiple Runner Owners

The pool lambda will expect the runner_owner input to be a comma-separated list of owners and process pool adjustment for each.

Testing

I have deployed the runners.zip and configured the dynamic pool in the setup I use for https://github.com/libp2p and https://github.com/ipfs.

I added a set of unit tests for the newly added functionality.

let topUp = 0;
if (event.poolSize >= 0) {
topUp = event.poolSize - numberOfRunnersInPool;
} else if (event.poolSize === -1) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you move the else part to a seperate method, for readavblity?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Of course! I'm on it. Thanks for the suggestion.

@@ -18,6 +22,13 @@ interface RunnerStatus {
status: string;
}

function canRunJob(workflowJobLabels: string[], runnerLabels: string[]): boolean {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assume this code, is copied from webhook? Correct? In that case we should move this to a common module. Can be done later in a new PR. In that case can you create a issue, and link it here in the code?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's correct. I simplified it a little bit here. It makes sense to me to want to extract it to a common module. I'll try to incorporate it into this PR, but in case it starts feeling to big for this change, I'll create an issue and propose a fix separately as suggested.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both fine, in case you keep it. Please create an issue and refer in a comment ot the issue.

@@ -36,7 +47,7 @@ export async function adjust(event: PoolEvent): Promise<void> {
const launchTemplateName = process.env.LAUNCH_TEMPLATE_NAME;
const instanceMaxSpotPrice = process.env.INSTANCE_MAX_SPOT_PRICE;
const instanceAllocationStrategy = process.env.INSTANCE_ALLOCATION_STRATEGY || 'lowest-price'; // same as AWS default
const runnerOwner = process.env.RUNNER_OWNER;
const runnerOwners = process.env.RUNNER_OWNER.split(',');
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assume you propose to use a list split with comma for muliple owners, which can be a repo as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's correct. It can be either a list of orgs or repos depending on the usecase.

logger.info('Checking for queued jobs to determine pool size');
let repos;
if (runnerType === 'Repo') {
repos = [repo];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it not be easier to maintain to fetch th lis tof repo's from the app. Like you do for the org. In that case this if is not needed. And the pool can adjust regardless knowing if it is installed in an org or repo.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we were to do it that way, then it wouldn't be possible to provide different configs for different repositories since a single pool adjust configuration would always be applied to all the repositories the app has access too. Even though it is not a requirement for me at the moment, I can see it as a valid use case.

if (runnerType === 'Repo') {
repos = [repo];
} else {
// @ts-expect-error The types normalized by paginate are not correct,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This way to dynamically scaling could work for smaller deployments. Since it will check all repo's and for each repo the potentially queued jobs. In case you have for example 1000 repo's. It means 1000 calls to check for jobs. Which mean the app will rate limit quickly.

GitHub does not provide a way to request queued jobs, so via the API there is no good alternative. But cause a rate limit will cause the whole control plane will not scale anymore till the rate limit is reset.

The only alternative I could think of is hooing on the events and use the events to press less hard on the API.

I am fine with this approach. But there should be some very clear warning in the documenntation. I also would opt we add a flag to enable dynamic scaling and mark it clearly experimental.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's correct! Unfortunately, I couldn't find a more efficient way to implement this using the API alone. As you said, it could be possible to use webhook events to try to limit the search space if, instead of checking all the repositories, we checked only those that had some jobs scheduled "recently" that weren't scheduled yet.

One way to implement it could be with an extra SQS queue where we would store workflow job queued events. The queue could have a delay configured. There could also be a lambda that consumes events from that queue. It would check whether a job the event is for has been scheduled already or not. If it has, that's it. If it hasn't, it would put the event back on that queue and on the queue that the scale up lambda consumes. Would something like that make sense to you?

I am fine with this approach. But there should be some very clear warning in the documenntation. I also would opt we add a flag to enable dynamic scaling and mark it clearly experimental.

Sure, that makes sense! I'll add a new config option for enabling the "dynamic" mode explicitly instead of controlling it via setting poolSize to -1. I'll make sure to add a warning about this in the documentation too.

Copy link
Contributor Author

@galargh galargh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the review, really appreciated 🙇 I'm aiming to apply the requested changes next week.

@@ -18,6 +22,13 @@ interface RunnerStatus {
status: string;
}

function canRunJob(workflowJobLabels: string[], runnerLabels: string[]): boolean {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's correct. I simplified it a little bit here. It makes sense to me to want to extract it to a common module. I'll try to incorporate it into this PR, but in case it starts feeling to big for this change, I'll create an issue and propose a fix separately as suggested.

@@ -36,7 +47,7 @@ export async function adjust(event: PoolEvent): Promise<void> {
const launchTemplateName = process.env.LAUNCH_TEMPLATE_NAME;
const instanceMaxSpotPrice = process.env.INSTANCE_MAX_SPOT_PRICE;
const instanceAllocationStrategy = process.env.INSTANCE_ALLOCATION_STRATEGY || 'lowest-price'; // same as AWS default
const runnerOwner = process.env.RUNNER_OWNER;
const runnerOwners = process.env.RUNNER_OWNER.split(',');
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's correct. It can be either a list of orgs or repos depending on the usecase.

let topUp = 0;
if (event.poolSize >= 0) {
topUp = event.poolSize - numberOfRunnersInPool;
} else if (event.poolSize === -1) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Of course! I'm on it. Thanks for the suggestion.

logger.info('Checking for queued jobs to determine pool size');
let repos;
if (runnerType === 'Repo') {
repos = [repo];
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we were to do it that way, then it wouldn't be possible to provide different configs for different repositories since a single pool adjust configuration would always be applied to all the repositories the app has access too. Even though it is not a requirement for me at the moment, I can see it as a valid use case.

if (runnerType === 'Repo') {
repos = [repo];
} else {
// @ts-expect-error The types normalized by paginate are not correct,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's correct! Unfortunately, I couldn't find a more efficient way to implement this using the API alone. As you said, it could be possible to use webhook events to try to limit the search space if, instead of checking all the repositories, we checked only those that had some jobs scheduled "recently" that weren't scheduled yet.

One way to implement it could be with an extra SQS queue where we would store workflow job queued events. The queue could have a delay configured. There could also be a lambda that consumes events from that queue. It would check whether a job the event is for has been scheduled already or not. If it has, that's it. If it hasn't, it would put the event back on that queue and on the queue that the scale up lambda consumes. Would something like that make sense to you?

I am fine with this approach. But there should be some very clear warning in the documenntation. I also would opt we add a flag to enable dynamic scaling and mark it clearly experimental.

Sure, that makes sense! I'll add a new config option for enabling the "dynamic" mode explicitly instead of controlling it via setting poolSize to -1. I'll make sure to add a warning about this in the documentation too.

Copy link
Contributor

This pull request has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed if no further activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the Stale label Aug 16, 2024
@npalm npalm removed the Stale label Aug 16, 2024
@npalm
Copy link
Member

npalm commented Nov 1, 2024

@galargh Still a nice PR, but some time ago I have implemented a job retry mechanism, which used another queue to rety jobs. Is that also solvering your issue here? We use the job retry on our smaller fleets where we have not or very small pools. I leave teh PR open, sicne I see some improvements on the unit test that I would like bring back.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants