"on-demand" and "spot" EC2 instances in a single stack/GH-app for multi-runner? #4138

cisco-sbg-mgiassa-ai · 2024-09-18T20:55:46Z

Good day,

Would it be realistically possible to setup an instance of multi-runner, and have the "spot-versus-on-demand" EC2 settings be per runner type, rather than some global setting for the entire GitHub App (i.e. stack/module-instance)? About 95% of the time, spot works great, but there are some CICD jobs where getting hit w/ node eviction can be quite painful (especially if it happens multiple times per day in a busy region during peak usage hours). It would be extremely helpful to be able to just set an addition runs-on flag and call it a day.

The text was updated successfully, but these errors were encountered:

npalm · 2024-10-02T05:59:56Z

Currently you can set via instance_target_capacity_type per runner type to use spot or on-demand. Which is not flexible. The module also allows to create on-demand if spot fails. But indeed there is nothing in place in case the job fails.

Options that could be investigated could be

Move automatically to on-demand if spot-failures is hitting a treshhold
Allow a dynamic label in runs-on to indicate a job needs to run on on-demand always
...?

crohr · 2024-10-11T08:57:45Z

@cisco-sbg-mgiassa-ai you probably want something like what RunsOn provides, with labels that allow dynamic runner configuration at runtime: https://runs-on.com/configuration/job-labels/#spot

crohr · 2024-10-11T09:01:33Z

Move automatically to on-demand if spot-failures is hitting a treshhold

@npalm do you mean the failures to start spot instances (due to e.g. quota issues), or listening to spot eviction events and avoid launching in spot mode if too many of them occur?

In the latter case, I've had trouble finding a proper way to get those events in close to real-time. In CloudTrail they are usually delayed by up to 15 minutes, which might be too late.

Another option would be to catch the event from the VM, and ping the control plane when this happens.

npalm · 2024-10-17T21:16:17Z

We have added in one of the latest releases a lambda that can log / metric spot termination instead of warning as well. The lambda acting on the warning should be near real time.

klingenm · 2024-11-26T14:35:26Z

@cisco-sbg-mgiassa-ai did you see the instance_allocation_strategy setting? Setting that to price-capacity-optimized reduces the spot terminations immensely (for a slightly higher price).

stuartp44 added the enhancement New feature or request label Oct 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"on-demand" and "spot" EC2 instances in a single stack/GH-app for multi-runner? #4138

"on-demand" and "spot" EC2 instances in a single stack/GH-app for multi-runner? #4138

cisco-sbg-mgiassa-ai commented Sep 18, 2024 •

edited

Loading

npalm commented Oct 2, 2024

crohr commented Oct 11, 2024

crohr commented Oct 11, 2024 •

edited

Loading

npalm commented Oct 17, 2024

klingenm commented Nov 26, 2024

"on-demand" and "spot" EC2 instances in a single stack/GH-app for multi-runner? #4138

"on-demand" and "spot" EC2 instances in a single stack/GH-app for multi-runner? #4138

Comments

cisco-sbg-mgiassa-ai commented Sep 18, 2024 • edited Loading

npalm commented Oct 2, 2024

crohr commented Oct 11, 2024

crohr commented Oct 11, 2024 • edited Loading

npalm commented Oct 17, 2024

klingenm commented Nov 26, 2024

cisco-sbg-mgiassa-ai commented Sep 18, 2024 •

edited

Loading

crohr commented Oct 11, 2024 •

edited

Loading