Skip to content

[ci] update to 1ES image pool#6866

Merged
jameslamb merged 10 commits intomasterfrom
ci/update-to-1es-image
Jun 15, 2025
Merged

[ci] update to 1ES image pool#6866
jameslamb merged 10 commits intomasterfrom
ci/update-to-1es-image

Conversation

@shiyu1994
Copy link
Collaborator

Update to 1ES ubuntu image pool for ubuntu agents in Azure DevOps.

@shiyu1994 shiyu1994 changed the title [ci] update to 1es image pool [ci] update to 1ES image pool Mar 18, 2025
@jameslamb
Copy link
Collaborator

It looks like the runners in this pool aren't ready yet to take containerized CI jobs.

They all failed like this at startup:

Error response from daemon: toomanyrequests: You have reached your pull rate limit as '1eshostedagent': dckr_jti_m3ejubTHZ9diqZy6GbB9nt2jvvs=. You may increase the limit by upgrading. https://www.docker.com/increase-rate-limit

(build link)

Some reasons I can think of that that might be happening:

  • the runners are sending unauthenticated requests to DockerHub
  • the 1esthostedagent service account (if that is one) has hit some rate limit
  • runners in this pool (and maybe other instances in the same VPC) all communicate out to the internet through a NAT gateway / NAT instance that looks like a single public IP to DockerHub, and we're hitting an IP-address-based rate limite

@jameslamb
Copy link
Collaborator

As of the changes in 8782505, there's now a new error in all of these jobs:

Error response from daemon: Head "https://lightgbm.azurecr.io/v2/vsts-agent/manifests/manylinux_2_28_x86_64": unauthorized: authentication required, visit https://aka.ms/acr/authorization for more information. CorrelationId: 2d2c72f8-71ef-435a-9fd0-a3cfec9132ee
##[warning]Docker pull failed with exit code 1, back off 00:00:10 seconds before retry.
##[error]Docker pull failed with exit code 1

(build link)

@shiyu1994
Copy link
Collaborator Author

/AzurePipelines run

@shiyu1994
Copy link
Collaborator Author

After changing some settings of the agent pool, it is now ready for use. Can you please check this PR again? @jameslamb @StrikerRUS Thanks.

Copy link
Collaborator

@StrikerRUS StrikerRUS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems that the process still is not stable:

/bin/docker pull ubuntu:22.04
Error response from daemon: toomanyrequests: You have reached your pull rate limit as '1eshostedagent': dckr_jti_eSn-7El0p7W-Tp4xffsIAv-Abcc=. You may increase the limit by upgrading. https://www.docker.com/increase-rate-limit

https://dev.azure.com/lightgbm-ci/lightgbm-ci/_build/results?buildId=17818&view=logs&j=1bad8b53-1e26-54ff-2fd0-35afad2d99e6&t=fad684ab-13d5-4063-a5a9-8f2a766e52d0&l=12

I think we can use Google Docker Mirror to overcome request limits: https://cloud.google.com/artifact-registry/docs/pull-cached-dockerhub-images#cli

@shiyu1994
Copy link
Collaborator Author

I need to figure out how to by-pass the rate limit of image pulling from Azure Container Registry.

@shiyu1994
Copy link
Collaborator Author

Seems that the process still is not stable:

/bin/docker pull ubuntu:22.04
Error response from daemon: toomanyrequests: You have reached your pull rate limit as '1eshostedagent': dckr_jti_eSn-7El0p7W-Tp4xffsIAv-Abcc=. You may increase the limit by upgrading. https://www.docker.com/increase-rate-limit

https://dev.azure.com/lightgbm-ci/lightgbm-ci/_build/results?buildId=17818&view=logs&j=1bad8b53-1e26-54ff-2fd0-35afad2d99e6&t=fad684ab-13d5-4063-a5a9-8f2a766e52d0&l=12

I think we can use Google Docker Mirror to overcome request limits: https://cloud.google.com/artifact-registry/docs/pull-cached-dockerhub-images#cli

Thanks for the suggestion. Let me try.

@shiyu1994
Copy link
Collaborator Author

This works now. Please check. If there's no problem. We can merge this. @StrikerRUS @jameslamb

@jameslamb
Copy link
Collaborator

Excellent, thank you!! There are a few more things that were commented out / removed in #6921 and #6919 that need to be restored.

I just pushed 8143213 doing that.

If this passes I'll merge it.

containers:
- container: linux-artifact-builder
image: lightgbm/vsts-agent:manylinux_2_28_x86_64
image: lightgbm.azurecr.io/vsts-agent:manylinux_2_28_x86_64
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For my understanding... what is lightgbm.azurecr.io? Did you manually push the image from https://hub.docker.com/r/lightgbm/vsts-agent there?

I'm wondering what we'd need to do the next time we make changes at https://github.com/guolinke/lightgbm-ci-docker

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I merged this to restore full CI, but I would still like an answer to this @shiyu1994

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shiyu1994 I would still like an answer to these questions, please.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shiyu1994 can you please answer this? I'm sorry to keep @-ing you but it's a big risk to the project that we don't know how to update the container images used in CI.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For my understanding... what is lightgbm.azurecr.io? Did you manually push the image from https://hub.docker.com/r/lightgbm/vsts-agent there?

I'm wondering what we'd need to do the next time we make changes at https://github.com/guolinke/lightgbm-ci-docker

Yes, lightgbm.azurecr.io is a new container registry on Azure Container Registry. And I pushed the image manually there.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for that. That's fine as a short-term workaround but at some point we will need to update the images again, and we need a plan for that.

I've put up a proposal in #7011, please comment when you have time.

@jameslamb
Copy link
Collaborator

On the latest commit, all but 1 job passed!

The macOS sdist job timed out after running for 1 hour. It failed somewhere in the Dask tests:

tests/python_package_test/test_arrow.py ................................ [  2%]
........................................................................ [  6%]
.......................................                                  [  9%]
tests/python_package_test/test_basic.py ................................ [ 11%]
........................................................................ [ 16%]
                                                                         [ 16%]
tests/python_package_test/test_callback.py ................              [ 17%]
tests/python_package_test/test_consistency.py ......                     [ 18%]
tests/python_package_test/test_dask.py ................................. [ 20%]
........................................................................ [ 25%]
........................................................................ [ 29%]
.................s...............s...............s...............s...... [ 34%]

(build link)

I doubt that's related to changes in this PR, maybe it's something similar to #4074 . Restarting it.

@jameslamb jameslamb requested a review from StrikerRUS June 13, 2025 03:28
@jameslamb jameslamb self-requested a review June 13, 2025 03:28
Copy link
Collaborator

@jameslamb jameslamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok looks like this is working! Thank you @shiyu1994 !!

If this passes I'll merge it.

When I wrote this, I forgot that @StrikerRUS had left a blocking review. I'd like to give him a chance to re-review before we merge this.

Copy link
Collaborator

@StrikerRUS StrikerRUS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for giving a chance to re-review and for finally finishing this PR!

@jameslamb jameslamb merged commit a79e033 into master Jun 15, 2025
51 checks passed
@jameslamb jameslamb deleted the ci/update-to-1es-image branch June 15, 2025 19:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants