-
Notifications
You must be signed in to change notification settings - Fork 282
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
set hostNetwork to True for TPUGKEJob #641
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
volumes=volumes, | ||
) | ||
|
||
# hostNetwork True and dnsPolicy do not work with Workload Identity and GCS Fuse. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose we should check here that workload identity is not being used, not just gcs fuse --- as sync'ed offline, will do more testing to see whether necessary before merging.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can I have more context about how hostNetwork and dnsPolicy do not work with Workload Identity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GKE Workload Identity requires forcing metadata traffic to a specific pod. However, if you set hostNetwork true that's no longer possible so it doesn't work.
Note your GKE clusters can keep on using Workload Identity, however the issue is that it won't work for any pod that is using hostNetwork: true. All your other pods using hostNetwork: false will continue to be able to utilize workload identity just like before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can attach the right service account (the same one used in the container) to GKE node pools instead of using the default one. Then it should work with hostNetwork=True
This helped increase v5e-512 performance from 50% MFU to 58% MFU. The gains could be more significant when going beyond 2 slices.
Majority of Google TPU benchmarks ran on GKE use hostNetwork=true. That's also what XPK sets by default.
The performance increase is due to being able to bypass container networking. Setting hostNetwork=true allows us to directly utilize the host NICs without having to first traverse container NIC and linux network bridges.