Skip to content

Support cluster domain for MPI HostFile#704

Merged
google-oss-prow[bot] merged 1 commit intokubeflow:masterfrom
tenzen-y:add-cluster-domain-opts
Aug 14, 2025
Merged

Support cluster domain for MPI HostFile#704
google-oss-prow[bot] merged 1 commit intokubeflow:masterfrom
tenzen-y:add-cluster-domain-opts

Conversation

@tenzen-y
Copy link
Member

@tenzen-y tenzen-y commented Aug 14, 2025

I added --cluster-domain option to add the cluster domain when MPIOperator constructs MPI HostFile.
When it is specified the HostFile domains are built with <pod-name>.<mpi-job-name>.<namespace>.svc.<cluster-domain> format. Otherwise, <pod-name>.<mpi-job-name>.<namespace>.svc is used the same as previously.

Background

In #453, we added <namespace>.svc suffix to address a special network environment. But, it is not an ideal "headless service" endpoint name format for the generic network environment, where svc.<cluster-domain> is configured by the kubelet. As a result, there are many DNS name resolutions trying until hitting the domain w/out a cluster domain. That causes the Launcher Job to be recreated many times due to the failure of DNS name resolution.

This is quite impacted in the runLauncherAsWorker environment since Launcher is launched very fast before the svc. is set up.

Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
@tenzen-y
Copy link
Member Author

/assign @terrytangyuan

Copy link
Member

@terrytangyuan terrytangyuan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

@google-oss-prow google-oss-prow bot added the lgtm label Aug 14, 2025
@google-oss-prow
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: terrytangyuan

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@google-oss-prow google-oss-prow bot merged commit e02dfe5 into kubeflow:master Aug 14, 2025
10 checks passed
@tenzen-y tenzen-y deleted the add-cluster-domain-opts branch August 14, 2025 23:54
fs.IntVar(&s.ControllerRateLimit, "controller-queue-rate-limit", 10, "Rate limit of the controller events queue .")
fs.IntVar(&s.ControllerBurst, "controller-queue-burst", 100, "Maximum burst of the controller events queue.")

fs.StringVar(&s.ClusterDomain, "cluster-domain", "", `
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tenzen-y Do we need this for MPI plugin in KF Trainer ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so since Trainer does not support <PodName>.<JobName>.<namespace>.svc format. The Trainer supports only <PodName>.<JobName>. In case of Trainer, the cluster domain is automatically considered, immediately.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants