Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GKE GPU A3 with TCPX support #517

Merged
merged 45 commits into from
Jun 29, 2024
Merged
Changes from 1 commit
Commits
Show all changes
45 commits
Select commit Hold shift + click to select a range
e77c3aa
add GKE GPU support to axlearn
samos123 Jun 10, 2024
fba2431
add volumes and initContainer
samos123 Jun 11, 2024
0af0170
finish up the pod spec
samos123 Jun 11, 2024
22a1862
add GKE runner for GPU
samos123 Jun 11, 2024
60f465a
extend != append duhh
samos123 Jun 11, 2024
b542e0e
fix volume mount change
samos123 Jun 11, 2024
2a78995
add local queue for internal cluster
samos123 Jun 11, 2024
637216c
move kueue annotation to jobset
samos123 Jun 11, 2024
23a3e8e
introduce gpu container image
samos123 Jun 11, 2024
8618e3f
add ld_library_path
samos123 Jun 11, 2024
de5c406
add env variables for a3
samos123 Jun 11, 2024
4fea2d5
ensure replicas of jobset is 1
samos123 Jun 11, 2024
232f597
automatically set distributed coordinator as env variables
samos123 Jun 11, 2024
495e12d
change NCCL_DEBUG to warn
samos123 Jun 12, 2024
17942c6
install gpu jax using pyproject.toml
samos123 Jun 12, 2024
ce1db6d
address comments from Mark
samos123 Jun 12, 2024
4de8ea2
fix missing sidecar
samos123 Jun 12, 2024
1887395
remove accidental double quote
samos123 Jun 12, 2024
09fc278
add default XLA flags
samos123 Jun 12, 2024
d51007f
hardcode max step to 1000
samos123 Jun 12, 2024
e5c2743
fix sidecar command
samos123 Jun 13, 2024
1ba2c5d
fix sidecar termination
samos123 Jun 13, 2024
7dcd651
allow passing queue name through flag
samos123 Jun 13, 2024
11b3a97
port over remaining xla flags
samos123 Jun 13, 2024
317b182
only use bare minimum xla flags
samos123 Jun 13, 2024
4a22f93
address Marks' nit on using env_vars.update
samos123 Jun 13, 2024
4401487
remove flags that make perf worse
samos123 Jun 13, 2024
0db3bb8
remove tpu np provionser from gpu runner
samos123 Jun 13, 2024
4080ed7
add mesh rule
samos123 Jun 13, 2024
ef5e25b
Revert "hardcode max step to 1000"
samos123 Jun 13, 2024
3b6d888
add doc for queue flag
samos123 Jun 14, 2024
7c10c32
fix punctuation and add link to tcpx docs
samos123 Jun 14, 2024
89933aa
more puntuation
samos123 Jun 14, 2024
2818e88
document NCCL env vars
samos123 Jun 14, 2024
3f275b4
throw error if GCS mount is set
samos123 Jun 14, 2024
2bfb500
throw error when pre provisioner is enabled
samos123 Jun 14, 2024
a247702
add testing coverage for GPUGKEJob
samos123 Jun 14, 2024
9d40f97
add basic gke_runner tests
samos123 Jun 15, 2024
65ad512
add more gke_runner tests
samos123 Jun 15, 2024
7d6bc04
address pr comments
samos123 Jun 17, 2024
92de543
add space
samos123 Jun 17, 2024
5b57474
fix missing .
samos123 Jun 17, 2024
0ce8001
add missing space
samos123 Jun 18, 2024
620cd38
fix pytype error in job_test.py
samos123 Jun 21, 2024
f680604
update golden configs
samos123 Jun 28, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
add space
  • Loading branch information
samos123 committed Jun 27, 2024
commit 92de5436a90ccc3605121abba745b4d1748f7e59
2 changes: 1 addition & 1 deletion axlearn/cloud/gcp/job.py
Original file line number Diff line number Diff line change
Expand Up @@ -839,7 +839,7 @@ def _build_main_container(self) -> Nested[Any]:
user_cmd = cfg.command
if user_cmd is None:
raise ValueError("Command should not be None.")
user_cmd += ";touch /run/tcpx/terminated"
user_cmd += "; touch /run/tcpx/terminated"
command = ["bash", "-c", user_cmd]

return dict(
Expand Down