-
Notifications
You must be signed in to change notification settings - Fork 280
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GKE GPU A3 with TCPX support #517
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A quick skim through...
0ca01ac
to
4aa8b09
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, overall looks reasonable. Can we add/update tests for changes in job.py
and gke_runner.py
?
I have addressed all comments, some with fixes and some with responses instead. I've added tests for GPUGKEJob. I also added some basic tests in the gke_runner |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Looks good to me after final comments.
I did another test on single node but still waiting for multi-node results. The single node result was able to verify that |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we leave a comment "Done" for comments which have been addressed? (See also: https://github.com/apple/axlearn/blob/main/CONTRIBUTING.md)
I resolved all comments and fixed missing punctation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
Seems Circle CI is awaiting someone to approve the run? |
Seems it was approved but there was a Circe CI Apple Auth error: https://app.circleci.com/pipelines/github/apple/axlearn/1925/workflows/99dbcc0a-7a9f-4338-8ec3-0cc2724a9e28 |
0c54895
to
3877a39
Compare
Looks like Apple Authenticate in circle ci is still failing but I don't see the logs. |
* add GKE GPU support to axlearn * add volumes and initContainer * finish up the pod spec * add GKE runner for GPU * extend != append duhh * fix volume mount change * add local queue for internal cluster * move kueue annotation to jobset * introduce gpu container image * add ld_library_path * add env variables for a3 * ensure replicas of jobset is 1 * automatically set distributed coordinator as env variables * change NCCL_DEBUG to warn * install gpu jax using pyproject.toml * address comments from Mark * fix missing sidecar * remove accidental double quote * add default XLA flags * hardcode max step to 1000 * fix sidecar command * fix sidecar termination * allow passing queue name through flag * port over remaining xla flags * only use bare minimum xla flags * address Marks' nit on using env_vars.update * remove flags that make perf worse * remove tpu np provionser from gpu runner * add mesh rule * Revert "hardcode max step to 1000" This reverts commit 8cda4f91414c00deb28c7f15d54d183076101d8b. * add doc for queue flag * fix punctuation and add link to tcpx docs * more puntuation * document NCCL env vars * throw error if GCS mount is set * throw error when pre provisioner is enabled * add testing coverage for GPUGKEJob * add basic gke_runner tests * add more gke_runner tests * address pr comments * add space * fix missing . * add missing space * fix pytype error in job_test.py * update golden configs
* Open-source MoE (#547) * MoE * update * update * update * update * update * update --------- Co-authored-by: Xianzhi Du <[email protected]> * Supports return_aux in PipelinedTransformerLayer. (#557) * link fix (#561) Co-authored-by: Xianzhi Du <[email protected]> * GKE GPU A3 with TCPX support (#517) * add GKE GPU support to axlearn * add volumes and initContainer * finish up the pod spec * add GKE runner for GPU * extend != append duhh * fix volume mount change * add local queue for internal cluster * move kueue annotation to jobset * introduce gpu container image * add ld_library_path * add env variables for a3 * ensure replicas of jobset is 1 * automatically set distributed coordinator as env variables * change NCCL_DEBUG to warn * install gpu jax using pyproject.toml * address comments from Mark * fix missing sidecar * remove accidental double quote * add default XLA flags * hardcode max step to 1000 * fix sidecar command * fix sidecar termination * allow passing queue name through flag * port over remaining xla flags * only use bare minimum xla flags * address Marks' nit on using env_vars.update * remove flags that make perf worse * remove tpu np provionser from gpu runner * add mesh rule * Revert "hardcode max step to 1000" This reverts commit 8cda4f91414c00deb28c7f15d54d183076101d8b. * add doc for queue flag * fix punctuation and add link to tcpx docs * more puntuation * document NCCL env vars * throw error if GCS mount is set * throw error when pre provisioner is enabled * add testing coverage for GPUGKEJob * add basic gke_runner tests * add more gke_runner tests * address pr comments * add space * fix missing . * add missing space * fix pytype error in job_test.py * update golden configs * [inference] remove unncessary mocks and use absolute imports (#563) Co-authored-by: guoli-yin <[email protected]> * Minor style changes. (#564) * add support different layer order in conformer * update * address review feedback * fix formatting * fix black --------- Co-authored-by: xianzhi <[email protected]> Co-authored-by: Xianzhi Du <[email protected]> Co-authored-by: Mark Lee <[email protected]> Co-authored-by: Sam Stoelinga <[email protected]> Co-authored-by: Guoli Yin <[email protected]> Co-authored-by: guoli-yin <[email protected]> Co-authored-by: Yongqiang Wang <[email protected]>
* Open-source MoE (#547) * MoE * update * update * update * update * update * update --------- Co-authored-by: Xianzhi Du <[email protected]> * Supports return_aux in PipelinedTransformerLayer. (#557) * link fix (#561) Co-authored-by: Xianzhi Du <[email protected]> * GKE GPU A3 with TCPX support (#517) * add GKE GPU support to axlearn * add volumes and initContainer * finish up the pod spec * add GKE runner for GPU * extend != append duhh * fix volume mount change * add local queue for internal cluster * move kueue annotation to jobset * introduce gpu container image * add ld_library_path * add env variables for a3 * ensure replicas of jobset is 1 * automatically set distributed coordinator as env variables * change NCCL_DEBUG to warn * install gpu jax using pyproject.toml * address comments from Mark * fix missing sidecar * remove accidental double quote * add default XLA flags * hardcode max step to 1000 * fix sidecar command * fix sidecar termination * allow passing queue name through flag * port over remaining xla flags * only use bare minimum xla flags * address Marks' nit on using env_vars.update * remove flags that make perf worse * remove tpu np provionser from gpu runner * add mesh rule * Revert "hardcode max step to 1000" This reverts commit 8cda4f91414c00deb28c7f15d54d183076101d8b. * add doc for queue flag * fix punctuation and add link to tcpx docs * more puntuation * document NCCL env vars * throw error if GCS mount is set * throw error when pre provisioner is enabled * add testing coverage for GPUGKEJob * add basic gke_runner tests * add more gke_runner tests * address pr comments * add space * fix missing . * add missing space * fix pytype error in job_test.py * update golden configs * [inference] remove unncessary mocks and use absolute imports (#563) Co-authored-by: guoli-yin <[email protected]> * Minor style changes. (#564) * Increases the maximum number of outputs in tb image. * Revert fuji-7B-v1-flash-single-host.txt * Revert fuji-7B-v1-flash-single-host.txt * Revert fuji-7B-v1-flash-single-host.txt * Revert fuji-7B-v1-flash-single-host --------- Co-authored-by: xianzhi <[email protected]> Co-authored-by: Xianzhi Du <[email protected]> Co-authored-by: Mark Lee <[email protected]> Co-authored-by: Sam Stoelinga <[email protected]> Co-authored-by: Guoli Yin <[email protected]> Co-authored-by: guoli-yin <[email protected]>
* add GKE GPU support to axlearn * add volumes and initContainer * finish up the pod spec * add GKE runner for GPU * extend != append duhh * fix volume mount change * add local queue for internal cluster * move kueue annotation to jobset * introduce gpu container image * add ld_library_path * add env variables for a3 * ensure replicas of jobset is 1 * automatically set distributed coordinator as env variables * change NCCL_DEBUG to warn * install gpu jax using pyproject.toml * address comments from Mark * fix missing sidecar * remove accidental double quote * add default XLA flags * hardcode max step to 1000 * fix sidecar command * fix sidecar termination * allow passing queue name through flag * port over remaining xla flags * only use bare minimum xla flags * address Marks' nit on using env_vars.update * remove flags that make perf worse * remove tpu np provionser from gpu runner * add mesh rule * Revert "hardcode max step to 1000" This reverts commit 8cda4f91414c00deb28c7f15d54d183076101d8b. * add doc for queue flag * fix punctuation and add link to tcpx docs * more puntuation * document NCCL env vars * throw error if GCS mount is set * throw error when pre provisioner is enabled * add testing coverage for GPUGKEJob * add basic gke_runner tests * add more gke_runner tests * address pr comments * add space * fix missing . * add missing space * fix pytype error in job_test.py * update golden configs
* Open-source MoE (#547) * MoE * update * update * update * update * update * update --------- Co-authored-by: Xianzhi Du <[email protected]> * Supports return_aux in PipelinedTransformerLayer. (apple#557) * link fix (#561) Co-authored-by: Xianzhi Du <[email protected]> * GKE GPU A3 with TCPX support (apple#517) * add GKE GPU support to axlearn * add volumes and initContainer * finish up the pod spec * add GKE runner for GPU * extend != append duhh * fix volume mount change * add local queue for internal cluster * move kueue annotation to jobset * introduce gpu container image * add ld_library_path * add env variables for a3 * ensure replicas of jobset is 1 * automatically set distributed coordinator as env variables * change NCCL_DEBUG to warn * install gpu jax using pyproject.toml * address comments from Mark * fix missing sidecar * remove accidental double quote * add default XLA flags * hardcode max step to 1000 * fix sidecar command * fix sidecar termination * allow passing queue name through flag * port over remaining xla flags * only use bare minimum xla flags * address Marks' nit on using env_vars.update * remove flags that make perf worse * remove tpu np provionser from gpu runner * add mesh rule * Revert "hardcode max step to 1000" This reverts commit 8cda4f91414c00deb28c7f15d54d183076101d8b. * add doc for queue flag * fix punctuation and add link to tcpx docs * more puntuation * document NCCL env vars * throw error if GCS mount is set * throw error when pre provisioner is enabled * add testing coverage for GPUGKEJob * add basic gke_runner tests * add more gke_runner tests * address pr comments * add space * fix missing . * add missing space * fix pytype error in job_test.py * update golden configs * [inference] remove unncessary mocks and use absolute imports (apple#563) Co-authored-by: guoli-yin <[email protected]> * Minor style changes. (apple#564) * add support different layer order in conformer * update * address review feedback * fix formatting * fix black --------- Co-authored-by: xianzhi <[email protected]> Co-authored-by: Xianzhi Du <[email protected]> Co-authored-by: Mark Lee <[email protected]> Co-authored-by: Sam Stoelinga <[email protected]> Co-authored-by: Guoli Yin <[email protected]> Co-authored-by: guoli-yin <[email protected]> Co-authored-by: Yongqiang Wang <[email protected]>
* Open-source MoE (#547) * MoE * update * update * update * update * update * update --------- Co-authored-by: Xianzhi Du <[email protected]> * Supports return_aux in PipelinedTransformerLayer. (apple#557) * link fix (#561) Co-authored-by: Xianzhi Du <[email protected]> * GKE GPU A3 with TCPX support (apple#517) * add GKE GPU support to axlearn * add volumes and initContainer * finish up the pod spec * add GKE runner for GPU * extend != append duhh * fix volume mount change * add local queue for internal cluster * move kueue annotation to jobset * introduce gpu container image * add ld_library_path * add env variables for a3 * ensure replicas of jobset is 1 * automatically set distributed coordinator as env variables * change NCCL_DEBUG to warn * install gpu jax using pyproject.toml * address comments from Mark * fix missing sidecar * remove accidental double quote * add default XLA flags * hardcode max step to 1000 * fix sidecar command * fix sidecar termination * allow passing queue name through flag * port over remaining xla flags * only use bare minimum xla flags * address Marks' nit on using env_vars.update * remove flags that make perf worse * remove tpu np provionser from gpu runner * add mesh rule * Revert "hardcode max step to 1000" This reverts commit 8cda4f91414c00deb28c7f15d54d183076101d8b. * add doc for queue flag * fix punctuation and add link to tcpx docs * more puntuation * document NCCL env vars * throw error if GCS mount is set * throw error when pre provisioner is enabled * add testing coverage for GPUGKEJob * add basic gke_runner tests * add more gke_runner tests * address pr comments * add space * fix missing . * add missing space * fix pytype error in job_test.py * update golden configs * [inference] remove unncessary mocks and use absolute imports (apple#563) Co-authored-by: guoli-yin <[email protected]> * Minor style changes. (apple#564) * Increases the maximum number of outputs in tb image. * Revert fuji-7B-v1-flash-single-host.txt * Revert fuji-7B-v1-flash-single-host.txt * Revert fuji-7B-v1-flash-single-host.txt * Revert fuji-7B-v1-flash-single-host --------- Co-authored-by: xianzhi <[email protected]> Co-authored-by: Xianzhi Du <[email protected]> Co-authored-by: Mark Lee <[email protected]> Co-authored-by: Sam Stoelinga <[email protected]> Co-authored-by: Guoli Yin <[email protected]> Co-authored-by: guoli-yin <[email protected]>
Add the ability to launch GPU jobs directly on GKE
Test cases I ran:
Fuji 7b single host per device batch 8 is working too:
step time on single node: 2.06 seconds
Fuji 7b 32 nodes batch per device 8:
step time on 32 nodes: 3.03 seconds
Fuji 7b 64 nodes batch per device 4:
Step time A3 3.92 seconds (note A3+ this is 1.785 seconds).
I plan to work on A3+ support in a separate PR.