GKE GPU A3 with TCPX support #517

samos123 · 2024-06-11T05:42:21Z

Add the ability to launch GPU jobs directly on GKE

Test cases I ran:

Fuji 7b single host per device batch 8 is working too:

BASTION_TIER=1 axlearn gcp gke start --instance_type=gpu-a3-highgpu-8g-8 \
        --cluster=daily-run-single-zone-gker27 --num_replicas=1 \
        --queue multislice-queue \
        --bundler_spec=allow_dirty=True \
        --bundler_type=artifactregistry --bundler_spec=image=gpu \
        --bundler_spec=dockerfile=Dockerfile --bundler_spec=target=gpu \
        -- mkdir -p /tmp/test_trainer\; python3 -m axlearn.common.launch_trainer_main \
  --module=text.gpt.c4_trainer --config=fuji-7B-v1-single-host \
  --trainer_dir=/tmp/test_trainer --data_dir=gs://axlearn-public/tensorflow_datasets \
  --jax_backend=gpu

step time on single node: 2.06 seconds

Fuji 7b 32 nodes batch per device 8:

BASTION_TIER=1 axlearn gcp gke start --instance_type=gpu-a3-highgpu-8g-256 \
        --cluster=daily-run-single-zone-gker27 --num_replicas=32 \
        --queue multislice-queue \
        --bundler_spec=allow_dirty=True \
        --bundler_type=artifactregistry --bundler_spec=image=gpu \
        --bundler_spec=dockerfile=Dockerfile --bundler_spec=target=gpu \
        -- mkdir -p /tmp/test_trainer\; python3 -m axlearn.common.launch_trainer_main \
  --module=text.gpt.c4_trainer --config=fuji-7B-v1 \
  --trainer_dir=/tmp/test_trainer --data_dir=gs://axlearn-public/tensorflow_datasets \
  --jax_backend=gpu

step time on 32 nodes: 3.03 seconds

Fuji 7b 64 nodes batch per device 4:

BASTION_TIER=1 axlearn gcp gke start --instance_type=gpu-a3-highgpu-8g-512 \
        --cluster=daily-run-single-zone-gker27 --num_replicas=64 \
        --queue multislice-queue \
        --bundler_spec=allow_dirty=True \
        --bundler_type=artifactregistry --bundler_spec=image=gpu \
        --bundler_spec=dockerfile=Dockerfile --bundler_spec=target=gpu \
        -- mkdir -p /tmp/test_trainer\; python3 -m axlearn.common.launch_trainer_main \
  --module=text.gpt.c4_trainer --config=fuji-7B-v1 \
  --trainer_dir=/tmp/test_trainer --data_dir=gs://axlearn-public/tensorflow_datasets \
  --jax_backend=gpu

Step time A3 3.92 seconds (note A3+ this is 1.785 seconds).

I plan to work on A3+ support in a separate PR.

axlearn/cloud/gcp/job.py

markblee

A quick skim through...

Dockerfile

axlearn/cloud/gcp/job.py

axlearn/common/launch.py

axlearn/cloud/gcp/jobs/gke_runner.py

axlearn/cloud/gcp/job.py

axlearn/experiments/text/gpt/c4_trainer.py

markblee

Thanks, overall looks reasonable. Can we add/update tests for changes in job.py and gke_runner.py?

axlearn/cloud/gcp/job.py

axlearn/experiments/text/gpt/c4_trainer.py

samos123 · 2024-06-14T23:25:44Z

I have addressed all comments, some with fixes and some with responses instead.

I've added tests for GPUGKEJob. I also added some basic tests in the gke_runner

markblee

Thanks! Looks good to me after final comments.

axlearn/cloud/gcp/job.py

axlearn/common/launch.py

samos123 · 2024-06-17T22:30:09Z

I did another test on single node but still waiting for multi-node results. The single node result was able to verify that ; works instead of \n.

markblee

Can we leave a comment "Done" for comments which have been addressed? (See also: https://github.com/apple/axlearn/blob/main/CONTRIBUTING.md)

axlearn/cloud/gcp/job.py

samos123 · 2024-06-17T23:05:23Z

I resolved all comments and fixed missing punctation

markblee

Thanks!

samos123 · 2024-06-18T21:32:23Z

Seems Circle CI is awaiting someone to approve the run?

samos123 · 2024-06-18T21:45:13Z

Seems it was approved but there was a Circe CI Apple Auth error: https://app.circleci.com/pipelines/github/apple/axlearn/1925/workflows/99dbcc0a-7a9f-4338-8ec3-0cc2724a9e28
I can't see what the error is though. So will probably need @markblee to take a look?

samos123 · 2024-06-19T01:26:08Z

Looks like Apple Authenticate in circle ci is still failing but I don't see the logs.

* add GKE GPU support to axlearn * add volumes and initContainer * finish up the pod spec * add GKE runner for GPU * extend != append duhh * fix volume mount change * add local queue for internal cluster * move kueue annotation to jobset * introduce gpu container image * add ld_library_path * add env variables for a3 * ensure replicas of jobset is 1 * automatically set distributed coordinator as env variables * change NCCL_DEBUG to warn * install gpu jax using pyproject.toml * address comments from Mark * fix missing sidecar * remove accidental double quote * add default XLA flags * hardcode max step to 1000 * fix sidecar command * fix sidecar termination * allow passing queue name through flag * port over remaining xla flags * only use bare minimum xla flags * address Marks' nit on using env_vars.update * remove flags that make perf worse * remove tpu np provionser from gpu runner * add mesh rule * Revert "hardcode max step to 1000" This reverts commit 8cda4f91414c00deb28c7f15d54d183076101d8b. * add doc for queue flag * fix punctuation and add link to tcpx docs * more puntuation * document NCCL env vars * throw error if GCS mount is set * throw error when pre provisioner is enabled * add testing coverage for GPUGKEJob * add basic gke_runner tests * add more gke_runner tests * address pr comments * add space * fix missing . * add missing space * fix pytype error in job_test.py * update golden configs

* Open-source MoE (#547) * MoE * update * update * update * update * update * update --------- Co-authored-by: Xianzhi Du <[email protected]> * Supports return_aux in PipelinedTransformerLayer. (#557) * link fix (#561) Co-authored-by: Xianzhi Du <[email protected]> * GKE GPU A3 with TCPX support (#517) * add GKE GPU support to axlearn * add volumes and initContainer * finish up the pod spec * add GKE runner for GPU * extend != append duhh * fix volume mount change * add local queue for internal cluster * move kueue annotation to jobset * introduce gpu container image * add ld_library_path * add env variables for a3 * ensure replicas of jobset is 1 * automatically set distributed coordinator as env variables * change NCCL_DEBUG to warn * install gpu jax using pyproject.toml * address comments from Mark * fix missing sidecar * remove accidental double quote * add default XLA flags * hardcode max step to 1000 * fix sidecar command * fix sidecar termination * allow passing queue name through flag * port over remaining xla flags * only use bare minimum xla flags * address Marks' nit on using env_vars.update * remove flags that make perf worse * remove tpu np provionser from gpu runner * add mesh rule * Revert "hardcode max step to 1000" This reverts commit 8cda4f91414c00deb28c7f15d54d183076101d8b. * add doc for queue flag * fix punctuation and add link to tcpx docs * more puntuation * document NCCL env vars * throw error if GCS mount is set * throw error when pre provisioner is enabled * add testing coverage for GPUGKEJob * add basic gke_runner tests * add more gke_runner tests * address pr comments * add space * fix missing . * add missing space * fix pytype error in job_test.py * update golden configs * [inference] remove unncessary mocks and use absolute imports (#563) Co-authored-by: guoli-yin <[email protected]> * Minor style changes. (#564) * add support different layer order in conformer * update * address review feedback * fix formatting * fix black --------- Co-authored-by: xianzhi <[email protected]> Co-authored-by: Xianzhi Du <[email protected]> Co-authored-by: Mark Lee <[email protected]> Co-authored-by: Sam Stoelinga <[email protected]> Co-authored-by: Guoli Yin <[email protected]> Co-authored-by: guoli-yin <[email protected]> Co-authored-by: Yongqiang Wang <[email protected]>

* Open-source MoE (#547) * MoE * update * update * update * update * update * update --------- Co-authored-by: Xianzhi Du <[email protected]> * Supports return_aux in PipelinedTransformerLayer. (#557) * link fix (#561) Co-authored-by: Xianzhi Du <[email protected]> * GKE GPU A3 with TCPX support (#517) * add GKE GPU support to axlearn * add volumes and initContainer * finish up the pod spec * add GKE runner for GPU * extend != append duhh * fix volume mount change * add local queue for internal cluster * move kueue annotation to jobset * introduce gpu container image * add ld_library_path * add env variables for a3 * ensure replicas of jobset is 1 * automatically set distributed coordinator as env variables * change NCCL_DEBUG to warn * install gpu jax using pyproject.toml * address comments from Mark * fix missing sidecar * remove accidental double quote * add default XLA flags * hardcode max step to 1000 * fix sidecar command * fix sidecar termination * allow passing queue name through flag * port over remaining xla flags * only use bare minimum xla flags * address Marks' nit on using env_vars.update * remove flags that make perf worse * remove tpu np provionser from gpu runner * add mesh rule * Revert "hardcode max step to 1000" This reverts commit 8cda4f91414c00deb28c7f15d54d183076101d8b. * add doc for queue flag * fix punctuation and add link to tcpx docs * more puntuation * document NCCL env vars * throw error if GCS mount is set * throw error when pre provisioner is enabled * add testing coverage for GPUGKEJob * add basic gke_runner tests * add more gke_runner tests * address pr comments * add space * fix missing . * add missing space * fix pytype error in job_test.py * update golden configs * [inference] remove unncessary mocks and use absolute imports (#563) Co-authored-by: guoli-yin <[email protected]> * Minor style changes. (#564) * Increases the maximum number of outputs in tb image. * Revert fuji-7B-v1-flash-single-host.txt * Revert fuji-7B-v1-flash-single-host.txt * Revert fuji-7B-v1-flash-single-host.txt * Revert fuji-7B-v1-flash-single-host --------- Co-authored-by: xianzhi <[email protected]> Co-authored-by: Xianzhi Du <[email protected]> Co-authored-by: Mark Lee <[email protected]> Co-authored-by: Sam Stoelinga <[email protected]> Co-authored-by: Guoli Yin <[email protected]> Co-authored-by: guoli-yin <[email protected]>

* add GKE GPU support to axlearn * add volumes and initContainer * finish up the pod spec * add GKE runner for GPU * extend != append duhh * fix volume mount change * add local queue for internal cluster * move kueue annotation to jobset * introduce gpu container image * add ld_library_path * add env variables for a3 * ensure replicas of jobset is 1 * automatically set distributed coordinator as env variables * change NCCL_DEBUG to warn * install gpu jax using pyproject.toml * address comments from Mark * fix missing sidecar * remove accidental double quote * add default XLA flags * hardcode max step to 1000 * fix sidecar command * fix sidecar termination * allow passing queue name through flag * port over remaining xla flags * only use bare minimum xla flags * address Marks' nit on using env_vars.update * remove flags that make perf worse * remove tpu np provionser from gpu runner * add mesh rule * Revert "hardcode max step to 1000" This reverts commit 8cda4f91414c00deb28c7f15d54d183076101d8b. * add doc for queue flag * fix punctuation and add link to tcpx docs * more puntuation * document NCCL env vars * throw error if GCS mount is set * throw error when pre provisioner is enabled * add testing coverage for GPUGKEJob * add basic gke_runner tests * add more gke_runner tests * address pr comments * add space * fix missing . * add missing space * fix pytype error in job_test.py * update golden configs

* Open-source MoE (#547) * MoE * update * update * update * update * update * update --------- Co-authored-by: Xianzhi Du <[email protected]> * Supports return_aux in PipelinedTransformerLayer. (apple#557) * link fix (#561) Co-authored-by: Xianzhi Du <[email protected]> * GKE GPU A3 with TCPX support (apple#517) * add GKE GPU support to axlearn * add volumes and initContainer * finish up the pod spec * add GKE runner for GPU * extend != append duhh * fix volume mount change * add local queue for internal cluster * move kueue annotation to jobset * introduce gpu container image * add ld_library_path * add env variables for a3 * ensure replicas of jobset is 1 * automatically set distributed coordinator as env variables * change NCCL_DEBUG to warn * install gpu jax using pyproject.toml * address comments from Mark * fix missing sidecar * remove accidental double quote * add default XLA flags * hardcode max step to 1000 * fix sidecar command * fix sidecar termination * allow passing queue name through flag * port over remaining xla flags * only use bare minimum xla flags * address Marks' nit on using env_vars.update * remove flags that make perf worse * remove tpu np provionser from gpu runner * add mesh rule * Revert "hardcode max step to 1000" This reverts commit 8cda4f91414c00deb28c7f15d54d183076101d8b. * add doc for queue flag * fix punctuation and add link to tcpx docs * more puntuation * document NCCL env vars * throw error if GCS mount is set * throw error when pre provisioner is enabled * add testing coverage for GPUGKEJob * add basic gke_runner tests * add more gke_runner tests * address pr comments * add space * fix missing . * add missing space * fix pytype error in job_test.py * update golden configs * [inference] remove unncessary mocks and use absolute imports (apple#563) Co-authored-by: guoli-yin <[email protected]> * Minor style changes. (apple#564) * add support different layer order in conformer * update * address review feedback * fix formatting * fix black --------- Co-authored-by: xianzhi <[email protected]> Co-authored-by: Xianzhi Du <[email protected]> Co-authored-by: Mark Lee <[email protected]> Co-authored-by: Sam Stoelinga <[email protected]> Co-authored-by: Guoli Yin <[email protected]> Co-authored-by: guoli-yin <[email protected]> Co-authored-by: Yongqiang Wang <[email protected]>

* Open-source MoE (#547) * MoE * update * update * update * update * update * update --------- Co-authored-by: Xianzhi Du <[email protected]> * Supports return_aux in PipelinedTransformerLayer. (apple#557) * link fix (#561) Co-authored-by: Xianzhi Du <[email protected]> * GKE GPU A3 with TCPX support (apple#517) * add GKE GPU support to axlearn * add volumes and initContainer * finish up the pod spec * add GKE runner for GPU * extend != append duhh * fix volume mount change * add local queue for internal cluster * move kueue annotation to jobset * introduce gpu container image * add ld_library_path * add env variables for a3 * ensure replicas of jobset is 1 * automatically set distributed coordinator as env variables * change NCCL_DEBUG to warn * install gpu jax using pyproject.toml * address comments from Mark * fix missing sidecar * remove accidental double quote * add default XLA flags * hardcode max step to 1000 * fix sidecar command * fix sidecar termination * allow passing queue name through flag * port over remaining xla flags * only use bare minimum xla flags * address Marks' nit on using env_vars.update * remove flags that make perf worse * remove tpu np provionser from gpu runner * add mesh rule * Revert "hardcode max step to 1000" This reverts commit 8cda4f91414c00deb28c7f15d54d183076101d8b. * add doc for queue flag * fix punctuation and add link to tcpx docs * more puntuation * document NCCL env vars * throw error if GCS mount is set * throw error when pre provisioner is enabled * add testing coverage for GPUGKEJob * add basic gke_runner tests * add more gke_runner tests * address pr comments * add space * fix missing . * add missing space * fix pytype error in job_test.py * update golden configs * [inference] remove unncessary mocks and use absolute imports (apple#563) Co-authored-by: guoli-yin <[email protected]> * Minor style changes. (apple#564) * Increases the maximum number of outputs in tb image. * Revert fuji-7B-v1-flash-single-host.txt * Revert fuji-7B-v1-flash-single-host.txt * Revert fuji-7B-v1-flash-single-host.txt * Revert fuji-7B-v1-flash-single-host --------- Co-authored-by: xianzhi <[email protected]> Co-authored-by: Xianzhi Du <[email protected]> Co-authored-by: Mark Lee <[email protected]> Co-authored-by: Sam Stoelinga <[email protected]> Co-authored-by: Guoli Yin <[email protected]> Co-authored-by: guoli-yin <[email protected]>

samos123 commented Jun 11, 2024

View reviewed changes

axlearn/cloud/gcp/job.py Show resolved Hide resolved

samos123 marked this pull request as draft June 11, 2024 05:54

samos123 force-pushed the gke-gpu branch from f44c64f to 1484f84 Compare June 12, 2024 00:04

markblee reviewed Jun 12, 2024

View reviewed changes

samos123 force-pushed the gke-gpu branch from 42c1717 to ac8bb6c Compare June 13, 2024 04:55

samos123 requested a review from markblee June 13, 2024 07:17

samos123 changed the title ~~WIP GKE GPU A3 with TCPX support~~ GKE GPU A3 with TCPX support Jun 13, 2024

samos123 commented Jun 13, 2024

View reviewed changes

axlearn/experiments/text/gpt/c4_trainer.py Outdated Show resolved Hide resolved

samos123 marked this pull request as ready for review June 13, 2024 17:28

samos123 force-pushed the gke-gpu branch 2 times, most recently from 0ca01ac to 4aa8b09 Compare June 13, 2024 19:11

markblee reviewed Jun 14, 2024

View reviewed changes

samos123 requested a review from markblee June 14, 2024 23:25

markblee reviewed Jun 17, 2024

View reviewed changes

samos123 requested a review from markblee June 17, 2024 22:29

markblee reviewed Jun 17, 2024

View reviewed changes

axlearn/cloud/gcp/job.py Outdated Show resolved Hide resolved

axlearn/cloud/gcp/job.py Outdated Show resolved Hide resolved

samos123 requested a review from markblee June 17, 2024 22:59

markblee approved these changes Jun 18, 2024

View reviewed changes

samos123 requested a review from markblee June 18, 2024 19:11

markblee approved these changes Jun 18, 2024

View reviewed changes

samos123 force-pushed the gke-gpu branch 2 times, most recently from 0c54895 to 3877a39 Compare June 18, 2024 22:39

samos123 requested a review from markblee June 18, 2024 22:39

markblee approved these changes Jun 18, 2024

View reviewed changes

samos123 added 13 commits June 27, 2024 15:44

fix punctuation and add link to tcpx docs

7c10c32

more puntuation

89933aa

document NCCL env vars

2818e88

throw error if GCS mount is set

3f275b4

throw error when pre provisioner is enabled

2bfb500

add testing coverage for GPUGKEJob

a247702

add basic gke_runner tests

9d40f97

add more gke_runner tests

65ad512

address pr comments

7d6bc04

add space

92de543

fix missing .

5b57474

add missing space

0ce8001

fix pytype error in job_test.py

620cd38

samos123 force-pushed the gke-gpu branch from e988ac0 to 38ea903 Compare June 27, 2024 22:44

samos123 requested a review from ruomingp as a code owner June 27, 2024 22:44

update golden configs

f680604

samos123 force-pushed the gke-gpu branch from 38ea903 to f680604 Compare June 28, 2024 00:36

ruomingp approved these changes Jun 28, 2024

View reviewed changes

markblee added this pull request to the merge queue Jun 28, 2024

Merged via the queue into apple:main with commit 4454f54 Jun 29, 2024
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GKE GPU A3 with TCPX support #517

GKE GPU A3 with TCPX support #517

samos123 commented Jun 11, 2024 •

edited

Loading

markblee left a comment

markblee left a comment •

edited

Loading

samos123 commented Jun 14, 2024 •

edited

Loading

markblee left a comment

samos123 commented Jun 17, 2024

markblee left a comment

samos123 commented Jun 17, 2024

markblee left a comment

samos123 commented Jun 18, 2024

samos123 commented Jun 18, 2024

samos123 commented Jun 19, 2024

GKE GPU A3 with TCPX support #517

GKE GPU A3 with TCPX support #517

Conversation

samos123 commented Jun 11, 2024 • edited Loading

markblee left a comment

Choose a reason for hiding this comment

markblee left a comment • edited Loading

Choose a reason for hiding this comment

samos123 commented Jun 14, 2024 • edited Loading

markblee left a comment

Choose a reason for hiding this comment

samos123 commented Jun 17, 2024

markblee left a comment

Choose a reason for hiding this comment

samos123 commented Jun 17, 2024

markblee left a comment

Choose a reason for hiding this comment

samos123 commented Jun 18, 2024

samos123 commented Jun 18, 2024

samos123 commented Jun 19, 2024

samos123 commented Jun 11, 2024 •

edited

Loading

markblee left a comment •

edited

Loading

samos123 commented Jun 14, 2024 •

edited

Loading