Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ElasticSearch does not run #2

Closed
nlacasse opened this issue Apr 26, 2018 · 3 comments
Closed

ElasticSearch does not run #2

nlacasse opened this issue Apr 26, 2018 · 3 comments

Comments

@nlacasse
Copy link
Collaborator

This requires some socket ioctls that are not currently implemented.

@clandry94
Copy link

Can't guarantee anything will come out of it but I can take a look into this

@fvoznika
Copy link
Member

fvoznika commented May 3, 2018

That would be great! Here are some pointers:

Follow these instructions to enable strace and run elasticsearch. Look for the log file: runsc.log.*.boot -- this represents the log for the Sentry process. The log will have a dump of all syscalls, look for ioctl calls that have failed: ioctl.*error.

The entry point for the the ioctl syscall in the Sentry is here.

@clandry94
Copy link

clandry94 commented May 4, 2018

Note: the elasticsearch image was failing to start due to not being able to find a loopback interface even though lo was present in the network interface list.

Adding --network=host to the runtime args gets me past the ES loopback interface lookup and to ioctl errors

shentubot pushed a commit that referenced this issue Jul 3, 2018
glibc's malloc also uses SYS_TIME. Permit it.

#0  0x0000000000de6267 in time ()
#1  0x0000000000db19d8 in get_nprocs ()
#2  0x0000000000d8a31a in arena_get2.part ()
#3  0x0000000000d8ab4a in malloc ()
#4  0x0000000000d3c6b5 in __sanitizer::InternalAlloc(unsigned long, __sanitizer::SizeClassAllocatorLocalCache<__sanitizer::SizeClassAllocator32<0ul, 140737488355328ull, 0ul, __sanitizer::SizeClassMap<3ul, 4ul, 8ul, 17ul, 64ul, 14ul>, 20ul, __sanitizer::TwoLevelByteMap<32768ull, 4096ull, __sanitizer::NoOpMapUnmapCallback>, __sanitizer::NoOpMapUnmapCallback> >*, unsigned long) ()
#5  0x0000000000d4cd70 in __tsan_go_start ()
#6  0x00000000004617a3 in racecall ()
#7  0x00000000010f4ea0 in runtime.findfunctab ()
#8  0x000000000043f193 in runtime.racegostart ()

Signed-off-by: Dmitry Vyukov <[email protected]>
[[email protected]: updated comments and commit message]
Signed-off-by: Michael Pratt <[email protected]>

Change-Id: Ibe2d0dc3035bf5052d5fb802cfaa37c5e0e7a09a
PiperOrigin-RevId: 203042627
dvyukov pushed a commit to dvyukov/gvisor that referenced this issue Jul 4, 2018
Closes google#2

PiperOrigin-RevId: 202997196
Change-Id: I0c9f6f5a8a1abe1ae427bca5f590bdf9f82a6675
dvyukov added a commit to dvyukov/gvisor that referenced this issue Jul 4, 2018
glibc's malloc also uses SYS_TIME. Permit it.

#0  0x0000000000de6267 in time ()
google#1  0x0000000000db19d8 in get_nprocs ()
google#2  0x0000000000d8a31a in arena_get2.part ()
google#3  0x0000000000d8ab4a in malloc ()
google#4  0x0000000000d3c6b5 in __sanitizer::InternalAlloc(unsigned long, __sanitizer::SizeClassAllocatorLocalCache<__sanitizer::SizeClassAllocator32<0ul, 140737488355328ull, 0ul, __sanitizer::SizeClassMap<3ul, 4ul, 8ul, 17ul, 64ul, 14ul>, 20ul, __sanitizer::TwoLevelByteMap<32768ull, 4096ull, __sanitizer::NoOpMapUnmapCallback>, __sanitizer::NoOpMapUnmapCallback> >*, unsigned long) ()
google#5  0x0000000000d4cd70 in __tsan_go_start ()
google#6  0x00000000004617a3 in racecall ()
google#7  0x00000000010f4ea0 in runtime.findfunctab ()
google#8  0x000000000043f193 in runtime.racegostart ()

Signed-off-by: Dmitry Vyukov <[email protected]>
[[email protected]: updated comments and commit message]
Signed-off-by: Michael Pratt <[email protected]>

Change-Id: Ibe2d0dc3035bf5052d5fb802cfaa37c5e0e7a09a
PiperOrigin-RevId: 203042627
tonistiigi referenced this issue in tonistiigi/gvisor Jan 30, 2019
Closes #2

PiperOrigin-RevId: 202997196
Change-Id: I0c9f6f5a8a1abe1ae427bca5f590bdf9f82a6675
Upstream-commit: fa64c2a
tonistiigi referenced this issue in tonistiigi/gvisor Jan 30, 2019
glibc's malloc also uses SYS_TIME. Permit it.

#0  0x0000000000de6267 in time ()
#1  0x0000000000db19d8 in get_nprocs ()
#2  0x0000000000d8a31a in arena_get2.part ()
#3  0x0000000000d8ab4a in malloc ()
google#4  0x0000000000d3c6b5 in __sanitizer::InternalAlloc(unsigned long, __sanitizer::SizeClassAllocatorLocalCache<__sanitizer::SizeClassAllocator32<0ul, 140737488355328ull, 0ul, __sanitizer::SizeClassMap<3ul, 4ul, 8ul, 17ul, 64ul, 14ul>, 20ul, __sanitizer::TwoLevelByteMap<32768ull, 4096ull, __sanitizer::NoOpMapUnmapCallback>, __sanitizer::NoOpMapUnmapCallback> >*, unsigned long) ()
google#5  0x0000000000d4cd70 in __tsan_go_start ()
google#6  0x00000000004617a3 in racecall ()
google#7  0x00000000010f4ea0 in runtime.findfunctab ()
google#8  0x000000000043f193 in runtime.racegostart ()

Signed-off-by: Dmitry Vyukov <[email protected]>
[[email protected]: updated comments and commit message]
Signed-off-by: Michael Pratt <[email protected]>

Change-Id: Ibe2d0dc3035bf5052d5fb802cfaa37c5e0e7a09a
PiperOrigin-RevId: 203042627
Upstream-commit: 6144751
tanjianfeng pushed a commit to tanjianfeng/gvisor that referenced this issue Aug 2, 2019
Below command under hostinet network will lead to panic:

  $ cat /proc/net/tcp

It's caused by the wrong SizeOfTCPInfo.

  #0 runtime.panicindex()
  google#1 encoding/binary.littleEndian.Uint64
  google#2 encoding/binary.(*littleEndian).Uint64
  google#3 gvisor.dev/gvisor/pkg/binary.unmarshal
  google#4 gvisor.dev/gvisor/pkg/binary.unmarshal
  google#5 gvisor.dev/gvisor/pkg/binary.Unmarshal
  google#6 gvisor.dev/gvisor/pkg/sentry/socket/hostinet.(*socketOperations).State
  google#7 gvisor.dev/gvisor/pkg/sentry/fs/proc.(*netTCP).ReadSeqFileData

Correct SizeOfTCPInfo from 104 to 192 to fix it.

Fixes google#640

Signed-off-by: Jianfeng Tan <[email protected]>
copybara-service bot pushed a commit that referenced this issue Mar 14, 2020
Add fpsimd support to KVM module so that the test case "TestKernelFloatingPoint"
can be passed on Arm64 platform.

Signed-off-by: Bin Lu <[email protected]>
FUTURE_COPYBARA_INTEGRATE_REVIEW=#1707 from lubinszARM:pr_lazy_fpsimd_2 bf87da8
PiperOrigin-RevId: 300843308
copybara-service bot pushed a commit that referenced this issue Mar 26, 2021
Before this change:

```
$ docker run --runtime=runsc --rm -it -v ~/tmp:/hosttmp ubuntu:focal /hosttmp/issue5732 --bytes1=128 --bytes2=1024
#1: read(128) = 128
#2: read(1024) = EOF
$ docker run --runtime=runsc-vfs2 --rm -it -v ~/tmp:/hosttmp ubuntu:focal /hosttmp/issue5732 --bytes1=128 --bytes2=1024
#1: read(128) = 128
#2: read(1024) = 256
```

After this change:

```
$ docker run --runtime=runsc --rm -it -v ~/tmp:/hosttmp ubuntu:focal /hosttmp/issue5732 --bytes1=128 --bytes2=1024
#1: read(128) = 128
#2: read(1024) = 256
$ docker run --runtime=runsc-vfs2 --rm -it -v ~/tmp:/hosttmp ubuntu:focal /hosttmp/issue5732 --bytes1=128 --bytes2=1024
#1: read(128) = 128
#2: read(1024) = 256
```

Fixes #5732

PiperOrigin-RevId: 365178386
copybara-service bot pushed a commit that referenced this issue Jul 3, 2024
Distributed training isn't working with PyTorch on certain A100 nodes.

Adds the missing ioctl `UVM_UNMAP_EXTERNAL` allowing for certain NCCL operations to succeed when using [`torch.distributed`](https://pytorch.org/docs/stable/distributed.html), fixing distributed training.

## Reproduction

This affects numerous A100 40GB and 80GB instances in our fleet. This reproduction requires 4 A100 GPUs, either 40GB or 80GB.

- **NVIDIA Driver Version**: 550.54.15
- **CUDA Version**: 12.4
- **NVIDIA device**: NVIDIA A100 80GB PCIe

### Steps

1. **Install gvisor**
```bash
URL="https://storage.googleapis.com/gvisor/releases/master/latest/${ARCH}"
wget -nc "${URL}/runsc" "${URL}/runsc.sha512"
chmod +x runsc
sudo cp runsc /usr/local/bin/runsc
sudo /usr/local/bin/runsc install
sudo systemctl reload docker
```

2. **Add GPU enabling gvisor options**

```json
{
    "runtimes": {
        "nvidia": {
            "path": "nvidia-container-runtime",
            "runtimeArgs": []
        },
        "runsc": {
            "path": "/usr/local/bin/runsc",
	    "runtimeArgs": ["--nvproxy", "--nvproxy-docker", "-debug-log=/tmp/runsc/", "-debug", "-strace"]

        }
    }
}
```
Reload configs with `sudo systemctl reload docker`.

3. **Run reproduction NCCL test**

This test creates one main process and N peer processes. Each peer process sends a torch `Tensor` to the main process using NCCL.

```Dockerfile
# Dockerfile
FROM python:3.9.15-slim-bullseye

RUN pip install torch numpy
COPY <<EOF repro.py
import argparse
import datetime
import os

import torch
import torch.distributed as dist
import torch.multiprocessing as mp

def setup(rank, world_size):
    os.environ["MASTER_ADDR"] = "localhost"
    os.environ["MASTER_PORT"] = "12355"
    dist.init_process_group("nccl", rank=rank, world_size=world_size, timeout=datetime.timedelta(seconds=600))
    torch.cuda.set_device(rank)

def cleanup():
    dist.destroy_process_group()

def send_tensor(rank, world_size):
    try:
        setup(rank, world_size)

        # rank receiving all tensors
        target_rank = world_size - 1

        dist.barrier()

        tensor = torch.ones(5).cuda(rank)
        if rank < target_rank:
            print(f"[RANK {rank}] sending tensor: {tensor}")
            dist.send(tensor=tensor, dst=target_rank)
        elif rank == target_rank:
            for other_rank in range(target_rank):
                tensor = torch.zeros(5).cuda(target_rank)
                dist.recv(tensor=tensor, src=other_rank)
                print(f"[RANK {target_rank}] received tensor from rank={other_rank}: {tensor}")

            print("PASS: NCCL working.")

    except Exception as e:
        print(f"[RANK {rank}] error in send_tensor: {e}")
        raise
    finally:
        cleanup()

def main(world_size: int = 2):
    mp.spawn(send_tensor, args=(world_size,), nprocs=world_size, join=True)

if __name__ == "__main__":
    parser = argparse.ArgumentParser(description="Run torch-based NCCL tests")
    parser.add_argument("world_size", type=int, help="number of GPUs to run test on")
    args = parser.parse_args()

    if args.world_size < 2:
        raise RuntimeError(f"world_size needs to be larger than 1 {args.world_size}")

    main(args.world_size)
EOF

ENTRYPOINT ["python", "repro.py", "4"]
```
Build image with:

```
docker build -f Dockerfile .
```

Then run it with:
```
sudo docker run -it --shm-size=2.00gb --runtime=runsc --gpus='"device=GPU-742ea7fc-dd4f-612c-e860-499bf200a815,GPU-94a801d8-7713-acf6-337d-338b7cfdf19e,GPU-0d19cef2-10ce-e445-a0be-3d330e36c1fd,GPU-ac5046fb-020c-93e8-2784-f44aedbc5bbd"' 040a44863fb1
```

#### Failure (truncated)
```
...
Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:672 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7edda14cf897 in /usr/local/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x5b3a23e (0x7edd8d73a23e in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #2: c10d::TCPStore::doWait(c10::ArrayRef<std::string>, std::chrono::duration<long, std::ratio<1l, 1000l> >) + 0x2c7 (0x7edd8d734c87 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7edd8d734f82 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7edd8d735fd1 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7edd8d6ea371 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7edd8d6ea371 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7edd8d6ea371 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #8: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7edd54da9189 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #9: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, c10::Device&, c10d::OpType, int, bool) + 0xc50 (0x7edd54db0610 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #10: c10d::ProcessGroupNCCL::recv(std::vector<at::Tensor, std::allocator<at::Tensor> >&, int, int) + 0x5f8 (0x7edd54dcf978 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #11: <unknown function> + 0x5adc309 (0x7edd8d6dc309 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #12: <unknown function> + 0x5ae6f10 (0x7edd8d6e6f10 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #13: <unknown function> + 0x5ae6fa5 (0x7edd8d6e6fa5 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #14: <unknown function> + 0x5124446 (0x7edd8cd24446 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #15: <unknown function> + 0x1acf4b8 (0x7edd896cf4b8 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #16: <unknown function> + 0x5aee004 (0x7edd8d6ee004 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #17: <unknown function> + 0x5af36b5 (0x7edd8d6f36b5 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #18: <unknown function> + 0xd2fe8e (0x7edda032fe8e in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_python.so)
frame #19: <unknown function> + 0x47f074 (0x7edd9fa7f074 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_python.so)
<omitting python frames>
frame #35: <unknown function> + 0x29d90 (0x7edda2029d90 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #36: __libc_start_main + 0x80 (0x7edda2029e40 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #37: <unknown function> + 0x108e (0x55f950b0c08e in /usr/local/bin/python)
. This may indicate a possible application crash on rank 0 or a network set up issue.
...
```

### Fix
gvisor debug logs show:

```
W0702 20:36:17.577055  445833 uvm.go:148] [  22:  84] nvproxy: unknown uvm ioctl 66 = 0x42
```
I've implemented that ioctl in this PR. This is the output after the fix.

```
[RANK 2] sending tensor: tensor([1., 1., 1., 1., 1.], device='cuda:2')
[RANK 0] sending tensor: tensor([1., 1., 1., 1., 1.], device='cuda:0')
[RANK 1] sending tensor: tensor([1., 1., 1., 1., 1.], device='cuda:1')
[RANK 3] received tensor from rank=0: tensor([1., 1., 1., 1., 1.], device='cuda:3')
[RANK 3] received tensor from rank=1: tensor([1., 1., 1., 1., 1.], device='cuda:3')
[RANK 3] received tensor from rank=2: tensor([1., 1., 1., 1., 1.], device='cuda:3')
PASS: NCCL working.
```
FUTURE_COPYBARA_INTEGRATE_REVIEW=#10610 from luiscape:master ee88734
PiperOrigin-RevId: 649146570
copybara-service bot pushed a commit that referenced this issue Jul 3, 2024
Distributed training isn't working with PyTorch on certain A100 nodes.

Adds the missing ioctl `UVM_UNMAP_EXTERNAL` allowing for certain NCCL operations to succeed when using [`torch.distributed`](https://pytorch.org/docs/stable/distributed.html), fixing distributed training.

## Reproduction

This affects numerous A100 40GB and 80GB instances in our fleet. This reproduction requires 4 A100 GPUs, either 40GB or 80GB.

- **NVIDIA Driver Version**: 550.54.15
- **CUDA Version**: 12.4
- **NVIDIA device**: NVIDIA A100 80GB PCIe

### Steps

1. **Install gvisor**
```bash
URL="https://storage.googleapis.com/gvisor/releases/master/latest/${ARCH}"
wget -nc "${URL}/runsc" "${URL}/runsc.sha512"
chmod +x runsc
sudo cp runsc /usr/local/bin/runsc
sudo /usr/local/bin/runsc install
sudo systemctl reload docker
```

2. **Add GPU enabling gvisor options**

```json
{
    "runtimes": {
        "nvidia": {
            "path": "nvidia-container-runtime",
            "runtimeArgs": []
        },
        "runsc": {
            "path": "/usr/local/bin/runsc",
	    "runtimeArgs": ["--nvproxy", "--nvproxy-docker", "-debug-log=/tmp/runsc/", "-debug", "-strace"]

        }
    }
}
```
Reload configs with `sudo systemctl reload docker`.

3. **Run reproduction NCCL test**

This test creates one main process and N peer processes. Each peer process sends a torch `Tensor` to the main process using NCCL.

```Dockerfile
# Dockerfile
FROM python:3.9.15-slim-bullseye

RUN pip install torch numpy
COPY <<EOF repro.py
import argparse
import datetime
import os

import torch
import torch.distributed as dist
import torch.multiprocessing as mp

def setup(rank, world_size):
    os.environ["MASTER_ADDR"] = "localhost"
    os.environ["MASTER_PORT"] = "12355"
    dist.init_process_group("nccl", rank=rank, world_size=world_size, timeout=datetime.timedelta(seconds=600))
    torch.cuda.set_device(rank)

def cleanup():
    dist.destroy_process_group()

def send_tensor(rank, world_size):
    try:
        setup(rank, world_size)

        # rank receiving all tensors
        target_rank = world_size - 1

        dist.barrier()

        tensor = torch.ones(5).cuda(rank)
        if rank < target_rank:
            print(f"[RANK {rank}] sending tensor: {tensor}")
            dist.send(tensor=tensor, dst=target_rank)
        elif rank == target_rank:
            for other_rank in range(target_rank):
                tensor = torch.zeros(5).cuda(target_rank)
                dist.recv(tensor=tensor, src=other_rank)
                print(f"[RANK {target_rank}] received tensor from rank={other_rank}: {tensor}")

            print("PASS: NCCL working.")

    except Exception as e:
        print(f"[RANK {rank}] error in send_tensor: {e}")
        raise
    finally:
        cleanup()

def main(world_size: int = 2):
    mp.spawn(send_tensor, args=(world_size,), nprocs=world_size, join=True)

if __name__ == "__main__":
    parser = argparse.ArgumentParser(description="Run torch-based NCCL tests")
    parser.add_argument("world_size", type=int, help="number of GPUs to run test on")
    args = parser.parse_args()

    if args.world_size < 2:
        raise RuntimeError(f"world_size needs to be larger than 1 {args.world_size}")

    main(args.world_size)
EOF

ENTRYPOINT ["python", "repro.py", "4"]
```
Build image with:

```
docker build -f Dockerfile .
```

Then run it with:
```
sudo docker run -it --shm-size=2.00gb --runtime=runsc --gpus='"device=GPU-742ea7fc-dd4f-612c-e860-499bf200a815,GPU-94a801d8-7713-acf6-337d-338b7cfdf19e,GPU-0d19cef2-10ce-e445-a0be-3d330e36c1fd,GPU-ac5046fb-020c-93e8-2784-f44aedbc5bbd"' 040a44863fb1
```

#### Failure (truncated)
```
...
Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:672 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7edda14cf897 in /usr/local/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x5b3a23e (0x7edd8d73a23e in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #2: c10d::TCPStore::doWait(c10::ArrayRef<std::string>, std::chrono::duration<long, std::ratio<1l, 1000l> >) + 0x2c7 (0x7edd8d734c87 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7edd8d734f82 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7edd8d735fd1 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7edd8d6ea371 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7edd8d6ea371 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7edd8d6ea371 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #8: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7edd54da9189 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #9: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, c10::Device&, c10d::OpType, int, bool) + 0xc50 (0x7edd54db0610 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #10: c10d::ProcessGroupNCCL::recv(std::vector<at::Tensor, std::allocator<at::Tensor> >&, int, int) + 0x5f8 (0x7edd54dcf978 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #11: <unknown function> + 0x5adc309 (0x7edd8d6dc309 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #12: <unknown function> + 0x5ae6f10 (0x7edd8d6e6f10 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #13: <unknown function> + 0x5ae6fa5 (0x7edd8d6e6fa5 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #14: <unknown function> + 0x5124446 (0x7edd8cd24446 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #15: <unknown function> + 0x1acf4b8 (0x7edd896cf4b8 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #16: <unknown function> + 0x5aee004 (0x7edd8d6ee004 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #17: <unknown function> + 0x5af36b5 (0x7edd8d6f36b5 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #18: <unknown function> + 0xd2fe8e (0x7edda032fe8e in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_python.so)
frame #19: <unknown function> + 0x47f074 (0x7edd9fa7f074 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_python.so)
<omitting python frames>
frame #35: <unknown function> + 0x29d90 (0x7edda2029d90 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #36: __libc_start_main + 0x80 (0x7edda2029e40 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #37: <unknown function> + 0x108e (0x55f950b0c08e in /usr/local/bin/python)
. This may indicate a possible application crash on rank 0 or a network set up issue.
...
```

### Fix
gvisor debug logs show:

```
W0702 20:36:17.577055  445833 uvm.go:148] [  22:  84] nvproxy: unknown uvm ioctl 66 = 0x42
```
I've implemented that ioctl in this PR. This is the output after the fix.

```
[RANK 2] sending tensor: tensor([1., 1., 1., 1., 1.], device='cuda:2')
[RANK 0] sending tensor: tensor([1., 1., 1., 1., 1.], device='cuda:0')
[RANK 1] sending tensor: tensor([1., 1., 1., 1., 1.], device='cuda:1')
[RANK 3] received tensor from rank=0: tensor([1., 1., 1., 1., 1.], device='cuda:3')
[RANK 3] received tensor from rank=1: tensor([1., 1., 1., 1., 1.], device='cuda:3')
[RANK 3] received tensor from rank=2: tensor([1., 1., 1., 1., 1.], device='cuda:3')
PASS: NCCL working.
```
FUTURE_COPYBARA_INTEGRATE_REVIEW=#10610 from luiscape:master ee88734
PiperOrigin-RevId: 649146570
copybara-service bot pushed a commit that referenced this issue Jul 3, 2024
Distributed training isn't working with PyTorch on certain A100 nodes.

Adds the missing ioctl `UVM_UNMAP_EXTERNAL` allowing for certain NCCL operations to succeed when using [`torch.distributed`](https://pytorch.org/docs/stable/distributed.html), fixing distributed training.

## Reproduction

This affects numerous A100 40GB and 80GB instances in our fleet. This reproduction requires 4 A100 GPUs, either 40GB or 80GB.

- **NVIDIA Driver Version**: 550.54.15
- **CUDA Version**: 12.4
- **NVIDIA device**: NVIDIA A100 80GB PCIe

### Steps

1. **Install gvisor**
```bash
URL="https://storage.googleapis.com/gvisor/releases/master/latest/${ARCH}"
wget -nc "${URL}/runsc" "${URL}/runsc.sha512"
chmod +x runsc
sudo cp runsc /usr/local/bin/runsc
sudo /usr/local/bin/runsc install
sudo systemctl reload docker
```

2. **Add GPU enabling gvisor options**

```json
{
    "runtimes": {
        "nvidia": {
            "path": "nvidia-container-runtime",
            "runtimeArgs": []
        },
        "runsc": {
            "path": "/usr/local/bin/runsc",
	    "runtimeArgs": ["--nvproxy", "--nvproxy-docker", "-debug-log=/tmp/runsc/", "-debug", "-strace"]

        }
    }
}
```
Reload configs with `sudo systemctl reload docker`.

3. **Run reproduction NCCL test**

This test creates one main process and N peer processes. Each peer process sends a torch `Tensor` to the main process using NCCL.

```Dockerfile
# Dockerfile
FROM python:3.9.15-slim-bullseye

RUN pip install torch numpy
COPY <<EOF repro.py
import argparse
import datetime
import os

import torch
import torch.distributed as dist
import torch.multiprocessing as mp

def setup(rank, world_size):
    os.environ["MASTER_ADDR"] = "localhost"
    os.environ["MASTER_PORT"] = "12355"
    dist.init_process_group("nccl", rank=rank, world_size=world_size, timeout=datetime.timedelta(seconds=600))
    torch.cuda.set_device(rank)

def cleanup():
    dist.destroy_process_group()

def send_tensor(rank, world_size):
    try:
        setup(rank, world_size)

        # rank receiving all tensors
        target_rank = world_size - 1

        dist.barrier()

        tensor = torch.ones(5).cuda(rank)
        if rank < target_rank:
            print(f"[RANK {rank}] sending tensor: {tensor}")
            dist.send(tensor=tensor, dst=target_rank)
        elif rank == target_rank:
            for other_rank in range(target_rank):
                tensor = torch.zeros(5).cuda(target_rank)
                dist.recv(tensor=tensor, src=other_rank)
                print(f"[RANK {target_rank}] received tensor from rank={other_rank}: {tensor}")

            print("PASS: NCCL working.")

    except Exception as e:
        print(f"[RANK {rank}] error in send_tensor: {e}")
        raise
    finally:
        cleanup()

def main(world_size: int = 2):
    mp.spawn(send_tensor, args=(world_size,), nprocs=world_size, join=True)

if __name__ == "__main__":
    parser = argparse.ArgumentParser(description="Run torch-based NCCL tests")
    parser.add_argument("world_size", type=int, help="number of GPUs to run test on")
    args = parser.parse_args()

    if args.world_size < 2:
        raise RuntimeError(f"world_size needs to be larger than 1 {args.world_size}")

    main(args.world_size)
EOF

ENTRYPOINT ["python", "repro.py", "4"]
```
Build image with:

```
docker build -f Dockerfile .
```

Then run it with:
```
sudo docker run -it --shm-size=2.00gb --runtime=runsc --gpus='"device=GPU-742ea7fc-dd4f-612c-e860-499bf200a815,GPU-94a801d8-7713-acf6-337d-338b7cfdf19e,GPU-0d19cef2-10ce-e445-a0be-3d330e36c1fd,GPU-ac5046fb-020c-93e8-2784-f44aedbc5bbd"' 040a44863fb1
```

#### Failure (truncated)
```
...
Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:672 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7edda14cf897 in /usr/local/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x5b3a23e (0x7edd8d73a23e in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #2: c10d::TCPStore::doWait(c10::ArrayRef<std::string>, std::chrono::duration<long, std::ratio<1l, 1000l> >) + 0x2c7 (0x7edd8d734c87 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7edd8d734f82 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7edd8d735fd1 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7edd8d6ea371 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7edd8d6ea371 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7edd8d6ea371 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #8: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7edd54da9189 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #9: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, c10::Device&, c10d::OpType, int, bool) + 0xc50 (0x7edd54db0610 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #10: c10d::ProcessGroupNCCL::recv(std::vector<at::Tensor, std::allocator<at::Tensor> >&, int, int) + 0x5f8 (0x7edd54dcf978 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #11: <unknown function> + 0x5adc309 (0x7edd8d6dc309 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #12: <unknown function> + 0x5ae6f10 (0x7edd8d6e6f10 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #13: <unknown function> + 0x5ae6fa5 (0x7edd8d6e6fa5 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #14: <unknown function> + 0x5124446 (0x7edd8cd24446 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #15: <unknown function> + 0x1acf4b8 (0x7edd896cf4b8 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #16: <unknown function> + 0x5aee004 (0x7edd8d6ee004 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #17: <unknown function> + 0x5af36b5 (0x7edd8d6f36b5 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #18: <unknown function> + 0xd2fe8e (0x7edda032fe8e in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_python.so)
frame #19: <unknown function> + 0x47f074 (0x7edd9fa7f074 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_python.so)
<omitting python frames>
frame #35: <unknown function> + 0x29d90 (0x7edda2029d90 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #36: __libc_start_main + 0x80 (0x7edda2029e40 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #37: <unknown function> + 0x108e (0x55f950b0c08e in /usr/local/bin/python)
. This may indicate a possible application crash on rank 0 or a network set up issue.
...
```

### Fix
gvisor debug logs show:

```
W0702 20:36:17.577055  445833 uvm.go:148] [  22:  84] nvproxy: unknown uvm ioctl 66 = 0x42
```
I've implemented that ioctl in this PR. This is the output after the fix.

```
[RANK 2] sending tensor: tensor([1., 1., 1., 1., 1.], device='cuda:2')
[RANK 0] sending tensor: tensor([1., 1., 1., 1., 1.], device='cuda:0')
[RANK 1] sending tensor: tensor([1., 1., 1., 1., 1.], device='cuda:1')
[RANK 3] received tensor from rank=0: tensor([1., 1., 1., 1., 1.], device='cuda:3')
[RANK 3] received tensor from rank=1: tensor([1., 1., 1., 1., 1.], device='cuda:3')
[RANK 3] received tensor from rank=2: tensor([1., 1., 1., 1., 1.], device='cuda:3')
PASS: NCCL working.
```
FUTURE_COPYBARA_INTEGRATE_REVIEW=#10610 from luiscape:master ee88734
PiperOrigin-RevId: 649146570
copybara-service bot pushed a commit that referenced this issue Jul 8, 2024
Distributed training isn't working with PyTorch on certain A100 nodes.

Adds the missing ioctl `UVM_UNMAP_EXTERNAL` allowing for certain NCCL operations to succeed when using [`torch.distributed`](https://pytorch.org/docs/stable/distributed.html), fixing distributed training.

## Reproduction

This affects numerous A100 40GB and 80GB instances in our fleet. This reproduction requires 4 A100 GPUs, either 40GB or 80GB.

- **NVIDIA Driver Version**: 550.54.15
- **CUDA Version**: 12.4
- **NVIDIA device**: NVIDIA A100 80GB PCIe

### Steps

1. **Install gvisor**
```bash
URL="https://storage.googleapis.com/gvisor/releases/master/latest/${ARCH}"
wget -nc "${URL}/runsc" "${URL}/runsc.sha512"
chmod +x runsc
sudo cp runsc /usr/local/bin/runsc
sudo /usr/local/bin/runsc install
sudo systemctl reload docker
```

2. **Add GPU enabling gvisor options**

```json
{
    "runtimes": {
        "nvidia": {
            "path": "nvidia-container-runtime",
            "runtimeArgs": []
        },
        "runsc": {
            "path": "/usr/local/bin/runsc",
	    "runtimeArgs": ["--nvproxy", "--nvproxy-docker", "-debug-log=/tmp/runsc/", "-debug", "-strace"]

        }
    }
}
```
Reload configs with `sudo systemctl reload docker`.

3. **Run reproduction NCCL test**

This test creates one main process and N peer processes. Each peer process sends a torch `Tensor` to the main process using NCCL.

```Dockerfile
# Dockerfile
FROM python:3.9.15-slim-bullseye

RUN pip install torch numpy
COPY <<EOF repro.py
import argparse
import datetime
import os

import torch
import torch.distributed as dist
import torch.multiprocessing as mp

def setup(rank, world_size):
    os.environ["MASTER_ADDR"] = "localhost"
    os.environ["MASTER_PORT"] = "12355"
    dist.init_process_group("nccl", rank=rank, world_size=world_size, timeout=datetime.timedelta(seconds=600))
    torch.cuda.set_device(rank)

def cleanup():
    dist.destroy_process_group()

def send_tensor(rank, world_size):
    try:
        setup(rank, world_size)

        # rank receiving all tensors
        target_rank = world_size - 1

        dist.barrier()

        tensor = torch.ones(5).cuda(rank)
        if rank < target_rank:
            print(f"[RANK {rank}] sending tensor: {tensor}")
            dist.send(tensor=tensor, dst=target_rank)
        elif rank == target_rank:
            for other_rank in range(target_rank):
                tensor = torch.zeros(5).cuda(target_rank)
                dist.recv(tensor=tensor, src=other_rank)
                print(f"[RANK {target_rank}] received tensor from rank={other_rank}: {tensor}")

            print("PASS: NCCL working.")

    except Exception as e:
        print(f"[RANK {rank}] error in send_tensor: {e}")
        raise
    finally:
        cleanup()

def main(world_size: int = 2):
    mp.spawn(send_tensor, args=(world_size,), nprocs=world_size, join=True)

if __name__ == "__main__":
    parser = argparse.ArgumentParser(description="Run torch-based NCCL tests")
    parser.add_argument("world_size", type=int, help="number of GPUs to run test on")
    args = parser.parse_args()

    if args.world_size < 2:
        raise RuntimeError(f"world_size needs to be larger than 1 {args.world_size}")

    main(args.world_size)
EOF

ENTRYPOINT ["python", "repro.py", "4"]
```
Build image with:

```
docker build -f Dockerfile .
```

Then run it with:
```
sudo docker run -it --shm-size=2.00gb --runtime=runsc --gpus='"device=GPU-742ea7fc-dd4f-612c-e860-499bf200a815,GPU-94a801d8-7713-acf6-337d-338b7cfdf19e,GPU-0d19cef2-10ce-e445-a0be-3d330e36c1fd,GPU-ac5046fb-020c-93e8-2784-f44aedbc5bbd"' 040a44863fb1
```

#### Failure (truncated)
```
...
Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:672 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7edda14cf897 in /usr/local/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x5b3a23e (0x7edd8d73a23e in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #2: c10d::TCPStore::doWait(c10::ArrayRef<std::string>, std::chrono::duration<long, std::ratio<1l, 1000l> >) + 0x2c7 (0x7edd8d734c87 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7edd8d734f82 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7edd8d735fd1 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7edd8d6ea371 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7edd8d6ea371 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7edd8d6ea371 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #8: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7edd54da9189 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #9: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, c10::Device&, c10d::OpType, int, bool) + 0xc50 (0x7edd54db0610 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #10: c10d::ProcessGroupNCCL::recv(std::vector<at::Tensor, std::allocator<at::Tensor> >&, int, int) + 0x5f8 (0x7edd54dcf978 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #11: <unknown function> + 0x5adc309 (0x7edd8d6dc309 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #12: <unknown function> + 0x5ae6f10 (0x7edd8d6e6f10 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #13: <unknown function> + 0x5ae6fa5 (0x7edd8d6e6fa5 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #14: <unknown function> + 0x5124446 (0x7edd8cd24446 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #15: <unknown function> + 0x1acf4b8 (0x7edd896cf4b8 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #16: <unknown function> + 0x5aee004 (0x7edd8d6ee004 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #17: <unknown function> + 0x5af36b5 (0x7edd8d6f36b5 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #18: <unknown function> + 0xd2fe8e (0x7edda032fe8e in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_python.so)
frame #19: <unknown function> + 0x47f074 (0x7edd9fa7f074 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_python.so)
<omitting python frames>
frame #35: <unknown function> + 0x29d90 (0x7edda2029d90 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #36: __libc_start_main + 0x80 (0x7edda2029e40 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #37: <unknown function> + 0x108e (0x55f950b0c08e in /usr/local/bin/python)
. This may indicate a possible application crash on rank 0 or a network set up issue.
...
```

### Fix
gvisor debug logs show:

```
W0702 20:36:17.577055  445833 uvm.go:148] [  22:  84] nvproxy: unknown uvm ioctl 66 = 0x42
```
I've implemented that ioctl in this PR. This is the output after the fix.

```
[RANK 2] sending tensor: tensor([1., 1., 1., 1., 1.], device='cuda:2')
[RANK 0] sending tensor: tensor([1., 1., 1., 1., 1.], device='cuda:0')
[RANK 1] sending tensor: tensor([1., 1., 1., 1., 1.], device='cuda:1')
[RANK 3] received tensor from rank=0: tensor([1., 1., 1., 1., 1.], device='cuda:3')
[RANK 3] received tensor from rank=1: tensor([1., 1., 1., 1., 1.], device='cuda:3')
[RANK 3] received tensor from rank=2: tensor([1., 1., 1., 1., 1.], device='cuda:3')
PASS: NCCL working.
```
FUTURE_COPYBARA_INTEGRATE_REVIEW=#10610 from luiscape:master ee88734
PiperOrigin-RevId: 649146570
copybara-service bot pushed a commit that referenced this issue Jul 13, 2024
Adding ioctls to fix a simple multi-GPU Huggingface`accelerate` program that does not work on GCP H100s.

---

### System details

* **instance type:** `a3-highgpu-8g` (GCP, us-east4-a)
* **NVIDIA driver:** `Driver Version: 550.54.15      CUDA Version: 12.4`
* **NVIDIA device:**  4 x NVIDIA H100 HBM3
* **uname -a:** `Linux gcp-h100-us-east4-a-0-bb25baf985414f8899dfdfcb82d6796d 5.15.0-208.159.3.el9uek.x86_64 #2 SMP Wed Jun 19 09:05:13 PDT 2024 x86_64 x86_64 x86_64 GNU/Linux`

```
runsc version release-20240513.0-173-gc526d251933a-dirty
spec: 1.1.0-rc.1
```

---

## Reproduction steps

1. **Install gVisor**

**2. Add GPU enabling gvisor options**

In `/etc/docker/daemon.json`:

```json
{
    "runtimes": {
        "nvidia": {
            "path": "nvidia-container-runtime",
            "runtimeArgs": []
        },
        "runsc": {
            "path": "/home/modal/runsc",
	    "runtimeArgs": ["--nvproxy", "--nvproxy-docker", "-debug-log=/tmp/runsc/", "-debug", "-strace"]

        }
    }
}
```

**3. Run Dockerfile**

```Dockerfile
# Dockerfile
FROM winglian/axolotl@sha256:5c724f7accd8188b0f84ead93b7efbfa8f8661f40e133646bd6d946bc3423d6d
RUN pip install fastapi==0.111.0
RUN pip install huggingface-hub~=0.23.0 pydantic==2.6.3 python-dateutil
ENV HUGGINGFACE_HUB_CACHE="/pretrained"
ENV TQDM_DISABLE="true"
ENV AXOLOTL_NCCL_TIMEOUT="60"

COPY <<EOF repro.py
import os
import subprocess
from pathlib import Path

print("[MOD-3226] hello from the repro!!!")
from accelerate import Accelerator

accelerator = Accelerator()

with accelerator.main_process_first():
    print(f"hello! {accelerator.process_index}")
EOF

ENTRYPOINT ["accelerate", "launch", "repro.py"]
```

```
sudo docker run -it --runtime=$RUNTIME --gpus='"device=GPU-c453e5c7-a56d-70bf-78ce-61be6cb8e0db,GPU-4703196a-e3df-9e3f-bb8b-6fa91c8e9970,GPU-4a9c162c-9280-eaa8-215a-2c681e82a99f,GPU-1660d344-e18b-e48a-cced-38380e903c31"' ce4326479c8412b13bba27416e3e77093d4411279b432ca1b25050f17ef57a67
```

### Results

**`runc`**

```
sudo docker run -it  --gpus='"device=GPU-c453e5c7-a56d-70bf-78ce-61be6cb8e0db,GPU-4703196a-e3df-9e3f-bb8b-6fa91c8e9970,GPU-4a9c162c-9280-eaa8-215a-2c681e82a99f,GPU-1660d344-e18b-e48a-cced-38380e903c31"' ce4326479c8412b13bba27416e3e77093d4411279b432ca1b25050f17ef57a67
The following values were not passed to `accelerate launch` and had defaults used instead:
	`--num_processes` was set to a value of `4`
		More than one GPU was found, enabling multi-GPU training.
		If this was unintended please pass in `--num_processes=1`.
	`--num_machines` was set to a value of `1`
	`--mixed_precision` was set to a value of `'no'`
	`--dynamo_backend` was set to a value of `'no'`
To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`.
[MOD-3226] hello from the repro!!!
[MOD-3226] hello from the repro!!!
[MOD-3226] hello from the repro!!!
[MOD-3226] hello from the repro!!!
hello! 0
hello! 1
hello! 2hello! 3
```

**`runsc` (main)**

<details> <summary>💥 Failure logs</summary>

```
sudo docker run -it --runtime=runsc --gpus='"device=GPU-c453e5c7-a56d-70bf-78ce-61be6cb8e0db,GPU-4703196a-e3df-9e3f-bb8b-6fa91c8e9970,GPU-4a9c162c-9280-eaa8-215a-2c681e82a99f,GPU-1660d344-e18b-e48a-cced-38380e903c31"' ce4326479c8412b13bba27416e3e77093d4411279b432ca1b25050f17ef57a67
The following values were not passed to `accelerate launch` and had defaults used instead:
	`--num_processes` was set to a value of `4`
		More than one GPU was found, enabling multi-GPU training.
		If this was unintended please pass in `--num_processes=1`.
	`--num_machines` was set to a value of `1`
	`--mixed_precision` was set to a value of `'no'`
	`--dynamo_backend` was set to a value of `'no'`
To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`.
[MOD-3226] hello from the repro!!!
[MOD-3226] hello from the repro!!!
[MOD-3226] hello from the repro!!!
[MOD-3226] hello from the repro!!!
Detected kernel version 4.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
hello! 0
Traceback (most recent call last):
  File "/workspace/axolotl/repro.py", line 10, in <module>
    with accelerator.main_process_first():
  File "/root/miniconda3/envs/py3.10/lib/python3.10/contextlib.py", line 142, in __exit__
    next(self.gen)
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/accelerate/accelerator.py", line 884, in main_process_first
    with self.state.main_process_first():
  File "/root/miniconda3/envs/py3.10/lib/python3.10/contextlib.py", line 142, in __exit__
    next(self.gen)
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/accelerate/state.py", line 1056, in main_process_first
    with PartialState().main_process_first():
  File "/root/miniconda3/envs/py3.10/lib/python3.10/contextlib.py", line 142, in __exit__
    next(self.gen)
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/accelerate/state.py", line 502, in main_process_first
    yield from self._goes_first(self.is_main_process)
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/accelerate/state.py", line 390, in _goes_first
    self.wait_for_everyone()
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/accelerate/state.py", line 379, in wait_for_everyone
    torch.distributed.barrier()
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 3696, in barrier
    work = default_pg.barrier(opts=opts)
torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1333, unhandled cuda error (run with NCCL_DEBUG=INFO for details), NCCL version 2.18.6
ncclUnhandledCudaError: Call to CUDA function failed.
Last error:
Cuda failure 'unknown error'
[2024-07-11 19:52:01,530] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 68 closing signal SIGTERM
[2024-07-11 19:52:01,532] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 69 closing signal SIGTERM
[2024-07-11 19:52:01,533] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 70 closing signal SIGTERM
[2024-07-11 19:52:02,108] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 67) of binary: /root/miniconda3/envs/py3.10/bin/python
Traceback (most recent call last):
  File "/root/miniconda3/envs/py3.10/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 46, in main
    args.func(args)
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1073, in launch_command
    multi_gpu_launcher(args)
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/accelerate/commands/launch.py", line 718, in multi_gpu_launcher
    distrib_run.run(args)
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/distributed/run.py", line 797, in run
    elastic_launch(
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
repro.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-07-11_19:52:01
  host      : d45a08528293
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 67)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
```

---

</details>

**`runsc` (this pull request)**

<details> <summary>✅ Success logs</summary>

```
[modal@gcp-h100-us-east4-a-0-bb25baf985414f8899dfdfcb82d6796d ~]$ sudo docker run -it --runtime=runsc --gpus='"device=GPU-c453e5c7-a56d-70bf-78ce-61be6cb8e0db,GPU-4703196a-e3df-9e3f-bb8b-6fa91c8e9970,GPU-4a9c162c-9280-eaa8-215a-2c681e82a99f,GPU-1660d344-e18b-e48a-cced-38380e903c31"' ce4326479c8412b13bba27416e3e77093d4411279b432ca1b25050f17ef57a67
The following values were not passed to `accelerate launch` and had defaults used instead:
	`--num_processes` was set to a value of `4`
		More than one GPU was found, enabling multi-GPU training.
		If this was unintended please pass in `--num_processes=1`.
	`--num_machines` was set to a value of `1`
	`--mixed_precision` was set to a value of `'no'`
	`--dynamo_backend` was set to a value of `'no'`
To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`.
[MOD-3226] hello from the repro!!!
[MOD-3226] hello from the repro!!!
[MOD-3226] hello from the repro!!!
[MOD-3226] hello from the repro!!!
Detected kernel version 4.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
hello! 0
hello! 1
hello! 3hello! 2
```

</details>

FUTURE_COPYBARA_INTEGRATE_REVIEW=#10649 from thundergolfer:master d3d19f1
PiperOrigin-RevId: 651754677
copybara-service bot pushed a commit that referenced this issue Jul 13, 2024
Adding ioctls to fix a simple multi-GPU Huggingface`accelerate` program that does not work on GCP H100s.

---

### System details

* **instance type:** `a3-highgpu-8g` (GCP, us-east4-a)
* **NVIDIA driver:** `Driver Version: 550.54.15      CUDA Version: 12.4`
* **NVIDIA device:**  4 x NVIDIA H100 HBM3
* **uname -a:** `Linux gcp-h100-us-east4-a-0-bb25baf985414f8899dfdfcb82d6796d 5.15.0-208.159.3.el9uek.x86_64 #2 SMP Wed Jun 19 09:05:13 PDT 2024 x86_64 x86_64 x86_64 GNU/Linux`

```
runsc version release-20240513.0-173-gc526d251933a-dirty
spec: 1.1.0-rc.1
```

---

## Reproduction steps

1. **Install gVisor**

**2. Add GPU enabling gvisor options**

In `/etc/docker/daemon.json`:

```json
{
    "runtimes": {
        "nvidia": {
            "path": "nvidia-container-runtime",
            "runtimeArgs": []
        },
        "runsc": {
            "path": "/home/modal/runsc",
	    "runtimeArgs": ["--nvproxy", "--nvproxy-docker", "-debug-log=/tmp/runsc/", "-debug", "-strace"]

        }
    }
}
```

**3. Run Dockerfile**

```Dockerfile
# Dockerfile
FROM winglian/axolotl@sha256:5c724f7accd8188b0f84ead93b7efbfa8f8661f40e133646bd6d946bc3423d6d
RUN pip install fastapi==0.111.0
RUN pip install huggingface-hub~=0.23.0 pydantic==2.6.3 python-dateutil
ENV HUGGINGFACE_HUB_CACHE="/pretrained"
ENV TQDM_DISABLE="true"
ENV AXOLOTL_NCCL_TIMEOUT="60"

COPY <<EOF repro.py
import os
import subprocess
from pathlib import Path

print("[MOD-3226] hello from the repro!!!")
from accelerate import Accelerator

accelerator = Accelerator()

with accelerator.main_process_first():
    print(f"hello! {accelerator.process_index}")
EOF

ENTRYPOINT ["accelerate", "launch", "repro.py"]
```

```
sudo docker run -it --runtime=$RUNTIME --gpus='"device=GPU-c453e5c7-a56d-70bf-78ce-61be6cb8e0db,GPU-4703196a-e3df-9e3f-bb8b-6fa91c8e9970,GPU-4a9c162c-9280-eaa8-215a-2c681e82a99f,GPU-1660d344-e18b-e48a-cced-38380e903c31"' ce4326479c8412b13bba27416e3e77093d4411279b432ca1b25050f17ef57a67
```

### Results

**`runc`**

```
sudo docker run -it  --gpus='"device=GPU-c453e5c7-a56d-70bf-78ce-61be6cb8e0db,GPU-4703196a-e3df-9e3f-bb8b-6fa91c8e9970,GPU-4a9c162c-9280-eaa8-215a-2c681e82a99f,GPU-1660d344-e18b-e48a-cced-38380e903c31"' ce4326479c8412b13bba27416e3e77093d4411279b432ca1b25050f17ef57a67
The following values were not passed to `accelerate launch` and had defaults used instead:
	`--num_processes` was set to a value of `4`
		More than one GPU was found, enabling multi-GPU training.
		If this was unintended please pass in `--num_processes=1`.
	`--num_machines` was set to a value of `1`
	`--mixed_precision` was set to a value of `'no'`
	`--dynamo_backend` was set to a value of `'no'`
To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`.
[MOD-3226] hello from the repro!!!
[MOD-3226] hello from the repro!!!
[MOD-3226] hello from the repro!!!
[MOD-3226] hello from the repro!!!
hello! 0
hello! 1
hello! 2hello! 3
```

**`runsc` (main)**

<details> <summary>💥 Failure logs</summary>

```
sudo docker run -it --runtime=runsc --gpus='"device=GPU-c453e5c7-a56d-70bf-78ce-61be6cb8e0db,GPU-4703196a-e3df-9e3f-bb8b-6fa91c8e9970,GPU-4a9c162c-9280-eaa8-215a-2c681e82a99f,GPU-1660d344-e18b-e48a-cced-38380e903c31"' ce4326479c8412b13bba27416e3e77093d4411279b432ca1b25050f17ef57a67
The following values were not passed to `accelerate launch` and had defaults used instead:
	`--num_processes` was set to a value of `4`
		More than one GPU was found, enabling multi-GPU training.
		If this was unintended please pass in `--num_processes=1`.
	`--num_machines` was set to a value of `1`
	`--mixed_precision` was set to a value of `'no'`
	`--dynamo_backend` was set to a value of `'no'`
To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`.
[MOD-3226] hello from the repro!!!
[MOD-3226] hello from the repro!!!
[MOD-3226] hello from the repro!!!
[MOD-3226] hello from the repro!!!
Detected kernel version 4.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
hello! 0
Traceback (most recent call last):
  File "/workspace/axolotl/repro.py", line 10, in <module>
    with accelerator.main_process_first():
  File "/root/miniconda3/envs/py3.10/lib/python3.10/contextlib.py", line 142, in __exit__
    next(self.gen)
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/accelerate/accelerator.py", line 884, in main_process_first
    with self.state.main_process_first():
  File "/root/miniconda3/envs/py3.10/lib/python3.10/contextlib.py", line 142, in __exit__
    next(self.gen)
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/accelerate/state.py", line 1056, in main_process_first
    with PartialState().main_process_first():
  File "/root/miniconda3/envs/py3.10/lib/python3.10/contextlib.py", line 142, in __exit__
    next(self.gen)
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/accelerate/state.py", line 502, in main_process_first
    yield from self._goes_first(self.is_main_process)
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/accelerate/state.py", line 390, in _goes_first
    self.wait_for_everyone()
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/accelerate/state.py", line 379, in wait_for_everyone
    torch.distributed.barrier()
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 3696, in barrier
    work = default_pg.barrier(opts=opts)
torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1333, unhandled cuda error (run with NCCL_DEBUG=INFO for details), NCCL version 2.18.6
ncclUnhandledCudaError: Call to CUDA function failed.
Last error:
Cuda failure 'unknown error'
[2024-07-11 19:52:01,530] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 68 closing signal SIGTERM
[2024-07-11 19:52:01,532] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 69 closing signal SIGTERM
[2024-07-11 19:52:01,533] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 70 closing signal SIGTERM
[2024-07-11 19:52:02,108] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 67) of binary: /root/miniconda3/envs/py3.10/bin/python
Traceback (most recent call last):
  File "/root/miniconda3/envs/py3.10/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 46, in main
    args.func(args)
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1073, in launch_command
    multi_gpu_launcher(args)
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/accelerate/commands/launch.py", line 718, in multi_gpu_launcher
    distrib_run.run(args)
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/distributed/run.py", line 797, in run
    elastic_launch(
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
repro.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-07-11_19:52:01
  host      : d45a08528293
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 67)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
```

---

</details>

**`runsc` (this pull request)**

<details> <summary>✅ Success logs</summary>

```
[modal@gcp-h100-us-east4-a-0-bb25baf985414f8899dfdfcb82d6796d ~]$ sudo docker run -it --runtime=runsc --gpus='"device=GPU-c453e5c7-a56d-70bf-78ce-61be6cb8e0db,GPU-4703196a-e3df-9e3f-bb8b-6fa91c8e9970,GPU-4a9c162c-9280-eaa8-215a-2c681e82a99f,GPU-1660d344-e18b-e48a-cced-38380e903c31"' ce4326479c8412b13bba27416e3e77093d4411279b432ca1b25050f17ef57a67
The following values were not passed to `accelerate launch` and had defaults used instead:
	`--num_processes` was set to a value of `4`
		More than one GPU was found, enabling multi-GPU training.
		If this was unintended please pass in `--num_processes=1`.
	`--num_machines` was set to a value of `1`
	`--mixed_precision` was set to a value of `'no'`
	`--dynamo_backend` was set to a value of `'no'`
To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`.
[MOD-3226] hello from the repro!!!
[MOD-3226] hello from the repro!!!
[MOD-3226] hello from the repro!!!
[MOD-3226] hello from the repro!!!
Detected kernel version 4.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
hello! 0
hello! 1
hello! 3hello! 2
```

</details>

FUTURE_COPYBARA_INTEGRATE_REVIEW=#10649 from thundergolfer:master d3d19f1
PiperOrigin-RevId: 651754677
copybara-service bot pushed a commit that referenced this issue Jul 13, 2024
Adding ioctls to fix a simple multi-GPU Huggingface`accelerate` program that does not work on GCP H100s.

---

### System details

* **instance type:** `a3-highgpu-8g` (GCP, us-east4-a)
* **NVIDIA driver:** `Driver Version: 550.54.15      CUDA Version: 12.4`
* **NVIDIA device:**  4 x NVIDIA H100 HBM3
* **uname -a:** `Linux gcp-h100-us-east4-a-0-bb25baf985414f8899dfdfcb82d6796d 5.15.0-208.159.3.el9uek.x86_64 #2 SMP Wed Jun 19 09:05:13 PDT 2024 x86_64 x86_64 x86_64 GNU/Linux`

```
runsc version release-20240513.0-173-gc526d251933a-dirty
spec: 1.1.0-rc.1
```

---

## Reproduction steps

1. **Install gVisor**

**2. Add GPU enabling gvisor options**

In `/etc/docker/daemon.json`:

```json
{
    "runtimes": {
        "nvidia": {
            "path": "nvidia-container-runtime",
            "runtimeArgs": []
        },
        "runsc": {
            "path": "/home/modal/runsc",
	    "runtimeArgs": ["--nvproxy", "--nvproxy-docker", "-debug-log=/tmp/runsc/", "-debug", "-strace"]

        }
    }
}
```

**3. Run Dockerfile**

```Dockerfile
# Dockerfile
FROM winglian/axolotl@sha256:5c724f7accd8188b0f84ead93b7efbfa8f8661f40e133646bd6d946bc3423d6d
RUN pip install fastapi==0.111.0
RUN pip install huggingface-hub~=0.23.0 pydantic==2.6.3 python-dateutil
ENV HUGGINGFACE_HUB_CACHE="/pretrained"
ENV TQDM_DISABLE="true"
ENV AXOLOTL_NCCL_TIMEOUT="60"

COPY <<EOF repro.py
import os
import subprocess
from pathlib import Path

print("[MOD-3226] hello from the repro!!!")
from accelerate import Accelerator

accelerator = Accelerator()

with accelerator.main_process_first():
    print(f"hello! {accelerator.process_index}")
EOF

ENTRYPOINT ["accelerate", "launch", "repro.py"]
```

```
sudo docker run -it --runtime=$RUNTIME --gpus='"device=GPU-c453e5c7-a56d-70bf-78ce-61be6cb8e0db,GPU-4703196a-e3df-9e3f-bb8b-6fa91c8e9970,GPU-4a9c162c-9280-eaa8-215a-2c681e82a99f,GPU-1660d344-e18b-e48a-cced-38380e903c31"' ce4326479c8412b13bba27416e3e77093d4411279b432ca1b25050f17ef57a67
```

### Results

**`runc`**

```
sudo docker run -it  --gpus='"device=GPU-c453e5c7-a56d-70bf-78ce-61be6cb8e0db,GPU-4703196a-e3df-9e3f-bb8b-6fa91c8e9970,GPU-4a9c162c-9280-eaa8-215a-2c681e82a99f,GPU-1660d344-e18b-e48a-cced-38380e903c31"' ce4326479c8412b13bba27416e3e77093d4411279b432ca1b25050f17ef57a67
The following values were not passed to `accelerate launch` and had defaults used instead:
	`--num_processes` was set to a value of `4`
		More than one GPU was found, enabling multi-GPU training.
		If this was unintended please pass in `--num_processes=1`.
	`--num_machines` was set to a value of `1`
	`--mixed_precision` was set to a value of `'no'`
	`--dynamo_backend` was set to a value of `'no'`
To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`.
[MOD-3226] hello from the repro!!!
[MOD-3226] hello from the repro!!!
[MOD-3226] hello from the repro!!!
[MOD-3226] hello from the repro!!!
hello! 0
hello! 1
hello! 2hello! 3
```

**`runsc` (main)**

<details> <summary>💥 Failure logs</summary>

```
sudo docker run -it --runtime=runsc --gpus='"device=GPU-c453e5c7-a56d-70bf-78ce-61be6cb8e0db,GPU-4703196a-e3df-9e3f-bb8b-6fa91c8e9970,GPU-4a9c162c-9280-eaa8-215a-2c681e82a99f,GPU-1660d344-e18b-e48a-cced-38380e903c31"' ce4326479c8412b13bba27416e3e77093d4411279b432ca1b25050f17ef57a67
The following values were not passed to `accelerate launch` and had defaults used instead:
	`--num_processes` was set to a value of `4`
		More than one GPU was found, enabling multi-GPU training.
		If this was unintended please pass in `--num_processes=1`.
	`--num_machines` was set to a value of `1`
	`--mixed_precision` was set to a value of `'no'`
	`--dynamo_backend` was set to a value of `'no'`
To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`.
[MOD-3226] hello from the repro!!!
[MOD-3226] hello from the repro!!!
[MOD-3226] hello from the repro!!!
[MOD-3226] hello from the repro!!!
Detected kernel version 4.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
hello! 0
Traceback (most recent call last):
  File "/workspace/axolotl/repro.py", line 10, in <module>
    with accelerator.main_process_first():
  File "/root/miniconda3/envs/py3.10/lib/python3.10/contextlib.py", line 142, in __exit__
    next(self.gen)
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/accelerate/accelerator.py", line 884, in main_process_first
    with self.state.main_process_first():
  File "/root/miniconda3/envs/py3.10/lib/python3.10/contextlib.py", line 142, in __exit__
    next(self.gen)
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/accelerate/state.py", line 1056, in main_process_first
    with PartialState().main_process_first():
  File "/root/miniconda3/envs/py3.10/lib/python3.10/contextlib.py", line 142, in __exit__
    next(self.gen)
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/accelerate/state.py", line 502, in main_process_first
    yield from self._goes_first(self.is_main_process)
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/accelerate/state.py", line 390, in _goes_first
    self.wait_for_everyone()
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/accelerate/state.py", line 379, in wait_for_everyone
    torch.distributed.barrier()
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 3696, in barrier
    work = default_pg.barrier(opts=opts)
torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1333, unhandled cuda error (run with NCCL_DEBUG=INFO for details), NCCL version 2.18.6
ncclUnhandledCudaError: Call to CUDA function failed.
Last error:
Cuda failure 'unknown error'
[2024-07-11 19:52:01,530] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 68 closing signal SIGTERM
[2024-07-11 19:52:01,532] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 69 closing signal SIGTERM
[2024-07-11 19:52:01,533] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 70 closing signal SIGTERM
[2024-07-11 19:52:02,108] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 67) of binary: /root/miniconda3/envs/py3.10/bin/python
Traceback (most recent call last):
  File "/root/miniconda3/envs/py3.10/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 46, in main
    args.func(args)
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1073, in launch_command
    multi_gpu_launcher(args)
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/accelerate/commands/launch.py", line 718, in multi_gpu_launcher
    distrib_run.run(args)
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/distributed/run.py", line 797, in run
    elastic_launch(
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
repro.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-07-11_19:52:01
  host      : d45a08528293
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 67)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
```

---

</details>

**`runsc` (this pull request)**

<details> <summary>✅ Success logs</summary>

```
[modal@gcp-h100-us-east4-a-0-bb25baf985414f8899dfdfcb82d6796d ~]$ sudo docker run -it --runtime=runsc --gpus='"device=GPU-c453e5c7-a56d-70bf-78ce-61be6cb8e0db,GPU-4703196a-e3df-9e3f-bb8b-6fa91c8e9970,GPU-4a9c162c-9280-eaa8-215a-2c681e82a99f,GPU-1660d344-e18b-e48a-cced-38380e903c31"' ce4326479c8412b13bba27416e3e77093d4411279b432ca1b25050f17ef57a67
The following values were not passed to `accelerate launch` and had defaults used instead:
	`--num_processes` was set to a value of `4`
		More than one GPU was found, enabling multi-GPU training.
		If this was unintended please pass in `--num_processes=1`.
	`--num_machines` was set to a value of `1`
	`--mixed_precision` was set to a value of `'no'`
	`--dynamo_backend` was set to a value of `'no'`
To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`.
[MOD-3226] hello from the repro!!!
[MOD-3226] hello from the repro!!!
[MOD-3226] hello from the repro!!!
[MOD-3226] hello from the repro!!!
Detected kernel version 4.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
hello! 0
hello! 1
hello! 3hello! 2
```

</details>

FUTURE_COPYBARA_INTEGRATE_REVIEW=#10649 from thundergolfer:master d3d19f1
PiperOrigin-RevId: 651754677
copybara-service bot pushed a commit that referenced this issue Jul 15, 2024
Adding ioctls to fix a simple multi-GPU Huggingface`accelerate` program that does not work on GCP H100s.

---

### System details

* **instance type:** `a3-highgpu-8g` (GCP, us-east4-a)
* **NVIDIA driver:** `Driver Version: 550.54.15      CUDA Version: 12.4`
* **NVIDIA device:**  4 x NVIDIA H100 HBM3
* **uname -a:** `Linux gcp-h100-us-east4-a-0-bb25baf985414f8899dfdfcb82d6796d 5.15.0-208.159.3.el9uek.x86_64 #2 SMP Wed Jun 19 09:05:13 PDT 2024 x86_64 x86_64 x86_64 GNU/Linux`

```
runsc version release-20240513.0-173-gc526d251933a-dirty
spec: 1.1.0-rc.1
```

---

## Reproduction steps

1. **Install gVisor**

**2. Add GPU enabling gvisor options**

In `/etc/docker/daemon.json`:

```json
{
    "runtimes": {
        "nvidia": {
            "path": "nvidia-container-runtime",
            "runtimeArgs": []
        },
        "runsc": {
            "path": "/home/modal/runsc",
	    "runtimeArgs": ["--nvproxy", "--nvproxy-docker", "-debug-log=/tmp/runsc/", "-debug", "-strace"]

        }
    }
}
```

**3. Run Dockerfile**

```Dockerfile
# Dockerfile
FROM winglian/axolotl@sha256:5c724f7accd8188b0f84ead93b7efbfa8f8661f40e133646bd6d946bc3423d6d
RUN pip install fastapi==0.111.0
RUN pip install huggingface-hub~=0.23.0 pydantic==2.6.3 python-dateutil
ENV HUGGINGFACE_HUB_CACHE="/pretrained"
ENV TQDM_DISABLE="true"
ENV AXOLOTL_NCCL_TIMEOUT="60"

COPY <<EOF repro.py
import os
import subprocess
from pathlib import Path

print("[MOD-3226] hello from the repro!!!")
from accelerate import Accelerator

accelerator = Accelerator()

with accelerator.main_process_first():
    print(f"hello! {accelerator.process_index}")
EOF

ENTRYPOINT ["accelerate", "launch", "repro.py"]
```

```
sudo docker run -it --runtime=$RUNTIME --gpus='"device=GPU-c453e5c7-a56d-70bf-78ce-61be6cb8e0db,GPU-4703196a-e3df-9e3f-bb8b-6fa91c8e9970,GPU-4a9c162c-9280-eaa8-215a-2c681e82a99f,GPU-1660d344-e18b-e48a-cced-38380e903c31"' ce4326479c8412b13bba27416e3e77093d4411279b432ca1b25050f17ef57a67
```

### Results

**`runc`**

```
sudo docker run -it  --gpus='"device=GPU-c453e5c7-a56d-70bf-78ce-61be6cb8e0db,GPU-4703196a-e3df-9e3f-bb8b-6fa91c8e9970,GPU-4a9c162c-9280-eaa8-215a-2c681e82a99f,GPU-1660d344-e18b-e48a-cced-38380e903c31"' ce4326479c8412b13bba27416e3e77093d4411279b432ca1b25050f17ef57a67
The following values were not passed to `accelerate launch` and had defaults used instead:
	`--num_processes` was set to a value of `4`
		More than one GPU was found, enabling multi-GPU training.
		If this was unintended please pass in `--num_processes=1`.
	`--num_machines` was set to a value of `1`
	`--mixed_precision` was set to a value of `'no'`
	`--dynamo_backend` was set to a value of `'no'`
To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`.
[MOD-3226] hello from the repro!!!
[MOD-3226] hello from the repro!!!
[MOD-3226] hello from the repro!!!
[MOD-3226] hello from the repro!!!
hello! 0
hello! 1
hello! 2hello! 3
```

**`runsc` (main)**

<details> <summary>💥 Failure logs</summary>

```
sudo docker run -it --runtime=runsc --gpus='"device=GPU-c453e5c7-a56d-70bf-78ce-61be6cb8e0db,GPU-4703196a-e3df-9e3f-bb8b-6fa91c8e9970,GPU-4a9c162c-9280-eaa8-215a-2c681e82a99f,GPU-1660d344-e18b-e48a-cced-38380e903c31"' ce4326479c8412b13bba27416e3e77093d4411279b432ca1b25050f17ef57a67
The following values were not passed to `accelerate launch` and had defaults used instead:
	`--num_processes` was set to a value of `4`
		More than one GPU was found, enabling multi-GPU training.
		If this was unintended please pass in `--num_processes=1`.
	`--num_machines` was set to a value of `1`
	`--mixed_precision` was set to a value of `'no'`
	`--dynamo_backend` was set to a value of `'no'`
To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`.
[MOD-3226] hello from the repro!!!
[MOD-3226] hello from the repro!!!
[MOD-3226] hello from the repro!!!
[MOD-3226] hello from the repro!!!
Detected kernel version 4.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
hello! 0
Traceback (most recent call last):
  File "/workspace/axolotl/repro.py", line 10, in <module>
    with accelerator.main_process_first():
  File "/root/miniconda3/envs/py3.10/lib/python3.10/contextlib.py", line 142, in __exit__
    next(self.gen)
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/accelerate/accelerator.py", line 884, in main_process_first
    with self.state.main_process_first():
  File "/root/miniconda3/envs/py3.10/lib/python3.10/contextlib.py", line 142, in __exit__
    next(self.gen)
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/accelerate/state.py", line 1056, in main_process_first
    with PartialState().main_process_first():
  File "/root/miniconda3/envs/py3.10/lib/python3.10/contextlib.py", line 142, in __exit__
    next(self.gen)
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/accelerate/state.py", line 502, in main_process_first
    yield from self._goes_first(self.is_main_process)
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/accelerate/state.py", line 390, in _goes_first
    self.wait_for_everyone()
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/accelerate/state.py", line 379, in wait_for_everyone
    torch.distributed.barrier()
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 3696, in barrier
    work = default_pg.barrier(opts=opts)
torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1333, unhandled cuda error (run with NCCL_DEBUG=INFO for details), NCCL version 2.18.6
ncclUnhandledCudaError: Call to CUDA function failed.
Last error:
Cuda failure 'unknown error'
[2024-07-11 19:52:01,530] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 68 closing signal SIGTERM
[2024-07-11 19:52:01,532] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 69 closing signal SIGTERM
[2024-07-11 19:52:01,533] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 70 closing signal SIGTERM
[2024-07-11 19:52:02,108] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 67) of binary: /root/miniconda3/envs/py3.10/bin/python
Traceback (most recent call last):
  File "/root/miniconda3/envs/py3.10/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 46, in main
    args.func(args)
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1073, in launch_command
    multi_gpu_launcher(args)
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/accelerate/commands/launch.py", line 718, in multi_gpu_launcher
    distrib_run.run(args)
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/distributed/run.py", line 797, in run
    elastic_launch(
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
repro.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-07-11_19:52:01
  host      : d45a08528293
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 67)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
```

---

</details>

**`runsc` (this pull request)**

<details> <summary>✅ Success logs</summary>

```
[modal@gcp-h100-us-east4-a-0-bb25baf985414f8899dfdfcb82d6796d ~]$ sudo docker run -it --runtime=runsc --gpus='"device=GPU-c453e5c7-a56d-70bf-78ce-61be6cb8e0db,GPU-4703196a-e3df-9e3f-bb8b-6fa91c8e9970,GPU-4a9c162c-9280-eaa8-215a-2c681e82a99f,GPU-1660d344-e18b-e48a-cced-38380e903c31"' ce4326479c8412b13bba27416e3e77093d4411279b432ca1b25050f17ef57a67
The following values were not passed to `accelerate launch` and had defaults used instead:
	`--num_processes` was set to a value of `4`
		More than one GPU was found, enabling multi-GPU training.
		If this was unintended please pass in `--num_processes=1`.
	`--num_machines` was set to a value of `1`
	`--mixed_precision` was set to a value of `'no'`
	`--dynamo_backend` was set to a value of `'no'`
To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`.
[MOD-3226] hello from the repro!!!
[MOD-3226] hello from the repro!!!
[MOD-3226] hello from the repro!!!
[MOD-3226] hello from the repro!!!
Detected kernel version 4.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
hello! 0
hello! 1
hello! 3hello! 2
```

</details>

FUTURE_COPYBARA_INTEGRATE_REVIEW=#10649 from thundergolfer:master d3d19f1
PiperOrigin-RevId: 651754677
copybara-service bot pushed a commit that referenced this issue Jul 15, 2024
Adding ioctls to fix a simple multi-GPU Huggingface`accelerate` program that does not work on GCP H100s.

---

### System details

* **instance type:** `a3-highgpu-8g` (GCP, us-east4-a)
* **NVIDIA driver:** `Driver Version: 550.54.15      CUDA Version: 12.4`
* **NVIDIA device:**  4 x NVIDIA H100 HBM3
* **uname -a:** `Linux gcp-h100-us-east4-a-0-bb25baf985414f8899dfdfcb82d6796d 5.15.0-208.159.3.el9uek.x86_64 #2 SMP Wed Jun 19 09:05:13 PDT 2024 x86_64 x86_64 x86_64 GNU/Linux`

```
runsc version release-20240513.0-173-gc526d251933a-dirty
spec: 1.1.0-rc.1
```

---

## Reproduction steps

1. **Install gVisor**

**2. Add GPU enabling gvisor options**

In `/etc/docker/daemon.json`:

```json
{
    "runtimes": {
        "nvidia": {
            "path": "nvidia-container-runtime",
            "runtimeArgs": []
        },
        "runsc": {
            "path": "/home/modal/runsc",
	    "runtimeArgs": ["--nvproxy", "--nvproxy-docker", "-debug-log=/tmp/runsc/", "-debug", "-strace"]

        }
    }
}
```

**3. Run Dockerfile**

```Dockerfile
# Dockerfile
FROM winglian/axolotl@sha256:5c724f7accd8188b0f84ead93b7efbfa8f8661f40e133646bd6d946bc3423d6d
RUN pip install fastapi==0.111.0
RUN pip install huggingface-hub~=0.23.0 pydantic==2.6.3 python-dateutil
ENV HUGGINGFACE_HUB_CACHE="/pretrained"
ENV TQDM_DISABLE="true"
ENV AXOLOTL_NCCL_TIMEOUT="60"

COPY <<EOF repro.py
import os
import subprocess
from pathlib import Path

print("[MOD-3226] hello from the repro!!!")
from accelerate import Accelerator

accelerator = Accelerator()

with accelerator.main_process_first():
    print(f"hello! {accelerator.process_index}")
EOF

ENTRYPOINT ["accelerate", "launch", "repro.py"]
```

```
sudo docker run -it --runtime=$RUNTIME --gpus='"device=GPU-c453e5c7-a56d-70bf-78ce-61be6cb8e0db,GPU-4703196a-e3df-9e3f-bb8b-6fa91c8e9970,GPU-4a9c162c-9280-eaa8-215a-2c681e82a99f,GPU-1660d344-e18b-e48a-cced-38380e903c31"' ce4326479c8412b13bba27416e3e77093d4411279b432ca1b25050f17ef57a67
```

### Results

**`runc`**

```
sudo docker run -it  --gpus='"device=GPU-c453e5c7-a56d-70bf-78ce-61be6cb8e0db,GPU-4703196a-e3df-9e3f-bb8b-6fa91c8e9970,GPU-4a9c162c-9280-eaa8-215a-2c681e82a99f,GPU-1660d344-e18b-e48a-cced-38380e903c31"' ce4326479c8412b13bba27416e3e77093d4411279b432ca1b25050f17ef57a67
The following values were not passed to `accelerate launch` and had defaults used instead:
	`--num_processes` was set to a value of `4`
		More than one GPU was found, enabling multi-GPU training.
		If this was unintended please pass in `--num_processes=1`.
	`--num_machines` was set to a value of `1`
	`--mixed_precision` was set to a value of `'no'`
	`--dynamo_backend` was set to a value of `'no'`
To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`.
[MOD-3226] hello from the repro!!!
[MOD-3226] hello from the repro!!!
[MOD-3226] hello from the repro!!!
[MOD-3226] hello from the repro!!!
hello! 0
hello! 1
hello! 2hello! 3
```

**`runsc` (main)**

<details> <summary>💥 Failure logs</summary>

```
sudo docker run -it --runtime=runsc --gpus='"device=GPU-c453e5c7-a56d-70bf-78ce-61be6cb8e0db,GPU-4703196a-e3df-9e3f-bb8b-6fa91c8e9970,GPU-4a9c162c-9280-eaa8-215a-2c681e82a99f,GPU-1660d344-e18b-e48a-cced-38380e903c31"' ce4326479c8412b13bba27416e3e77093d4411279b432ca1b25050f17ef57a67
The following values were not passed to `accelerate launch` and had defaults used instead:
	`--num_processes` was set to a value of `4`
		More than one GPU was found, enabling multi-GPU training.
		If this was unintended please pass in `--num_processes=1`.
	`--num_machines` was set to a value of `1`
	`--mixed_precision` was set to a value of `'no'`
	`--dynamo_backend` was set to a value of `'no'`
To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`.
[MOD-3226] hello from the repro!!!
[MOD-3226] hello from the repro!!!
[MOD-3226] hello from the repro!!!
[MOD-3226] hello from the repro!!!
Detected kernel version 4.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
hello! 0
Traceback (most recent call last):
  File "/workspace/axolotl/repro.py", line 10, in <module>
    with accelerator.main_process_first():
  File "/root/miniconda3/envs/py3.10/lib/python3.10/contextlib.py", line 142, in __exit__
    next(self.gen)
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/accelerate/accelerator.py", line 884, in main_process_first
    with self.state.main_process_first():
  File "/root/miniconda3/envs/py3.10/lib/python3.10/contextlib.py", line 142, in __exit__
    next(self.gen)
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/accelerate/state.py", line 1056, in main_process_first
    with PartialState().main_process_first():
  File "/root/miniconda3/envs/py3.10/lib/python3.10/contextlib.py", line 142, in __exit__
    next(self.gen)
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/accelerate/state.py", line 502, in main_process_first
    yield from self._goes_first(self.is_main_process)
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/accelerate/state.py", line 390, in _goes_first
    self.wait_for_everyone()
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/accelerate/state.py", line 379, in wait_for_everyone
    torch.distributed.barrier()
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 3696, in barrier
    work = default_pg.barrier(opts=opts)
torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1333, unhandled cuda error (run with NCCL_DEBUG=INFO for details), NCCL version 2.18.6
ncclUnhandledCudaError: Call to CUDA function failed.
Last error:
Cuda failure 'unknown error'
[2024-07-11 19:52:01,530] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 68 closing signal SIGTERM
[2024-07-11 19:52:01,532] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 69 closing signal SIGTERM
[2024-07-11 19:52:01,533] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 70 closing signal SIGTERM
[2024-07-11 19:52:02,108] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 67) of binary: /root/miniconda3/envs/py3.10/bin/python
Traceback (most recent call last):
  File "/root/miniconda3/envs/py3.10/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 46, in main
    args.func(args)
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1073, in launch_command
    multi_gpu_launcher(args)
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/accelerate/commands/launch.py", line 718, in multi_gpu_launcher
    distrib_run.run(args)
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/distributed/run.py", line 797, in run
    elastic_launch(
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
repro.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-07-11_19:52:01
  host      : d45a08528293
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 67)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
```

---

</details>

**`runsc` (this pull request)**

<details> <summary>✅ Success logs</summary>

```
[modal@gcp-h100-us-east4-a-0-bb25baf985414f8899dfdfcb82d6796d ~]$ sudo docker run -it --runtime=runsc --gpus='"device=GPU-c453e5c7-a56d-70bf-78ce-61be6cb8e0db,GPU-4703196a-e3df-9e3f-bb8b-6fa91c8e9970,GPU-4a9c162c-9280-eaa8-215a-2c681e82a99f,GPU-1660d344-e18b-e48a-cced-38380e903c31"' ce4326479c8412b13bba27416e3e77093d4411279b432ca1b25050f17ef57a67
The following values were not passed to `accelerate launch` and had defaults used instead:
	`--num_processes` was set to a value of `4`
		More than one GPU was found, enabling multi-GPU training.
		If this was unintended please pass in `--num_processes=1`.
	`--num_machines` was set to a value of `1`
	`--mixed_precision` was set to a value of `'no'`
	`--dynamo_backend` was set to a value of `'no'`
To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`.
[MOD-3226] hello from the repro!!!
[MOD-3226] hello from the repro!!!
[MOD-3226] hello from the repro!!!
[MOD-3226] hello from the repro!!!
Detected kernel version 4.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
hello! 0
hello! 1
hello! 3hello! 2
```

</details>

FUTURE_COPYBARA_INTEGRATE_REVIEW=#10649 from thundergolfer:master d3d19f1
PiperOrigin-RevId: 651754677
copybara-service bot pushed a commit that referenced this issue Oct 23, 2024
Attempt #2.

This runs in continuous mode only.

PiperOrigin-RevId: 689056926
copybara-service bot pushed a commit that referenced this issue Oct 23, 2024
Attempt #2.

This runs in continuous mode only.

PiperOrigin-RevId: 689056926
copybara-service bot pushed a commit that referenced this issue Oct 23, 2024
Attempt #2.

This runs in continuous mode only.

PiperOrigin-RevId: 689056926
copybara-service bot pushed a commit that referenced this issue Oct 24, 2024
Attempt #2.

This runs in continuous mode only.

PiperOrigin-RevId: 689056926
copybara-service bot pushed a commit that referenced this issue Oct 30, 2024
Attempt #2.

This runs in continuous mode only.

PiperOrigin-RevId: 689056926
copybara-service bot pushed a commit that referenced this issue Oct 30, 2024
Attempt #2.

This runs in continuous mode only.

PiperOrigin-RevId: 691516066
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants