ElasticSearch does not run #2

nlacasse · 2018-04-26T21:45:54Z

This requires some socket ioctls that are not currently implemented.

clandry94 · 2018-05-02T20:51:58Z

Can't guarantee anything will come out of it but I can take a look into this

fvoznika · 2018-05-03T09:10:30Z

That would be great! Here are some pointers:

Follow these instructions to enable strace and run elasticsearch. Look for the log file: runsc.log.*.boot -- this represents the log for the Sentry process. The log will have a dump of all syscalls, look for ioctl calls that have failed: ioctl.*error.

The entry point for the the ioctl syscall in the Sentry is here.

clandry94 · 2018-05-04T02:56:48Z

Note: the elasticsearch image was failing to start due to not being able to find a loopback interface even though lo was present in the network interface list.

Adding --network=host to the runtime args gets me past the ES loopback interface lookup and to ioctl errors

glibc's malloc also uses SYS_TIME. Permit it. #0 0x0000000000de6267 in time () #1 0x0000000000db19d8 in get_nprocs () #2 0x0000000000d8a31a in arena_get2.part () #3 0x0000000000d8ab4a in malloc () #4 0x0000000000d3c6b5 in __sanitizer::InternalAlloc(unsigned long, __sanitizer::SizeClassAllocatorLocalCache<__sanitizer::SizeClassAllocator32<0ul, 140737488355328ull, 0ul, __sanitizer::SizeClassMap<3ul, 4ul, 8ul, 17ul, 64ul, 14ul>, 20ul, __sanitizer::TwoLevelByteMap<32768ull, 4096ull, __sanitizer::NoOpMapUnmapCallback>, __sanitizer::NoOpMapUnmapCallback> >*, unsigned long) () #5 0x0000000000d4cd70 in __tsan_go_start () #6 0x00000000004617a3 in racecall () #7 0x00000000010f4ea0 in runtime.findfunctab () #8 0x000000000043f193 in runtime.racegostart () Signed-off-by: Dmitry Vyukov <[email protected]> [[email protected]: updated comments and commit message] Signed-off-by: Michael Pratt <[email protected]> Change-Id: Ibe2d0dc3035bf5052d5fb802cfaa37c5e0e7a09a PiperOrigin-RevId: 203042627

Closes google#2 PiperOrigin-RevId: 202997196 Change-Id: I0c9f6f5a8a1abe1ae427bca5f590bdf9f82a6675

glibc's malloc also uses SYS_TIME. Permit it. #0 0x0000000000de6267 in time () google#1 0x0000000000db19d8 in get_nprocs () google#2 0x0000000000d8a31a in arena_get2.part () google#3 0x0000000000d8ab4a in malloc () google#4 0x0000000000d3c6b5 in __sanitizer::InternalAlloc(unsigned long, __sanitizer::SizeClassAllocatorLocalCache<__sanitizer::SizeClassAllocator32<0ul, 140737488355328ull, 0ul, __sanitizer::SizeClassMap<3ul, 4ul, 8ul, 17ul, 64ul, 14ul>, 20ul, __sanitizer::TwoLevelByteMap<32768ull, 4096ull, __sanitizer::NoOpMapUnmapCallback>, __sanitizer::NoOpMapUnmapCallback> >*, unsigned long) () google#5 0x0000000000d4cd70 in __tsan_go_start () google#6 0x00000000004617a3 in racecall () google#7 0x00000000010f4ea0 in runtime.findfunctab () google#8 0x000000000043f193 in runtime.racegostart () Signed-off-by: Dmitry Vyukov <[email protected]> [[email protected]: updated comments and commit message] Signed-off-by: Michael Pratt <[email protected]> Change-Id: Ibe2d0dc3035bf5052d5fb802cfaa37c5e0e7a09a PiperOrigin-RevId: 203042627

Closes #2 PiperOrigin-RevId: 202997196 Change-Id: I0c9f6f5a8a1abe1ae427bca5f590bdf9f82a6675 Upstream-commit: fa64c2a

glibc's malloc also uses SYS_TIME. Permit it. #0 0x0000000000de6267 in time () #1 0x0000000000db19d8 in get_nprocs () #2 0x0000000000d8a31a in arena_get2.part () #3 0x0000000000d8ab4a in malloc () google#4 0x0000000000d3c6b5 in __sanitizer::InternalAlloc(unsigned long, __sanitizer::SizeClassAllocatorLocalCache<__sanitizer::SizeClassAllocator32<0ul, 140737488355328ull, 0ul, __sanitizer::SizeClassMap<3ul, 4ul, 8ul, 17ul, 64ul, 14ul>, 20ul, __sanitizer::TwoLevelByteMap<32768ull, 4096ull, __sanitizer::NoOpMapUnmapCallback>, __sanitizer::NoOpMapUnmapCallback> >*, unsigned long) () google#5 0x0000000000d4cd70 in __tsan_go_start () google#6 0x00000000004617a3 in racecall () google#7 0x00000000010f4ea0 in runtime.findfunctab () google#8 0x000000000043f193 in runtime.racegostart () Signed-off-by: Dmitry Vyukov <[email protected]> [[email protected]: updated comments and commit message] Signed-off-by: Michael Pratt <[email protected]> Change-Id: Ibe2d0dc3035bf5052d5fb802cfaa37c5e0e7a09a PiperOrigin-RevId: 203042627 Upstream-commit: 6144751

Below command under hostinet network will lead to panic: $ cat /proc/net/tcp It's caused by the wrong SizeOfTCPInfo. #0 runtime.panicindex() google#1 encoding/binary.littleEndian.Uint64 google#2 encoding/binary.(*littleEndian).Uint64 google#3 gvisor.dev/gvisor/pkg/binary.unmarshal google#4 gvisor.dev/gvisor/pkg/binary.unmarshal google#5 gvisor.dev/gvisor/pkg/binary.Unmarshal google#6 gvisor.dev/gvisor/pkg/sentry/socket/hostinet.(*socketOperations).State google#7 gvisor.dev/gvisor/pkg/sentry/fs/proc.(*netTCP).ReadSeqFileData Correct SizeOfTCPInfo from 104 to 192 to fix it. Fixes google#640 Signed-off-by: Jianfeng Tan <[email protected]>

Add fpsimd support to KVM module so that the test case "TestKernelFloatingPoint" can be passed on Arm64 platform. Signed-off-by: Bin Lu <[email protected]> FUTURE_COPYBARA_INTEGRATE_REVIEW=#1707 from lubinszARM:pr_lazy_fpsimd_2 bf87da8 PiperOrigin-RevId: 300843308

Before this change: ``` $ docker run --runtime=runsc --rm -it -v ~/tmp:/hosttmp ubuntu:focal /hosttmp/issue5732 --bytes1=128 --bytes2=1024 #1: read(128) = 128 #2: read(1024) = EOF $ docker run --runtime=runsc-vfs2 --rm -it -v ~/tmp:/hosttmp ubuntu:focal /hosttmp/issue5732 --bytes1=128 --bytes2=1024 #1: read(128) = 128 #2: read(1024) = 256 ``` After this change: ``` $ docker run --runtime=runsc --rm -it -v ~/tmp:/hosttmp ubuntu:focal /hosttmp/issue5732 --bytes1=128 --bytes2=1024 #1: read(128) = 128 #2: read(1024) = 256 $ docker run --runtime=runsc-vfs2 --rm -it -v ~/tmp:/hosttmp ubuntu:focal /hosttmp/issue5732 --bytes1=128 --bytes2=1024 #1: read(128) = 128 #2: read(1024) = 256 ``` Fixes #5732 PiperOrigin-RevId: 365178386

Distributed training isn't working with PyTorch on certain A100 nodes. Adds the missing ioctl `UVM_UNMAP_EXTERNAL` allowing for certain NCCL operations to succeed when using [`torch.distributed`](https://pytorch.org/docs/stable/distributed.html), fixing distributed training. ## Reproduction This affects numerous A100 40GB and 80GB instances in our fleet. This reproduction requires 4 A100 GPUs, either 40GB or 80GB. - **NVIDIA Driver Version**: 550.54.15 - **CUDA Version**: 12.4 - **NVIDIA device**: NVIDIA A100 80GB PCIe ### Steps 1. **Install gvisor** ```bash URL="https://storage.googleapis.com/gvisor/releases/master/latest/${ARCH}" wget -nc "${URL}/runsc" "${URL}/runsc.sha512" chmod +x runsc sudo cp runsc /usr/local/bin/runsc sudo /usr/local/bin/runsc install sudo systemctl reload docker ``` 2. **Add GPU enabling gvisor options** ```json { "runtimes": { "nvidia": { "path": "nvidia-container-runtime", "runtimeArgs": [] }, "runsc": { "path": "/usr/local/bin/runsc", "runtimeArgs": ["--nvproxy", "--nvproxy-docker", "-debug-log=/tmp/runsc/", "-debug", "-strace"] } } } ``` Reload configs with `sudo systemctl reload docker`. 3. **Run reproduction NCCL test** This test creates one main process and N peer processes. Each peer process sends a torch `Tensor` to the main process using NCCL. ```Dockerfile # Dockerfile FROM python:3.9.15-slim-bullseye RUN pip install torch numpy COPY <<EOF repro.py import argparse import datetime import os import torch import torch.distributed as dist import torch.multiprocessing as mp def setup(rank, world_size): os.environ["MASTER_ADDR"] = "localhost" os.environ["MASTER_PORT"] = "12355" dist.init_process_group("nccl", rank=rank, world_size=world_size, timeout=datetime.timedelta(seconds=600)) torch.cuda.set_device(rank) def cleanup(): dist.destroy_process_group() def send_tensor(rank, world_size): try: setup(rank, world_size) # rank receiving all tensors target_rank = world_size - 1 dist.barrier() tensor = torch.ones(5).cuda(rank) if rank < target_rank: print(f"[RANK {rank}] sending tensor: {tensor}") dist.send(tensor=tensor, dst=target_rank) elif rank == target_rank: for other_rank in range(target_rank): tensor = torch.zeros(5).cuda(target_rank) dist.recv(tensor=tensor, src=other_rank) print(f"[RANK {target_rank}] received tensor from rank={other_rank}: {tensor}") print("PASS: NCCL working.") except Exception as e: print(f"[RANK {rank}] error in send_tensor: {e}") raise finally: cleanup() def main(world_size: int = 2): mp.spawn(send_tensor, args=(world_size,), nprocs=world_size, join=True) if __name__ == "__main__": parser = argparse.ArgumentParser(description="Run torch-based NCCL tests") parser.add_argument("world_size", type=int, help="number of GPUs to run test on") args = parser.parse_args() if args.world_size < 2: raise RuntimeError(f"world_size needs to be larger than 1 {args.world_size}") main(args.world_size) EOF ENTRYPOINT ["python", "repro.py", "4"] ``` Build image with: ``` docker build -f Dockerfile . ``` Then run it with: ``` sudo docker run -it --shm-size=2.00gb --runtime=runsc --gpus='"device=GPU-742ea7fc-dd4f-612c-e860-499bf200a815,GPU-94a801d8-7713-acf6-337d-338b7cfdf19e,GPU-0d19cef2-10ce-e445-a0be-3d330e36c1fd,GPU-ac5046fb-020c-93e8-2784-f44aedbc5bbd"' 040a44863fb1 ``` #### Failure (truncated) ``` ... Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:672 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7edda14cf897 in /usr/local/lib/python3.11/site-packages/torch/lib/libc10.so) frame #1: <unknown function> + 0x5b3a23e (0x7edd8d73a23e in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so) frame #2: c10d::TCPStore::doWait(c10::ArrayRef<std::string>, std::chrono::duration<long, std::ratio<1l, 1000l> >) + 0x2c7 (0x7edd8d734c87 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so) frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7edd8d734f82 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so) frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7edd8d735fd1 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so) frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7edd8d6ea371 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so) frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7edd8d6ea371 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so) frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7edd8d6ea371 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so) frame #8: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7edd54da9189 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so) frame #9: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, c10::Device&, c10d::OpType, int, bool) + 0xc50 (0x7edd54db0610 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so) frame #10: c10d::ProcessGroupNCCL::recv(std::vector<at::Tensor, std::allocator<at::Tensor> >&, int, int) + 0x5f8 (0x7edd54dcf978 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so) frame #11: <unknown function> + 0x5adc309 (0x7edd8d6dc309 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so) frame #12: <unknown function> + 0x5ae6f10 (0x7edd8d6e6f10 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so) frame #13: <unknown function> + 0x5ae6fa5 (0x7edd8d6e6fa5 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so) frame #14: <unknown function> + 0x5124446 (0x7edd8cd24446 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so) frame #15: <unknown function> + 0x1acf4b8 (0x7edd896cf4b8 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so) frame #16: <unknown function> + 0x5aee004 (0x7edd8d6ee004 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so) frame #17: <unknown function> + 0x5af36b5 (0x7edd8d6f36b5 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so) frame #18: <unknown function> + 0xd2fe8e (0x7edda032fe8e in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_python.so) frame #19: <unknown function> + 0x47f074 (0x7edd9fa7f074 in /usr/local/lib/python3.11/site-packages/torch/lib/libtorch_python.so) <omitting python frames> frame #35: <unknown function> + 0x29d90 (0x7edda2029d90 in /usr/lib/x86_64-linux-gnu/libc.so.6) frame #36: __libc_start_main + 0x80 (0x7edda2029e40 in /usr/lib/x86_64-linux-gnu/libc.so.6) frame #37: <unknown function> + 0x108e (0x55f950b0c08e in /usr/local/bin/python) . This may indicate a possible application crash on rank 0 or a network set up issue. ... ``` ### Fix gvisor debug logs show: ``` W0702 20:36:17.577055 445833 uvm.go:148] [ 22: 84] nvproxy: unknown uvm ioctl 66 = 0x42 ``` I've implemented that ioctl in this PR. This is the output after the fix. ``` [RANK 2] sending tensor: tensor([1., 1., 1., 1., 1.], device='cuda:2') [RANK 0] sending tensor: tensor([1., 1., 1., 1., 1.], device='cuda:0') [RANK 1] sending tensor: tensor([1., 1., 1., 1., 1.], device='cuda:1') [RANK 3] received tensor from rank=0: tensor([1., 1., 1., 1., 1.], device='cuda:3') [RANK 3] received tensor from rank=1: tensor([1., 1., 1., 1., 1.], device='cuda:3') [RANK 3] received tensor from rank=2: tensor([1., 1., 1., 1., 1.], device='cuda:3') PASS: NCCL working. ``` FUTURE_COPYBARA_INTEGRATE_REVIEW=#10610 from luiscape:master ee88734 PiperOrigin-RevId: 649146570

Adding ioctls to fix a simple multi-GPU Huggingface`accelerate` program that does not work on GCP H100s. --- ### System details * **instance type:** `a3-highgpu-8g` (GCP, us-east4-a) * **NVIDIA driver:** `Driver Version: 550.54.15 CUDA Version: 12.4` * **NVIDIA device:** 4 x NVIDIA H100 HBM3 * **uname -a:** `Linux gcp-h100-us-east4-a-0-bb25baf985414f8899dfdfcb82d6796d 5.15.0-208.159.3.el9uek.x86_64 #2 SMP Wed Jun 19 09:05:13 PDT 2024 x86_64 x86_64 x86_64 GNU/Linux` ``` runsc version release-20240513.0-173-gc526d251933a-dirty spec: 1.1.0-rc.1 ``` --- ## Reproduction steps 1. **Install gVisor** **2. Add GPU enabling gvisor options** In `/etc/docker/daemon.json`: ```json { "runtimes": { "nvidia": { "path": "nvidia-container-runtime", "runtimeArgs": [] }, "runsc": { "path": "/home/modal/runsc", "runtimeArgs": ["--nvproxy", "--nvproxy-docker", "-debug-log=/tmp/runsc/", "-debug", "-strace"] } } } ``` **3. Run Dockerfile** ```Dockerfile # Dockerfile FROM winglian/axolotl@sha256:5c724f7accd8188b0f84ead93b7efbfa8f8661f40e133646bd6d946bc3423d6d RUN pip install fastapi==0.111.0 RUN pip install huggingface-hub~=0.23.0 pydantic==2.6.3 python-dateutil ENV HUGGINGFACE_HUB_CACHE="/pretrained" ENV TQDM_DISABLE="true" ENV AXOLOTL_NCCL_TIMEOUT="60" COPY <<EOF repro.py import os import subprocess from pathlib import Path print("[MOD-3226] hello from the repro!!!") from accelerate import Accelerator accelerator = Accelerator() with accelerator.main_process_first(): print(f"hello! {accelerator.process_index}") EOF ENTRYPOINT ["accelerate", "launch", "repro.py"] ``` ``` sudo docker run -it --runtime=$RUNTIME --gpus='"device=GPU-c453e5c7-a56d-70bf-78ce-61be6cb8e0db,GPU-4703196a-e3df-9e3f-bb8b-6fa91c8e9970,GPU-4a9c162c-9280-eaa8-215a-2c681e82a99f,GPU-1660d344-e18b-e48a-cced-38380e903c31"' ce4326479c8412b13bba27416e3e77093d4411279b432ca1b25050f17ef57a67 ``` ### Results **`runc`** ``` sudo docker run -it --gpus='"device=GPU-c453e5c7-a56d-70bf-78ce-61be6cb8e0db,GPU-4703196a-e3df-9e3f-bb8b-6fa91c8e9970,GPU-4a9c162c-9280-eaa8-215a-2c681e82a99f,GPU-1660d344-e18b-e48a-cced-38380e903c31"' ce4326479c8412b13bba27416e3e77093d4411279b432ca1b25050f17ef57a67 The following values were not passed to `accelerate launch` and had defaults used instead: `--num_processes` was set to a value of `4` More than one GPU was found, enabling multi-GPU training. If this was unintended please pass in `--num_processes=1`. `--num_machines` was set to a value of `1` `--mixed_precision` was set to a value of `'no'` `--dynamo_backend` was set to a value of `'no'` To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`. [MOD-3226] hello from the repro!!! [MOD-3226] hello from the repro!!! [MOD-3226] hello from the repro!!! [MOD-3226] hello from the repro!!! hello! 0 hello! 1 hello! 2hello! 3 ``` **`runsc` (main)** <details> <summary>💥 Failure logs</summary> ``` sudo docker run -it --runtime=runsc --gpus='"device=GPU-c453e5c7-a56d-70bf-78ce-61be6cb8e0db,GPU-4703196a-e3df-9e3f-bb8b-6fa91c8e9970,GPU-4a9c162c-9280-eaa8-215a-2c681e82a99f,GPU-1660d344-e18b-e48a-cced-38380e903c31"' ce4326479c8412b13bba27416e3e77093d4411279b432ca1b25050f17ef57a67 The following values were not passed to `accelerate launch` and had defaults used instead: `--num_processes` was set to a value of `4` More than one GPU was found, enabling multi-GPU training. If this was unintended please pass in `--num_processes=1`. `--num_machines` was set to a value of `1` `--mixed_precision` was set to a value of `'no'` `--dynamo_backend` was set to a value of `'no'` To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`. [MOD-3226] hello from the repro!!! [MOD-3226] hello from the repro!!! [MOD-3226] hello from the repro!!! [MOD-3226] hello from the repro!!! Detected kernel version 4.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher. hello! 0 Traceback (most recent call last): File "/workspace/axolotl/repro.py", line 10, in <module> with accelerator.main_process_first(): File "/root/miniconda3/envs/py3.10/lib/python3.10/contextlib.py", line 142, in __exit__ next(self.gen) File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/accelerate/accelerator.py", line 884, in main_process_first with self.state.main_process_first(): File "/root/miniconda3/envs/py3.10/lib/python3.10/contextlib.py", line 142, in __exit__ next(self.gen) File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/accelerate/state.py", line 1056, in main_process_first with PartialState().main_process_first(): File "/root/miniconda3/envs/py3.10/lib/python3.10/contextlib.py", line 142, in __exit__ next(self.gen) File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/accelerate/state.py", line 502, in main_process_first yield from self._goes_first(self.is_main_process) File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/accelerate/state.py", line 390, in _goes_first self.wait_for_everyone() File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/accelerate/state.py", line 379, in wait_for_everyone torch.distributed.barrier() File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper return func(*args, **kwargs) File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 3696, in barrier work = default_pg.barrier(opts=opts) torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1333, unhandled cuda error (run with NCCL_DEBUG=INFO for details), NCCL version 2.18.6 ncclUnhandledCudaError: Call to CUDA function failed. Last error: Cuda failure 'unknown error' [2024-07-11 19:52:01,530] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 68 closing signal SIGTERM [2024-07-11 19:52:01,532] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 69 closing signal SIGTERM [2024-07-11 19:52:01,533] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 70 closing signal SIGTERM [2024-07-11 19:52:02,108] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 67) of binary: /root/miniconda3/envs/py3.10/bin/python Traceback (most recent call last): File "/root/miniconda3/envs/py3.10/bin/accelerate", line 8, in <module> sys.exit(main()) File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 46, in main args.func(args) File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1073, in launch_command multi_gpu_launcher(args) File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/accelerate/commands/launch.py", line 718, in multi_gpu_launcher distrib_run.run(args) File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/distributed/run.py", line 797, in run elastic_launch( File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: repro.py FAILED ------------------------------------------------------------ Failures: <NO_OTHER_FAILURES> ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2024-07-11_19:52:01 host : d45a08528293 rank : 0 (local_rank: 0) exitcode : 1 (pid: 67) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ``` --- </details> **`runsc` (this pull request)** <details> <summary>✅ Success logs</summary> ``` [modal@gcp-h100-us-east4-a-0-bb25baf985414f8899dfdfcb82d6796d ~]$ sudo docker run -it --runtime=runsc --gpus='"device=GPU-c453e5c7-a56d-70bf-78ce-61be6cb8e0db,GPU-4703196a-e3df-9e3f-bb8b-6fa91c8e9970,GPU-4a9c162c-9280-eaa8-215a-2c681e82a99f,GPU-1660d344-e18b-e48a-cced-38380e903c31"' ce4326479c8412b13bba27416e3e77093d4411279b432ca1b25050f17ef57a67 The following values were not passed to `accelerate launch` and had defaults used instead: `--num_processes` was set to a value of `4` More than one GPU was found, enabling multi-GPU training. If this was unintended please pass in `--num_processes=1`. `--num_machines` was set to a value of `1` `--mixed_precision` was set to a value of `'no'` `--dynamo_backend` was set to a value of `'no'` To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`. [MOD-3226] hello from the repro!!! [MOD-3226] hello from the repro!!! [MOD-3226] hello from the repro!!! [MOD-3226] hello from the repro!!! Detected kernel version 4.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher. hello! 0 hello! 1 hello! 3hello! 2 ``` </details> FUTURE_COPYBARA_INTEGRATE_REVIEW=#10649 from thundergolfer:master d3d19f1 PiperOrigin-RevId: 651754677

Attempt #2. This runs in continuous mode only. PiperOrigin-RevId: 689056926

Attempt #2. This runs in continuous mode only. PiperOrigin-RevId: 691516066

shentubot closed this as completed in 1cd0c31 Jul 2, 2018

dvyukov pushed a commit to dvyukov/gvisor that referenced this issue Jul 4, 2018

Make default limits the same as with runc

fa64c2a

Closes google#2 PiperOrigin-RevId: 202997196 Change-Id: I0c9f6f5a8a1abe1ae427bca5f590bdf9f82a6675

tonistiigi referenced this issue in tonistiigi/gvisor Jan 30, 2019

Make default limits the same as with runc

3c971d3

Closes #2 PiperOrigin-RevId: 202997196 Change-Id: I0c9f6f5a8a1abe1ae427bca5f590bdf9f82a6675 Upstream-commit: fa64c2a

tanjianfeng mentioned this issue Aug 2, 2019

fix wrong SizeOfTCPInfo #643

Closed

amscanne referenced this issue in amscanne/gvisor Nov 14, 2019

Fix governance link on community page. (#2)

27a8830

copybara-service bot mentioned this issue Mar 13, 2020

Lazy-fpsimd supporting #2: add Arm64-fpsimd support to KVM module #2160

Closed

amscanne mentioned this issue Mar 30, 2020

Fix Copy-On-Write cause memory waste #608

Closed

copybara-service bot mentioned this issue Apr 23, 2020

Lazy-fpsimd supporting #2: add Arm64-fpsimd support to KVM module #2528

Merged

mrahatm mentioned this issue Jul 14, 2020

add /dev/kmsg support for gvisor #3081

Closed

DentonGentry mentioned this issue Mar 25, 2021

Reading 128 bytes from /proc/net/route makes file appear truncated #5732

Closed

benbuzbee mentioned this issue Feb 28, 2023

runsc debug coredump #8620

Closed

This was referenced Aug 16, 2024

No obvious way to checkpoint a container when TCP sockets have been recently closed and are in TIME_WAIT state in the kernel #10788

Closed

sysctl options declared in config.json not applied to container #10790

Open

TheQuantumFractal mentioned this issue Sep 13, 2024

Discrepancy between network behavior in gVisor and runc #10908

Closed

copybara-service bot pushed a commit that referenced this issue Oct 23, 2024

Run CUDA tests as part of GPU tests.

87ccd0c

Attempt #2. This runs in continuous mode only. PiperOrigin-RevId: 689056926

copybara-service bot mentioned this issue Oct 23, 2024

Run CUDA tests as part of GPU tests. #11078

Merged

copybara-service bot pushed a commit that referenced this issue Oct 23, 2024

Run CUDA tests as part of GPU tests.

eeed5b4

Attempt #2. This runs in continuous mode only. PiperOrigin-RevId: 689056926

copybara-service bot pushed a commit that referenced this issue Oct 23, 2024

Run CUDA tests as part of GPU tests.

8fe86ef

Attempt #2. This runs in continuous mode only. PiperOrigin-RevId: 689056926

copybara-service bot pushed a commit that referenced this issue Oct 24, 2024

Run CUDA tests as part of GPU tests.

8dcb5c4

Attempt #2. This runs in continuous mode only. PiperOrigin-RevId: 689056926

copybara-service bot pushed a commit that referenced this issue Oct 30, 2024

Run CUDA tests as part of GPU tests.

93a269d

Attempt #2. This runs in continuous mode only. PiperOrigin-RevId: 689056926

copybara-service bot pushed a commit that referenced this issue Oct 30, 2024

Run CUDA tests as part of GPU tests.

0c4a709

Attempt #2. This runs in continuous mode only. PiperOrigin-RevId: 691516066

pawalt mentioned this issue Nov 2, 2024

Runsc exec wipes capabilities if they are provided #11108

Closed

azliu0 mentioned this issue Nov 21, 2024

ray.init() unexpectedly hangs when -host-uds=all and -overlay2=none #11202

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ElasticSearch does not run #2

ElasticSearch does not run #2

nlacasse commented Apr 26, 2018

clandry94 commented May 2, 2018

fvoznika commented May 3, 2018

clandry94 commented May 4, 2018 •

edited

Loading

ElasticSearch does not run #2

ElasticSearch does not run #2

Comments

nlacasse commented Apr 26, 2018

clandry94 commented May 2, 2018

fvoznika commented May 3, 2018

clandry94 commented May 4, 2018 • edited Loading

clandry94 commented May 4, 2018 •

edited

Loading