Skip to content

Failing when trying to run DeepSeek-R1-3bit on 3 Studios M2 Ultra with 128GB RAM each #1226

@sck-at-ucy

Description

@sck-at-ucy

Perhaps this is hopeless but I thought it would be worth asking. I know I am close to the memory limit. Is there still hope to fit the model on 3 M2U with 128GB RAM? The 4th node is being used on another project and was curious if it would fit in 3 nodes.

I have used sudo sysctl -w iogpu.wired_limit_mb=122000 on all three nodes.

The code is running and I can see the memory increasing but then it fails apparently before it completes loading the model because at the time of failure it is still 100% CPU utilization.

/opt/homebrew/bin/mpirun --mca oob_tcp_if_include bridge0 --mca btl_tcp_if_include bridge0 \
--map-by ppr:1:node --mca coll_tuned_use_dynamic_rules 1 \
--mca coll_tuned_allreduce_algorithm 5 --mca btl_tcp_links 4 \
--mca mpi_thread_multiple 0 --mca btl_tcp_eager_limit 4194304 \
--mca btl_tcp_sndbuf 8388608 --mca btl_tcp_rcvbuf 8388608 --mca btl self,tcp \
-x DYLD_LIBRARY_PATH=/opt/homebrew/lib/ \
-np 3 --host 10.0.0.1:1,10.0.0.3:1,localhost:1 \
/Users/m2/anaconda3/envs/pythonProject_StreamLit/bin/python /Users/m2/pipeline_generate.py \
--model /Volumes/PACIFIC-GROVE/DeepSeek-R1-3bit \
--prompt "What's better a straight or a flush in texas hold'em?" \
--max-tokens 1024
[WARNING] Generating with a model that requires 116168 MB which is close to the maximum recommended size of 122000 MB. This can be slow. See the documentation for possible work-arounds: https://github.com/ml-explore/mlx-examples/tree/main/llms#large-models
libc++abi: terminating due to uncaught exception of type std::runtime_error: [METAL] Command buffer execution failed: Caused GPU Timeout Error (00000002:kIOGPUCommandBufferCallbackErrorTimeout)
[Mendocino:28151] *** Process received signal ***
[Mendocino:28151] Signal: Abort trap: 6 (6)
[Mendocino:28151] Signal code:  (0)
[Mendocino:28151] [ 0] 0   libsystem_platform.dylib            0x000000019e542e04 _sigtramp + 56
[Mendocino:28151] [ 1] 0   libsystem_pthread.dylib             0x000000019e50bf70 pthread_kill + 288
[Mendocino:28151] [ 2] 0   libsystem_c.dylib                   0x000000019e418908 abort + 128
[Mendocino:28151] [ 3] 0   libc++abi.dylib                     0x000000019e4c244c _ZN10__cxxabiv130__aligned_malloc_with_fallbackEm + 0
[Mendocino:28151] [ 4] 0   libc++abi.dylib                     0x000000019e4b0a24 _ZL28demangling_terminate_handlerv + 320
[Mendocino:28151] [ 5] 0   libobjc.A.dylib                     0x000000019e1593f4 _ZL15_objc_terminatev + 172
[Mendocino:28151] [ 6] 0   libc++abi.dylib                     0x000000019e4c1710 _ZSt11__terminatePFvvE + 16
[Mendocino:28151] [ 7] 0   libc++abi.dylib                     0x000000019e4c16b4 _ZSt9terminatev + 108
[Mendocino:28151] [ 8] 0   libdispatch.dylib                   0x000000019e359688 _dispatch_client_callout4 + 40
[Mendocino:28151] [ 9] 0   libdispatch.dylib                   0x000000019e375c88 _dispatch_mach_msg_invoke + 464
[Mendocino:28151] [10] 0   libdispatch.dylib                   0x000000019e360a38 _dispatch_lane_serial_drain + 352
[Mendocino:28151] [11] 0   libdispatch.dylib                   0x000000019e3769dc _dispatch_mach_invoke + 456
[Mendocino:28151] [12] 0   libdispatch.dylib                   0x000000019e360a38 _dispatch_lane_serial_drain + 352
[Mendocino:28151] [13] 0   libdispatch.dylib                   0x000000019e361764 _dispatch_lane_invoke + 432
[Mendocino:28151] [14] 0   libdispatch.dylib                   0x000000019e360a38 _dispatch_lane_serial_drain + 352
[Mendocino:28151] [15] 0   libdispatch.dylib                   0x000000019e361730 _dispatch_lane_invoke + 380
[Mendocino:28151] [16] 0   libdispatch.dylib                   0x000000019e36c9a0 _dispatch_root_queue_drain_deferred_wlh + 288
[Mendocino:28151] [17] 0   libdispatch.dylib                   0x000000019e36c1ec _dispatch_workloop_worker_thread + 540
[Mendocino:28151] [18] 0   libsystem_pthread.dylib             0x000000019e5083d8 _pthread_wqthread + 288
[Mendocino:28151] [19] 0   libsystem_pthread.dylib             0x000000019e5070f0 start_wqthread + 8
[Mendocino:28151] *** End of error message ***
--------------------------------------------------------------------------
prterun noticed that process rank 2 with PID 28151 on node Mendocino exited on
signal 6 (Abort trap: 6).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions