Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[bug] 偶现,在脚本中使用大量go(function () {})处理任务偶尔会卡死在epoll_wait #5621

Closed
omigafu opened this issue Dec 16, 2024 · 9 comments

Comments

@omigafu
Copy link

omigafu commented Dec 16, 2024

Please answer these questions before submitting your issue.
不是在http_server中运行,是在脚本模式下

  1. What did you do? If possible, provide a simple script for reproducing the error.
Swoole\Coroutine\run(function ()  {
    ...
    foreach ($postData as $k => $v) {
        go(function () use ($chan, $k, $v) {
             $client = new Swoole\Coroutine\Client(SWOOLE_SOCK_TCP);
            $client->connect($v['ip'], $v['port'], 2)
            $client->send($in);
            $rs = $client->recv(1);
            $client->close();
            $chan->push(['id' => $k, 'body' => $rs], 4);
        });
    }
    for ($i = 0; $i < $realCount; $i++) {
        $result = $chan->pop(4);
        $returns[$result['id']] = $result['body'];
    }
});
  1. What did you expect to see?
    期望能正常执行完任务

  2. What did you see instead?
    进程停在了epoll_wait

#0  0x00007f7f416d6e87 in epoll_wait () from /lib64/libc.so.6
#1  0x00007f7f3e9eff9d in swoole::ReactorEpoll::wait (this=0x36fffd0, timeo=<optimized out>) at /usr/local/services/swoole-src-4.8.13/include/swoole_reactor.h:269
#2  0x00007f7f3e93c52b in swoole::Reactor::wait (this=<optimized out>, timeout=0x0) at /usr/local/services/swoole-src-4.8.13/include/swoole_reactor.h:165
#3  php_swoole_event_wait () at /usr/local/services/swoole-src-4.8.13/ext-src/swoole_event.cc:278
#4  php_swoole_event_wait () at /usr/local/services/swoole-src-4.8.13/ext-src/swoole_event.cc:262
#5  0x00007f7f3e93801f in zim_swoole_coroutine_scheduler_start (execute_data=<optimized out>, return_value=0x7f7f3f216550) at /usr/local/services/swoole-src-4.8.13/ext-src/swoole_coroutine_scheduler.cc:319
#6  0x00000000009e31b6 in ZEND_DO_FCALL_SPEC_RETVAL_USED_HANDLER () at /usr/local/services/php-7.4.12/Zend/zend_vm_execute.h:1730
#7  execute_ex (ex=0xf) at /usr/local/services/php-7.4.12/Zend/zend_vm_execute.h:53865
#8  0x0000000000954246 in zend_call_function (fci=fci@entry=0x7ffcb4dd1300, fci_cache=<optimized out>, fci_cache@entry=0x7ffcb4dd1370) at /usr/local/services/php-7.4.12/Zend/zend_execute_API.c:820
#9  0x00007f7f3e90b298 in sw_zend_call_function_ex (retval=0x0, params=0x7f7f3f213bd0, param_count=1, fci_cache=0x7ffcb4dd1370, function_name=0x0) at /usr/local/services/swoole-src-4.8.13/ext-src/php_swoole_private.h:974
#10 zend::function::call (fci_cache=fci_cache@entry=0x7ffcb4dd1370, argc=argc@entry=1, argv=argv@entry=0x7f7f3f213bd0, retval=retval@entry=0x0, enable_coroutine=<optimized out>) at /usr/local/services/swoole-src-4.8.13/ext-src/php_swoole_cxx.cc:100
#11 0x00007f7f3e962098 in php_swoole_process_start (process=0x7f7f2e97f0a0, zobject=0x7f7f3f213bd0) at /usr/local/services/swoole-src-4.8.13/ext-src/swoole_process.cc:739
#12 0x00007f7f3e962117 in zim_swoole_process_start (execute_data=0x7f7f3f213bb0, return_value=0x7ffcb4dd1430) at /usr/local/services/swoole-src-4.8.13/ext-src/swoole_process.cc:771
#13 0x00000000009e3548 in ZEND_DO_FCALL_SPEC_RETVAL_UNUSED_HANDLER () at /usr/local/services/php-7.4.12/Zend/zend_vm_execute.h:1618
#14 execute_ex (ex=0xf) at /usr/local/services/php-7.4.12/Zend/zend_vm_execute.h:53861
#15 0x0000000000954246 in zend_call_function (fci=fci@entry=0x7ffcb4dd1560, fci_cache=<optimized out>, fci_cache@entry=0x7ffcb4dd15d0) at /usr/local/services/php-7.4.12/Zend/zend_execute_API.c:820
#16 0x00007f7f3e90b298 in sw_zend_call_function_ex (retval=0x0, params=0x7f7f3f213350, param_count=1, fci_cache=0x7ffcb4dd15d0, function_name=0x0) at /usr/local/services/swoole-src-4.8.13/ext-src/php_swoole_private.h:974
#17 zend::function::call (fci_cache=fci_cache@entry=0x7ffcb4dd15d0, argc=argc@entry=1, argv=argv@entry=0x7f7f3f213350, retval=retval@entry=0x0, enable_coroutine=<optimized out>) at /usr/local/services/swoole-src-4.8.13/ext-src/php_swoole_cxx.cc:100
#18 0x00007f7f3e962098 in php_swoole_process_start (process=0x7f7f3f39b1e0, zobject=0x7f7f3f213350) at /usr/local/services/swoole-src-4.8.13/ext-src/swoole_process.cc:739
#19 0x00007f7f3e962117 in zim_swoole_process_start (execute_data=0x7f7f3f213330, return_value=0x7ffcb4dd1690) at /usr/local/services/swoole-src-4.8.13/ext-src/swoole_process.cc:771
#20 0x00000000009e3548 in ZEND_DO_FCALL_SPEC_RETVAL_UNUSED_HANDLER () at /usr/local/services/php-7.4.12/Zend/zend_vm_execute.h:1618
#21 execute_ex (ex=0xf) at /usr/local/services/php-7.4.12/Zend/zend_vm_execute.h:53861
#22 0x00000000009e40a3 in zend_execute (op_array=0x7f7f3f27bee0, return_value=<optimized out>) at /usr/local/services/php-7.4.12/Zend/zend_vm_execute.h:57957
#23 0x0000000000962044 in zend_execute_scripts (type=type@entry=8, retval=0x7f7f3f213020, retval@entry=0x0, file_count=file_count@entry=3) at /usr/local/services/php-7.4.12/Zend/zend.c:1677
#24 0x00000000009043e0 in php_execute_script (primary_file=primary_file@entry=0x7ffcb4dd3c30) at /usr/local/services/php-7.4.12/main/main.c:2621
#25 0x00000000009e6133 in do_cli (argc=5, argv=0x340f030) at /usr/local/services/php-7.4.12/sapi/cli/php_cli.c:964
#26 0x0000000000642c88 in main (argc=5, argv=0x340f030) at /usr/local/services/php-7.4.12/sapi/cli/php_cli.c:1359
  1. What version of Swoole are you using (show your php --ri swoole)?
    swoole

Swoole => enabled
Author => Swoole Team [email protected]
Version => 4.8.13
Built => Nov 13 2023 15:55:26
coroutine => enabled with boost asm context
epoll => enabled
eventfd => enabled
signalfd => enabled
cpu_affinity => enabled
spinlock => enabled
rwlock => enabled
sockets => enabled
openssl => OpenSSL 1.1.1k FIPS 25 Mar 2021
dtls => enabled
zlib => 1.2.11
mutex_timedlock => enabled
pthread_barrier => enabled
futex => enabled
async_redis => enabled

Directive => Local Value => Master Value
swoole.enable_coroutine => On => On
swoole.enable_library => On => On
swoole.enable_preemptive_scheduler => Off => Off
swoole.display_errors => On => On
swoole.use_shortname => On => On
swoole.unixsock_buffer_size => 8388608 => 8388608

swoole

Swoole => enabled
Author => Swoole Team [email protected]
Version => 5.1.2
Built => Aug 13 2024 17:02:50
coroutine => enabled with boost asm context
epoll => enabled
eventfd => enabled
signalfd => enabled
cpu_affinity => enabled
spinlock => enabled
rwlock => enabled
sockets => enabled
openssl => OpenSSL 3.0.12 24 Oct 2023
dtls => enabled
http2 => enabled
json => enabled
zlib => 1.2.13
brotli => E16781312/D16781312
mutex_timedlock => enabled
pthread_barrier => enabled
futex => enabled
async_redis => enabled

Directive => Local Value => Master Value
swoole.enable_coroutine => On => On
swoole.enable_library => On => On
swoole.enable_fiber_mock => Off => Off
swoole.enable_preemptive_scheduler => Off => Off
swoole.display_errors => On => On
swoole.use_shortname => On => On
swoole.unixsock_buffer_size => 8388608 => 8388608
6. What is your machine environment used (show your uname -a & php -v & gcc -v) ?
PHP 7.4.12

PHP 8.3.7

2个php版本和swoole版本都存在这个问题,不过是偶现,没有任何报错,进程就卡死在epoll_wait上,请问有人遇到过类似的问题吗?

@NathanFreeman
Copy link
Member

多协程并行要用Swoole\Coroutine\WaitGroup保证主协程不退出。
也需要检查一下是不是send过程中数据没过去

@omigafu
Copy link
Author

omigafu commented Dec 16, 2024

多协程并行要用Swoole\Coroutine\WaitGroup保证主协程不退出。 也需要检查一下是不是send过程中数据没过去

  1. WaitGroup也有使用过,也是一样的情况
  2. 如果是send失败的话,recv有设置超时时间,正常应该也不会出现这个问题
  3. 复现难度还是比较大,要执行大规模离线任务才可能复现

@matyhtf
Copy link
Member

matyhtf commented Dec 16, 2024

这一般都是你的 php 代码层面存在逻辑错误,主线程在 epoll_wait 处轮询是符合预期的,没有任何事件时将进入 epoll 等待。

你可以记录协程 id,然后通过 Coroutine::getBackTrace() 获取对应协程挂起的位置。

@omigafu
Copy link
Author

omigafu commented Dec 17, 2024

这一般都是你的 php 代码层面存在逻辑错误,主线程在 epoll_wait 处轮询是符合预期的,没有任何事件时将进入 epoll 等待。

你可以记录协程 id,然后通过 Coroutine::getBackTrace() 获取对应协程挂起的位置。

感谢,我通过这个方式再尝试下定位。
我们有办法给Coroutine设置一个超时时间吗?

@LIngMax
Copy link

LIngMax commented Dec 19, 2024

我这边也遇到类似的问题 不停创建协程进入阻塞时会遇到 该问题

协程1 调用$lock->lock() ;sleep(1);$lock->unlock() ;
协程2调用 $lock->lock();sleep(1); $lock->unlock() ;

协程3调用 $lock->lock();sleep(1); $lock->unlock() ;
同时启动协程会造成协程3彻底卡死 再也唤不起

通常代替方案 浪费点性能 解决这个问题
while(!$lock->trylock())usleep(1000*10);#10ms#等待锁#协程挂起#让出cpu

至于chan没有检查是否阻塞 那就没办法了

@LIngMax
Copy link

LIngMax commented Dec 19, 2024

   push写入前要判断下是通道   否满了     Swoole\Coroutine\Channel->isFull()

@NathanFreeman
Copy link
Member

协程中不能用互斥锁和sleep,容易导致死锁

@LIngMax
Copy link

LIngMax commented Dec 19, 2024

while($chan->isFull())usleep(1000*10);#10ms#等待 通道有空位置#协程挂起#让出cpu
这种只是个笨办法

希望官方能换个阻塞处理方式吧
希望给事件循环去唤醒 而不是阻塞卡死全部协程

@matyhtf
Copy link
Member

matyhtf commented Dec 24, 2024

Channel 本身就具有协程同步的能力,不需要使用锁。相反如果使用了锁,可能会导致进程死锁。

目前 Swoole\Lock 是进程间锁,Swoole\Thread\Lock 是线程间锁,均无法用于协程环境。

@matyhtf matyhtf closed this as completed Dec 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants