Error when run test on 500 pods #408

volnyansky · 2024-06-03T09:03:05Z

Brief summary

I'm trying to run the test on 500 pods and get the error :
exec /usr/bin/k6: argument list too long
I find a workaround by batching tests in 300 pods packages with the same test id

k6-operator version or image

0.0.14

Helm chart version (if applicable)

k6-operator-3.6.0

TestRun / PrivateLoadZone YAML

apiVersion: k6.io/v1alpha1
kind: TestRun
metadata:
name: ${USERNAME}-${SCRIPT}-${BATCH}
namespace: k6
spec:
#number of pods to run in parallel
parallelism: ${BATCH_PODS}
script:
configMap:
name: ${USERNAME}-test-script-${BATCH}
file: test.tar
arguments: -o experimental-prometheus-rw --tag testid=${TESTID}
runner:
image: 569129334545.dkr.ecr.us-east-1.amazonaws.com/k6-robot-dev:latest
env:
- name: K6_PROMETHEUS_RW_SERVER_URL
value: "http://victoria-metrics-single-server.monitoring.svc.cluster.local:8428/api/v1/write"
- name: K6_PROMETHEUS_RW_TREND_STATS
value: "count,sum,min,max,avg,med,p(90),p(95),p(99)"
- name: K6_BROWSER_ARGS
value: "window-size=1920x1080,no-sandbox,disable-setuid-sandbox,allow-file-access,use-fake-device-for-media-stream,use-fake-ui-for-media-stream,use-file-for-fake-video-capture=/usr/local/assets/video.mjpeg,use-file-for-fake-audio-capture=/usr/local/assets/audio.wav"
- name: K6_BROWSER_TIMEOUT
value: "45s"
- name: VU_ID_START
value: "${VU_ID_START}"
nodeSelector:
engageli.com/role: k6-load-test
resources:
limits:
cpu: "${CPU}"
memory: ${MEMORY}Mi
requests:
cpu: 100m
memory: ${MEMORY}Mi

Other environment details (if applicable)

No response

Steps to reproduce the problem

Run test on 500pods , number of VUs doesn't matter

Expected behaviour

Tets runs in given number of pods

Actual behaviour

Test crashes

volnyansky · 2024-06-03T09:16:32Z

It think the issue is in curl max argument length , I see the following in the started pod:
curl --retry 3 -X PATCH -H 'Content-Type: application/json' http://10.100.232.147:6565/v1/status -d '{"data":{"attributes":{"paused":false,"stopped":false} │ │ ,"id":"default","type":"status"}}';curl --retry 3 -X PATCH -H 'Content-Type: application/json' http://10.100.109.212:6565/v1/status -d '{"data":{"attributes":{"p │ │ aused":false,"stopped":false},"id":"default","type":"status"}}';curl --retry 3 -X PATCH -H 'Content-Type: application/json' http://10.100.60.188:6565/v1/status - │ │ d '{"data":{"attributes":{"paused":false,"stopped":false},"id":"default","type":"status"}}';curl --retry 3 -X PATCH -H 'Content-Type: application/json' http://10 │ │ .100.188.109:6565/v1/status -d '{"data":{"attributes":{"paused":false,"stopped":false},"id":"default","type":"status"}}';curl --retry 3 -X PATCH -H 'Content-Type │ │ : application/json' http://10.100.112.255:6565/v1/status -d '{"data":{"attributes":{"paused":false,"stopped":false},"id":"default","type":"status"}}';curl --retr │ │ y 3 -X PATCH -H 'Content-Type: application/json' http://10.100.143.154:6565/v1/status -d '{"data":{"attributes":{"paused":false,"stopped":false},"id":"default"," │ │ type":"status"}}';curl --retry 3 -X PATCH -H 'Content-Type: application/json' http://10.100.164.249:6565/v1/status -d '{"data":{"attributes":{"paused":false,"sto │ │ pped":false},"id":"default","type":"status"}}';curl --retry 3 -X PATCH -H 'Content-Type: application/json' http://10.100.39.112:6565/v1/status -d '{"data":{"attr │ │ ibutes":{"paused":false,"stopped":false},"id":"default","type":"status"}}';curl --retry 3 -X PATCH -H 'Content-Type: application/json' http://10.100.214.183:6565 │ │ /v1/status -d '{"data":{"attributes":{"paused":false,"stopped":false},"id":"default","type":"status"}}';curl --retry 3 -X PATCH -H 'Content-Type: application/jso │ │ n' http://10.100.199.160:6565/v1/status -d '{"data":{"attributes":{"paused":false,"stopped":false},"id":"default","type":"status"}}';curl --retry 3 -X PATCH -H ' │ │ Content-Type: application/json' http://10.100.110.241:6565/v1/status -d '{"data":{"attributes":{"paused":false,"stopped":false},"id":"default","type":"status"}}' │ │ ;curl --retry 3 -X PATCH -H 'Content-Type: application/json' http://10.100.204.180:6565/v1/status -d '{"data":{"attributes":{"paused":false,"stopped":false},"id" │ │ :"default","type":"status"}}';curl --retry 3 -X PATCH -H 'Content-Type: application/json' http://10.100.44.121:6565/v1/status -d '{"data":{"attributes":{"paused" │ │ :false,"stopped":false},"id":"default","type":"status"}}';curl --retry 3 -X PATCH -H 'Content-Type: application/json' http://10.100.33.28:6565/v1/status -d '{"da │ │ ta":{"attributes":{"paused":false,"stopped":false},"id":"default","typ ...

yorugac · 2024-06-03T11:27:52Z

Hi @volnyansky, this is certainly a new one 😅 Did the error exec /usr/bin/k6: argument list too long come from the starter pod then?

Out of curiosity, what are you testing that you need such a large test?

On solution. That command is just an iterative concatenation: I guess we could just split it into several commands when there are lots of instances. The question is what kind of values for ARG_MAX can be expected in Kubernetes deployments.
The starter command has sequential execution now anyway which is probably not ideal for such a large test as here. But figuring out parallelization for it would definitely be a harder problem.

volnyansky · 2024-06-03T12:25:33Z

@yorugac I'm running a stress test with a real browser. I need to test not only REST and websocket apis , but also webrtc. So I can't run thousands of robots in one pod.
Yes, I have the issue in started pod

volnyansky · 2024-06-03T18:40:15Z

Updated it fails on runners too. The most strange thing , that command line is not too long : k6 run │ --quiet │ --execution-segment=7/250:8/250 │ --execution-segment-sequence=0,1/250,2/250,3/250,4/250,5/250,6/250,7/250,8/250,9/250,10/250,11/250,12/250,13/250,14/250,15/250,16/250,17/250,18/250,19/250, 20/250,21/250,22/250,23/250,24/250,25/250,26/250,27/250,28/250,29/250,30/250,31/250,32/250,33/250,34/250,35/250,36/250,37/250,38/250,39/250,40/250,41/250,42/250,43/250,44/250,45/250,46/250,47/250,48/250,49/250,50/250,51/250,52/250,53/250,54/250,55/250,56/250,57/250,58/250,59/250,60/250,61/250,62/250,63/250,64/250,65/250, │ 66/250,67/250,68/250,69/250,70/250,71/250,72/250,73/250,74/250,75/250,76/250,77/250,78/250,79/250,80/250,81/250,82/250,83/250,84/250,85/250,86/250,87/250,88/250,89/250,90/250,91/250,92/250,93/250,94/250,95/250,96/250,97/250,98/250,99/250,100/250,101/250,102/250,103/250,104/250,105/250,106/250,107/250,108/250,109/250,110/250,111/250,112/250,113/250,114/250,115/250,116/250,117/250,118/250,119/250,120/250,121/250,122/250,123/250,124/250,125/250,126/250,127/250,128/250,129/250,130/250,131/250,132/250,133/250,134/250,135/250,136/250,137/250,138/250,139/250,140/250,141/250,142/250,143/250,144/250,145/250,146/250,147/250,148/250,149/250,150/250,151/250,152/250,153/250,154/250,155/250,156/250,157/250,158/250,159/250,160/250,161/250,162/250,163/250,164/250,165/250,166/250,167/250,168/250,169/250,170/250,171/250,172/250,173/250,174/250,175/250,176/250,177/250,178/250,179/250,180/250,181/250,182/250,183/250,184/250,185/250,186/250,187/250,188/250,189/250,190/250,191/250,192/250,193/250,194/250,195/250,196/250,197/250,198/250,199/250,200/250,201/250,202/250,203/250,204/250,205/250,206/250,207/250,208/250,209/250,210/250,211/250,212/250,213/250,214/250,215/250,216/250,217/250,218/250,219/250,220/250,221/250,222/250,223/250,224/250,225/250,226/250,227/250,228/250,229/250,230/250,231/250,232/250,233/250,234/250,235/250,236/250,237/250,238/250,239/250,240/250,241/250,242/250,243/250,244/250,245/250,246/250,247/250,248/250,249/250,1 -o experimental-prometheus-rw --tag testid=stas-browser-mock-login-test-7.5k-2024-06-03-21-07-03 /test/test.tar --address=0.0.0.0:6565 --paused --tag instance_id=8 --tag job_name=stas-browser-mock-login-test-0-8

yorugac · 2024-06-04T07:42:28Z

it fails on runners too.

@volnyansky, can you please post the full log from one of those runners?

I'm running a stress test with a real browser.

I'm a bit confused by "real browser" part: do you mean the xk6-browser?

volnyansky · 2024-06-04T08:19:45Z

I'm a bit confused by "real browser" part: do you mean the xk6-browser? - yes, it is xk6.
Log contains only one line :
exec /usr/bin/k6: argument list too long .

Also I figured out that i need to wait until services left after the previous test are deleted. You code collects IPS from services list which also can lead to overflow.

volnyansky · 2024-06-04T08:24:46Z

@yorugac I have idea for fix - you can store IPS in env variable(s) as list separated by ; . Then you can iterate over this list in docker start command:
`#!/bin/bash

IFS=';' read -ra ARR <<< "$IPS"

for i in "${ARR[@]}"; do
# process "$i"
curl -X PATCH "$i"
done`

volnyansky · 2024-06-04T12:21:41Z

@yorugac I've found final workaround :) I'm running then test in batches and assigning his own namespace per batch. You query k8s list services in your code, so it is possibly return all services in the namespace and not the current test run

yorugac · 2024-06-07T09:21:07Z

@volnyansky, WDYM by batches? You're not running 500 instances anymore?

it is possibly return all services in the namespace and not the current test run

🤔 we'd still need to send a "start" command with something like cURL though.

Could you please clarify a bit? 🙂

volnyansky · 2024-06-09T06:05:01Z

@yorugac I need to run more than 500 instances, 5000 actually. So I split one test into several and I call them batches. But If all these tests are run in one namespace I still get "argument list too long error", and If I isolate each test in its own namespace I don't get error.

I agree that you still need to send curl, I just proposed a more compact way to call it , to not reach ARG_MAX limit which causes "arguments to long error".

yorugac · 2024-06-10T12:54:13Z

🤔 It's strange that namespace is a factor here... If the test is "split" then it's already producing another curl call, even if both tests are in the same namespace. IIUC, the error appears form curl itself and from k6 - not from getting the list of Kubernetes services.

Well, I think it's still about making batches, as described in this comment. Do you happen to have any estimate on what the value of ARG_MAX is? For example, what size of batches work for you?

volnyansky · 2024-06-11T15:59:14Z

@yorugac In my env ARG_MAX= 131072 bytes

frittentheke · 2024-08-16T07:42:05Z

If I may kindly point to the discussion about the use of the REST API.
I was commenting about switching to doing the "start" command natively and not via some job -> pod and templated curl invocations: #87 (comment).

It's not only about efficiency, but also about keeping the k6-operator closer in the loop about the state of the runners....

volnyansky added the bug Something isn't working label Jun 3, 2024

yorugac mentioned this issue Aug 20, 2024

Remove starter job in favor of built-in implementation #87

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error when run test on 500 pods #408

Error when run test on 500 pods #408

volnyansky commented Jun 3, 2024

volnyansky commented Jun 3, 2024

yorugac commented Jun 3, 2024

volnyansky commented Jun 3, 2024 •

edited

Loading

volnyansky commented Jun 3, 2024 •

edited

Loading

yorugac commented Jun 4, 2024 •

edited

Loading

volnyansky commented Jun 4, 2024

volnyansky commented Jun 4, 2024 •

edited

Loading

volnyansky commented Jun 4, 2024

yorugac commented Jun 7, 2024

volnyansky commented Jun 9, 2024

yorugac commented Jun 10, 2024

volnyansky commented Jun 11, 2024 •

edited

Loading

frittentheke commented Aug 16, 2024

Error when run test on 500 pods #408

Error when run test on 500 pods #408

Comments

volnyansky commented Jun 3, 2024

Brief summary

k6-operator version or image

Helm chart version (if applicable)

TestRun / PrivateLoadZone YAML

Other environment details (if applicable)

Steps to reproduce the problem

Expected behaviour

Actual behaviour

volnyansky commented Jun 3, 2024

yorugac commented Jun 3, 2024

volnyansky commented Jun 3, 2024 • edited Loading

volnyansky commented Jun 3, 2024 • edited Loading

yorugac commented Jun 4, 2024 • edited Loading

volnyansky commented Jun 4, 2024

volnyansky commented Jun 4, 2024 • edited Loading

volnyansky commented Jun 4, 2024

yorugac commented Jun 7, 2024

volnyansky commented Jun 9, 2024

yorugac commented Jun 10, 2024

volnyansky commented Jun 11, 2024 • edited Loading

frittentheke commented Aug 16, 2024

volnyansky commented Jun 3, 2024 •

edited

Loading

volnyansky commented Jun 3, 2024 •

edited

Loading

yorugac commented Jun 4, 2024 •

edited

Loading

volnyansky commented Jun 4, 2024 •

edited

Loading

volnyansky commented Jun 11, 2024 •

edited

Loading