Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error when run test on 500 pods #408

Open
volnyansky opened this issue Jun 3, 2024 · 13 comments
Open

Error when run test on 500 pods #408

volnyansky opened this issue Jun 3, 2024 · 13 comments
Labels
bug Something isn't working

Comments

@volnyansky
Copy link

Brief summary

I'm trying to run the test on 500 pods and get the error :
exec /usr/bin/k6: argument list too long
I find a workaround by batching tests in 300 pods packages with the same test id

k6-operator version or image

0.0.14

Helm chart version (if applicable)

k6-operator-3.6.0

TestRun / PrivateLoadZone YAML

apiVersion: k6.io/v1alpha1
kind: TestRun
metadata:
name: ${USERNAME}-${SCRIPT}-${BATCH}
namespace: k6
spec:
#number of pods to run in parallel
parallelism: ${BATCH_PODS}
script:
configMap:
name: ${USERNAME}-test-script-${BATCH}
file: test.tar
arguments: -o experimental-prometheus-rw --tag testid=${TESTID}
runner:
image: 569129334545.dkr.ecr.us-east-1.amazonaws.com/k6-robot-dev:latest
env:
- name: K6_PROMETHEUS_RW_SERVER_URL
value: "http://victoria-metrics-single-server.monitoring.svc.cluster.local:8428/api/v1/write"
- name: K6_PROMETHEUS_RW_TREND_STATS
value: "count,sum,min,max,avg,med,p(90),p(95),p(99)"
- name: K6_BROWSER_ARGS
value: "window-size=1920x1080,no-sandbox,disable-setuid-sandbox,allow-file-access,use-fake-device-for-media-stream,use-fake-ui-for-media-stream,use-file-for-fake-video-capture=/usr/local/assets/video.mjpeg,use-file-for-fake-audio-capture=/usr/local/assets/audio.wav"
- name: K6_BROWSER_TIMEOUT
value: "45s"
- name: VU_ID_START
value: "${VU_ID_START}"
nodeSelector:
engageli.com/role: k6-load-test
resources:
limits:
cpu: "${CPU}"
memory: ${MEMORY}Mi
requests:
cpu: 100m
memory: ${MEMORY}Mi

Other environment details (if applicable)

No response

Steps to reproduce the problem

Run test on 500pods , number of VUs doesn't matter

Expected behaviour

Tets runs in given number of pods

Actual behaviour

Test crashes

@volnyansky volnyansky added the bug Something isn't working label Jun 3, 2024
@volnyansky
Copy link
Author

It think the issue is in curl max argument length , I see the following in the started pod:
curl --retry 3 -X PATCH -H 'Content-Type: application/json' http://10.100.232.147:6565/v1/status -d '{"data":{"attributes":{"paused":false,"stopped":false} │ │ ,"id":"default","type":"status"}}';curl --retry 3 -X PATCH -H 'Content-Type: application/json' http://10.100.109.212:6565/v1/status -d '{"data":{"attributes":{"p │ │ aused":false,"stopped":false},"id":"default","type":"status"}}';curl --retry 3 -X PATCH -H 'Content-Type: application/json' http://10.100.60.188:6565/v1/status - │ │ d '{"data":{"attributes":{"paused":false,"stopped":false},"id":"default","type":"status"}}';curl --retry 3 -X PATCH -H 'Content-Type: application/json' http://10 │ │ .100.188.109:6565/v1/status -d '{"data":{"attributes":{"paused":false,"stopped":false},"id":"default","type":"status"}}';curl --retry 3 -X PATCH -H 'Content-Type │ │ : application/json' http://10.100.112.255:6565/v1/status -d '{"data":{"attributes":{"paused":false,"stopped":false},"id":"default","type":"status"}}';curl --retr │ │ y 3 -X PATCH -H 'Content-Type: application/json' http://10.100.143.154:6565/v1/status -d '{"data":{"attributes":{"paused":false,"stopped":false},"id":"default"," │ │ type":"status"}}';curl --retry 3 -X PATCH -H 'Content-Type: application/json' http://10.100.164.249:6565/v1/status -d '{"data":{"attributes":{"paused":false,"sto │ │ pped":false},"id":"default","type":"status"}}';curl --retry 3 -X PATCH -H 'Content-Type: application/json' http://10.100.39.112:6565/v1/status -d '{"data":{"attr │ │ ibutes":{"paused":false,"stopped":false},"id":"default","type":"status"}}';curl --retry 3 -X PATCH -H 'Content-Type: application/json' http://10.100.214.183:6565 │ │ /v1/status -d '{"data":{"attributes":{"paused":false,"stopped":false},"id":"default","type":"status"}}';curl --retry 3 -X PATCH -H 'Content-Type: application/jso │ │ n' http://10.100.199.160:6565/v1/status -d '{"data":{"attributes":{"paused":false,"stopped":false},"id":"default","type":"status"}}';curl --retry 3 -X PATCH -H ' │ │ Content-Type: application/json' http://10.100.110.241:6565/v1/status -d '{"data":{"attributes":{"paused":false,"stopped":false},"id":"default","type":"status"}}' │ │ ;curl --retry 3 -X PATCH -H 'Content-Type: application/json' http://10.100.204.180:6565/v1/status -d '{"data":{"attributes":{"paused":false,"stopped":false},"id" │ │ :"default","type":"status"}}';curl --retry 3 -X PATCH -H 'Content-Type: application/json' http://10.100.44.121:6565/v1/status -d '{"data":{"attributes":{"paused" │ │ :false,"stopped":false},"id":"default","type":"status"}}';curl --retry 3 -X PATCH -H 'Content-Type: application/json' http://10.100.33.28:6565/v1/status -d '{"da │ │ ta":{"attributes":{"paused":false,"stopped":false},"id":"default","typ ...

@yorugac
Copy link
Collaborator

yorugac commented Jun 3, 2024

Hi @volnyansky, this is certainly a new one 😅 Did the error exec /usr/bin/k6: argument list too long come from the starter pod then?

Out of curiosity, what are you testing that you need such a large test?

On solution. That command is just an iterative concatenation: I guess we could just split it into several commands when there are lots of instances. The question is what kind of values for ARG_MAX can be expected in Kubernetes deployments.
The starter command has sequential execution now anyway which is probably not ideal for such a large test as here. But figuring out parallelization for it would definitely be a harder problem.

@volnyansky
Copy link
Author

volnyansky commented Jun 3, 2024

@yorugac I'm running a stress test with a real browser. I need to test not only REST and websocket apis , but also webrtc. So I can't run thousands of robots in one pod.
Yes, I have the issue in started pod

@volnyansky
Copy link
Author

volnyansky commented Jun 3, 2024

Updated it fails on runners too. The most strange thing , that command line is not too long : k6 run │ --quiet │ --execution-segment=7/250:8/250 │ --execution-segment-sequence=0,1/250,2/250,3/250,4/250,5/250,6/250,7/250,8/250,9/250,10/250,11/250,12/250,13/250,14/250,15/250,16/250,17/250,18/250,19/250, 20/250,21/250,22/250,23/250,24/250,25/250,26/250,27/250,28/250,29/250,30/250,31/250,32/250,33/250,34/250,35/250,36/250,37/250,38/250,39/250,40/250,41/250,42/250,43/250,44/250,45/250,46/250,47/250,48/250,49/250,50/250,51/250,52/250,53/250,54/250,55/250,56/250,57/250,58/250,59/250,60/250,61/250,62/250,63/250,64/250,65/250, │ 66/250,67/250,68/250,69/250,70/250,71/250,72/250,73/250,74/250,75/250,76/250,77/250,78/250,79/250,80/250,81/250,82/250,83/250,84/250,85/250,86/250,87/250,88/250,89/250,90/250,91/250,92/250,93/250,94/250,95/250,96/250,97/250,98/250,99/250,100/250,101/250,102/250,103/250,104/250,105/250,106/250,107/250,108/250,109/250,110/250,111/250,112/250,113/250,114/250,115/250,116/250,117/250,118/250,119/250,120/250,121/250,122/250,123/250,124/250,125/250,126/250,127/250,128/250,129/250,130/250,131/250,132/250,133/250,134/250,135/250,136/250,137/250,138/250,139/250,140/250,141/250,142/250,143/250,144/250,145/250,146/250,147/250,148/250,149/250,150/250,151/250,152/250,153/250,154/250,155/250,156/250,157/250,158/250,159/250,160/250,161/250,162/250,163/250,164/250,165/250,166/250,167/250,168/250,169/250,170/250,171/250,172/250,173/250,174/250,175/250,176/250,177/250,178/250,179/250,180/250,181/250,182/250,183/250,184/250,185/250,186/250,187/250,188/250,189/250,190/250,191/250,192/250,193/250,194/250,195/250,196/250,197/250,198/250,199/250,200/250,201/250,202/250,203/250,204/250,205/250,206/250,207/250,208/250,209/250,210/250,211/250,212/250,213/250,214/250,215/250,216/250,217/250,218/250,219/250,220/250,221/250,222/250,223/250,224/250,225/250,226/250,227/250,228/250,229/250,230/250,231/250,232/250,233/250,234/250,235/250,236/250,237/250,238/250,239/250,240/250,241/250,242/250,243/250,244/250,245/250,246/250,247/250,248/250,249/250,1 -o experimental-prometheus-rw --tag testid=stas-browser-mock-login-test-7.5k-2024-06-03-21-07-03 /test/test.tar --address=0.0.0.0:6565 --paused --tag instance_id=8 --tag job_name=stas-browser-mock-login-test-0-8

@yorugac
Copy link
Collaborator

yorugac commented Jun 4, 2024

it fails on runners too.

@volnyansky, can you please post the full log from one of those runners?

I'm running a stress test with a real browser.

I'm a bit confused by "real browser" part: do you mean the xk6-browser?

@volnyansky
Copy link
Author

I'm a bit confused by "real browser" part: do you mean the xk6-browser? - yes, it is xk6.
Log contains only one line :
exec /usr/bin/k6: argument list too long .

Also I figured out that i need to wait until services left after the previous test are deleted. You code collects IPS from services list which also can lead to overflow.

@volnyansky
Copy link
Author

volnyansky commented Jun 4, 2024

@yorugac I have idea for fix - you can store IPS in env variable(s) as list separated by ; . Then you can iterate over this list in docker start command:
`#!/bin/bash

IFS=';' read -ra ARR <<< "$IPS"

for i in "${ARR[@]}"; do
# process "$i"
curl -X PATCH "$i"
done`

@volnyansky
Copy link
Author

@yorugac I've found final workaround :) I'm running then test in batches and assigning his own namespace per batch. You query k8s list services in your code, so it is possibly return all services in the namespace and not the current test run

@yorugac
Copy link
Collaborator

yorugac commented Jun 7, 2024

@volnyansky, WDYM by batches? You're not running 500 instances anymore?

it is possibly return all services in the namespace and not the current test run

🤔 we'd still need to send a "start" command with something like cURL though.

Could you please clarify a bit? 🙂

@volnyansky
Copy link
Author

@yorugac I need to run more than 500 instances, 5000 actually. So I split one test into several and I call them batches. But If all these tests are run in one namespace I still get "argument list too long error", and If I isolate each test in its own namespace I don't get error.

I agree that you still need to send curl, I just proposed a more compact way to call it , to not reach ARG_MAX limit which causes "arguments to long error".

@yorugac
Copy link
Collaborator

yorugac commented Jun 10, 2024

🤔 It's strange that namespace is a factor here... If the test is "split" then it's already producing another curl call, even if both tests are in the same namespace. IIUC, the error appears form curl itself and from k6 - not from getting the list of Kubernetes services.

Well, I think it's still about making batches, as described in this comment. Do you happen to have any estimate on what the value of ARG_MAX is? For example, what size of batches work for you?

@volnyansky
Copy link
Author

volnyansky commented Jun 11, 2024

@yorugac In my env ARG_MAX= 131072 bytes

@frittentheke
Copy link
Contributor

If I may kindly point to the discussion about the use of the REST API.
I was commenting about switching to doing the "start" command natively and not via some job -> pod and templated curl invocations: #87 (comment).

It's not only about efficiency, but also about keeping the k6-operator closer in the loop about the state of the runners....

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants