Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question]: Slow file parsing #3952

Open
cuihangrui opened this issue Dec 10, 2024 · 14 comments
Open

[Question]: Slow file parsing #3952

cuihangrui opened this issue Dec 10, 2024 · 14 comments
Labels
question Further information is requested

Comments

@cuihangrui
Copy link

Describe your problem

请问最新版本已经解决这问题了吗,我使用0.13版本,上传文件解析慢,已经过了10小时 几乎进度没有增长

@cuihangrui cuihangrui added the question Further information is requested label Dec 10, 2024
@JinHai-CN JinHai-CN changed the title [Question]: 关于解析文件慢的问题 [Question]: Slow file parsing Dec 10, 2024
@JinHai-CN
Copy link
Contributor

We intend to create an international community, so we encourage using English for communication.

@JinHai-CN
Copy link
Contributor

  1. You may use the latest 'dev' release.
  2. Check the CPU workload when RAGFlow is parsing file.
  3. Provide more information on the knowledge base, such as chunking method, file type, embedding model.

@cuihangrui
Copy link
Author

  1. 您可以使用最新的 'dev' 版本。
  2. 检查 RAGFlow 解析文件时的 CPU 工作负载。
  3. 提供有关知识库的更多信息,例如分块方法、文件类型、嵌入模型。

The older version I am using now, if you restart the container, will improve the parsing speed, but after a while it will still be slow

@JinHai-CN
Copy link
Contributor

Perhaps too many parsing tasks are blocking, so the parsing speed looks very slow.

@cuihangrui
Copy link
Author

可能是太多的解析任务阻塞了,所以解析速度看起来非常慢。

Are you saying that uploading multiple files at once can cause this problem? Or the file in the parsing is deleted by the back-end interface, and the new file is uploaded again, and the call parsing will be slow

@cuihangrui
Copy link
Author

可能是太多的解析任务阻塞了,所以解析速度看起来非常慢。
And my cpu load is not very high
image
image

@yingfeng
Copy link
Member

You can configure more task_executors in entrypoint.sh to increase the parallelism. Also, you can figure out the time occupation percentage, say how much time is taken on OCR, how much time is taken by embedding,...,etc. If the percentage of embedding is too high, you can choose to use GPU to accelerate embedding process.

@dassio
Copy link

dassio commented Dec 10, 2024

You can configure more task_executors in entrypoint.sh to increase the parallelism. Also, you can figure out the time occupation percentage, say how much time is taken on OCR, how much time is taken by embedding,...,etc. If the percentage of embedding is too high, you can choose to use GPU to accelerate embedding process.

how can i config the entrypoint.sh for more task executor?

@KevinHuSh
Copy link
Collaborator

Increase WS.

@y0ung-y
Copy link

y0ung-y commented Dec 11, 2024

您可以配置更多 task_executors 以提高并行度。此外,您可以计算出时间占用百分比,比如 OCR 花费了多少时间,嵌入花费了多少时间,...,等等。如果 embedding 的比例过高,可以选择使用 GPU 来加速 embedding 过程。entrypoint.sh

How can I accelerate the embedding process using Gpus

@y0ung-y
Copy link

y0ung-y commented Dec 11, 2024

增加 WS。

Can I just fill in the WS value more in line 19 of entrypoint.sh, e.g. WS=10?

@KevinHuSh
Copy link
Collaborator

You could test it ^^

@cuihangrui
Copy link
Author

您可以对其进行^^

If you change the WS value in entrypoint.sh, do you run the script directly to start? What does this file do, I don't see docker-compose calling this sh file

@KevinHuSh
Copy link
Collaborator

You need to re-build docker image or mount it (entrypoint.sh) out after change it out of docker image.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

6 participants