Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question]: Multilingual support between embedding knowledge base, retrieval testing, search, and assistant chat #4503

Open
predoctech opened this issue Jan 16, 2025 · 4 comments
Labels
Feature question Further information is requested

Comments

@predoctech
Copy link

Describe your problem

As this project has a Chinese/English focus I tried to experiment with a bilingual test case.
So the source document is in Chinese:
Screenshot from 2025-01-16 12-31-16
Embedding is done with maidalun1020/bce-embedding-base_v1, which I understood to be a Bilingual and Crosslingual Embedding model.
I work under the assumption that it means while the source document is in Chinese, I will be able to perform retrieval testing, search, and chat in English should the semantic meaning of a chunk matches. Obviously the LLM deployed (Gemini) needs to be bilingual as well which is the case.
However that is not what I have experienced with.
Retrieval testing: Always return with "no data"
Search: No result
Screenshot from 2025-01-16 12-39-39
Chat: Knowledge base is empty
Screenshot from 2025-01-16 12-41-21
Please advise if multilingual support is available in Ragflow, or if what has attempted wasn't the correct approach for such a purpose? Thanks.

@predoctech predoctech added the question Further information is requested label Jan 16, 2025
@senovr
Copy link

senovr commented Jan 16, 2025

I would second this question. I tried multi-language use case (one knowledge base, documents in two languages, embedder e5-medium that is multi-lingual).
When I asked question in English- only English documents are used for reference, when I am asking in second language - it uses only second language documents.

@KevinHuSh
Copy link
Collaborator

Multilingual search is not supported well so far.

@predoctech
Copy link
Author

Upon further experiments I found that the limitation is more to do with the RAG process rather than the LLM model. Basically an embedded vector from English questions will not retrieve any embedded vector with Chinese data, thus leaving any subsequent LLM interaction irrelevant. However according to the description of BCEmbedding model:
EmbeddingModel handle bilingual and crosslingual retrieval task in English and Chinese
So why would this become a hurdle when adopted and utilized within RAGFLOW?

@senovr
Copy link

senovr commented Jan 17, 2025 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants