Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Best parameters for 40 million embeddings #315

Open
sujigrena opened this issue May 28, 2021 · 2 comments
Open

Best parameters for 40 million embeddings #315

sujigrena opened this issue May 28, 2021 · 2 comments

Comments

@sujigrena
Copy link

sujigrena commented May 28, 2021

Hi Team @piem @fabiencastan @groodt @2ooom @vinnitu @yurymalkov ,

We have a requirement of getting best match with gallery size of about 40 Million (embedding size 128) with best performance and accuracy. Can you please suggest us what could be the suitable distance type, ef, M parameters. We are having a hard time figuring out these parameters. We hope your expertise on dealing huge data could help us in refining the parameters and arriving at optimal results. Thanks in advance.

@yurymalkov
Copy link
Member

Hi @sujigrena,
The optimal parameters depends on the intrinsic data dimensionality, so it is is hard to tell the exact ones (unless you have an estimate, e.g. the clustering factor of the k-NN graph)
The distance type depends on the origin of the vectors. If those are an output of a neural network I would recommend to directly train on objective for a decided distance (by default the neural classifier is trained for inner product, this can be altered to L2 or cosine).
I would go with M=16 first, and have a bench for checking the accuracy on the query set. Build an index, find ef which give high recall (e.g. 0.95) and set ef_contruction to that parameter. As a rule of thumb, increase M if ef_consruction is more than a thousand and repeat. Also please look at https://github.com/nmslib/hnswlib/blob/master/ALGO_PARAMS.md

@sujigrena
Copy link
Author

sujigrena commented May 31, 2021

Thank you for the inputs @yurymalkov . One query here, Is there a way or any shortcut to arrive at the estimate ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants