Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Threaded add_items issue #28

Open
sumsuddin opened this issue Jun 12, 2018 · 3 comments
Open

Threaded add_items issue #28

sumsuddin opened this issue Jun 12, 2018 · 3 comments

Comments

@sumsuddin
Copy link

sumsuddin commented Jun 12, 2018

I was looking into the python example. In my experience, the threaded add_items gives me different result & accuracy every time I run the script.

I think using multiple threads while adding the items is wrong here.

p.set_num_threads(4) # by default using all available cores

Moreover when I used cosine spcae the accuracy was around 50% in some index generation.
Index(space='cosine', dim=dim)

When I used single thread the results were consistant all the time.
p.set_num_threads(1)

Can someone clarify the issue?

@sumsuddin sumsuddin changed the title Threaded add_item issue Threaded add_items issue Jun 12, 2018
@yurymalkov
Copy link
Member

Hi @sumsuddin,
Can you please provide a demo script to understand what is going on?

@sumsuddin
Copy link
Author

sumsuddin commented Jun 19, 2018

I can't share the private data that I was working on. But here is a randomly generated numpy array that I saved in a file. I attached the saved file here so that you can investigate.

# Generating sample data
#data = np.float32(np.random.random((num_elements, dim)))
#np.savetxt('data.txt', data)
data = np.loadtxt('data.txt')

For this specific random number combination (attached file) I get following two different recall accuracy randomly in different run.

Recall for two batches: 0.99990000000000001 (this happens rarely)
Recall for two batches: 1.0 (I mostly get this one)

Increasing the item size makes the issue more obvious in my experiments.
num_elements = 100000

I guess you can find easier ways to regenerate the issue.
Thanks for your time.

Python version : Python 2.7.6
OS: Ubuntu 14.04.5 LTS

data.txt

@yurymalkov
Copy link
Member

I see. Thanks!
It seems there are only two options to solve this:

  1. use single-threaded construction.
  2. setting high ef/efConstruction values, so the search will be almost exact.

There is a potential fix that can stabilize the randomness to some extent - setting the element levels before the actual insertion (it would require updating bindings), but it will not solve the problem completely.
I think that hnsw in faiss (e.g. https://github.com/facebookresearch/faiss/blob/master/benchs/bench_hnsw.py) works that way. You can try it (although it is generally slower than hnswlib at fixed accuracy).

jelmerk pushed a commit to jelmerk/hnswlib that referenced this issue May 21, 2019
…level, this should make the index a lot more stable see : nmslib/hnswlib#28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants