Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

index.add_items() stuck when using the same data #130

Open
vasi-grigo opened this issue Jul 30, 2019 · 6 comments
Open

index.add_items() stuck when using the same data #130

vasi-grigo opened this issue Jul 30, 2019 · 6 comments

Comments

@vasi-grigo
Copy link

This could potentially be classified as a bug but occurs only if you're heavily misusing the library:

import hnswlib
import ujson
import numpy

dim = 128
cnt = 84000
el = numpy.float32(numpy.random.random((1, dim))).tolist()[0]
vals = [el for i in range(cnt)]
ids = list(range(cnt))

idx = hnswlib.Index(space='l2', dim=dim)
idx.init_index(max_elements=cnt, ef_construction=300, M=200)
idx.add_items(vals, ids)  # <--- chokes here

84k was the magical number for me but it depends on the system its running in
I found myself debugging this issue as a consequence of our everyday sanity tests that keep inserting the same record over and over again.

Привет землякам и спасибо за добротную работу 💪 👍

@yurymalkov
Copy link
Member

Hi @vasi-grigo,

Thank you! I'll check it. Not sure why this happens.

@yurymalkov
Copy link
Member

@vasi-grigo
Cannot reproduce... The snippet takes a lot of time, but it finishes.
Tried with cnt = 10000000, M=30 (to speed up). Still finishes.

@vasi-grigo
Copy link
Author

I encountered it while running it in a container that was allocated 4 vCPUs on GCP
Possibly i didn't wait long enough (had been stuck for an hour or so though; i have a problem with temper and patience)
All things considered, sticking the same elements into the index with different labels at scale = misusing the library so this can't be considered a bug/issue.
In our case, it just wasn't obvious at first that we're inserting the same records over and over again.
Maybe it's worth considering adding a simple check during insertion that at least 2 items are unique. Insertion then would fail earlier and, thus, problems could be detected earlier (as it doesn't make sense to waste a lot of time building an index that you can't search anyway). But this is just thinking out loud ...

@yurymalkov
Copy link
Member

@vasi-grigo
Ok. I think this can be solved by a warning about the loss of performance for this case.
One problem is that it is not clear from the algorithm's perspective whether the elements are the same (e.g. for inner product space the distance is not zero). Maybe we should post a warning is the distance is exactly zero.

@vasi-grigo
Copy link
Author

Yep, precisely. How can the algorithm be distinguishing items if all of them are identical?
An index that contains the same elements has no value so maybe even why bother building it?

@yurymalkov
Copy link
Member

Well, identical elements can have different labels and, if there are only few of them, the algorithm can handle it.
The problem is that the insertion method from the library cannot tell if the elements are the same, because it gets only a distance. For Euclid and cosine similarity the distance is zero, so, probably, giving a warning for zero distance might solve this. Will check it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants