index.add_items() stuck when using the same data #130

vasi-grigo · 2019-07-30T08:44:55Z

This could potentially be classified as a bug but occurs only if you're heavily misusing the library:

import hnswlib
import ujson
import numpy

dim = 128
cnt = 84000
el = numpy.float32(numpy.random.random((1, dim))).tolist()[0]
vals = [el for i in range(cnt)]
ids = list(range(cnt))

idx = hnswlib.Index(space='l2', dim=dim)
idx.init_index(max_elements=cnt, ef_construction=300, M=200)
idx.add_items(vals, ids)  # <--- chokes here

84k was the magical number for me but it depends on the system its running in
I found myself debugging this issue as a consequence of our everyday sanity tests that keep inserting the same record over and over again.

Привет землякам и спасибо за добротную работу 💪 👍

The text was updated successfully, but these errors were encountered:

yurymalkov · 2019-07-30T17:42:28Z

Hi @vasi-grigo,

Thank you! I'll check it. Not sure why this happens.

yurymalkov · 2019-07-31T10:13:03Z

@vasi-grigo
Cannot reproduce... The snippet takes a lot of time, but it finishes.
Tried with cnt = 10000000, M=30 (to speed up). Still finishes.

vasi-grigo · 2019-08-03T17:10:23Z

I encountered it while running it in a container that was allocated 4 vCPUs on GCP
Possibly i didn't wait long enough (had been stuck for an hour or so though; i have a problem with temper and patience)
All things considered, sticking the same elements into the index with different labels at scale = misusing the library so this can't be considered a bug/issue.
In our case, it just wasn't obvious at first that we're inserting the same records over and over again.
Maybe it's worth considering adding a simple check during insertion that at least 2 items are unique. Insertion then would fail earlier and, thus, problems could be detected earlier (as it doesn't make sense to waste a lot of time building an index that you can't search anyway). But this is just thinking out loud ...

yurymalkov · 2019-08-05T12:46:10Z

@vasi-grigo
Ok. I think this can be solved by a warning about the loss of performance for this case.
One problem is that it is not clear from the algorithm's perspective whether the elements are the same (e.g. for inner product space the distance is not zero). Maybe we should post a warning is the distance is exactly zero.

vasi-grigo · 2019-08-06T06:09:08Z

Yep, precisely. How can the algorithm be distinguishing items if all of them are identical?
An index that contains the same elements has no value so maybe even why bother building it?

yurymalkov · 2019-08-06T12:00:39Z

Well, identical elements can have different labels and, if there are only few of them, the algorithm can handle it.
The problem is that the insertion method from the library cannot tell if the elements are the same, because it gets only a distance. For Euclid and cosine similarity the distance is zero, so, probably, giving a warning for zero distance might solve this. Will check it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

index.add_items() stuck when using the same data #130

index.add_items() stuck when using the same data #130

vasi-grigo commented Jul 30, 2019

yurymalkov commented Jul 30, 2019

yurymalkov commented Jul 31, 2019

vasi-grigo commented Aug 3, 2019

yurymalkov commented Aug 5, 2019

vasi-grigo commented Aug 6, 2019

yurymalkov commented Aug 6, 2019

index.add_items() stuck when using the same data #130

index.add_items() stuck when using the same data #130

Comments

vasi-grigo commented Jul 30, 2019

yurymalkov commented Jul 30, 2019

yurymalkov commented Jul 31, 2019

vasi-grigo commented Aug 3, 2019

yurymalkov commented Aug 5, 2019

vasi-grigo commented Aug 6, 2019

yurymalkov commented Aug 6, 2019