Indexing is faster with less features #156

DoanThu · 2019-10-17T08:24:40Z

I run this:
dim = df.shape[1]
index = hnswlib.Index(space='l2', dim=dim)
index.init_index(ef_construction=100, M=48, max_elements=len(df))
index.set_ef(10)
index.set_num_threads(16)
index.add_items(df.values)
The shape of df is (980432, 188) at first.
If i run with 188 features, it takes ~200s to finish. However, when I set
df = df.iloc[:,2:]
which means the features now is only 186, it is ~620s to complete indexing.
As I see, two features I dropped have all value '1'.
Can you please tell me why this happens and help me accelerate the second case?

The text was updated successfully, but these errors were encountered:

yurymalkov · 2019-10-17T19:11:42Z

Hi @DoanThu,
I am not very familiar with pandas. Can you convert the data to numpy (e.g. np.ascontiguousarray) and try again?
Also, I might be wrong, but looking at df = df.iloc[:,2] I would assume that it transfers to (980432, 1).

DoanThu · 2019-10-18T01:01:03Z

Hi @yurymalkov
Sorry it is actually df = df.iloc[:, 2:]. I edited above.
In index.add_items(df.values) I converted the df to array already.
As you suggested, I change it to numpy array index.add_items(np.asarray(df.values, dtype=int)), but the results are still the same as before.
The np.asarray(df.values, dtype=int) looks like:

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 1, ..., 0, 0, 1],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

Moreover, I also have another dataframe with shape (971613, 186) and it takes up to ~1200s to index.
Can you please help me with this, as the execution time is critical for me?

yurymalkov · 2019-10-21T18:30:45Z

@DoanThu Sorry for a late reply.
Are there many duplicates in our dataset?
Can you share a sample (numpy)?

DoanThu · 2019-10-22T02:43:41Z

@yurymalkov Yes, there are lots of duplicates in my data. When I dedup it, the remaining rows are 70066 (the one with shape (980432, 188) and (980432, 186) as well). The df with (971613, 186) remains ~170k rows after being deduped.
In this case I guess the number of distinct values in each dataset is the reason for indexing speed. Is it correct? However, why (980432, 188) is indexed faster than (980432, 186) is still vague.
Below is a sample of the dataset:

array([1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

yurymalkov · 2019-10-22T08:14:05Z

Duplicates can have a big effect on the algorithm performance. I assume that might be the reason why it behaves so strangely.
Can you check the performance of the indexes with deduped data?

DoanThu · 2019-10-23T04:02:58Z

Hi @yurymalkov
With deduped data, (980432, 188) is turned into (70066, 188) and indexed in ~4s.
(980432, 186) became (70066, 186) and in ~7.2s
The 186 thing is still slower though.

yurymalkov · 2019-10-29T14:12:29Z

@DoanThu That is strange. One thing might that different distance functions can be used for 188 and and 186, but the difference should be much smaller.
If you share the data, I can look for the reasons.

DoanThu · 2019-10-29T14:58:22Z

@yurymalkov
Yes, the data is on this link and can be run with the script above.
Please let me know if you have any problem opening the link.

yurymalkov · 2019-10-29T15:32:32Z

@DoanThu Great! Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Indexing is faster with less features #156

Indexing is faster with less features #156

DoanThu commented Oct 17, 2019 •

edited

Loading

yurymalkov commented Oct 17, 2019

DoanThu commented Oct 18, 2019

yurymalkov commented Oct 21, 2019

DoanThu commented Oct 22, 2019 •

edited

Loading

yurymalkov commented Oct 22, 2019

DoanThu commented Oct 23, 2019

yurymalkov commented Oct 29, 2019

DoanThu commented Oct 29, 2019 •

edited

Loading

yurymalkov commented Oct 29, 2019

Indexing is faster with less features #156

Indexing is faster with less features #156

Comments

DoanThu commented Oct 17, 2019 • edited Loading

yurymalkov commented Oct 17, 2019

DoanThu commented Oct 18, 2019

yurymalkov commented Oct 21, 2019

DoanThu commented Oct 22, 2019 • edited Loading

yurymalkov commented Oct 22, 2019

DoanThu commented Oct 23, 2019

yurymalkov commented Oct 29, 2019

DoanThu commented Oct 29, 2019 • edited Loading

yurymalkov commented Oct 29, 2019

DoanThu commented Oct 17, 2019 •

edited

Loading

DoanThu commented Oct 22, 2019 •

edited

Loading

DoanThu commented Oct 29, 2019 •

edited

Loading