Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Indexing is faster with less features #156

Open
DoanThu opened this issue Oct 17, 2019 · 9 comments
Open

Indexing is faster with less features #156

DoanThu opened this issue Oct 17, 2019 · 9 comments

Comments

@DoanThu
Copy link

DoanThu commented Oct 17, 2019

I run this:
dim = df.shape[1]
index = hnswlib.Index(space='l2', dim=dim)
index.init_index(ef_construction=100, M=48, max_elements=len(df))
index.set_ef(10)
index.set_num_threads(16)
index.add_items(df.values)
The shape of df is (980432, 188) at first.
If i run with 188 features, it takes ~200s to finish. However, when I set
df = df.iloc[:,2:]
which means the features now is only 186, it is ~620s to complete indexing.
As I see, two features I dropped have all value '1'.
Can you please tell me why this happens and help me accelerate the second case?

@yurymalkov
Copy link
Member

Hi @DoanThu,
I am not very familiar with pandas. Can you convert the data to numpy (e.g. np.ascontiguousarray) and try again?
Also, I might be wrong, but looking at df = df.iloc[:,2] I would assume that it transfers to (980432, 1).

@DoanThu
Copy link
Author

DoanThu commented Oct 18, 2019

Hi @yurymalkov
Sorry it is actually df = df.iloc[:, 2:]. I edited above.
In index.add_items(df.values) I converted the df to array already.
As you suggested, I change it to numpy array index.add_items(np.asarray(df.values, dtype=int)), but the results are still the same as before.
The np.asarray(df.values, dtype=int) looks like:

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 1, ..., 0, 0, 1],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

Moreover, I also have another dataframe with shape (971613, 186) and it takes up to ~1200s to index.
Can you please help me with this, as the execution time is critical for me?

@yurymalkov
Copy link
Member

@DoanThu Sorry for a late reply.
Are there many duplicates in our dataset?
Can you share a sample (numpy)?

@DoanThu
Copy link
Author

DoanThu commented Oct 22, 2019

@yurymalkov Yes, there are lots of duplicates in my data. When I dedup it, the remaining rows are 70066 (the one with shape (980432, 188) and (980432, 186) as well). The df with (971613, 186) remains ~170k rows after being deduped.
In this case I guess the number of distinct values in each dataset is the reason for indexing speed. Is it correct? However, why (980432, 188) is indexed faster than (980432, 186) is still vague.
Below is a sample of the dataset:

array([1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

@yurymalkov
Copy link
Member

Duplicates can have a big effect on the algorithm performance. I assume that might be the reason why it behaves so strangely.
Can you check the performance of the indexes with deduped data?

@DoanThu
Copy link
Author

DoanThu commented Oct 23, 2019

Hi @yurymalkov
With deduped data, (980432, 188) is turned into (70066, 188) and indexed in ~4s.
(980432, 186) became (70066, 186) and in ~7.2s
The 186 thing is still slower though.

@yurymalkov
Copy link
Member

@DoanThu That is strange. One thing might that different distance functions can be used for 188 and and 186, but the difference should be much smaller.
If you share the data, I can look for the reasons.

@DoanThu
Copy link
Author

DoanThu commented Oct 29, 2019

@yurymalkov
Yes, the data is on this link and can be run with the script above.
Please let me know if you have any problem opening the link.

@yurymalkov
Copy link
Member

@DoanThu Great! Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants