-
Notifications
You must be signed in to change notification settings - Fork 655
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Indexing is faster with less features #156
Comments
Hi @DoanThu, |
Hi @yurymalkov
Moreover, I also have another dataframe with shape (971613, 186) and it takes up to ~1200s to index. |
@DoanThu Sorry for a late reply. |
@yurymalkov Yes, there are lots of duplicates in my data. When I dedup it, the remaining rows are 70066 (the one with shape (980432, 188) and (980432, 186) as well). The df with (971613, 186) remains ~170k rows after being deduped.
|
Duplicates can have a big effect on the algorithm performance. I assume that might be the reason why it behaves so strangely. |
Hi @yurymalkov |
@DoanThu That is strange. One thing might that different distance functions can be used for 188 and and 186, but the difference should be much smaller. |
@yurymalkov |
@DoanThu Great! Thanks! |
I run this:
dim = df.shape[1]
index = hnswlib.Index(space='l2', dim=dim)
index.init_index(ef_construction=100, M=48, max_elements=len(df))
index.set_ef(10)
index.set_num_threads(16)
index.add_items(df.values)
The shape of
df
is (980432, 188) at first.If i run with 188 features, it takes ~200s to finish. However, when I set
df = df.iloc[:,2:]
which means the features now is only 186, it is ~620s to complete indexing.
As I see, two features I dropped have all value '1'.
Can you please tell me why this happens and help me accelerate the second case?
The text was updated successfully, but these errors were encountered: