-
Notifications
You must be signed in to change notification settings - Fork 655
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update hnswalg.h #324
base: develop
Are you sure you want to change the base?
Update hnswalg.h #324
Conversation
Merge 0.5.2 changes into master
Collect_metrics@searchBaseLayer : We should not increment metric_distance_computation with the size of neighbors, since there may be some neighbors already visited.
Hi @zeraph6. Thanks for the PR! |
Hi Malkov,
I tested it and there's no difference for random dataset, and it may probably be the same for skewed dataset.
Ilias
…________________________________
From: Yury Malkov ***@***.***>
Sent: Sunday, July 11, 2021, 1:46 AM
To: nmslib/hnswlib
Cc: Ilias AZIZI; Mention
Subject: Re: [nmslib/hnswlib] Update hnswalg.h (#324)
CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you recognize the sender and know the content is safe.
Hi @zeraph6<https://github.com/zeraph6>. Thanks for the PR!
I wonder, have you tested if it affects the performance (e.g. due to doing branching on every step)?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<#324 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ARZVF745JFNXMIARXE2AW7LTXDSXBANCNFSM5AB6ITBQ>.
|
But the difference in the number of distance calculations is important (~5%) |
Got it! I'll test it on small dim data and let you know. |
Sure! Yes, I used the statistics in the C++ interface.
Ilias
…________________________________
From: Yury Malkov ***@***.***>
Sent: Sunday, July 11, 2021 6:53:06 PM
To: nmslib/hnswlib ***@***.***>
Cc: Ilias AZIZI ***@***.***>; Mention ***@***.***>
Subject: Re: [nmslib/hnswlib] Update hnswalg.h (#324)
CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you recognize the sender and know the content is safe.
Got it! I'll test it on small dim data and let you know.
Also, I wonder, were you using the statistics in the C++ interface?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<#324 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ARZVF75X4HMCQQTRJJGBYDLTXHLAFANCNFSM5AB6ITBQ>.
|
@zeraph6 I've finally tested the change. Sorry for big delay. The change actually leads to a drop of performance for multi-threaded search (up to 2X on low-dim data). This is probably due to more frequent access to a shared variable which causes excessive locking. There are several solutions for that, but the best is probably to accumulate metrics locally and increment them once search. |
I've used this code (python that calls C++ code): import hnswlib
import numpy as np
import pickle
import time
dim = 4
# Generating sample data
for _ in range(10):
print()
for num_elements in[10000,100000,1000000]:
for t in [1,24]:
data = np.float32(np.random.random((num_elements, dim)))
ids = np.arange(num_elements)
# Declaring index
p = hnswlib.Index(space = 'l2', dim = dim) # possible options are l2, cosine or ip
# Initializing index - the maximum number of elements should be known beforehand
p.init_index(max_elements = num_elements*2, ef_construction = 200, M = 16)
# Element insertion (can be called several times):
p.add_items(data)
# Controlling the recall by setting ef:
p.set_ef(50) # ef should always be > k
t0=time.time()
p.set_num_threads(t)
labels, distances = p.knn_query(data, k = 1)
print(f"num_elements: {num_elements:10}, threads:{t:2}, time:{time.time()-t0}")
print() For baseline it gives (on a 3900X):
For tested changes it gives:
Let me know if plan to update the PR (if not I'll do this, but later) |
Well noted! I will update it asap
Ilias
…________________________________
From: Yury Malkov ***@***.***>
Sent: Monday, August 2, 2021 3:24:49 AM
To: nmslib/hnswlib ***@***.***>
Cc: Ilias AZIZI ***@***.***>; Mention ***@***.***>
Subject: Re: [nmslib/hnswlib] Update hnswalg.h (#324)
CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you recognize the sender and know the content is safe.
I've used this code (python that calls C++ code):
import hnswlib
import numpy as np
import pickle
import time
dim = 4
# Generating sample data
for _ in range(10):
print()
for num_elements in[10000,100000,1000000]:
for t in [1,24]:
data = np.float32(np.random.random((num_elements, dim)))
ids = np.arange(num_elements)
# Declaring index
p = hnswlib.Index(space = 'l2', dim = dim) # possible options are l2, cosine or ip
# Initializing index - the maximum number of elements should be known beforehand
p.init_index(max_elements = num_elements*2, ef_construction = 200, M = 16)
# Element insertion (can be called several times):
p.add_items(data)
# Controlling the recall by setting ef:
p.set_ef(50) # ef should always be > k
t0=time.time()
p.set_num_threads(t)
labels, distances = p.knn_query(data, k = 1)
print(f"num_elements: {num_elements:10}, threads:{t:2}, time:{time.time()-t0}")
print()
For baseline it gives (on a 3900X):
num_elements: 10000, threads:24, time:0.02010512351989746
num_elements: 100000, threads: 1, time:1.32535719871521
num_elements: 100000, threads:24, time:0.21545171737670898
num_elements: 1000000, threads: 1, time:21.12733006477356
num_elements: 1000000, threads:24, time:2.674464464187622
For tested changes it gives:
num_elements: 10000, threads:24, time:0.044780731201171875
num_elements: 100000, threads: 1, time:1.1950428485870361
num_elements: 100000, threads:24, time:0.5131497383117676
num_elements: 1000000, threads: 1, time:21.29676127433777
num_elements: 1000000, threads:24, time:5.72478461265564
Let me know if plan to update the PR (if not I'll do this, but later)
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<#324 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ARZVF7577WYSZO2CR3ED3O3T2X6XDANCNFSM5AB6ITBQ>.
|
Collect_metrics@searchBaseLayer : We should not increment metric_distance_computation with the size of neighbors, since there may be some neighbors already visited.