-
Notifications
You must be signed in to change notification settings - Fork 668
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add a small example #302
base: master
Are you sure you want to change the base?
add a small example #302
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the PR! I would like to highlight a few things
- I suspect the code will run significantly slower due to default comparator (see the comment).
- I do not understand the function/logic of bUpdate. It seems like it breaks the assumption we can get the element by its label.
|
||
tableint cur_c = 0; | ||
{ | ||
// Checking if the element with the same label already exists | ||
// if so, updating it *instead* of creating a new element. | ||
std::unique_lock <std::mutex> templock_curr(cur_element_count_guard_); | ||
auto search = label_lookup_.find(label); | ||
if (search != label_lookup_.end()) { | ||
if (bUpdate && search != label_lookup_.end()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not understand the bUpdate
logic here. It add a new element with the same label? What happens to the first element? How label_lookup_
should operate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From sfit_1b.cpp example, it seems that you sometimes use 'label' as the unique ID of elements (when Building index), sometimes use 'label' as the class of elements (see the usage of the return value of searchKnn). I think 'lable' should mean class. For example, in MNIST its value is one of (0,1,2,3,4,5,6,7,8,9). Then, the reason of use 'bUpdate' is obviously. We don't want to update an existing class every time we add a new point. We only let this happen when we need (e.g at online training). At that time, the data of the closest node will be replaced by the new coming data.
If this is not your original intention, omit this suggestion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But if the meaning of label is not class, then the question is where and how to find the classes of the elements?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the class labels should be stored by the user (in a dict or an array)
From a design point of view, there should be a mechanism for the library user to extract the vectors back/delete/update by using a key. There are no other keys defined in the library so we ended up use labels as keys. To change that there should be a good alternative to support those operations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand, it is hard to add an 'ID' at data_level0_memory_?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the labels are stored in a dict or an array, it still need spend time to retrieve them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is possible, but I think it will break some stuff like compatibility with old saved indices as this label also needs to be stored somewhere.
It also would add complexity to the interface, there would be both id and and class in knn.
While labels for classifications can be done from the user side, both in C++ and python via simple lookups.
@@ -73,7 +73,7 @@ namespace hnswlib { | |||
struct CompareByFirst { | |||
constexpr bool operator()(std::pair<dist_t, tableint> const &a, | |||
std::pair<dist_t, tableint> const &b) const noexcept { | |||
return a.first < b.first; | |||
return a.first > b.first; //let the smaller at the top |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That change seems to change the logic significantly...
As far as I understand the main goal is to omit the CompareByFirst or whatever in the template, but AFAIK that severely affect the performance of the queue operation (I think the default comparator is a bit more complicated, so slower).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see the "std::priority_queue" reference
"
template <class T, class Container = vector, class Compare = less > class priority_queue;
Template parameters
...
Compare
A binary predicate that takes two elements (of type T) as arguments and returns a bool.
The expression comp(a,b), where comp is an object of this type and a and b are elements in the container, shall return true if a is considered to go before b in the strict weak ordering the function defines.
The priority_queue uses this function to maintain the elements sorted in a way that preserves heap properties (i.e., that the element popped is the last according to this strict weak ordering).
This can be a function pointer or a function object, and defaults to less, which returns the same as applying the less-than operator (a<b).
"
I change "return a.first < b.first;" (this should be the default) to "return a.first > b.first;" to sort the smaller distance at the top, so we need not use minus distance. This will not "severely affect the performance".
By the way, at hnswalg.h you define many variables of "std::priority_queue<std::pair<dist_t, tableint>, std::vector<std::pair<dist_t, tableint>>, CompareByFirst>". If this will affect the performance, reduce the usage of such variables will only improve the performance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changing the sign does not affect the performance, usage of the default comparator does (I can mark the code that does that). Slowness is an empirical fact that I've observed (I probably more lost more than few weeks before I realized that).
searchBaseLayer(tableint ep_id, const void *data_point, int layer) { | ||
VisitedList *vl = visited_list_pool_->getFreeVisitedList(); | ||
vl_type *visited_array = vl->mass; | ||
vl_type visited_array_tag = vl->curV; | ||
|
||
std::priority_queue<std::pair<dist_t, tableint>, std::vector<std::pair<dist_t, tableint>>, CompareByFirst> top_candidates; | ||
std::priority_queue<std::pair<dist_t, tableint>> top_candidates; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
E.g. here removing CompareByFirst
made the code significantly slower.
Thank you for let me know that. May be we can implement a "priority_deque" so we need not copy data frequently.06.04.2021, 11:10, "Yury Malkov" ***@***.***>:
@yurymalkov commented on this pull request.
In hnswlib/hnswalg.h:
@@ -73,7 +73,7 @@ namespace hnswlib {
struct CompareByFirst {
constexpr bool operator()(std::pair<dist_t, tableint> const &a,
std::pair<dist_t, tableint> const &b) const noexcept {
- return a.first < b.first;
+ return a.first > b.first; //let the smaller at the top
Changing the sign does not affect the performance, usage of the default comparator does (I can mark the code that does that). Slowness is an empirical fact that I've observed (I probably more lost more than few weeks before I realized that).
—You are receiving this because you modified the open/close state.Reply to this email directly, view it on GitHub, or unsubscribe.
|
first, copy train-labels.idx1-ubyte, train-images.idx3-ubyte, t10k-labels.idx1-ubyte, t10k-images.idx3-ubyte to the same path of the executable file.
then, run ./mnist