Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding elements to an index loaded in memory iteratively #79

Open
mehrrsaa opened this issue Dec 6, 2018 · 12 comments
Open

Adding elements to an index loaded in memory iteratively #79

mehrrsaa opened this issue Dec 6, 2018 · 12 comments

Comments

@mehrrsaa
Copy link

mehrrsaa commented Dec 6, 2018

Currently, from what I understand of the documentation, we would need to load an index into the memory to be able to update the number of max elements and add new elements to it.

My question is: what if the index is already loaded in the memory, some process is being run iteratively and we want to append new elements yielded from that process to this already loaded index and update the object. At the moment it seems that we need to follow this routine: build index, save it, load it back (increase max elements in load argument), add elements, save, and again...

This adds load time to this process. I was wondering if there is a way to do this without saving and loading the index and just append to an already in memory index over and over (and save when we want).

Thank you in advance for any help on this!

@yurymalkov
Copy link
Member

Hi @mehrrsaa,
Not sure what do you mean. There is no easy way to merge two indexes.
What can be done, is an automatic extension of the number of max elements as the index grows.

@mehrrsaa
Copy link
Author

mehrrsaa commented Dec 7, 2018

Hi @yurymalkov and thank you for the fast response!
Sorry, may be I can make it more clear with an example:

Considering the example in hnswlib docs:
This is what is done now:
We init p and load an already built index into it and add new elements to it (which is an awesome capability to have, thank you!):
p = hnswlib.Index(space='l2', dim=dim)
p.load_index("first_half.bin", max_elements = num_elements)
p.add_items(data2)

What I am wondering about is, if there is a way to grow the index without saving and loading it again into memory, so considering if we keep the index in the memory indefinitely, when a new batch of data comes in and exceeds its previously set "max element" limit, we would want to do something like:

p.add_items(data3, new_max_element = num_elements + len(data3))

I hope it was more clear this time. My guess is this can't be done, but I want to make sure.

@yurymalkov
Copy link
Member

@mehrrsaa Yes, p.add_items(data3, new_max_element = num_elements + len(data3)) is not available at the moment.
But implementing similar functionality is on the TODO list. Probably it will be done within few weeks.

@mehrrsaa
Copy link
Author

Thank you, that would be awesome!

@mehrrsaa mehrrsaa reopened this Mar 12, 2019
@mehrrsaa
Copy link
Author

Hello,

I was wondering if there is still a plan in place to implement this functionality?

Thank you

@yurymalkov
Copy link
Member

Hi @mehrrsaa
Yes it is still in the plans. I am too busy right now, sorry...
Will start doing it in two weeks.

@mehrrsaa
Copy link
Author

Thank you for getting back to me @yurymalkov, I appreciate it!

@yurymalkov
Copy link
Member

@mehrrsaa Finally done it as a manual index resize(resize_index). Now it is the develop branch.
Sorry it took that long.

@Allenlaobai7
Copy link

Allenlaobai7 commented Oct 26, 2020

@yurymalkov hi, I have one question following from the previous discussion:

I want to build an index using 2 million samples, and in order to avoid memory problem, I'm reading the data in chunks and add to the index one by one. I set the max_elements to be 2million from the start. Currently I'm following the example code and implementing save and load and it has been working fine:

init = 1
for samples in pd.read_csv(path, chunksize=CHUNK_SIZE):
    index_vemb = hnswlib.Index(space='cosine', dim=args.dim)
    if init == 1:  # init
        index_vemb.init_index(max_elements=args.vid_cnt, ef_construction=200, M=16)  # M=16
        init = 0
    else:  # load and append new data
        index_vemb.load_index(args.model_path)
    index_vemb.add_items(samples['emb'].tolist(), sample['vid'].tolist())
    index_vemb.save_index(args.model_path)
    del index_vemb

I would like to check whether I can skip the saving and loading part? something like this:

init = 1
for samples in pd.read_csv(path, chunksize=CHUNK_SIZE):
    index_vemb = hnswlib.Index(space='cosine', dim=args.dim)
    if init == 1:  # init
        index_vemb.init_index(max_elements=args.vid_cnt, ef_construction=200, M=16)  # M=16
        init = 0
    index_vemb.add_items(samples['emb'].tolist(), sample['vid'].tolist())
index_vemb.save_index(args.model_path)

Thank you!

@yurymalkov
Copy link
Member

@Allenlaobai7
I am not sure I fully understand. You do not need to load from the index to add elements.
I think something like this should work (though I have not tested the code):

index_vemb = hnswlib.Index(space='cosine', dim=args.dim)
index_vemb.init_index(max_elements=args.vid_cnt, ef_construction=200, M=16)  # M=16
for samples in pd.read_csv(path, chunksize=CHUNK_SIZE):
    index_vemb.add_items(samples['emb'].tolist(), sample['vid'].tolist())
index_vemb.save_index(args.model_path)

@Allenlaobai7
Copy link

@yurymalkov Thank you for the quick reply, I followed the code from read.me and therefore implemented the save and load part. I think it make sense to continue adding items as long as the sample size does not exceed max_elements. Let me test it later to make sure it works.

@yurymalkov
Copy link
Member

Ok. Thanks for the feedback! Didn't think about it...
The code was to demonstrate that you can add elements after loading the index (e.g. the index if fully dynamic).
Yes, you can safely add elements until the capacity is reached. And when the capacity is reached, you can use resize_index to increase it (though probably a more user-friendly way is needed).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants