Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

large dataset #81

Open
sctrueew opened this issue Dec 16, 2018 · 19 comments
Open

large dataset #81

sctrueew opened this issue Dec 16, 2018 · 19 comments

Comments

@sctrueew
Copy link

sctrueew commented Dec 16, 2018

Hello everyone,
I've extracted the features of 100M images and each image is an array of 4096.I have a machine with 128Gb Ram and I want to know that
What is the best parameters should I use?
Should I split into multiple indexes?
Can I use this method for large scale?

@sctrueew sctrueew changed the title Build a large dataset large dataset Dec 16, 2018
@yurymalkov
Copy link
Member

Hi @zpmmehrdad,
It seems the dataset is to large fit into the memory (100M4096sizeof(float))~1.5Tb. HNSW index without the data requires much less memory.
You can try to compress your data (you can look at https://github.com/facebookresearch/faiss or https://github.com/dbaranchuk/ivf-hnsw). Compressing would help if the real dimensionality of the data is not that big - which is very likely to be the case.
Otherwise, you can use splitting.

@sctrueew
Copy link
Author

Hi @yurymalkov,
Thank you for response. If I split my dataset into 10 indexes or 100 indexes, how much RAM is needed for each index?
I've used Annoy before for 10M images that used less 1Gb RAM.

@yurymalkov
Copy link
Member

@zpmmehrdad Did you use offline building with annoy?
If online, annoy also stores the data, so it should take 150 Gb, not 1 Gb. If offline, it should be extremely slow.
I think I am missing something.

@sctrueew
Copy link
Author

@yurymalkov I've made 10 indexes and I search parallel on all indexes and finally merge them.for this I want to use hnswlib because it's very faster and accurate than Annoy.
You think what should I do for 100M?
Can I use PCA for compression?

@searchivarius
Copy link
Member

searchivarius commented Dec 17, 2018 via email

@sctrueew
Copy link
Author

sctrueew commented Dec 17, 2018

Hi @searchivarius, thanks for reply
Can you give an example for L2-autoencoder or PCA in python?

@searchivarius
Copy link
Member

searchivarius commented Dec 17, 2018 via email

@sctrueew
Copy link
Author

Thanks a lot @searchivarius

@sctrueew
Copy link
Author

sctrueew commented Dec 18, 2018

Hi @searchivarius ,
For example, I've tested100 vectors (each vector is 4096x1) and I run the code:


import random
import hnswlib
import numpy as np
from sklearn.decomposition import PCA

data = np.float32(np.random.random((100, 4096)))
pca = PCA(n_components=0.99)
new_arr = pca.fit_transform(data)
dim = new_arr.shape[1]
p = hnswlib.Index(space='l2', dim=dim)
p.set_ef(10)
p.set_num_threads(8)
p.add_items(new_arr)
p.save_index("test.bin")
labels, distances = p.knn_query(new_arr[0], k=10)

#it has no problem but when I want to search by new query 
query1= np.float32(np.random.random((1, 4096)))
pca = PCA(n_components=0.99)
new_query= pca.fit_transform(query1)
labels, distances = p.knn_query(new_query, k=10)

PCA doesn't work on a single vector
What should I do?

@searchivarius
Copy link
Member

searchivarius commented Dec 18, 2018 via email

@sctrueew
Copy link
Author

@searchivarius I didn't call fit_transform the second time

new_query= np.float32(np.random.random((1, 4096)))
labels, distances = p.knn_query(new_query, k=10)

But unfortunately the results are wrong.

@searchivarius
Copy link
Member

searchivarius commented Dec 19, 2018 via email

@searchivarius
Copy link
Member

searchivarius commented Dec 19, 2018 via email

@sctrueew
Copy link
Author

execue me @searchivarius, could you please an example? I quite confused
thanks.

@yurymalkov
Copy link
Member

@zpmmehrdad You should try the same (e.g. train on the randomly selected portion of the dataset, transform the query using trained PCA), but on real data. The results are wrong because PCA does not work on random data - it can help a lot in case the data is correlated between the dimensions (this is true for many real datasets).
You should be able to see how the accuracy of the search degrades with decreasing n_components from 4096 to 1. When n_components=4096 there should be no change in accuracy.
You can also try quantization approaches (e.g. the ones in faiss), but they are harder to handle and tune.

@preetim96
Copy link

preetim96 commented Mar 6, 2019

Hello @yurymalkov and @searchivarius
What changes I have to do so that the index is created on ssd disk and searching is also performed in Disk Index ?

@yurymalkov
Copy link
Member

@preetim96 Probably, instead of allocating the chunk of the memory you would need to memmap it to the SSD.

@preetim96
Copy link

Hello @yurymalkov Thanks for reply
Can you provide a code snippet to do this?

@pawanm09
Copy link

@yurymalkov @searchivarius
When we memmap the index to the disk. Then for searching do we need to load the index into RAM again?
Or we can perform the search in disk index itself.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants