-
Notifications
You must be signed in to change notification settings - Fork 668
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
large dataset #81
Comments
Hi @zpmmehrdad, |
Hi @yurymalkov, |
@zpmmehrdad Did you use offline building with annoy? |
@yurymalkov I've made 10 indexes and I search parallel on all indexes and finally merge them.for this I want to use hnswlib because it's very faster and accurate than Annoy. |
I think 4096 is too much. Very likely you can get the same accuracy after
reducing dimensionality with an L2-autoencoder or PCA. PCA is simpler,
L2-autoencoder is potentially more accurate.
Kind regards,
Leo (Leonid) Boytsov
…On Mon, Dec 17, 2018 at 12:41 PM mehrdad mazhari ***@***.***> wrote:
@yurymalkov <https://github.com/yurymalkov> I've made 10 indexes and I
search parallel on all indexes and finally merge them.for this I want to
use hnswlib because it's very faster and accurate than Annoy.
You think what should I do for 100M?
Can I use PCA for compression?
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#81 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAyZMrB_VGm3gy43B3APDy7uwFqGN_R0ks5u59dXgaJpZM4ZVTB6>
.
|
Hi @searchivarius, thanks for reply |
Well, you need to use one of the frameworks like tensorflow or pytorch.
Then, a simplest thing to do is to have a narrowing fully-connected layer
and widening layer. The loss would be an L2-reconstruction loss. Of course,
there are details (which non-linearities to use as activations), but there
are plenty of examples to google for.
For PCA:
https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html
Crucially, you "fit" the PCA model on some reasonably small subset and then
apply it to a whole data set.
Kind regards,
Leo (Leonid) Boytsov
…On Mon, Dec 17, 2018 at 1:08 PM mehrdad mazhari ***@***.***> wrote:
Hi @searchivarius <https://github.com/searchivarius>, thanks for reply
Can you example for L2-autoencoder or PCA in python?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#81 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAyZMtEqEgI7dI4PlIC6PTDYO_i9X-7fks5u592bgaJpZM4ZVTB6>
.
|
Thanks a lot @searchivarius |
Hi @searchivarius ,
PCA doesn't work on a single vector |
Don't call fit_transform the second time. You do fit or fit_transform only
once on the *training set*. This is when you learn the model. Then you just
*apply* the model by calling transform:
https://scikit-learn.org/stable/data_transforms.html
Kind regards,
Leo (Leonid) Boytsov
…On Tue, Dec 18, 2018 at 3:41 AM mehrdad mazhari ***@***.***> wrote:
Hi @searchivarius <https://github.com/searchivarius> ,
For example, I've tested100 vectors (each vector is 4096x1) and I run the
code:
import random
import hnswlib
import numpy as np
from sklearn.decomposition import PCA
data = np.float32(np.random.random((100, 4096)))
pca = PCA(n_components=0.99)
new_arr = pca.fit_transform(data)
dim = new_arr.shape[1]
p = hnswlib.Index(space='l2', dim=dim)
p.set_ef(10)
p.set_num_threads(8)
p.add_items(new_arr)
p.save_index("test.bin")
labels, distances = p.knn_query(new_arr[0], k=10)
it has no problem but when I want to search by new query
query1= np.float32(np.random.random((1, 4096)))
pca = PCA(n_components=0.99)
new_query= pca.fit_transform(query1)
labels, distances = p.knn_query(new_query, k=10)
PCA doesn't work on a single vector
What should I do?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#81 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAyZMpi_ZpT1sFnPFpwdWA3r-Po9-Bktks5u6KoggaJpZM4ZVTB6>
.
|
@searchivarius I didn't call fit_transform the second time
But unfortunately the results are wrong. |
new_query= *pca.fit_transform(query1)*
labels, distances = p.knn_query(new_query, k=10)
PCA doesn't work on a single vector
What should I do?
…On Wed, Dec 19, 2018, 12:48 AM mehrdad mazhari ***@***.*** wrote:
@searchivarius <https://github.com/searchivarius> I didn't call
fit_transform the second time
new_query= np.float32(np.random.random((1, 4096)))
labels, distances = p.knn_query(new_query, k=10)
But unfortunately the results are wrong.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#81 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAyZMk6urn2a7S2sd8cquoLSVTf7soDgks5u6dNJgaJpZM4ZVTB6>
.
|
Also random 4096 dim data isn't amenable to knn search.
…On Wed, Dec 19, 2018, 1:50 AM Leo Boytsov ***@***.*** wrote:
new_query= *pca.fit_transform(query1)*
labels, distances = p.knn_query(new_query, k=10)
PCA doesn't work on a single vector
What should I do?
On Wed, Dec 19, 2018, 12:48 AM mehrdad mazhari ***@***.***
wrote:
> @searchivarius <https://github.com/searchivarius> I didn't call
> fit_transform the second time
>
> new_query= np.float32(np.random.random((1, 4096)))
> labels, distances = p.knn_query(new_query, k=10)
>
> But unfortunately the results are wrong.
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> <#81 (comment)>, or mute
> the thread
> <https://github.com/notifications/unsubscribe-auth/AAyZMk6urn2a7S2sd8cquoLSVTf7soDgks5u6dNJgaJpZM4ZVTB6>
> .
>
|
execue me @searchivarius, could you please an example? I quite confused |
@zpmmehrdad You should try the same (e.g. train on the randomly selected portion of the dataset, transform the query using trained PCA), but on real data. The results are wrong because PCA does not work on random data - it can help a lot in case the data is correlated between the dimensions (this is true for many real datasets). |
Hello @yurymalkov and @searchivarius |
@preetim96 Probably, instead of allocating the chunk of the memory you would need to memmap it to the SSD. |
Hello @yurymalkov Thanks for reply |
@yurymalkov @searchivarius |
Hello everyone,
I've extracted the features of 100M images and each image is an array of 4096.I have a machine with 128Gb Ram and I want to know that
What is the best parameters should I use?
Should I split into multiple indexes?
Can I use this method for large scale?
The text was updated successfully, but these errors were encountered: