Product Quantization Table (PQTable) is a fast nearest neighbor search method for product-quantized codes via hash-tables. PQTable achieves one of the fastest search performances on a single CPU to date (2017) with significantly efficient memory usage (0.059 ms per query over 10^9 data points with just 5.5 GB memory consumption).
References:
-
Project page: http://yusukematsui.me/project/pqtable/pqtable.html
-
Y. Matsui, T. Yamasaki, and K. Aizawa, "PQTable: Nonexhaustive Fast Search for Product-Quantized Codes Using Hash Tables", IEEE Transactions on Multimedia 2018. [paper] (extended version of ICCV paper)
-
Y. Matsui, T. Yamasaki, and K. Aizawa, "PQTable: Fast Exact Asymmetric Distance Neighbor Search for Product Quantization using Hash Tables", ICCV 2015. [paper][supplementary]
Requisites:
- C++11
- CMake
- OpenCV 3.X
- TCMalloc (optional, but strongly recommended). On ubuntu, you can install TCMalloc by
sudo apt-get install libgoogle-perftools-dev
. Then link it via the "-ltcmalloc" linker flag. This makes the search ~2x faster.
Build as usual with CMake:
$ git clone https://github.com/matsui528/pqtable.git
$ cd pqtable
$ mkdir build && cd build && cmake ..
$ make
You can try a small demo using the siftsmall data. This does not take time.
First, cd
to the top directory of this project, then download the siftsmall vectors on data/
.
$ bash scripts/download_siftsmall.sh
Go to the bin directory, and run the demo
$ cd build/bin
$ ./demo_siftsmall
The program trains a product-quantizer, encodes vectors, builds PQTable, and runs the search. The final results will be somthing like:
...
93th query: nearest_id=6950, dist=49826.5
94th query: nearest_id=4053, dist=59649.2
95th query: nearest_id=7271, dist=76137.4
96th query: nearest_id=7512, dist=71723.1
97th query: nearest_id=7927, dist=68105
98th query: nearest_id=6289, dist=15346.3
99th query: nearest_id=8082, dist=54443.8
0.113859 [msec/query]
Note that these results (the nearest_ids and the distances) might be slightly different from yours because the training step includes a random process.
You can try a large-scale demo using the sift1b data. This demo reproduces the results of Table 3 in the paper.
Let's cd
to the top directory of this project, and download the sift1b vectors on data/
.
$ bash scripts/download_sift1b.sh
This would take several hours. Since the data is large, ~250 GB disk space is required. Next, go to the bin dir, and run the code for training a product quantizer.
$ cd build/bin
$ ./demo_sift1b_train
This also takes several hours. If you do not care about the quality of the product quantizer and just try the demo, please change top_n
in the demo_sift1b_train.cpp
to the small number such as 10000, and compile it again.
After finishing the training, the trained file codewords.txt
is created on the build/bin
.
Next, encode input vectors into PQ codes.
$ ./demo_sift1b_encode
This would take one or two hours. You can get codes.bin
, which is a set of PQ codes of base vectors in the sift1b data. It takes 4 GB (if M=4).
Then, build a PQTable using codes.bin
.
$ ./demo_sift1b_build_table
A directory pqtable
will be created, which contains the codewords and the table itself. Using these files, you can run the search.
$ ./demo_sift1b_search
You will have a result like this:
top_k: 1
0.0579275 [msec/query]
The is the runtime per query for the sift1b data. You will also have score.txt
, which contains the searched IDs. You can check a recall rate using an evaluation script.
$ cd ../..
$ python scripts/eval.py build/bin/score.txt data/gnd/idx_1000M.ivecs
This scripts evaluates the search result using the groundtruth annotation. The reuslt will be:
Recall@1: 0.002
Note that you can see any top-k results by passing the argment in the search function, e.g.,
$ ./demo_sift1b_search 100 # This creates top-100 results on score.txt
$ cd ../..
$ python scripts/eval.py build/bin/score.txt data/gnd/idx_1000M.ivecs
Then you will see Recall@1, 2, 5, ..., 100.