Skip to content

harrisonzhy/sparse-ann

Repository files navigation

sparse-ann

Introduction

The BigANN challenge aims to encourage the development of indexing data structures and search algorithms for practical variants of the Approximate Nearest Neighbor (ANN) or Vector search problem on commodity hardware.

This work in progress targets the high-dimensional sparse approximate nearest neighbor search problem using ray tracing primitives and is built into FAISS.

Quick Start

Edit env.sh to ensure the correct paths point to the libraries in your system, and then:

$ source env.sh

Put datasets in the data directory:

$ mkdir data

Unpack the siftsmall dataset into data (see below for source):

$ cd data
$ tar -xzvf siftsmall.tar.gz

To run the sparse ray tracing implementation:

$ cd bin
$ sh ./run.sh

To run other implementations (see bin):

$ make all
$ cd bin
$ sh ./run_<impl-name>.sh

For a larger dataset:

$ cd data
$ wget ftp://ftp.irisa.fr/local/texmex/corpus/sift.tar.gz
$ tar -xzvf sift.tar.gz

References

  1. An Approximate Algorithm for Maximum Inner Product Search over Streaming Sparse Vectors [Paper]

  2. Billion-scale Similarity Search with GPUs [Paper] [Code]

  3. DiskANN: Fast Accurate Billion-point Nearest Neighbor Search on a Single Node [Paper] [Video] [Slides 1] [Slides 2]

  4. Product Quantization for Nearest Neighbor Search [Paper]

  5. Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs [Paper]

  6. Worst-case Performance of Popular Approximate Nearest Neighbor Search Implementations: Guarantees and Limitations [Paper]

  7. big-ann-benchmarks

  8. ann-benchmarks

More Datasets

Name Link # Datapoints Dimensions Format
DEEP 1M https://github.com/johnpzh/iQAN_AE/blob/master/scripts/get.deep1m.sh 1,000,000 96 float32
DEEP 10M https://github.com/johnpzh/iQAN_AE/blob/master/scripts/get.deep10m.sh 10,000,000 96 float32
DEEP 100M https://github.com/johnpzh/iQAN_AE/blob/master/scripts/get.deep100m.sh 100,000,000 96 float32
DEEP 1B https://www.tensorflow.org/datasets/catalog/deep1b 1,000,000,000 96 float32
SIFT small http://corpus-texmex.irisa.fr 10,000 128 float32
SIFT 1M http://corpus-texmex.irisa.fr 1,000,000 128 float32
SIFT 100M https://github.com/johnpzh/iQAN_AE/blob/master/scripts/get.sift100m.sh 100,000,000 128 float32
SIFT 1B http://corpus-texmex.irisa.fr 1,000,000,000 128 uint8
GIST1M http://corpus-texmex.irisa.fr 1,000,000 960 float32
YFCC 10M 10,000,000 192 uint8
YFCC 100M https://multimediacommons.wordpress.com/yfcc100m-core-dataset/ 99,200,000 192 uint8
Yandex T2I 1B https://research.yandex.com/blog/benchmarks-for-billion-scale-similarity-search 1,000,000,000 200 float32
MS MARCO https://microsoft.github.io/msmarco/ 8,841,823 ~30,000 float32
MS SPACEV 1B https://github.com/microsoft/SPTAG/tree/main/datasets/SPACEV1B 1,402,020,720 100 float32
MS Turing 30M 30,000,000 100 float32

About

High-dimensional sparse approximate nearest neighbor search using ray tracing primitives and built into FAISS. Work in progress.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors