feat: Add new dataset with OpenAI embeddings for 1M DBpedia entities #434

KShivendu · 2023-07-02T14:44:19Z

Add a new dataset with OpenAI embeddings for DBpedia entities.

This PR also introduces HuggingFace datasets library which can help with loading any dataset on HuggingFace. This makes it easier for anyone to fork and benchmark against almost any public dataset :)

ann_benchmarks/datasets.py

erikbern · 2023-07-02T17:24:55Z

ann_benchmarks/datasets.py

+    TEST_SIZE = 10_000
+
+    X_train = embeddings[TEST_SIZE:]
+    X_test = embeddings[:TEST_SIZE] # Take the first 10k as test set


minor preference for using sklearn.test_train_split but not critical

erikbern · 2023-07-02T17:26:14Z

Nice! Are there bigger ones btw? We have a few datasets already that are around 1M vectors so it might be interesting to try something larger (like 3-10M)

KShivendu · 2023-07-02T22:06:17Z

Are there bigger ones btw?

While working on this benchmark we didn't find any dataset with >=1536 dimensions. That's why I created one. We are planning to take this up to a 10M or even 100M scale in the upcoming weeks/months. I'll create PRs here when we do so :)

In the meantime, please note that this 1M dbpedia-entities dataset will take a lot of computing/RAM/time to run because of 1536 dimensions. One needs ~17GB RAM to run this with Qdrant and ~13GB for PGVector.

Also, I'll do the changes you suggested. Thanks for your quick response :D

erikbern · 2023-07-05T13:39:53Z

Nice, thanks! I'll run this and will add the dataset to S3

KShivendu added 2 commits July 2, 2023 20:10

feat: Add new dataset with openai embeddings for dbpedia entities

d4ebbc9

feat: Add huggingface dataset library in requirements

4860650

KShivendu changed the title ~~feat: Add new dataset with OpenAI embeddings for DBpedia entities~~ feat: Add new dataset with 1M OpenAI embeddings for DBpedia entities Jul 2, 2023

KShivendu changed the title ~~feat: Add new dataset with 1M OpenAI embeddings for DBpedia entities~~ feat: Add new dataset with OpenAI embeddings for 1M DBpedia entities Jul 2, 2023

erikbern reviewed Jul 2, 2023

View reviewed changes

ann_benchmarks/datasets.py Show resolved Hide resolved

erikbern reviewed Jul 2, 2023

View reviewed changes

KShivendu mentioned this pull request Jul 5, 2023

Include number of vectors in the db for benchmarking #435

Closed

KShivendu added 3 commits July 5, 2023 16:52

feat: Use sklearn to split train and test sets

a68b79f

feat: Introduce smaller splits for DBpedia entities dataset

47b0e0b

fix: Remove redundant dbpedia 1M dataset

cf116aa

erikbern merged commit 9bcf775 into erikbern:main Jul 5, 2023

ankane mentioned this pull request Jul 5, 2023

Patchset to speed-up ivfflat build pgvector/pgvector#175

Closed

KShivendu mentioned this pull request Sep 15, 2023

Publish 1M embeddings benchmark #464

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add new dataset with OpenAI embeddings for 1M DBpedia entities #434

feat: Add new dataset with OpenAI embeddings for 1M DBpedia entities #434

KShivendu commented Jul 2, 2023 •

edited

Loading

erikbern Jul 2, 2023

KShivendu Jul 5, 2023

erikbern commented Jul 2, 2023

KShivendu commented Jul 2, 2023 •

edited

Loading

erikbern commented Jul 5, 2023

feat: Add new dataset with OpenAI embeddings for 1M DBpedia entities #434

feat: Add new dataset with OpenAI embeddings for 1M DBpedia entities #434

Conversation

KShivendu commented Jul 2, 2023 • edited Loading

erikbern Jul 2, 2023

Choose a reason for hiding this comment

KShivendu Jul 5, 2023

Choose a reason for hiding this comment

erikbern commented Jul 2, 2023

KShivendu commented Jul 2, 2023 • edited Loading

erikbern commented Jul 5, 2023

KShivendu commented Jul 2, 2023 •

edited

Loading

KShivendu commented Jul 2, 2023 •

edited

Loading