Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add new dataset with OpenAI embeddings for 1M DBpedia entities #434

Merged
merged 5 commits into from
Jul 5, 2023

Conversation

KShivendu
Copy link
Contributor

@KShivendu KShivendu commented Jul 2, 2023

Add a new dataset with OpenAI embeddings for DBpedia entities.

This PR also introduces HuggingFace datasets library which can help with loading any dataset on HuggingFace. This makes it easier for anyone to fork and benchmark against almost any public dataset :)

@KShivendu KShivendu changed the title feat: Add new dataset with OpenAI embeddings for DBpedia entities feat: Add new dataset with 1M OpenAI embeddings for DBpedia entities Jul 2, 2023
@KShivendu KShivendu changed the title feat: Add new dataset with 1M OpenAI embeddings for DBpedia entities feat: Add new dataset with OpenAI embeddings for 1M DBpedia entities Jul 2, 2023
TEST_SIZE = 10_000

X_train = embeddings[TEST_SIZE:]
X_test = embeddings[:TEST_SIZE] # Take the first 10k as test set
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor preference for using sklearn.test_train_split but not critical

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@erikbern
Copy link
Owner

erikbern commented Jul 2, 2023

Nice! Are there bigger ones btw? We have a few datasets already that are around 1M vectors so it might be interesting to try something larger (like 3-10M)

@KShivendu
Copy link
Contributor Author

KShivendu commented Jul 2, 2023

Are there bigger ones btw?

While working on this benchmark we didn't find any dataset with >=1536 dimensions. That's why I created one. We are planning to take this up to a 10M or even 100M scale in the upcoming weeks/months. I'll create PRs here when we do so :)

In the meantime, please note that this 1M dbpedia-entities dataset will take a lot of computing/RAM/time to run because of 1536 dimensions. One needs ~17GB RAM to run this with Qdrant and ~13GB for PGVector.

Also, I'll do the changes you suggested. Thanks for your quick response :D

@erikbern
Copy link
Owner

erikbern commented Jul 5, 2023

Nice, thanks! I'll run this and will add the dataset to S3

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants