-
Notifications
You must be signed in to change notification settings - Fork 758
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Add new dataset with OpenAI embeddings for 1M DBpedia entities #434
Conversation
ann_benchmarks/datasets.py
Outdated
TEST_SIZE = 10_000 | ||
|
||
X_train = embeddings[TEST_SIZE:] | ||
X_test = embeddings[:TEST_SIZE] # Take the first 10k as test set |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor preference for using sklearn.test_train_split
but not critical
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
Nice! Are there bigger ones btw? We have a few datasets already that are around 1M vectors so it might be interesting to try something larger (like 3-10M) |
While working on this benchmark we didn't find any dataset with >=1536 dimensions. That's why I created one. We are planning to take this up to a 10M or even 100M scale in the upcoming weeks/months. I'll create PRs here when we do so :) In the meantime, please note that this 1M Also, I'll do the changes you suggested. Thanks for your quick response :D |
Nice, thanks! I'll run this and will add the dataset to S3 |
Add a new dataset with OpenAI embeddings for DBpedia entities.
This PR also introduces HuggingFace datasets library which can help with loading any dataset on HuggingFace. This makes it easier for anyone to fork and benchmark against almost any public dataset :)