Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[easy] allow multithreading in load_dataset #301

Merged
merged 1 commit into from
Jun 17, 2023

Conversation

okuvshynov
Copy link
Contributor

Both loading from the network and generating train split can benefit from multithreading (especially on larger instances/faster network).

On 30 CPU + single A100 lambda instance it brings down the time to get the data from ~35min (28 threads for tokenizing, 1 thread for load_dataset) to ~8min (28 threads for both).

CPU util looks like this:

28 thread load_dataset (finished):
nanogpt_dataload_mt

1 thread load_dataset (still running):
nanogpt_dataload_1T

@karpathy karpathy merged commit 41d7014 into karpathy:master Jun 17, 2023
@karpathy
Copy link
Owner

doh I didn't know about this, thank you!!

@okuvshynov
Copy link
Contributor Author

looks like we might have to revert this :( while it works well on Linux machines i tried it on, when I tried to download the data on Macbook it hangs somewhere in the library (unless I set num_proc=1)

@okuvshynov
Copy link
Contributor Author

another option is to make num_proc_load_dataset=1 , so that it works by default, but mention somewhere that we can set this higher.

@okuvshynov
Copy link
Contributor Author

hopefully no need to revert.

542ac51 -- the issue was supposedly that multiprocessing module implementation on that platform cannot find entry point unless you define 'main'. I'll test it and create new PR

carmocca added a commit to Lightning-AI/litgpt that referenced this pull request Jun 21, 2023
ken-viper added a commit to ken-viper/litgpt that referenced this pull request Sep 5, 2024
gkielian pushed a commit to gkielian/ReaLLMASIC_nanogpt that referenced this pull request Sep 5, 2024
[easy] allow multithreading in load_dataset
gkielian added a commit to gkielian/ReaLLMASIC_nanogpt that referenced this pull request Dec 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants