Skip to content

Commit

Permalink
Merge pull request karpathy#301 from okuvshynov/master
Browse files Browse the repository at this point in the history
[easy] allow multithreading in load_dataset
  • Loading branch information
karpathy authored Jun 17, 2023
2 parents 7339b90 + bb7e967 commit 41d7014
Showing 1 changed file with 6 additions and 1 deletion.
7 changes: 6 additions & 1 deletion data/openwebtext/prepare.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,8 +11,13 @@
# good number to use is ~order number of cpu cores // 2
num_proc = 8

# number of workers in load_dataset() call
# best number might be different from num_proc above as it also depends on NW speed.
# it is better than 1 usually though
num_proc_load_dataset = num_proc

# takes 54GB in huggingface .cache dir, about 8M documents (8,013,769)
dataset = load_dataset("openwebtext")
dataset = load_dataset("openwebtext", num_proc=num_proc_load_dataset)

# owt by default only contains the 'train' split, so create a test split
split_dataset = dataset["train"].train_test_split(test_size=0.0005, seed=2357, shuffle=True)
Expand Down

0 comments on commit 41d7014

Please sign in to comment.