Pre-training GPT-2 Model

This repository is an attempt to pre-train GPT-2 model from scratch and explore how to scale up the training process using TPU via PyTorch/XLA.

Getting started

Start by cloning the repository:

git clone https://github.com/minhnguyent546/pre-training-gpt2.git
cd pre-training-gpt2

The next step is to set up the environment, for example, using Python venv:

python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

# or if you want to train on CUDA with torch_xla
pip install -r requirements-xla_cuda.txt

# or on TPU with torch_xla
pip install -r requirements-xla_tpu.txt

Running pre-training

Preparing the dataset

Run the following script to download and tokenize the Fineweb or Fineweb-Edu dataset (10BT split):

python scripts/prepare_fineweb.py \
    --seed 1061109567 \
    --tokenizer gpt2 \
    --shard-size 100000000 \
    --output-dir ./fineweb_10BT \
    --num-workers 4

You will need approximately 80 GiB space on disk to download and process the dataset, plus 20 GiB for tokens files, leading to roughly 100 GiB in total.

Training

To train the model, you can run:

python run_pretrain.py \
    --checkpoints-dir ./checkpoints \
    --seed 1061109567 \
    --train-dir ./fineweb_10BT/train \
    --valid-dir ./fineweb_10BT/test \
    --vocab-size 50304 \
    --seq-length 1024 \
    --d-model 768 \
    --num-layers 12 \
    --num-heads 12 \
    --d-ff 3072 \
    --activation gelu \
    --tie-weights \
    --wandb-logging \
    --wandb-project pre-training-gpt2 \
    --wandb-name gpt2-124m \
    --wandb-logging-interval 100 \
    --optim-type adamw \
    --learning-rate 0.0006 \
    --betas 0.9 0.95 \
    --weight-decay 0.1 \
    --decay-method cosine \
    --warmup-steps 750 \
    --min-lr 0.00006 \
    --decay-steps 20000 \
    --train-batch-size 256 \
    --eval-batch-size 256 \
    --gradient-accum-step 1 \
    --mixed-precision bf16 \
    --train-steps 20000 \
    --valid-steps 100 \
    --valid-interval 500 \
    --save-interval 1000 \
    --saved-checkpoint-limit 6 \
    --max-grad-norm 1.0

Or, for example, if you have a node with 2 GPUs:

torchrun --standalone --nproc_per_node=2 run_pretrain.py

Or run pre-training with XLA devices via torch_xla, e.g. to train on a TPU (tested on TPUv3-8, if you encounter errors when running on Colab's TPUv2-8 or Kaggle's TPUv3-8, you might want to check out this notebook to change some environment variables which can cause some errors):

PJRT_DEVICE=TPU python run_pretrain_xla.py
    --mp-start-method spawn \
    --use-syncfree-optim

You can also use GPU with torch_xla:

export XLA_REGISTER_INSTALLED_PLUGINS=1
PJRT_DEVICE=CUDA GPU_NUM_DEVICES=2 python run_pretrain_xla.py \
    --mp-start-method spawn  # CUDA devices are not visible across processes when using fork

You can then run some generation. Here is an example:

python generate.py \
    --model /path/to/checkpoint.pt \
    --tokenizer gpt2 \
    --seed 42 \
    --max-new-tokens 40 \
    --temperature 0.75 \
    --top-k 40 \
    --top-p 0.95

An example input and output from the model:

>> I'm a large language model,
I'm a large language model, so I would like to be able to see a real human language. The next step, is to see the human brain in terms of the brain's language, but not the other way around.

Starting from pre-trained checkpoints

You can also leverage pre-trained checkpoints from OpenAI's GPT-2 by specifying --from-checkpoint:

export XLA_REGISTER_INSTALLED_PLUGINS=1
PJRT_DEVICE=CUDA GPU_NUM_DEVICES=2 python run_pretrain_xla.py \
    --do-test \
    --seed 1061109567 \
    --train-dir ./fineweb_10BT/train \
    --valid-dir ./fineweb_10BT/test \
    --vocab-size 50304 \
    --seq-length 1024 \
    --d-model 768 \
    --num-layers 12 \
    --num-heads 12 \
    --d-ff 3072 \
    --activation gelu \
    --tie-weights \
    --matmul-precision high \
    --eval-batch-size 8 \
    --mixed-precision fp16 \
    --valid-steps 500 \
    --mp-start-method spawn \
    --from-checkpoint gpt2

Here are some figures from OpenAI's GPT-2 checkpoints:

Model	Params	loss	PPL
gpt2	124M	3.297	27.03
gpt2-medium	355M	3.023	20.55
gpt2-large	774M	2.888	17.96
gpt2-xl	1.5B	2.791	16.30

The images below describe what the loss curves look like when training base model (124M) from scratch on Fineweb-Edu (10BT subset), the training took around 14 hours on Kaggle's TPUv3-8:

Name		Name	Last commit message	Last commit date
Latest commit History 136 Commits
.git-hooks		.git-hooks
assets		assets
gpt2		gpt2
scripts		scripts
.editorconfig		.editorconfig
.flake8		.flake8
.gitattributes		.gitattributes
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md
generate.py		generate.py
git-conventional-commits.yaml		git-conventional-commits.yaml
pyrightconfig.json		pyrightconfig.json
requirements-xla_cuda.txt		requirements-xla_cuda.txt
requirements-xla_tpu.txt		requirements-xla_tpu.txt
requirements.txt		requirements.txt
run_pretrain.py		run_pretrain.py
run_pretrain_xla.py		run_pretrain_xla.py
train_tokenizer.py		train_tokenizer.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Pre-training GPT-2 Model

Getting started

Running pre-training

Preparing the dataset

Training

Starting from pre-trained checkpoints

About

Releases

Packages

Languages

minhnguyent546/pre-training-gpt2

Folders and files

Latest commit

History

Repository files navigation

Pre-training GPT-2 Model

Getting started

Running pre-training

Preparing the dataset

Training

Starting from pre-trained checkpoints

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages