Third-party benchmark #6

hiyouga · 2024-03-07T17:08:40Z

Hello, thank you very much for such excellent work. We have conducted some experiments using Llama-Factory, and the results indicate that Galore can significantly reduce memory usage during full parameter fine-tuning. We utilized the 8-bit AdamW optimizer and pure bfloat16 training with gradient checkpointing. Galore requires only 18GB of VRAM to train a Llama-2 7B model, while the standard 8-bit AdamW optimizer requires at least 40GB of VRAM. We provide reproducible scripts for SFT training here: https://github.com/hiyouga/LLaMA-Factory/blob/main/examples/extras/galore/galore_adamw_8bit_bf16.sh

	Rank	Retain grad	Memory	Token/s
8-bit AdamW		Yes	40GB	1434
8-bit GaLore	16	Yes	28GB	1532
8-bit GaLore	128	Yes	29GB	1532
16-bit GaLore	128	Yes	30GB	1615
16-bit GaLore	128	No	18GB	1587
8-bit GaLore	1024	Yes	36GB	1238

* We omitted the time of computing SVD for GaLore every update_proj_gap step, it costs around 10 minutes for a 7B model.

model: LLaMA-2 7B
device: NVIDIA A100
token batch size: 512
activation checkpointing: enabled
flash attention: disabled

Experiment results last updated: Mar 9th.
todo: add loss convergence results.

The text was updated successfully, but these errors were encountered:

samuelazran · 2024-03-07T17:42:05Z

Hello, thank you very much for such excellent work. We have conducted some experiments using Llama-Factory, and the results indicate that Galore can significantly reduce memory usage during full parameter fine-tuning. We utilized the 8-bit AdamW optimizer and pure bfloat16 training with gradient checkpointing. Galore requires only 28GB of VRAM to train a Llama-2 7B model, while the standard 8-bit AdamW optimizer requires at least 42GB of VRAM. Galore also demonstrates superior training speed, achieving about 130% of the throughput. We provide reproducible scripts for SFT training here: https://github.com/hiyouga/LLaMA-Factory/blob/main/examples/extras/galore/galore_adamw_8bit_bf16.sh

8-bit AdamW GaLore 8-bit AdamW
GRAM 28GB 42GB
Speed 1.14 it/s 1.59 it/s

Thank you for sharing! have you checked accuracy benchmarks too?

hiyouga · 2024-03-07T17:46:59Z

@samuelazran nope, but the loss curve is pretty good for me

monk1337 · 2024-03-07T18:08:35Z

@hiyouga It would be interesting to benchmark some state-of-the-art LLMs on a few tasks from the LLM leaderboard. The accuracy reported on the GLUE benchmark using pre-trained RoBERTa-Base doesn't seem to be increasing by a large margin

Larryvrh · 2024-03-08T02:05:31Z

GaLore is way better than LoRA and its variants in terms of loss, based on my small-scale experiment, even though this was not a rigorous study and only serve as a preliminary test.

yongchanghao · 2024-03-08T06:36:42Z

Hi @hiyouga, I am trying out GaLore with this repo. However, I am experiencing a very low throughput on an A6000. How did you manage to make it >1it/s? In addition, if I understand correctly, GaGlore reduces O(N) operations (element-wise scaling) but adds more O(N^3) operations (SVD and projections) upon Adam-8bit, how is it faster instead?

hiyouga · 2024-03-08T11:32:24Z

@yongchanghao Sorry, we might miss some experimental details. We used gradient_accumulation_steps=2 in the above experiments. We advise using a larger batch size but smaller gradient accumulation steps for better throughput.
Regarding the complexity issue, SVD is only performed once every update_proj_gap steps, so GaLore performs better in most cases during training.

pkumc · 2024-03-08T12:24:21Z

@yongchanghao Sorry, we might miss some experimental details. We used gradient_accumulation_steps=4 in the above experiments. Using the provided script (with ga=8) reduces the throughput by half. We advise using a larger batch size but smaller gradient accumulation steps for better throughput. Regarding the complexity issue, SVD is only performed once every update_proj_gap steps, so GaLore performs better in most cases during training.

@hiyouga I'm also confused why GaLore can improve throughput without increasing batch_size. Actually, in the paper it mentioned "which
induces 17% overhead compared to 8-bit Adam implementation." And Table 8 shows that 8-bit GaLore is slower than Adam8bit.

hiyouga · 2024-03-08T18:20:22Z

@pkumc The previous results were somewhat unfair indeed. Now we have adjusted the experimental setup and updated the results. When the rank is small (<128), GaLore still has better throughput. I guess it may be because GaLore has fewer FLOPs in training. Regarding the data reported in the paper, we have discussed it with the author, and it may be due to different hardware which has varied GEMM performance.

yongchanghao · 2024-03-08T20:05:10Z

@hiyouga Thanks for the update. I feel the current data make more sense.

For future readers' reference, my preliminary experience aligns well with the data reported in in #3 (comment)

geronimi73 · 2024-03-13T20:32:03Z

Hello everyone! I'm also experimenting with the Galore optimizer and the loss curves look great! But I don't see any benefit in memory usage, Galore even uses a bit more.

I compare a TinyLlama full parameter finetune, same batch size and learning rate, FA2 enabled, GaLoreAdamW8bit (🟢) versus adamw_8bit (🔵). Not sure if the optimizer parameters scale=1, rank=1024, and update_proj_gap=200 I picked make sense.
Loss looks good with these settings, but

GaLoreAdamW8bit should use less memory right?
and, should it not also be faster?

Code:

use_galore = True
lr = 1e-5
modelpath = "../models/TinyLlama-1.1B-intermediate-step-1431k-3T"
set_seed(42)
run_id = f"use_galore-{use_galore}-{str(uuid.uuid4())}"

model = AutoModelForCausalLM.from_pretrained(
    modelpath,    
    torch_dtype=torch.bfloat16,
    attn_implementation = "flash_attention_2",  
    use_cache = False,
)
tokenizer = AutoTokenizer.from_pretrained(modelpath, use_fast = False)
train_dataset = load_dataset('imdb', split='train')

def load_galore_optimizer(model, target_modules_list=["attn", "mlp"]):
    galore_params = []
    for module_name, module in model.named_modules():
        if not isinstance(module, nn.Linear): continue
        if not any(target_key in module_name for target_key in target_modules_list): continue
        galore_params.append(module.weight)
        print(module_name)
    id_galore_params = {id(p) for p in galore_params}
    regular_params = [p for p in model.parameters() if id(p) not in id_galore_params]

    param_groups = [
        dict(params=regular_params),
        dict(
            params=galore_params,
            rank=1024,
            update_proj_gap=200,
            scale=1,
            proj_type="std",
        ),
    ]
    optimizer = GaLoreAdamW8bit(param_groups, lr=lr)
    scheduler = get_constant_schedule(optimizer)
    return optimizer, scheduler

args = TrainingArguments(
    output_dir = run_id,
    optim = "adamw_8bit",
    logging_steps = 1, 
    max_steps = 100,
    per_device_train_batch_size = 16,
    learning_rate = lr,
    lr_scheduler_type = "constant",
    gradient_checkpointing = True,
)

trainer = SFTTrainer(
    model = model, 
    train_dataset = train_dataset,
    dataset_text_field = 'text',
    max_seq_length = 512,
    optimizers = load_galore_optimizer(model) if use_galore else (None, None),
    args = args,
)
trainer.train()

What might be the issue here? Any input highly appreciated!

Larryvrh · 2024-03-14T00:38:40Z

Hello everyone! I'm also experimenting with the Galore optimizer and the loss curves look great! But I don't see any benefit in memory usage, Galore even uses a bit more.

I compare a TinyLlama full parameter finetune, same batch size and learning rate, FA2 enabled, GaLoreAdamW8bit (🟢) versus adamw_8bit (🔵). Not sure if the optimizer parameters scale=1, rank=1024, and update_proj_gap=200 I picked make sense but the loss looks good at least. Memory should be less with GaLoreAdamW8bit and should it not also be faster?

Code:

use_galore = True
lr = 1e-5
modelpath = "../models/TinyLlama-1.1B-intermediate-step-1431k-3T"
set_seed(42)
run_id = f"use_galore-{use_galore}-{str(uuid.uuid4())}"

model = AutoModelForCausalLM.from_pretrained(
    modelpath,    
    torch_dtype=torch.bfloat16,
    attn_implementation = "flash_attention_2",  
    use_cache = False,
)
tokenizer = AutoTokenizer.from_pretrained(modelpath, use_fast = False)
train_dataset = load_dataset('imdb', split='train')

def load_galore_optimizer(model, target_modules_list=["attn", "mlp"]):
    galore_params = []
    for module_name, module in model.named_modules():
        if not isinstance(module, nn.Linear): continue
        if not any(target_key in module_name for target_key in target_modules_list): continue
        galore_params.append(module.weight)
        print(module_name)
    id_galore_params = {id(p) for p in galore_params}
    regular_params = [p for p in model.parameters() if id(p) not in id_galore_params]

    param_groups = [
        dict(params=regular_params),
        dict(
            params=galore_params,
            rank=1024,
            update_proj_gap=200,
            scale=1,
            proj_type="std",
        ),
    ]
    optimizer = GaLoreAdamW8bit(param_groups, lr=lr)
    scheduler = get_constant_schedule(optimizer)
    return optimizer, scheduler

args = TrainingArguments(
    output_dir = run_id,
    optim = "adamw_8bit",
    logging_steps = 1, 
    max_steps = 100,
    per_device_train_batch_size = 16,
    learning_rate = lr,
    lr_scheduler_type = "constant",
    gradient_checkpointing = True,
)

trainer = SFTTrainer(
    model = model, 
    train_dataset = train_dataset,
    dataset_text_field = 'text',
    max_seq_length = 512,
    optimizers = load_galore_optimizer(model) if use_galore else (None, None),
    args = args,
)
trainer.train()

What might be the issue here? Any input highly appreciated!

I believe you need do galore layer by layer in order to save memory, as in

GaLore/torchrun_main.py

Line 334 in a6bc165

elif args.optimizer.lower() == 'galore_adamw8bit_per_layer':

Leosgp · 2024-03-21T15:35:04Z

GaLore is way better than LoRA and its variants in terms of loss, based on my small-scale experiment, even though this was not a rigorous study and only serve as a preliminary test.

Hello, I'd like to know what data and model you used to achieve this effect?

WangRongsheng · 2024-04-05T10:04:30Z

This is not a formal research. Although Galore reduces the amount of memory used, it is undeniable that Galore increases the training time by a factor of three. The increase in time is not friendly to LLM training.

This is test code:

'''
#install
conda create --name test python=3.11
conda activate test

export CUDA_HOME=xxxxxxx
export LD_LIBRARY_PATH=$CUDA_HOME"/lib64:$LD_LIBRARY_PATH"
export PATH=$CUDA_HOME"/bin:$PATH"
pip install -U transformers trl datasets
conda install pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia
pip install galore-torch

HF support optimizer
['adamw_hf', 'adamw_torch', 'adamw_torch_fused', 'adamw_torch_xla', 'adamw_torch_npu_fused', 'adamw_apex_fused', 'adafactor', 'adamw_anyprecision', 'sgd', 'adagrad', 'adamw_bnb_8bit', 'adamw_8bit', 'lion_8bit', 'lion_32bit', 
'paged_adamw_32bit', 'paged_adamw_8bit', 'paged_lion_32bit', 'paged_lion_8bit', 'rmsprop', 'rmsprop_bnb', 'rmsprop_bnb_8bit', 'rmsprop_bnb_32bit', 
'galore_adamw', 'galore_adamw_8bit', 'galore_adafactor', 
'galore_adamw_layerwise', 'galore_adamw_8bit_layerwise', 'galore_adafactor_layerwise']

'''
import torch
import datasets
from transformers import TrainingArguments, AutoConfig, AutoTokenizer, AutoModelForCausalLM
import trl, time

train_dataset = datasets.load_dataset('imdb', split='train')

args = TrainingArguments(
    output_dir="./test-galore",
    max_steps=100,
    per_device_train_batch_size=2,
    optim="adamw_hf",
    optim_target_modules=["attn", "mlp"]
)

model_id = "Qwen/Qwen1.5-0.5B"   
#model_id = "Qwen/Qwen1.5-4B"
#model_id = "Qwen/Qwen1.5-7B"
#model_id = "mistralai/Mistral-7B-v0.1"

config = AutoConfig.from_pretrained(model_id)
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_config(config).to(0)

trainer = trl.SFTTrainer(
    model=model, 
    args=args,
    train_dataset=train_dataset,
    dataset_text_field='text',
    max_seq_length=512,
)

start_time = time.time()
trainer.train()
train_time = time.time()-start_time

print(f"=====================================================")
print(f"Time Used: {train_time:.2f} s")
print(f"memory_allocated: {torch.cuda.memory_allocated()/1024.0/1024.0:.2f} MB")
print(f"max_memory_allocated: {torch.cuda.max_memory_allocated()/1024.0/1024.0:.2f} MB")
print(f"memory_reserved: {torch.cuda.memory_reserved()/1024.0/1024.0:.2f} MB")
print(f"max_memory_reserved: {torch.cuda.max_memory_reserved()/1024.0/1024.0:.2f} MB")
print(f"free memory: {torch.cuda.mem_get_info()[0]/1024.0/1024.0:.2f} MB")
print(f"=====================================================")

jiaweizzhao · 2024-04-05T19:16:26Z

Thanks for providing your results @WangRongsheng . We are working on efficiency optimization and you can expect a big throughput boost in the next version. For train_loss, did you tune lr for GaLore?

WangRongsheng · 2024-04-06T12:30:58Z

Thanks for providing your results @WangRongsheng . We are working on efficiency optimization and you can expect a big throughput boost in the next version. For train_loss, did you tune lr for GaLore?

I will do it.

hiyouga mentioned this issue Mar 13, 2024

FEAT / Optim: Add GaLore optimizer huggingface/transformers#29588

Merged

jiaweizzhao assigned jiaweizzhao and unassigned jiaweizzhao Mar 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Third-party benchmark #6

Third-party benchmark #6

hiyouga commented Mar 7, 2024 •

edited

Loading

samuelazran commented Mar 7, 2024

hiyouga commented Mar 7, 2024

monk1337 commented Mar 7, 2024

Larryvrh commented Mar 8, 2024

yongchanghao commented Mar 8, 2024 •

edited

Loading

hiyouga commented Mar 8, 2024 •

edited

Loading

pkumc commented Mar 8, 2024 •

edited

Loading

hiyouga commented Mar 8, 2024

yongchanghao commented Mar 8, 2024

geronimi73 commented Mar 13, 2024 •

edited

Loading

Larryvrh commented Mar 14, 2024

Leosgp commented Mar 21, 2024

WangRongsheng commented Apr 5, 2024

jiaweizzhao commented Apr 5, 2024

WangRongsheng commented Apr 6, 2024

Third-party benchmark #6

Third-party benchmark #6

Comments

hiyouga commented Mar 7, 2024 • edited Loading

samuelazran commented Mar 7, 2024

hiyouga commented Mar 7, 2024

monk1337 commented Mar 7, 2024

Larryvrh commented Mar 8, 2024

yongchanghao commented Mar 8, 2024 • edited Loading

hiyouga commented Mar 8, 2024 • edited Loading

pkumc commented Mar 8, 2024 • edited Loading

hiyouga commented Mar 8, 2024

yongchanghao commented Mar 8, 2024

geronimi73 commented Mar 13, 2024 • edited Loading

Larryvrh commented Mar 14, 2024

Leosgp commented Mar 21, 2024

WangRongsheng commented Apr 5, 2024

jiaweizzhao commented Apr 5, 2024

WangRongsheng commented Apr 6, 2024

hiyouga commented Mar 7, 2024 •

edited

Loading

yongchanghao commented Mar 8, 2024 •

edited

Loading

hiyouga commented Mar 8, 2024 •

edited

Loading

pkumc commented Mar 8, 2024 •

edited

Loading

geronimi73 commented Mar 13, 2024 •

edited

Loading