-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Open
Description
Hi,
I am doing experiments with FP8 training using megatron on H100s, but I still have some questions (do not see in the documentations):
- Does FP8 Megatron training reduces VRAM on GPUs so that I can increase batch-size or increase sequence-length?
- Does the repo currently support LORA FP8 training and then directly merge/export LORA to FP8 checkpoints (without converting checkpoints from FP8->FP16 model and then FP16->FP8 for inferencing)?
- Does FP8 training offer speedup in your tests? I setup a quick training run but the training speed is 1.5x-2x slower than FP16, I used the example script llm.sh in the fp8 folder of the repo.
Thanks,
Metadata
Metadata
Assignees
Labels
No labels