Skip to content

使用megatron 微调FB8模型问题 #6994

@yugenlgy

Description

@yugenlgy

硬件:80G*8
PYTORCH_CUDA_ALLOC_CONF='expandable_segments:True'
NPROC_PER_NODE=8
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
megatron sft
--model /test/weights/Qwen3-Coder-30B-A3B-Instruct-FP8
--train_type lora
--lora_rank 16
--lora_alpha 32
--target_modules all-linear
--load_safetensors true
--save_safetensors true
--fp8_recipe delayed
--fp8_format hybrid
--fp8_param_gather true
--dataset /test/ml/CodeAlpaca-20k.jsonl
--load_from_cache_file true
--tensor_model_parallel_size 1
--expert_model_parallel_size 8
--moe_permute_fusion true
--moe_grouped_gemm true
--moe_shared_expert_overlap true
--moe_aux_loss_coeff 1e-6
--micro_batch_size 2
--global_batch_size 16
--recompute_granularity full
--recompute_method uniform
--recompute_num_layers 1
--max_epochs 1
--finetune true
--cross_entropy_loss_fusion true
--lr 1e-5
--lr_warmup_fraction 0.05
--min_lr 1e-6
--save megatron_output/Qwen3-Coder-30B-A3B-Instruct-FP8-result
--eval_interval 200
--save_interval 200
--max_length 2048
--num_workers 8
--dataset_num_proc 8
--no_save_optim true
--no_save_rng true
--sequence_parallel true
--moe_expert_capacity_factor 2
--use_precision_aware_optimizer true
--exp_avg_dtype bf16
--exp_avg_sq_dtype bf16
--attention_backend flash
--model_author swift
--main_params_dtype fp16
--model_name swift-robot
--merge_lora false

关键日志
[INFO:swift] [rank0] model_parameter_info: PeftModelForCausalLM: 5253.8388M Params (88.8668M Trainable [1.6915%]), 0.0000M Buffers.

number of parameters on (tensor, pipeline) model parallel rank (0, 0): 5253838848
[after model, optimizer, and learning rate scheduler are built] datetime: 2025-12-10 19:17:57
building train, validation, and test datasets ...
[after dataloaders are built] datetime: 2025-12-10 19:17:57
done with setup ...
(min, max) time across ranks (ms):
model-and-optimizer-setup ......................: (42193.62, 42195.00)
train/valid/test-data-iterators-setup ..........: (0.74, 1.14)
training ...
Overwriting rerun_state_machine.current_iteration from -1 to 0...
[before the start of training step] datetime: 2025-12-10 19:17:57
[INFO:swift] The training of Epoch 0 starts...
WARNING:DotProductAttention:flash-attn v3 may provide important feature support or performance improvement. Please install flash-attn v3 by
(1) git clone https://github.com/Dao-AILab/flash-attention.git
(2) cd flash-attention/ && git checkout 3ba6f82 && git submodule update --init && cd hopper/ && python setup.py install
(3) python_path=python -c "import site; print(site.getsitepackages()[0])"
(4) mkdir -p $python_path/flash_attn_3
(5) cp flash_attn_interface.py $python_path/flash_attn_3/flash_attn_interface.py
WARNING:megatron.core.rerun_state_machine:Result validation enabled
Number of parameters in transformer block in billions: 29.90
Number of parameters in embedding layers in billions: 0.62
Total number of parameters in billions: 30.52
Number of parameters in most loaded shard in billions: 30.5197
Theoretical memory footprints: weight and optimizer=218294.09 MB
[2025-12-10 19:18:34] iteration 1/ 1251 | consumed samples: 16 | elapsed time per iteration (ms): 36464.7 | memory(GiB): 12.5 | elapsed time: 36s | remaining time: 12h 39m 40s | learning rate: 1.598721E-07 | global batch size: 16 | lm loss: 1.463566E+01 | load_balancing_loss: 3.929863E+00 | loss scale: 1.0 | grad norm: 0.000 | number of skipped iterations: 0 | number of nan iterations: 0 |

Image

1、这个 5253.8388M是如何计算出来的?
2、为什么GPU显存占用这么高?每个GPU都在20G左右我的目目标是将Qwen3-Coder-30B-A3B-Instruct-FP8模型进行DP=1,EP=8进行,显存不用占4-5G左右吗?FP8在微调时候,模型加载到显存时候,会变成float16/bfloat16吗?我只想对FP8类型进微调
3、我最终是想在这8张卡上微调Qwen3-Coder-480B-A35B-Instruct-FP8模型,如果仅仅只是修改--model参数,就会报OOM问题,该如何修改训练脚本?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions