-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Description
硬件:80G*8
PYTORCH_CUDA_ALLOC_CONF='expandable_segments:True'
NPROC_PER_NODE=8
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
megatron sft
--model /test/weights/Qwen3-Coder-30B-A3B-Instruct-FP8
--train_type lora
--lora_rank 16
--lora_alpha 32
--target_modules all-linear
--load_safetensors true
--save_safetensors true
--fp8_recipe delayed
--fp8_format hybrid
--fp8_param_gather true
--dataset /test/ml/CodeAlpaca-20k.jsonl
--load_from_cache_file true
--tensor_model_parallel_size 1
--expert_model_parallel_size 8
--moe_permute_fusion true
--moe_grouped_gemm true
--moe_shared_expert_overlap true
--moe_aux_loss_coeff 1e-6
--micro_batch_size 2
--global_batch_size 16
--recompute_granularity full
--recompute_method uniform
--recompute_num_layers 1
--max_epochs 1
--finetune true
--cross_entropy_loss_fusion true
--lr 1e-5
--lr_warmup_fraction 0.05
--min_lr 1e-6
--save megatron_output/Qwen3-Coder-30B-A3B-Instruct-FP8-result
--eval_interval 200
--save_interval 200
--max_length 2048
--num_workers 8
--dataset_num_proc 8
--no_save_optim true
--no_save_rng true
--sequence_parallel true
--moe_expert_capacity_factor 2
--use_precision_aware_optimizer true
--exp_avg_dtype bf16
--exp_avg_sq_dtype bf16
--attention_backend flash
--model_author swift
--main_params_dtype fp16
--model_name swift-robot
--merge_lora false
关键日志
[INFO:swift] [rank0] model_parameter_info: PeftModelForCausalLM: 5253.8388M Params (88.8668M Trainable [1.6915%]), 0.0000M Buffers.
number of parameters on (tensor, pipeline) model parallel rank (0, 0): 5253838848
[after model, optimizer, and learning rate scheduler are built] datetime: 2025-12-10 19:17:57
building train, validation, and test datasets ...
[after dataloaders are built] datetime: 2025-12-10 19:17:57
done with setup ...
(min, max) time across ranks (ms):
model-and-optimizer-setup ......................: (42193.62, 42195.00)
train/valid/test-data-iterators-setup ..........: (0.74, 1.14)
training ...
Overwriting rerun_state_machine.current_iteration from -1 to 0...
[before the start of training step] datetime: 2025-12-10 19:17:57
[INFO:swift] The training of Epoch 0 starts...
WARNING:DotProductAttention:flash-attn v3 may provide important feature support or performance improvement. Please install flash-attn v3 by
(1) git clone https://github.com/Dao-AILab/flash-attention.git
(2) cd flash-attention/ && git checkout 3ba6f82 && git submodule update --init && cd hopper/ && python setup.py install
(3) python_path=python -c "import site; print(site.getsitepackages()[0])"
(4) mkdir -p $python_path/flash_attn_3
(5) cp flash_attn_interface.py $python_path/flash_attn_3/flash_attn_interface.py
WARNING:megatron.core.rerun_state_machine:Result validation enabled
Number of parameters in transformer block in billions: 29.90
Number of parameters in embedding layers in billions: 0.62
Total number of parameters in billions: 30.52
Number of parameters in most loaded shard in billions: 30.5197
Theoretical memory footprints: weight and optimizer=218294.09 MB
[2025-12-10 19:18:34] iteration 1/ 1251 | consumed samples: 16 | elapsed time per iteration (ms): 36464.7 | memory(GiB): 12.5 | elapsed time: 36s | remaining time: 12h 39m 40s | learning rate: 1.598721E-07 | global batch size: 16 | lm loss: 1.463566E+01 | load_balancing_loss: 3.929863E+00 | loss scale: 1.0 | grad norm: 0.000 | number of skipped iterations: 0 | number of nan iterations: 0 |
1、这个 5253.8388M是如何计算出来的?
2、为什么GPU显存占用这么高?每个GPU都在20G左右我的目目标是将Qwen3-Coder-30B-A3B-Instruct-FP8模型进行DP=1,EP=8进行,显存不用占4-5G左右吗?FP8在微调时候,模型加载到显存时候,会变成float16/bfloat16吗?我只想对FP8类型进微调
3、我最终是想在这8张卡上微调Qwen3-Coder-480B-A35B-Instruct-FP8模型,如果仅仅只是修改--model参数,就会报OOM问题,该如何修改训练脚本?