-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Description
nnodes=2
nproc_per_node=4
CUDA_VISIBLE_DEVICES=0,1,2,3
NNODES=$nnodes
NODE_RANK=0
MASTER_ADDR=127.0.0.1
MASTER_PORT=28350
NPROC_PER_NODE=$nproc_per_node
swift sft
--model /data/weights/Qwen/Qwen3-14B
--train_type lora
--dataset '/data/ms-swift/swift/llm/dataset/data/merged.json'
--load_from_cache_file true
--split_dataset_ratio 0.01
--torch_dtype bfloat16
--num_train_epochs 1
--per_device_train_batch_size 1
--per_device_eval_batch_size 1
--learning_rate 1e-5
--gradient_accumulation_steps $(expr 32 / $nproc_per_node / $nnodes)
--eval_steps 100
--save_steps 100
--save_total_limit 2
--logging_steps 5
--max_length 8192
--output_dir output
--system 'You are a helpful assistant.'
--warmup_ratio 0.05
--dataloader_num_workers 4
--deepspeed zero3
我的指令,看着通信成功,但是执行到[INFO:swift] model_kwargs: {'device_map': None, 'dtype': torch.bfloat16}直接卡住。
