Skip to content

megatron dpo训练Qwen3-Omni卡死 #6995

@Mryangkaitong

Description

@Mryangkaitong

单机8卡训练,卡在下面的位置,部分卡利用率一直为0

training ...
Overwriting rerun_state_machine.current_iteration from -1 to 0...
[before the start of training step] datetime: 2025-12-10 18:23:24
WARNING:DotProductAttention:flash-attn v3 may provide important feature support or performance improvement. Please install flash-attn v3 by
(1) git clone https://github.com/Dao-AILab/flash-attention.git
(2) cd flash-attention/hopper && python setup.py install
(3) python_path=python -c "import site; print(site.getsitepackages()[0])"
(4) mkdir -p $python_path/flash_attn_3
(5) cp flash_attn_interface.py $python_path/flash_attn_3/flash_attn_interface.py
WARNING:megatron.core.rerun_state_machine:Result validation enabled

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions