-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Open
Description
单机8卡训练,卡在下面的位置,部分卡利用率一直为0
training ...
Overwriting rerun_state_machine.current_iteration from -1 to 0...
[before the start of training step] datetime: 2025-12-10 18:23:24
WARNING:DotProductAttention:flash-attn v3 may provide important feature support or performance improvement. Please install flash-attn v3 by
(1) git clone https://github.com/Dao-AILab/flash-attention.git
(2) cd flash-attention/hopper && python setup.py install
(3) python_path=python -c "import site; print(site.getsitepackages()[0])"
(4) mkdir -p $python_path/flash_attn_3
(5) cp flash_attn_interface.py $python_path/flash_attn_3/flash_attn_interface.py
WARNING:megatron.core.rerun_state_machine:Result validation enabled
Metadata
Metadata
Assignees
Labels
No labels