-
Notifications
You must be signed in to change notification settings - Fork 76
Open
Description
Problem
Training fails with ValueError when using FSDP with HYBRID_SHARD sharding strategy on systems where the nproc_per_node specified is below the total available set of GPUs on a given node.
Error
ValueError: The arg 'group_size' (8) must not exceed the world size (2)
Stack Trace
File ".../torch/distributed/fsdp/fully_sharded_data_parallel.py", line 439, in __init__
_init_process_group_state(
File ".../torch/distributed/fsdp/_init_utils.py", line 113, in _init_process_group_state
state = _init_process_group_state_for_hybrid_shard(
File ".../torch/distributed/fsdp/_init_utils.py", line 160, in _init_process_group_state_for_hybrid_shard
intra_node_group, inter_node_group = _init_intra_and_inter_node_groups(
File ".../torch/distributed/fsdp/_init_utils.py", line 266, in _init_intra_and_inter_node_groups
_init_intra_node_process_group(num_devices_per_node),
File ".../torch/distributed/fsdp/_init_utils.py", line 211, in _init_intra_node_process_group
intra_node_subgroup, _ = dist.new_subgroups(num_devices_per_node)
File ".../torch/distributed/distributed_c10d.py", line 5500, in new_subgroups
raise ValueError(
ValueError: The arg 'group_size' (8) must not exceed the world size (2)
Cause
When HYBRID_SHARD is used without an explicit device_mesh, FSDP1 auto-detects num_devices_per_node which defaults to 8. It then attempts to create intra-node process groups of size 8, which fails when world_size < 8.
FSDP1 does not provide a straightforward way to configure the intra-node group size when using HYBRID_SHARD without a device_mesh.
Reproduction
Run training with:
distributed_training_framework: fsdpfsdp_sharding_strategy: HYBRID_SHARD- Fewer than 8 GPUs (e.g., 2 GPUs)
Environment
- PyTorch with FSDP1 (
torch.distributed.fsdp.FullyShardedDataParallel) - Accelerate
- Any system with < 8 GPUs
Metadata
Metadata
Assignees
Labels
No labels