Skip to content

HYBRID_SHARD fails when world_size < available GPUs #678

@RobotSail

Description

@RobotSail

Problem

Training fails with ValueError when using FSDP with HYBRID_SHARD sharding strategy on systems where the nproc_per_node specified is below the total available set of GPUs on a given node.

Error

ValueError: The arg 'group_size' (8) must not exceed the world size (2)

Stack Trace

File ".../torch/distributed/fsdp/fully_sharded_data_parallel.py", line 439, in __init__
    _init_process_group_state(
File ".../torch/distributed/fsdp/_init_utils.py", line 113, in _init_process_group_state
    state = _init_process_group_state_for_hybrid_shard(
File ".../torch/distributed/fsdp/_init_utils.py", line 160, in _init_process_group_state_for_hybrid_shard
    intra_node_group, inter_node_group = _init_intra_and_inter_node_groups(
File ".../torch/distributed/fsdp/_init_utils.py", line 266, in _init_intra_and_inter_node_groups
    _init_intra_node_process_group(num_devices_per_node),
File ".../torch/distributed/fsdp/_init_utils.py", line 211, in _init_intra_node_process_group
    intra_node_subgroup, _ = dist.new_subgroups(num_devices_per_node)
File ".../torch/distributed/distributed_c10d.py", line 5500, in new_subgroups
    raise ValueError(
ValueError: The arg 'group_size' (8) must not exceed the world size (2)

Cause

When HYBRID_SHARD is used without an explicit device_mesh, FSDP1 auto-detects num_devices_per_node which defaults to 8. It then attempts to create intra-node process groups of size 8, which fails when world_size < 8.

FSDP1 does not provide a straightforward way to configure the intra-node group size when using HYBRID_SHARD without a device_mesh.

Reproduction

Run training with:

  • distributed_training_framework: fsdp
  • fsdp_sharding_strategy: HYBRID_SHARD
  • Fewer than 8 GPUs (e.g., 2 GPUs)

Environment

  • PyTorch with FSDP1 (torch.distributed.fsdp.FullyShardedDataParallel)
  • Accelerate
  • Any system with < 8 GPUs

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions