-
Notifications
You must be signed in to change notification settings - Fork 665
[Model] tp+ep support v1_loader #5465
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
[Model] tp+ep support v1_loader #5465
Conversation
|
Thanks for your contribution! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR adds support for v1_loader with tensor parallelism (tp) and expert parallelism (ep) by implementing weight_loader methods across multiple layer types and ensuring proper weight attribute configuration for parallel computation.
- Adds
weight_loadermethods to normalization, linear, and attention layers for v1_loader compatibility - Configures output_dim attributes on bias parameters for tensor parallel layers
- Implements whitelist-based conditional splitting for specific layer patterns (e.g., shared_experts)
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| fastdeploy/model_executor/layers/normalization.py | Adds weight_loader method to RMSNorm for proper dtype conversion during weight loading |
| fastdeploy/model_executor/layers/mtp_linear.py | Sets output_dim attribute on bias for tensor parallel support (has a bug - missing conditional check) |
| fastdeploy/model_executor/layers/lm_head.py | Conditionally sets output_dim attribute on bias when bias is present |
| fastdeploy/model_executor/layers/linear.py | Implements comprehensive weight_loader for RowParallelLinear with whitelist-based splitting logic and dtype casting |
| fastdeploy/model_executor/layers/attention/attention.py | Adds early return path in weight_loader for qk_norm weights (has dtype inconsistency issue) |
| shard_size = (self.fd_config.parallel_config.tensor_parallel_rank + 1) * block_size | ||
|
|
||
| # when use_sequence_parallel_moe, we don't split. | ||
| if layer_in_white_list: | ||
| pass | ||
| else: | ||
| loaded_weight = slice_fn(loaded_weight, output_dim, shard_offset, shard_size) |
Copilot
AI
Dec 9, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The variable name shard_size is misleading. It's actually used as the end index (absolute position), not a size. Consider renaming to shard_end for clarity, consistent with how slice_fn uses start and end parameters.
| shard_size = (self.fd_config.parallel_config.tensor_parallel_rank + 1) * block_size | |
| # when use_sequence_parallel_moe, we don't split. | |
| if layer_in_white_list: | |
| pass | |
| else: | |
| loaded_weight = slice_fn(loaded_weight, output_dim, shard_offset, shard_size) | |
| shard_end = (self.fd_config.parallel_config.tensor_parallel_rank + 1) * block_size | |
| # when use_sequence_parallel_moe, we don't split. | |
| if layer_in_white_list: | |
| pass | |
| else: | |
| loaded_weight = slice_fn(loaded_weight, output_dim, shard_offset, shard_end) |
| if layer_in_white_list: | ||
| pass | ||
| else: | ||
| loaded_weight = slice_fn(loaded_weight, output_dim, shard_offset, shard_size) |
Copilot
AI
Dec 9, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For consistency with the rest of the codebase, use named parameters when calling slice_fn. Change to:
loaded_weight = slice_fn(loaded_weight, output_dim, start=shard_offset, end=shard_size)This matches the pattern used in lines 543, 696, and other weight_loader implementations.
| loaded_weight = slice_fn(loaded_weight, output_dim, shard_offset, shard_size) | |
| loaded_weight = slice_fn(loaded_weight, output_dim, start=shard_offset, end=shard_size) |
| if layer_in_white_list: | ||
| pass | ||
| else: | ||
| loaded_weight = slice_fn(loaded_weight, output_dim, shard_offset, shard_size) | ||
|
|
||
| tp_row_bias = getattr(param, "tp_row_bias", None) | ||
| if layer_in_white_list: | ||
| pass | ||
| else: | ||
| if tp_row_bias: | ||
| loaded_weight = loaded_weight / self.fd_config.parallel_config.tensor_parallel_size |
Copilot
AI
Dec 9, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] The empty pass statements in the whitelist check reduce code readability. Consider refactoring to a more explicit pattern:
if not layer_in_white_list:
loaded_weight = slice_fn(loaded_weight, output_dim, start=shard_offset, end=shard_size)This eliminates the unnecessary pass statements and makes the control flow clearer.
| if layer_in_white_list: | |
| pass | |
| else: | |
| loaded_weight = slice_fn(loaded_weight, output_dim, shard_offset, shard_size) | |
| tp_row_bias = getattr(param, "tp_row_bias", None) | |
| if layer_in_white_list: | |
| pass | |
| else: | |
| if tp_row_bias: | |
| loaded_weight = loaded_weight / self.fd_config.parallel_config.tensor_parallel_size | |
| if not layer_in_white_list: | |
| loaded_weight = slice_fn(loaded_weight, output_dim, shard_offset, shard_size) | |
| tp_row_bias = getattr(param, "tp_row_bias", None) | |
| if not layer_in_white_list and tp_row_bias: | |
| loaded_weight = loaded_weight / self.fd_config.parallel_config.tensor_parallel_size |
| if layer_in_white_list: | ||
| pass | ||
| else: | ||
| loaded_weight = slice_fn(loaded_weight, output_dim, shard_offset, shard_size) | ||
|
|
||
| tp_row_bias = getattr(param, "tp_row_bias", None) | ||
| if layer_in_white_list: | ||
| pass | ||
| else: | ||
| if tp_row_bias: | ||
| loaded_weight = loaded_weight / self.fd_config.parallel_config.tensor_parallel_size |
Copilot
AI
Dec 9, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] Similar to above, the empty pass statements for the whitelist check reduce readability. Consider refactoring to:
if tp_row_bias and not layer_in_white_list:
loaded_weight = loaded_weight / self.fd_config.parallel_config.tensor_parallel_size| if layer_in_white_list: | |
| pass | |
| else: | |
| loaded_weight = slice_fn(loaded_weight, output_dim, shard_offset, shard_size) | |
| tp_row_bias = getattr(param, "tp_row_bias", None) | |
| if layer_in_white_list: | |
| pass | |
| else: | |
| if tp_row_bias: | |
| loaded_weight = loaded_weight / self.fd_config.parallel_config.tensor_parallel_size | |
| if not layer_in_white_list: | |
| loaded_weight = slice_fn(loaded_weight, output_dim, shard_offset, shard_size) | |
| tp_row_bias = getattr(param, "tp_row_bias", None) | |
| if tp_row_bias and not layer_in_white_list: | |
| loaded_weight = loaded_weight / self.fd_config.parallel_config.tensor_parallel_size |
| ) | ||
| if self.tp_size > 1: | ||
| set_weight_attrs(self.linear.weight, {"output_dim": True}) | ||
| set_weight_attrs(self.linear.bias, {"output_dim": True}) |
Copilot
AI
Dec 9, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potential bug: Setting attributes on self.linear.bias without checking if bias exists. When with_bias=False, self.linear.bias may not exist, which would cause an AttributeError.
Add a conditional check:
if self.tp_size > 1:
set_weight_attrs(self.linear.weight, {"output_dim": True})
if self.bias_key is not None:
set_weight_attrs(self.linear.bias, {"output_dim": True})| set_weight_attrs(self.linear.bias, {"output_dim": True}) | |
| if self.bias_key is not None: | |
| set_weight_attrs(self.linear.bias, {"output_dim": True}) |
| def weight_loader(self, param, loaded_weight, loaded_shard_id: Optional[str] = None): | ||
| loaded_weight = get_tensor(loaded_weight).cast(paddle.get_default_dtype()) | ||
|
|
||
| if self.use_qk_norm and ("q_norm" in param.name or "k_norm" in param.name): |
Copilot
AI
Dec 9, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Type conversion inconsistency: The weight_loader casts to paddle.get_default_dtype() at line 232, but q_norm and k_norm weights are defined with dtype "float32" (lines 203, 211) and explicitly cast to float32 in load_state_dict (lines 224-225).
Consider casting to float32 for consistency:
if self.use_qk_norm and ("q_norm" in param.name or "k_norm" in param.name):
loaded_weight = get_tensor(loaded_weight).astype("float32")
param.copy_(loaded_weight, False)
return| if self.use_qk_norm and ("q_norm" in param.name or "k_norm" in param.name): | |
| if self.use_qk_norm and ("q_norm" in param.name or "k_norm" in param.name): | |
| loaded_weight = loaded_weight.astype("float32") |
| set_weight_attrs(self.bias, {"tp_row_bias": True}) | ||
|
|
||
| def weight_loader(self, param, loaded_weight, loaded_shard_id: Optional[str] = None): | ||
| # In some senerio such as tsp, weight and bias of this layer will not be split in specific module. |
Copilot
AI
Dec 9, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo in comment: "senerio" should be "scenario".
| # In some senerio such as tsp, weight and bias of this layer will not be split in specific module. | |
| # In some scenario such as tsp, weight and bias of this layer will not be split in specific module. |
Motivation
tp+ep support v1_loader
Modifications
tp+ep support v1_loader
Usage or Command
--load-choices "default_v1"
Accuracy Tests
Do not affect model outputs.
Checklist
[FDConfig],[APIServer],[Engine],[Scheduler],[PD Disaggregation],[Executor],[Graph Optimization],[Speculative Decoding],[RL],[Models],[Quantization],[Loader],[OP],[KVCache],[DataProcessor],[BugFix],[Docs],[CI],[Optimization],[Feature],[Benchmark],[Others],[XPU],[HPU],[GCU],[DCU],[Iluvatar],[Metax]]pre-commitbefore commit.releasebranch, make sure the PR has been submitted to thedevelopbranch, then cherry-pick it to thereleasebranch with the[Cherry-Pick]PR tag.