Skip to content

feat: preserve durable context across summarization#3887

Merged
WillemJiang merged 6 commits into
bytedance:mainfrom
ShenAC-SAC:feat/durable-summary-context
Jul 1, 2026
Merged

feat: preserve durable context across summarization#3887
WillemJiang merged 6 commits into
bytedance:mainfrom
ShenAC-SAC:feat/durable-summary-context

Conversation

@ShenAC-SAC

@ShenAC-SAC ShenAC-SAC commented Jun 30, 2026

Copy link
Copy Markdown
Collaborator

Why

DeerFlow has several long-lived runtime facts that need to survive message-window compaction, but those facts should not be stored as ordinary chat transcript entries.

Before this change, summarization wrote the generated summary back into messages as a hidden HumanMessage(name="summary"). Skill preservation had a similar shape: recently loaded skill read_file messages were kept around so the model could still see skill instructions after compaction. That worked, but it coupled durable runtime state to the mutable transcript:

  • UI/frontend merge logic, RunJournal processing, middleware scans, and model context all had to understand hidden control messages inside messages.
  • A generated summary could be treated like a real user message by later middleware unless every consumer remembered to special-case it.
  • Completed subagent work could disappear after summarization, causing the lead agent to delegate the same work again.
  • Preserving skill context by retaining raw read_file results could keep full SKILL.md bodies in checkpoints and future model windows.
  • Runtime facts had no explicit reducer, cap, or rendering policy separate from the transcript.

Concrete failure modes this addresses:

  • After summarization, completed task tool results could be compacted out of the live message window, so the lead agent could lose evidence that a subagent already finished the work and delegate the same task again.
  • Loaded skill instructions previously survived compaction only if the raw read_file call and paired ToolMessage stayed in recent messages, which made skill availability depend on the transcript retention policy.
  • Summary text stored as HumanMessage(name="summary") could be mistaken for user input by later middleware unless every middleware remembered to filter it.
  • Keeping hidden control messages in messages made frontend merge/archive logic and backend middleware scans depend on transcript-specific conventions instead of explicit runtime state.

This PR moves the first set of durable runtime facts out of messages and into explicit thread-state channels, then projects those facts back into model requests only at call time.

Design Summary

The important design shift is:

  1. Raw user, assistant, and tool messages remain the transcript.
  2. Middleware extracts durable facts into typed ThreadState channels.
  3. Summarization compacts old raw messages into summary_text instead of creating a hidden summary message.
  4. Each model call receives an ephemeral durable-context projection built from thread state.
  5. That projection is not written back into checkpointed messages.

So the model still sees the summary, delegations, and skill references, but the checkpointed transcript no longer has to carry hidden control messages for those facts.

What Changed

DurableContextMiddleware

Adds DurableContextMiddleware, registered before summarization in the lead-agent chain.

It captures durable facts from the current transcript before summarization can compact the paired tool-call/tool-result messages, then injects a bounded model-request-only durable context projection on later calls.

The projection is intentionally split by authority level:

  • Static durable-context handling rules are injected as a SystemMessage.
  • Runtime-provided values such as summary_text, delegation results, and skill descriptions are injected separately as a hidden HumanMessage data block marked with additional_kwargs["durable_context_data"] and hide_from_ui.

This prevents user/model/tool/subagent text captured in durable fields from being promoted to system-role instructions. Captured values are escaped or bounded before rendering.

Summary state channel

Summarization now writes generated prose to ThreadState.summary_text instead of inserting a hidden summary message into messages.

When a later compaction happens, the summarizer receives both:

  • the previous summary_text; and
  • the new raw messages that need to be folded in.

The new summary replaces summary_text, while compacted raw messages are removed through the normal message reducer and the recent message window is preserved.

The summary prompt budgeting was also hardened after review:

  • existing summary text participates in trigger token counting;
  • previous summary and new messages are trimmed as separate content sections before XML-like wrapper tags are added;
  • partial trimming uses LangChain trim_messages with a character splitter so single long messages can still be budgeted;
  • deterministic fallback truncation is now a hard cap, including very small budgets.

Legacy HumanMessage(name="summary") messages remain filtered by middleware paths for old threads, but new compactions no longer create them.

Delegation ledger

Adds a deterministic delegations channel for task tool dispatches and paired ToolMessage results.

Each ledger entry stores:

  • task call id;
  • short description;
  • subagent type;
  • current or terminal status;
  • bounded result brief;
  • result hash;
  • source result reference;
  • creation timestamp.

The rendered ledger is newest-first and tells the model when in-progress work is already delegated and when completed work should be reused instead of delegated again. Failed, cancelled, and timed-out entries are represented as retryable prior attempts rather than reusable results.

The ledger now uses the shared backend/frontend extract_subagent_status contract instead of duplicating prefix parsing. Dispatches start as in_progress; unknown or streaming task outputs stay in_progress rather than being persisted as fake failed entries, so durable context does not mask contract drift or streaming intermediate states.

The reducer caps retained ledger entries to the newest window, and capture is cap-aware so entries intentionally evicted by the cap are not repeatedly re-emitted and allowed to evict newer entries.

Skill context references

Adds a durable skill_context channel for loaded SKILL.md references.

The extractor only captures successful read calls that:

  • use a configured read tool name;
  • resolve under the configured skills root;
  • point specifically to a SKILL.md file;
  • are not errored tool reads.

The persisted entry is a compact reference: name, path, frontmatter description, and loaded_at. It deliberately does not persist the full skill body. The model sees an active-skill reminder with the path and description; if it needs exact instructions, it should re-read the skill file.

skill_file_read_tool_names: [] now disables this durable skill-reference capture instead of falling back to default read tool names.

Middleware compatibility

Legacy summary messages are excluded from user-targeted middleware scans:

  • InputSanitizationMiddleware
  • DynamicContextMiddleware
  • SkillActivationMiddleware

This preserves compatibility for existing checkpointed threads that may already contain HumanMessage(name="summary") while keeping new threads on the state-channel design.

Documentation and config

Updates backend and frontend docs to reflect the final design:

  • summaries are stored in summary_text, not injected back as regular messages;
  • durable runtime values are projected as hidden durable-context data, not promoted to system-role instructions;
  • skill context stores references, not full skill bodies;
  • frontend middleware docs include DurableContextMiddleware in the correct order;
  • config.example.yaml explains that skill_file_read_tool_names: [] disables durable skill-reference capture.

Review Hardening Since Initial Version

The review pass found and fixed several edge cases:

  • Untrusted durable field values are no longer rendered inside SystemMessage.
  • Existing summary_text is counted when deciding whether summarization should trigger.
  • Previous summaries cannot blow up the summarization prompt unchecked.
  • Summary prompt trimming cannot return only wrapper markup with no actual message content.
  • Very small fallback budgets now stay within the requested cap.
  • Unknown/non-terminal task results remain in_progress instead of becoming fake failed ledger entries.
  • The delegation ledger cap no longer causes old evicted entries to be re-captured and evict newer entries.
  • Live durable skill-context assertions were updated to look for the hidden durable-context data message rather than a SystemMessage.
  • Docs and PR-facing config comments were corrected to avoid overclaiming that full skill instructions are persisted or re-injected.

Scope Notes

This PR intentionally keeps the durable context surface narrow:

  • delegation capture is limited to the existing task tool;
  • skill context captures only SKILL.md references, not arbitrary files under the skills directory;
  • slash-skill activation still injects the current-turn skill body directly and is not expanded into a durable channel here;
  • this does not attempt to redesign every possible summary or memory type.

The goal is to establish the durable channel pattern and apply it to the first high-value facts that are already affected by message-window compaction.

Follow-up / Remaining Decoupling Work

This PR moves the persistence and model-projection surface for summaries, delegated work, and loaded skill references out of messages, but it does not eliminate every message-derived capture path in DeerFlow.

The remaining production coupling is on the capture side:

  • delegations is still initially derived from AIMessage.tool_calls for the task tool and later upgraded from paired terminal ToolMessage results.
  • Task status and result text still rely on the existing task-result string contract, with shared backend/frontend prefix handling for historical compatibility.
  • skill_context is still initially derived from successful skill-file read tool calls and the paired ToolMessage.content, from which only the SKILL.md frontmatter description is parsed into a compact reference.
  • Backend and frontend still keep compatibility shims for old HumanMessage(name="summary") control messages and legacy subtask-result text, so historical threads remain renderable.

A good follow-up PR would move these capture points closer to their authoritative producers: emit structured task-result metadata directly from the task tool/runtime boundary, and emit structured skill-read metadata from the file-read/tool boundary instead of re-parsing tool transcript content. After that migration window, the legacy prefix and hidden-summary compatibility shims can be retired separately.

This is intentionally left out of this PR because it changes tool/result contracts across backend, frontend, runtime state, and historical-thread compatibility. The important step here is that once the facts are captured, they no longer need raw transcript messages to remain in the context window.

Testing

Local verification after resolving Copilot feedback and merging latest upstream/main:

  • cd backend && uv run ruff check packages/harness/deerflow/agents/thread_state.py packages/harness/deerflow/agents/middlewares/delegation_ledger.py packages/harness/deerflow/agents/middlewares/durable_context_middleware.py packages/harness/deerflow/runtime/journal.py tests/test_delegation_ledger.py tests/test_durable_context_middleware.py tests/test_thread_state_reducers.py tests/test_run_journal.py tests/test_delegation_ledger_live.py
    • All checks passed!
  • cd backend && uv run pytest tests/test_delegation_ledger.py tests/test_thread_state_reducers.py tests/test_durable_context_middleware.py tests/test_run_journal.py::TestChatModelStartHumanMessage tests/test_summarization_summary_text.py tests/test_summarization_middleware.py tests/test_skill_context.py tests/test_input_sanitization_middleware.py tests/test_dynamic_context_middleware.py tests/test_slash_skills.py -q
    • 257 passed, 1 warning
  • git diff --check and git diff --cached --check
    • both pass

CI should be used as the final source of truth after the latest pushed commit.

@ShenAC-SAC ShenAC-SAC marked this pull request as ready for review June 30, 2026 14:09
@github-actions github-actions Bot added area:agents Agents, subagents, graph wiring, prompts, langgraph.json area:backend Gateway / runtime / core backend under backend/ area:docs Documentation and Markdown only area:frontend Next.js frontend under frontend/ needs-validation Touches front/back contract surface; needs real-path validation risk:high High risk: backend API, agents, sandbox, auth, deps, CI size/XL PR changes 700+ lines labels Jun 30, 2026
@WillemJiang WillemJiang requested a review from Copilot July 1, 2026 02:42

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR decouples long-lived runtime context (summary, delegated work results, and loaded skill references) from the mutable transcript by persisting them in explicit ThreadState channels and projecting them into model requests only at call time.

Changes:

  • Add DurableContextMiddleware to capture + inject summary_text, delegation_ledger, and skill_context as hidden durable-context data (with a separate authority SystemMessage contract).
  • Update summarization to write prose summaries into ThreadState.summary_text instead of inserting HumanMessage(name="summary") into messages, with improved budgeting/trimming behavior.
  • Add reducers + tests for the new thread-state channels; update docs/config and frontend docs to reflect the new middleware ordering and durable-context behavior.

Reviewed changes

Copilot reviewed 27 out of 27 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
frontend/src/core/threads/hooks.ts Adds backward-compat handling for legacy HumanMessage(name="summary") markers in frontend summarization tracking.
frontend/src/content/zh/harness/middlewares.mdx Updates middleware ordering/docs to include DurableContextMiddleware and the new summary behavior.
frontend/src/content/en/harness/middlewares.mdx Updates middleware ordering/docs to include DurableContextMiddleware and the new summary behavior.
config.example.yaml Bumps config version and documents durable skill-reference capture via skill_file_read_tool_names (legacy preserve settings removed).
backend/tests/test_thread_state_reducers.py Adds reducer coverage for delegation_ledger + skill_context and asserts summary channel existence.
backend/tests/test_summarization_summary_text.py New tests validating summarization writes summary_text, budgets correctly, and fails safely.
backend/tests/test_summarization_middleware.py Updates existing tests to assert no synthetic summary message and validates summary_text emission.
backend/tests/test_slash_skills.py Adjusts skill activation tests for hidden/durable-context behavior and adds legacy-summary guard.
backend/tests/test_skill_context.py New tests for extracting + rendering compact skill references (not full bodies).
backend/tests/test_run_journal.py Removes test that ensured name="summary" messages were skipped when extracting first human message.
backend/tests/test_input_sanitization_middleware.py Updates tests around genuine-user-message classification for legacy summary messages.
backend/tests/test_dynamic_context_middleware.py Adds guard to prevent legacy summary messages from being injection targets.
backend/tests/test_durable_context_middleware.py New integration tests for capture + injection across summarization and with skill refs.
backend/tests/test_delegation_ledger.py New tests for extracting + rendering delegation ledger entries and bounds/escaping.
backend/tests/test_delegation_ledger_live.py Adds an opt-in live E2E test validating ledger + summary persistence across real summarization.
backend/tests/fixtures/replay/write_read_file.ultra.events.json Updates replay fixture keys to include new thread-state channels.
backend/packages/harness/deerflow/runtime/journal.py Changes first-human extraction logic (now only filters hide_from_ui).
backend/packages/harness/deerflow/config/summarization_config.py Removes legacy skill-preserve fields and documents skill-read tool names for durable capture.
backend/packages/harness/deerflow/agents/thread_state.py Adds delegation_ledger, skill_context, summary_text channels and reducers.
backend/packages/harness/deerflow/agents/middlewares/summarization_middleware.py Writes summaries to summary_text, counts prior summary in triggers, and improves trimming/failure handling.
backend/packages/harness/deerflow/agents/middlewares/skill_context.py New: deterministic extraction + rendering of compact skill references under skills root.
backend/packages/harness/deerflow/agents/middlewares/input_sanitization_middleware.py Updates documentation text around how injected HumanMessages are identified.
backend/packages/harness/deerflow/agents/middlewares/durable_context_middleware.py New: capture + inject durable context (summary/ledger/skills) without persisting injected messages.
backend/packages/harness/deerflow/agents/middlewares/delegation_ledger.py New: deterministic extraction + rendering of completed task delegations (with bounding + escaping).
backend/packages/harness/deerflow/agents/lead_agent/agent.py Registers DurableContextMiddleware before summarization; removes old summarization skill-rescue wiring.
backend/docs/summarization.md Updates documentation to reflect summary_text + durable projection model.
backend/AGENTS.md Updates thread-state and middleware-order documentation to include durable-context channels/middleware.

Comment on lines 202 to 206
if caller == "lead_agent" and not self._first_human_msg and messages:
for batch in reversed(messages):
for m in reversed(batch):
if isinstance(m, HumanMessage) and m.name != "summary" and m.additional_kwargs.get("hide_from_ui") is not True:
if isinstance(m, HumanMessage) and m.additional_kwargs.get("hide_from_ui") is not True:
self.set_first_human_message(m.text)
Comment on lines +118 to +121
super().__init__()
self._skills_root = (skills_container_path or _DEFAULT_SKILLS_ROOT).rstrip("/")
self._skill_read_tool_names = frozenset(_DEFAULT_SKILL_READ_TOOL_NAMES if skill_file_read_tool_names is None else skill_file_read_tool_names)

@WillemJiang

Copy link
Copy Markdown
Collaborator

@ShenAC-SAC, please take a look at the review comment of Copilot and resolve the conflict with the main branch.

…y-context

# Conflicts:
#	backend/AGENTS.md
#	backend/packages/harness/deerflow/agents/thread_state.py
#	backend/tests/fixtures/replay/write_read_file.ultra.events.json
#	backend/tests/test_delegation_ledger.py
@ShenAC-SAC

Copy link
Copy Markdown
Collaborator Author

@WillemJiang Thanks for the reminder. I addressed the Copilot review comments and resolved the conflict with latest main.

What changed in this update:

  • Restored legacy-summary filtering in RunJournal, so old checkpointed HumanMessage(name="summary") messages are not recorded as llm.human.input or used as the first user message.
  • Replaced the skills_container_path.rstrip("/") normalization with POSIX path normalization, preserving / and slash-only roots instead of collapsing them to an empty string.
  • Merged latest upstream/main into this branch.

Conflict resolution note:

The conflict came from #3877 adding a system-maintained delegation ledger in parallel with this PR's durable-context work. I resolved it by keeping a single delegation state channel, ThreadState.delegations, instead of maintaining two ledgers.

The merged design keeps #3877's useful behavior for in_progress delegated tasks and terminal-status downgrade protection, while preserving this PR's richer durable context projection with terminal result metadata (result_brief, result_sha256, result_ref). I also removed the separate DelegationLedgerMiddleware registration path so delegation state is not injected twice; DurableContextMiddleware now owns the unified capture/projection path for summary, delegations, and skill references.

I also updated the PR body and docs to describe the current delegations design.

Verification:

  • cd backend && uv run ruff check ... -> all checks passed
  • cd backend && uv run pytest ... -q -> 257 passed, 1 warning
  • GitHub checks are now green for backend unit tests, e2e tests, frontend unit tests, lint backend/frontend, backend-blocking-io, and replay E2E layers.

Current PR state is MERGEABLE; it is still blocked only by required review.

@WillemJiang WillemJiang merged commit 442248d into bytedance:main Jul 1, 2026
14 checks passed
@WillemJiang WillemJiang added this to the 2.1.0 milestone Jul 1, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:agents Agents, subagents, graph wiring, prompts, langgraph.json area:backend Gateway / runtime / core backend under backend/ area:docs Documentation and Markdown only area:frontend Next.js frontend under frontend/ needs-validation Touches front/back contract surface; needs real-path validation risk:high High risk: backend API, agents, sandbox, auth, deps, CI size/XL PR changes 700+ lines

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants