Skip to content

Commit f8b7adc

Browse files
committed
feat: add reflection layer for task verification
This adds a "judge" layer that verifies if the agent completed a task correctly. After each turn, if reflection is enabled, it: 1. Collects the initial task, recent tool calls, and final result 2. Sends this context to a judge model for evaluation 3. If incomplete, provides feedback and forces the agent to continue 4. Limits to 3 reflection attempts to prevent infinite loops Enable with: --enable reflection or [features].reflection = true reflection test: add reflection layer integration test for Azure OpenAI Add integration tests that verify the reflection layer works correctly with Azure OpenAI. Tests create hello.py and test_hello.py, run pytest, and verify the reflection layer evaluates task completion. Also fix missing wiremock imports in view_image.rs tests. test: make reflection test model configurable via AZURE_OPENAI_MODEL Defaults to gpt-5-mini if not set. docs: update reflection layer documentation - Document how reflection layer works - Add configuration instructions - Update test running instructions with AZURE_OPENAI_MODEL env var test: add SWE-bench style eval suite for reflection layer Add evaluation tests inspired by SWE-bench to measure the impact of the reflection layer on coding task performance. Tests include: - Task 1: Off-by-one errors in array processing - Task 2: String logic errors (palindrome, word count) - Task 3: Missing edge case handling Each task can be run with or without reflection to compare results. Includes eval_summary test that runs all tasks and reports comparison. docs: add eval suite documentation to reflection.md Document the SWE-bench style evaluation tests, including: - Task descriptions and bug types - Commands to run individual and comparative tests - Sample output showing reflection layer improvement Update docs feat(reflection): implement judge_model parameter and improve error detection - Add model_override support to ModelClient for judge model selection - Add max_attempts field to ReflectionContext (removes hardcoded constant) - Add sophisticated output_indicates_error() with 30+ error patterns - Exclude false positives like "error handling", "no errors", etc. - Update tests for new ReflectionContext signature 1. Protocol (codex-rs/protocol/src/protocol.rs) - Added ReflectionVerdictEvent struct with fields: completed, confidence, reasoning, feedback, attempt, max_attempts - Added ReflectionVerdict variant to EventMsg enum 2. Core (codex-rs/core/src/codex.rs) - Added import for ReflectionVerdictEvent - Emit ReflectionVerdict event right after getting the verdict from the judge model (line ~2253) 3. Rollout policy (codex-rs/core/src/rollout/policy.rs) - Added ReflectionVerdict to persisted events (so it shows in rollout files) 4. TUI (codex-rs/tui/src/) - Added new_reflection_verdict() function in history_cell.rs - Added on_reflection_verdict() handler in chatwidget.rs 5. TUI2 (codex-rs/tui2/src/) - same changes as TUI 6. Exec (codex-rs/exec/src/event_processor_with_human_output.rs) - Added reflection verdict display for CLI exec mode 7. MCP Server (codex-rs/mcp-server/src/codex_tool_runner.rs) - Added ReflectionVerdict to the match arm for event handling What You'll See Now When reflection runs, you'll see output like: On success: ✓ reflection: Task completed (confidence: 95%) The agent successfully created the hello package with tests... On incomplete (will retry): ⟳ reflection: Task incomplete - attempt 1/3 (confidence: 40%) Reasoning: Tests were not run after code changes Feedback: Please run the tests to verify the implementation works To test, run codex with your task and you should now see the reflection verdict at the end! feat(reflection): add JSON schema for structured verdict output - Add verdict_json_schema() to ensure judge model returns valid JSON - Use output_schema in reflection prompt for structured outputs - Add demo1 Python hello world app with tests
1 parent bef36f4 commit f8b7adc

File tree

26 files changed

+2602
-5
lines changed

26 files changed

+2602
-5
lines changed

codex-rs/core/src/client.rs

Lines changed: 18 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -63,6 +63,8 @@ pub struct ModelClient {
6363
effort: Option<ReasoningEffortConfig>,
6464
summary: ReasoningSummaryConfig,
6565
session_source: SessionSource,
66+
/// Optional model override for specialized use cases (e.g., reflection judge).
67+
model_override: Option<String>,
6668
}
6769

6870
#[allow(clippy::too_many_arguments)]
@@ -88,9 +90,19 @@ impl ModelClient {
8890
effort,
8991
summary,
9092
session_source,
93+
model_override: None,
9194
}
9295
}
9396

97+
/// Returns a clone of this client with a different model for the API request.
98+
/// This is useful for specialized tasks like reflection judging that may use
99+
/// a different (often cheaper/faster) model than the main agent.
100+
pub fn with_model_override(&self, model: &str) -> Self {
101+
let mut client = self.clone();
102+
client.model_override = Some(model.to_string());
103+
client
104+
}
105+
94106
pub fn get_model_context_window(&self) -> Option<i64> {
95107
let model_family = self.get_model_family();
96108
let effective_context_window_percent = model_family.effective_context_window_percent;
@@ -294,9 +306,13 @@ impl ModelClient {
294306
self.session_source.clone()
295307
}
296308

297-
/// Returns the currently configured model slug.
309+
/// Returns the currently configured model slug, or the override if set.
298310
pub fn get_model(&self) -> String {
299-
self.get_model_family().get_model_slug().to_string()
311+
if let Some(ref override_model) = self.model_override {
312+
override_model.clone()
313+
} else {
314+
self.get_model_family().get_model_slug().to_string()
315+
}
300316
}
301317

302318
/// Returns the currently configured model family.

codex-rs/core/src/codex.rs

Lines changed: 123 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -20,6 +20,7 @@ use crate::openai_models::model_family::ModelFamily;
2020
use crate::openai_models::models_manager::ModelsManager;
2121
use crate::parse_command::parse_command;
2222
use crate::parse_turn_item;
23+
use crate::reflection::{ReflectionContext, evaluate_reflection};
2324
use crate::stream_events_utils::HandleOutputCtx;
2425
use crate::stream_events_utils::handle_non_tool_response_item;
2526
use crate::stream_events_utils::handle_output_item_done;
@@ -114,6 +115,7 @@ use crate::protocol::TokenCountEvent;
114115
use crate::protocol::TokenUsage;
115116
use crate::protocol::TokenUsageInfo;
116117
use crate::protocol::TurnDiffEvent;
118+
use crate::protocol::ReflectionVerdictEvent;
117119
use crate::protocol::WarningEvent;
118120
use crate::rollout::RolloutRecorder;
119121
use crate::rollout::RolloutRecorderParams;
@@ -371,6 +373,7 @@ pub(crate) struct TurnContext {
371373
pub(crate) tool_call_gate: Arc<ReadinessFlag>,
372374
pub(crate) exec_policy: Arc<RwLock<ExecPolicy>>,
373375
pub(crate) truncation_policy: TruncationPolicy,
376+
pub(crate) reflection: crate::config::types::ReflectionConfig,
374377
}
375378

376379
impl TurnContext {
@@ -536,6 +539,7 @@ impl Session {
536539
per_turn_config.as_ref(),
537540
model_family.truncation_policy,
538541
),
542+
reflection: per_turn_config.reflection.clone(),
539543
}
540544
}
541545

@@ -2087,6 +2091,7 @@ async fn spawn_review_thread(
20872091
tool_call_gate: Arc::new(ReadinessFlag::new()),
20882092
exec_policy: parent_turn_context.exec_policy.clone(),
20892093
truncation_policy: TruncationPolicy::new(&per_turn_config, model_family.truncation_policy),
2094+
reflection: per_turn_config.reflection.clone(),
20902095
};
20912096

20922097
// Seed the child task with the review prompt as the initial user message.
@@ -2175,6 +2180,9 @@ pub(crate) async fn run_task(
21752180
.await;
21762181
}
21772182

2183+
// Extract initial task for reflection BEFORE input is consumed
2184+
let initial_task = extract_initial_task_from_input(&input);
2185+
21782186
let initial_input_for_turn: ResponseInputItem = ResponseInputItem::from(input);
21792187
let response_item: ResponseItem = initial_input_for_turn.clone().into();
21802188
sess.record_response_item_and_emit_turn_item(turn_context.as_ref(), response_item)
@@ -2192,6 +2200,11 @@ pub(crate) async fn run_task(
21922200
// many turns, from the perspective of the user, it is a single turn.
21932201
let turn_diff_tracker = Arc::new(tokio::sync::Mutex::new(TurnDiffTracker::new()));
21942202

2203+
// Track reflection attempts
2204+
let mut reflection_attempt: u32 = 0;
2205+
let reflection_config = &turn_context.reflection;
2206+
let reflection_enabled = reflection_config.enabled || sess.enabled(Feature::Reflection);
2207+
21952208
loop {
21962209
// Note that pending_input would be something like a message the user
21972210
// submitted through the UI while the model was running. Though the UI
@@ -2255,7 +2268,95 @@ pub(crate) async fn run_task(
22552268
}
22562269

22572270
if !needs_follow_up {
2258-
last_agent_message = turn_last_agent_message;
2271+
last_agent_message = turn_last_agent_message.clone();
2272+
2273+
// Run reflection if enabled and we haven't exceeded max attempts
2274+
let max_attempts = reflection_config.max_attempts;
2275+
if reflection_enabled && reflection_attempt < max_attempts {
2276+
reflection_attempt += 1;
2277+
info!(
2278+
"Running reflection evaluation (attempt {}/{})",
2279+
reflection_attempt, max_attempts
2280+
);
2281+
2282+
// Collect conversation items for reflection
2283+
let history_items = sess.clone_history().await.get_history_for_prompt();
2284+
let context = ReflectionContext::from_conversation(
2285+
initial_task.clone(),
2286+
&history_items,
2287+
reflection_attempt,
2288+
max_attempts,
2289+
);
2290+
2291+
// Evaluate with the judge, optionally using a different model
2292+
match evaluate_reflection(
2293+
&turn_context.client,
2294+
context,
2295+
reflection_config.model.as_deref(),
2296+
)
2297+
.await
2298+
{
2299+
Ok(verdict) => {
2300+
info!(
2301+
"Reflection verdict: completed={}, confidence={:.2}",
2302+
verdict.completed, verdict.confidence
2303+
);
2304+
2305+
// Emit reflection verdict event for visibility
2306+
sess.send_event(
2307+
&turn_context,
2308+
EventMsg::ReflectionVerdict(ReflectionVerdictEvent {
2309+
completed: verdict.completed,
2310+
confidence: verdict.confidence,
2311+
reasoning: verdict.reasoning.clone(),
2312+
feedback: verdict.feedback.clone(),
2313+
attempt: reflection_attempt,
2314+
max_attempts,
2315+
}),
2316+
)
2317+
.await;
2318+
2319+
if !verdict.completed {
2320+
if let Some(feedback) = verdict.feedback {
2321+
info!("Task incomplete, injecting feedback: {}", feedback);
2322+
2323+
// Inject feedback as a new user message
2324+
let feedback_msg = format!(
2325+
"[Reflection Judge - Attempt {}/{}] Task verification failed.\n\nReasoning: {}\n\nFeedback: {}\n\nPlease address the above feedback and complete the task.",
2326+
reflection_attempt,
2327+
max_attempts,
2328+
verdict.reasoning,
2329+
feedback
2330+
);
2331+
2332+
let feedback_item = ResponseItem::Message {
2333+
id: None,
2334+
role: "user".to_string(),
2335+
content: vec![ContentItem::InputText {
2336+
text: feedback_msg,
2337+
}],
2338+
};
2339+
2340+
sess.record_conversation_items(
2341+
&turn_context,
2342+
&[feedback_item],
2343+
)
2344+
.await;
2345+
2346+
// Continue the loop to process the feedback
2347+
continue;
2348+
}
2349+
} else {
2350+
info!("Reflection: Task completed successfully");
2351+
}
2352+
}
2353+
Err(e) => {
2354+
warn!("Reflection evaluation failed: {}", e);
2355+
// Continue without blocking on reflection errors
2356+
}
2357+
}
2358+
}
2359+
22592360
sess.notifier()
22602361
.notify(&UserNotification::AgentTurnComplete {
22612362
thread_id: sess.conversation_id.to_string(),
@@ -2292,6 +2393,27 @@ pub(crate) async fn run_task(
22922393
last_agent_message
22932394
}
22942395

2396+
/// Extract the initial task/prompt from user input.
2397+
fn extract_initial_task_from_input(input: &[UserInput]) -> String {
2398+
for item in input {
2399+
match item {
2400+
UserInput::Text { text } => {
2401+
return text.clone();
2402+
}
2403+
UserInput::Image { .. } | UserInput::LocalImage { .. } => {
2404+
// Skip images, look for text
2405+
}
2406+
UserInput::Skill { name, .. } => {
2407+
// Return skill name as task description
2408+
return format!("Run skill: {}", name);
2409+
}
2410+
// Handle future variants of the non-exhaustive enum
2411+
_ => {}
2412+
}
2413+
}
2414+
"(No initial task found)".to_string()
2415+
}
2416+
22952417
#[instrument(
22962418
skip_all,
22972419
fields(

codex-rs/core/src/config/mod.rs

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,8 @@ use crate::config::types::Notifications;
77
use crate::config::types::OtelConfig;
88
use crate::config::types::OtelConfigToml;
99
use crate::config::types::OtelExporterKind;
10+
use crate::config::types::ReflectionConfig;
11+
use crate::config::types::ReflectionConfigToml;
1012
use crate::config::types::SandboxWorkspaceWrite;
1113
use crate::config::types::ShellEnvironmentPolicy;
1214
use crate::config::types::ShellEnvironmentPolicyToml;
@@ -274,6 +276,9 @@ pub struct Config {
274276
/// Centralized feature flags; source of truth for feature gating.
275277
pub features: Features,
276278

279+
/// Configuration for the reflection/judge feature.
280+
pub reflection: ReflectionConfig,
281+
277282
/// The active profile name used to derive this `Config` (if any).
278283
pub active_profile: Option<String>,
279284

@@ -666,6 +671,9 @@ pub struct ConfigToml {
666671
/// Settings for ghost snapshots (used for undo).
667672
#[serde(default)]
668673
pub ghost_snapshot: Option<GhostSnapshotToml>,
674+
/// Configuration for the reflection/judge feature.
675+
#[serde(default)]
676+
pub reflection: Option<ReflectionConfigToml>,
669677

670678
/// When `true`, checks for Codex updates on startup and surfaces update prompts.
671679
/// Set to `false` only if your Codex updates are centrally managed.
@@ -1221,6 +1229,7 @@ impl Config {
12211229
use_experimental_use_rmcp_client,
12221230
ghost_snapshot,
12231231
features,
1232+
reflection: cfg.reflection.map(Into::into).unwrap_or_default(),
12241233
active_profile: active_profile_name,
12251234
active_project,
12261235
windows_wsl_setup_acknowledged: cfg.windows_wsl_setup_acknowledged.unwrap_or(false),
@@ -2983,6 +2992,7 @@ model_verbosity = "high"
29832992
use_experimental_use_rmcp_client: false,
29842993
ghost_snapshot: GhostSnapshotConfig::default(),
29852994
features: Features::with_defaults(),
2995+
reflection: ReflectionConfig::default(),
29862996
active_profile: Some("o3".to_string()),
29872997
active_project: ProjectConfig { trust_level: None },
29882998
windows_wsl_setup_acknowledged: false,
@@ -3058,6 +3068,7 @@ model_verbosity = "high"
30583068
use_experimental_use_rmcp_client: false,
30593069
ghost_snapshot: GhostSnapshotConfig::default(),
30603070
features: Features::with_defaults(),
3071+
reflection: ReflectionConfig::default(),
30613072
active_profile: Some("gpt3".to_string()),
30623073
active_project: ProjectConfig { trust_level: None },
30633074
windows_wsl_setup_acknowledged: false,
@@ -3148,6 +3159,7 @@ model_verbosity = "high"
31483159
use_experimental_use_rmcp_client: false,
31493160
ghost_snapshot: GhostSnapshotConfig::default(),
31503161
features: Features::with_defaults(),
3162+
reflection: ReflectionConfig::default(),
31513163
active_profile: Some("zdr".to_string()),
31523164
active_project: ProjectConfig { trust_level: None },
31533165
windows_wsl_setup_acknowledged: false,
@@ -3224,6 +3236,7 @@ model_verbosity = "high"
32243236
use_experimental_use_rmcp_client: false,
32253237
ghost_snapshot: GhostSnapshotConfig::default(),
32263238
features: Features::with_defaults(),
3239+
reflection: ReflectionConfig::default(),
32273240
active_profile: Some("gpt5".to_string()),
32283241
active_project: ProjectConfig { trust_level: None },
32293242
windows_wsl_setup_acknowledged: false,

0 commit comments

Comments
 (0)