Skip to main content
A ready-to-run example is available here!

Overview

A plain conversation.run() stops as soon as the agent thinks it is done. The /goal command is stricter: after each run it asks a second judge LLM to audit the transcript for authoritative evidence — file contents, command output, test results — that the objective is provably complete. If something is still missing, the loop re-prompts the agent with the judge’s feedback and runs again, until the goal is genuinely done or a hard iteration cap is reached. That makes it a good fit for verifiable objectives like “make the tests pass”, “produce a working CLI”, or “publish a passing migration”: the agent cannot finish just by claiming success — the judge has to see the green output first. Use cases:
  • Test-driven objectives — finish only when pytest (or any command) actually passes
  • Multi-step deliverables — keep the agent going until every requirement is verified
  • Long-running tasks — combine with a critic and stop hooks for full control over termination
Like the Critic, /goal is an extension applied to a conversation: it composes with whatever agent, tools, or critic you already have. The critic governs each inner run(); the /goal loop governs the overall objective.

How It Works

1. send objective                  →  agent runs, calls FinishAction
2. judge LLM audits the transcript →  produces { score, complete, missing }
3. if complete                     →  stop, return GoalOutcome(status="complete")
   else if max_iterations reached  →  stop, return GoalOutcome(status="capped")
   else                            →  send a follow-up with `missing`, run again
Because run_goal drives the conversation you pass in (it does not fork or spin up a sidecar), every turn — objective, agent work, judge-driven follow-ups — lands in the same conversation.state.events history.

Quick Start

from openhands.sdk import LLM, Agent, Conversation, Tool
from openhands.sdk.conversation.goal import run_goal
from openhands.tools.file_editor import FileEditorTool
from openhands.tools.terminal import TerminalTool

# Two LLMs: one does the work, one independently judges completion.
agent_llm = LLM(usage_id="agent",      model="gpt-5.5", api_key=api_key)
judge_llm = LLM(usage_id="goal-judge", model="gpt-5.5", api_key=api_key)

agent = Agent(
    llm=agent_llm,
    tools=[Tool(name=TerminalTool.name), Tool(name=FileEditorTool.name)],
)
conversation = Conversation(agent=agent, workspace=workspace)

objective = (
    "Create mathx.py with an add(a, b) function and test_mathx.py with a "
    "pytest test for it. The goal is complete only when "
    "`python -m pytest -q` passes."
)

outcome = run_goal(conversation, objective, judge_llm, max_iterations=3)

print(f"Goal {outcome.status} after {outcome.iterations} audit round(s).")
print(f"Judge score: {outcome.verdict.score:.2f}")
Use a separate LLM instance (distinct usage_id) for the judge, even if you reuse the same model. Keeping the judge isolated from the agent’s LLM lets you account for its cost separately and avoids accidentally sharing streaming or callback state.

Understanding the Result

run_goal returns a GoalOutcome that reports whether the loop ended cleanly or was capped, plus the judge’s final verdict.
FieldTypeDescription
status"complete" | "capped"Whether the judge confirmed completion, or the loop hit max_iterations.
iterationsintNumber of audit rounds performed (≥ 1).
verdictGoalVerdictThe judge’s last verdict.
The GoalVerdict is what the judge LLM produces every round:
FieldTypeDescription
scorefloat (0.0–1.0)Probability that the full objective is provably done.
completeboolWhether the judge considers the objective complete.
missingstrConcise description of what remains, or empty if complete.
The missing field is what the loop feeds back to the agent in the next follow-up turn, so the agent knows exactly which requirements still need verifiable evidence.

Parameters

ParameterTypeDefaultDescription
conversationBaseConversationThe conversation to drive. Any agent/tools/critic config is supported.
objectivestrThe goal to pursue and audit against. Must be non-empty.
judge_llmLLMThe second LLM that grades completion. Should be independent from the agent’s LLM.
max_iterationsint10Hard cap on audit rounds before the loop returns status="capped".

Composing With a Critic

/goal and a Critic operate at different layers:
  • A critic governs each inner run() — it can refine the agent’s work mid-run via iterative refinement.
  • The /goal loop governs the overall objective — it decides whether to re-prompt the agent at all.
They compose without changes: attach a critic to the agent as usual, then drive the conversation with run_goal. Every inner run() still consults the critic; the outer loop still re-runs until the judge is satisfied.
from openhands.sdk.critic import APIBasedCritic
from openhands.sdk.conversation.goal import run_goal

agent = Agent(
    llm=agent_llm,
    tools=[...],
    critic=APIBasedCritic(...),  # governs each run()
)
conversation = Conversation(agent=agent, workspace=workspace)

outcome = run_goal(conversation, objective, judge_llm, max_iterations=5)

Lower-Level Building Blocks

run_goal is a thin synchronous driver over a transport-agnostic controller. If you need to integrate the loop into a custom driver (async, agent-server, UI progress reporting), reach for the building blocks directly.

GoalController

GoalController owns the continue-vs-stop decision logic and the iteration cap. It does no conversation transport I/O — the driver owns sending messages and running the agent — but it does own the judge call: on_run_finished() synchronously invokes the judge LLM, so treat that call as blocking.
from openhands.sdk.conversation.goal import GoalController, GoalDone

controller = GoalController(objective, judge_llm, max_iterations=10)
conversation.send_message(controller.start())

while True:
    conversation.run()
    step = controller.on_run_finished(conversation.state.events)
    if isinstance(step, GoalDone):
        outcome = step.outcome
        break
    # step is GoalContinue — feed the follow-up back to the agent
    conversation.send_message(step.followup)
That split lets a synchronous driver and an asynchronous agent-server task share the exact same decision logic — only the I/O loop differs.

judge_goal

judge_goal is the reusable kernel: a synchronous, LLM-backed evaluator with signature judge_goal(judge_llm, objective, events) → GoalVerdict and no dependency on the loop. It calls the judge LLM each time, so it is not a pure function. Use it directly to build a /status command, a stop hook, or a server endpoint:
from openhands.sdk.conversation.goal import judge_goal

verdict = judge_goal(judge_llm, objective, conversation.state.events)
if verdict.complete:
    print("Done!")
else:
    print(f"Still missing: {verdict.missing}")
The judge renders the conversation as a plain role: text transcript and asks the LLM for a strict-JSON verdict. The agent’s system prompt is intentionally excluded from the transcript to keep judge token cost low — it carries no goal-specific evidence.

Notes

  • Goal vs. Critic. A critic scores each run() and triggers refinement turns inside one run. The /goal loop drives the overall objective from the outside. The two compose: the critic improves each turn; the goal loop ensures the right number of turns happen.
  • No fork. run_goal drives the conversation you pass in — it does not create a sidecar conversation. All goal-related events land in the same conversation.state.events history.
  • Conservative parsing. If the judge response cannot be parsed as JSON, the verdict falls back to score=0.0, complete=False so the loop keeps working rather than falsely finishing.

Ready-to-run Example

This example is available on GitHub: examples/01_standalone_sdk/54_goal_completion_loop.py
examples/01_standalone_sdk/54_goal_completion_loop.py
"""The /goal command: pursue an objective until a judge LLM confirms it is done.

A plain ``conversation.run()`` stops as soon as the agent *thinks* it is
finished. The ``/goal`` loop is stricter: after each run it asks a second
"judge" LLM to audit the transcript for authoritative evidence -- file
contents, command output, test results -- that the objective is *provably*
complete. If something is still missing, it re-prompts the agent with the
judge's feedback and runs again, until the goal is genuinely done or a hard
iteration cap is reached.

That makes it a good fit for verifiable objectives like "make the tests pass":
the agent cannot finish just by claiming success; the judge has to see green
output first.

Key concepts demonstrated:
1. ``run_goal(conversation, objective, judge_llm, max_iterations=...)`` drives
   the conversation from the outside, re-prompting until the judge is satisfied.
2. A second, independent "judge" LLM grades completion -- separate from the
   agent that does the work.
3. The returned ``GoalOutcome`` reports whether the goal ``"complete"``-d or was
   ``"capped"``, how many audit rounds it took, and the judge's final verdict.

Because ``run_goal`` drives the conversation you pass in (it does not fork or
spin up a sidecar), every turn -- objective, agent work, judge-driven followups
-- lands in the same ``conversation.state.events`` history. It therefore
composes with whatever agent, tools, or critic you already have.
"""

import os
import tempfile

from openhands.sdk import LLM, Agent, Conversation, Tool
from openhands.sdk.conversation.goal import run_goal
from openhands.tools.file_editor import FileEditorTool
from openhands.tools.terminal import TerminalTool


# The agent LLM does the work; the judge LLM independently grades completion.
# Two separate instances (same model, distinct usage_id) keep their costs apart.
model = os.getenv("LLM_MODEL", "gpt-5.5")
api_key = os.getenv("LLM_API_KEY")
base_url = os.getenv("LLM_BASE_URL")
agent_llm = LLM(usage_id="agent", model=model, api_key=api_key, base_url=base_url)
judge_llm = LLM(usage_id="goal-judge", model=model, api_key=api_key, base_url=base_url)

agent = Agent(
    llm=agent_llm,
    tools=[Tool(name=TerminalTool.name), Tool(name=FileEditorTool.name)],
)

workspace = tempfile.mkdtemp(prefix="goal_demo_")
conversation = Conversation(agent=agent, workspace=workspace)

# A verifiable objective: the judge can only call it done once it has seen
# pytest actually pass -- not merely the agent asserting that it did.
objective = (
    "Create mathx.py with an add(a, b) function and test_mathx.py with a pytest "
    "test for it. The goal is complete only when `python -m pytest -q` passes."
)

# Drive the conversation toward the objective, re-judging after each run.
outcome = run_goal(conversation, objective, judge_llm, max_iterations=3)

print("\n" + "=" * 70)
print(f"Goal {outcome.status} after {outcome.iterations} audit round(s).")
print(f"Judge score: {outcome.verdict.score:.2f}")
if outcome.verdict.missing:
    print(f"Still missing: {outcome.verdict.missing}")
print(f"Workspace: {workspace}")
print("=" * 70)

# Report cost (agent work + judge audits).
cost = agent_llm.metrics.accumulated_cost + judge_llm.metrics.accumulated_cost
print(f"EXAMPLE_COST: {cost}")
You can run the example code as-is.
The model name should follow the LiteLLM convention: provider/model_name (e.g., anthropic/claude-sonnet-4-5-20250929, openai/gpt-4o). The LLM_API_KEY should be the API key for your chosen provider.
ChatGPT Plus/Pro subscribers: You can use LLM.subscription_login() to authenticate with your ChatGPT account and access Codex models without consuming API credits. See the LLM Subscriptions guide for details.

Next Steps

  • Critic — Score and refine individual agent runs in real time
  • Iterative Refinement — Multi-agent feedback loop for quality-bound tasks
  • Hooks — Customize start/stop semantics on every run
  • Persistence — Save and restore conversation state across goal runs