Building a self-evaluating AI agent in 50 lines
One of the patterns I find most useful in production AI systems is the self-evaluating loop: generate an answer, score it, retry with specific feedback if it falls short. Not because LLMs always get it wrong — but because “good enough” is a constraint worth encoding explicitly rather than leaving to chance.
Here is the full thing, in about 50 lines of Python.
The pattern
Three nodes, one loop:
draft → judge → accept (if score ≥ threshold)
→ draft (if score < threshold and attempts < max)
→ give_up (if out of retries)
The routing predicate controls the loop. The state carries the score and attempt count. No hidden magic.
The code
from __future__ import annotations
import os
from pydantic import BaseModel
from openai import OpenAI
from jamjet import Workflow
client = OpenAI(
api_key=os.getenv("OPENAI_API_KEY", "ollama"),
base_url=os.getenv("OPENAI_BASE_URL", "http://localhost:11434/v1"),
)
MODEL = os.getenv("MODEL_NAME", "llama3.2")
QUESTION = os.getenv("QUESTION", "Explain event sourcing in one paragraph.")
MIN_SCORE = 4
MAX_RETRIES = 3
def llm(system: str, user: str) -> str:
resp = client.chat.completions.create(
model=MODEL, temperature=0, max_tokens=300,
messages=[{"role": "system", "content": system},
{"role": "user", "content": user}],
)
return (resp.choices[0].message.content or "").strip()
wf = Workflow("self-eval")
@wf.state
class State(BaseModel):
question: str
answer: str = ""
feedback: str = ""
judge_score: int = 0
attempts: int = 0
@wf.step(next="judge")
async def draft(state: State) -> State:
prompt = state.question
if state.feedback:
prompt += f"\n\nPrevious attempt was rated {state.judge_score}/5. Feedback: {state.feedback}\nPlease improve."
answer = llm("You are a concise technical writer.", prompt)
return state.model_copy(update={"answer": answer, "attempts": state.attempts + 1})
@wf.step(next={
"accept": lambda s: s.judge_score >= MIN_SCORE,
"draft": lambda s: s.judge_score < MIN_SCORE and s.attempts < MAX_RETRIES,
"give_up": lambda s: s.judge_score < MIN_SCORE and s.attempts >= MAX_RETRIES,
})
async def judge(state: State) -> State:
raw = llm(
"You are a strict technical editor. Rate the answer 1-5 (5=excellent). "
"Reply in exactly this format: SCORE: <n>\nFEEDBACK: <one sentence>",
f"Question: {state.question}\nAnswer: {state.answer}",
)
score, feedback = 3, "Could be clearer."
for line in raw.splitlines():
if line.startswith("SCORE:"):
try: score = int(line.split(":")[1].strip())
except ValueError: pass
if line.startswith("FEEDBACK:"):
feedback = line.split(":", 1)[1].strip()
return state.model_copy(update={"judge_score": score, "feedback": feedback})
@wf.step
async def accept(state: State) -> State:
return state
@wf.step
async def give_up(state: State) -> State:
return state
result = wf.run_sync(State(question=QUESTION))
s = result.state
print(f"\nFinal answer ({s.attempts} attempt{'s' if s.attempts != 1 else ''}, score {s.judge_score}/5):")
print(s.answer)
if s.attempts > 1:
print(f"\nFeedback that triggered retry: {s.feedback}")
What is happening
draft generates an answer. On retries, it receives the previous score and feedback as context, so it knows what to improve — not just “try again.”
judge asks a second LLM call to score the answer 1–5 and give one sentence of feedback. The scoring and routing are completely separate from the generation. You can swap the judge for a different model, a different prompt, or a deterministic scorer without touching draft.
Routing is the routing predicate on judge. Three branches, plain Python lambdas on the state. You can test them without running any LLM:
assert route(State(judge_score=5, attempts=1)) == "accept"
assert route(State(judge_score=2, attempts=1)) == "draft"
assert route(State(judge_score=2, attempts=3)) == "give_up"
result.events gives you the full execution trace after it runs:
✓ draft 1200ms
✓ judge 810ms
✓ accept 0ms
──────────────────
Judge score: 5/5 — accepted on first attempt
Why this matters
The self-evaluating loop is a pattern that comes up constantly in production:
- Code review agents that retry until the diff passes a quality check
- Summarisation agents that retry if the summary is too long
- SQL agents that retry if the generated query fails validation
- Report generators that retry if a fact-checker flags an error
In most frameworks you build this loop in ad-hoc ways — a while loop, custom retry logic, external state tracking. In JamJet it is just a routing predicate. The loop is explicit, inspectable, and testable.
Try it
This is example 04 in the jamjet-benchmarks examples. Runs locally with Ollama, no API key:
git clone https://github.com/jamjet-labs/jamjet-benchmarks
cd jamjet-benchmarks/examples/04_self_evaluating_workflow
pip install -r requirements.txt
OPENAI_API_KEY=ollama OPENAI_BASE_URL=http://localhost:11434/v1 MODEL_NAME=llama3.2 python main.py