Token logs and flame graphs tell you what the LLM did. They don’t tell you whether the agent did its job. A field guide to Job Cards, Agent Bureaucracy, and the documentation-first approach to AI agent observability.
An LLM call is a transaction.An agent run is a piece of work.A Job Card tells you what the agent attempted,what it decided, what it touched, and why —the difference between forensics and coaching.
A field note for engineers, CTOs, and people who have to explain to the board why an agent did what it did.
LLM observability tracks model calls. Agent observability tracks decisions. An LLM call is a transaction; an agent run is a piece of work — and a perfectly observable transaction can still produce wrong work.
Most LLM observability dashboards I see in the wild answer the wrong question. They tell you, with great precision, that a model returned 1,247 tokens in 3.2 seconds at a cost of $0.018, that the temperature was 0.3, and that the system prompt hash matches yesterday’s. All true. All necessary. None of it tells you whether the agent did its job.
This is the category error. An LLM call is a transaction. An agent run is a piece of work. You can have perfectly observable transactions inside a completely opaque piece of work — every span green, every latency under SLO, every token accounted for, and the customer still got the wrong refund.
TensorOps puts this even more sharply in their January 2026 piece, “Why LLM Observability Won’t Save Your Agents: The Rise of Agent Bureaucracy.” Their thesis: passive logging captures everything and explains nothing. The agent returns a confident 200 OK while quietly looping, hallucinating, or wandering down a rabbit hole, and your dashboard gives you no way to tell the difference at a glance.
If you take one thing from this article, take this: agent observability is the discipline of producing a defensible record of what the agent attempted, what it decided, what it touched, and why — such that a human who was not in the room can reconstruct the work after the fact. It is closer to a project manager’s status report or a surgeon’s operative note than to an APM dashboard.
The Telecom Trap is the delusion that capturing every token and intermediate thought will reveal why an agent failed. It produces infinite logs and zero insight.
TensorOps has a name for the failure mode most teams are stuck in: the Telecom Trap. The delusion that capturing every internal “thought” — every token, every tool call, every intermediate reasoning step — will somehow reveal why an agent failed. It’s the equivalent, they argue, of a CEO wiretapping every employee phone call to diagnose why revenue is down. Expensive, invasive, useless.
Let me make this concrete. Here’s a redacted excerpt from a customer-support agent trace I reviewed last quarter at a fintech client. The agent’s job: process a refund request. Total trace length: roughly 4,800 lines of OpenTelemetry spans across 47 minutes.
[14:02:11] llm.call prompt_tokens=1402 completion_tokens=89 "I'll check the order status..."
[14:02:14] tool.call get_order(order_id="44812") -> {status:"shipped",...}
[14:02:17] llm.call prompt_tokens=1518 completion_tokens=124 "Let me verify eligibility..."
[14:02:19] tool.call check_refund_policy(sku="HEN-44") -> {eligible:true,...}
[14:02:22] llm.call prompt_tokens=1689 completion_tokens=201 "I should double-check..."
[14:02:25] tool.call get_order(order_id="44812") -> {status:"shipped",...} <- same call
[14:02:28] llm.call prompt_tokens=1843 completion_tokens=178 "Let me verify once more..."
[14:02:31] tool.call check_refund_policy(sku="HEN-44") -> {eligible:true,...} <- same call
... 38 more iterations of the same two tool calls ...
[14:48:53] llm.call ... "I'll need to escalate this." Every span is healthy. Latency is fine. Every individual LLM call returns a 200. Cost: $14.62 on what should have been a $0.04 task. The agent ran in a tight loop for 47 minutes, calling the same two tools with the same two arguments, hallucinating that it needed “one more verification” each time before being killed by a session timeout.
A traditional LLM observability dashboard shows this as 156 successful calls. A flame graph shows a wide, evenly distributed band of green. There is no failure to detect, because nothing failed — except the work.
This is the Telecom Trap in one screenshot. Infinite logs, zero insight.
Agent Bureaucracy is the discipline of forcing AI agents to file structured status reports — Job Cards — before every meaningful action. Treat agents as junior employees, not as scripts.
TensorOps proposes a better mental model: treat agents as junior employees, not as scripts. You don’t manage a junior employee by strapping a GoPro to their head and reviewing the footage. You give them structure — a Jira ticket, a status template, a Slack channel for blockers — and you require them to document their state before acting.
This is what they call Agent Bureaucracy: a deliberate protocol of structured reporting and state management. Agents are forced, via system prompts and orchestration, to produce auditable artifacts before each meaningful action. TensorOps calls these artifacts Job Cards.
A Job Card looks roughly like this:
{
"current_phase": "BLOCKED",
"goal": "Issue partial refund for order #44812",
"sub_goal": "Refresh expired OAuth token for billing API",
"attempts": 3,
"last_action": "called auth.refresh_token() with refresh_token=***",
"last_result": "401 Unauthorized -- endpoint returned no body",
"confidence": 0.18,
"reason_blocked": "Documentation does not specify a fallback refresh URL for legacy customers.",
"loop_detected": true,
"needs_human": true,
"suggested_escalation": "billing-platform-oncall"
}Compare that to the 4,800 lines of token logs above. The Job Card tells you, in one screen, exactly what happened, why it stalled, and who needs to fix it. The dashboard reads the artifact, not the firehose. Debugging shifts from forensic archaeology to coaching.
The same fintech agent, instrumented with a Job Card protocol, produces a different artifact at minute three:
{
"current_phase": "LOOP_DETECTED",
"goal": "Issue partial refund for order #44812",
"loop_signature": "get_order + check_refund_policy x 3",
"confidence": 0.34,
"self_diagnosis": "I have all required information but am unable to commit to the refund decision.",
"needs_human": true
}Forty-four minutes and fourteen dollars saved. More importantly: a clear, reviewable record of why the agent escalated, not just that it did.
Seven artifacts: the resolved goal, the committed plan, tool calls with arguments, data lineage, decisions not taken, world-changing side effects, and structured confidence — not just tokens, latency, and cost.
Strip the marketing language away from the serious agent observability vendors — Langfuse, LangSmith, Arize, Datadog, AgentOps, Sentry, MLflow — and they’re converging on the same set of artifacts. Not the same UI, not the same pricing, but the same artifacts. The Job Card is one specific implementation of a more general principle.
Here’s the working list, framed as questions a reviewer should be able to answer from the trace alone:
What was the agent asked to do? Not the literal user message — the resolved goal. If the user said “deal with the Henderson ticket,” the trace should record that the agent interpreted this as “issue a partial refund on order #44812 and email the customer.” Goal interpretation is where most failures originate, and most platforms still don’t surface it explicitly.
What plan did it commit to? For agents with explicit planning steps, the plan is an artifact. For ReAct-style loops without a formal plan, the implicit plan is the sequence of tool selections — which means the rationale for each tool selection needs to be captured per step, not just the tool name.
Which tools did it call, with what arguments, against what data? Most platforms do this part well. The trap is volume. A “research agent completed 8 model calls costing $0.04” is not useful unless you can also see why it made 8 instead of 3 — and the answer is usually one tool returning slow or wrong data, triggering retries that look like reasoning but are actually flailing.
Where did the information come from? The lineage problem. Most LLM observability tools fail here. If the agent’s final answer cites a policy, the trace must show which document, which retrieval, which chunk. A trace’s existence is itself evidence that the answer came from your data and not from the foundation model’s priors. No trace, no provenance.
What did the agent decide not to do? Negative space matters. An agent that considered escalating to a human and decided against it has made a load-bearing decision. The Job Card pattern handles this naturally because phase transitions and confidence scores capture why a path was taken instead of an alternative.
What changed in the world? Agents now write to databases, send emails, file tickets, move money. Side effects need to be logged with the same rigor as model calls — with idempotency keys, rollback paths, and clear ownership. AgentSight, an eBPF-based observability framework out of recent systems research, makes this argument from first principles: monitor the agent at the kernel and network boundary, because that’s where the irreversible stuff happens.
How confident should we be in the result? Not the model’s self-reported confidence in a freeform sentence — that’s noise. Structured: an LLM-as-judge score, a heuristic check, a schema validation, a comparison against ground truth where available. The Job Card’s confidence field, when paired with an external evaluator, becomes a first-class signal you can alert on.
Notice what’s not on this list: token counts, latency percentiles, model versions, temperature settings. Those still matter — they’re necessary for cost control and for debugging the boring kind of failure — but they belong to LLM call observability, which is a sub-component, not the whole thing.
The choice matters less than the discipline. AgentOps fits the Job Card pattern out of the box; LangSmith leads on non-engineer review; Langfuse wins on self-hosted; Arize and Datadog correlate with infrastructure; Sentry unifies with errors and performance.
The TensorOps article points at AgentOps as a practical layer for this kind of structured observability, and the framing fits: AgentOps is purpose-built for the agent lifecycle rather than the model call. Two-line SDK integration (pip install agentops; agentops.init()), automatic instrumentation across 400+ LLMs and most major frameworks (CrewAI, AutoGen, LangChain, OpenAI Agents SDK), and — critically for the Job Card pattern — session replays that let you scrub through any production run point-in-time.
It’s worth being honest about the landscape, though. AgentOps is one option. LangSmith’s annotation queues are particularly strong if your bottleneck is non-engineer review. Langfuse is the best self-hosted open-source option for teams with data residency constraints. Arize and Datadog shine when you need to correlate agent behavior with broader infrastructure metrics. Sentry is the cleanest pick if you’re already in their ecosystem and want agent traces unified with errors, performance, and session replays.
The choice matters less than the discipline. Whichever platform you pick, the question is whether your agents are emitting Job Cards or token streams. A platform full of token streams is just a more expensive version of tail -f.
Force structured state output before every tool call, record it as a span attribute, and trip-wire on loop detection and low confidence for high-stakes actions. Roughly thirty lines of Python.
Here’s how the Job Card pattern looks grafted onto a typical LangGraph agent. The key move is forcing structured state output before each action, then making that state a first-class span attribute that any observability backend can index.
from typing import TypedDict, Literal
from langgraph.graph import StateGraph
import agentops # or langsmith, langfuse, arize -- same idea
class JobCard(TypedDict):
current_phase: Literal["PLANNING", "EXECUTING", "BLOCKED",
"LOOP_DETECTED", "ESCALATING", "DONE"]
goal: str
sub_goal: str | None
attempts: int
confidence: float
last_action: str | None
reason_blocked: str | None
needs_human: bool
JOB_CARD_SYSTEM_PROMPT = """
Before every tool call or final answer, you MUST output a Job Card
as valid JSON matching this schema. The Job Card is your status
report -- it will be reviewed by humans. Be concise and honest.
If you have attempted the same tool with the same arguments more
than twice, set current_phase to LOOP_DETECTED and needs_human to true.
If your confidence is below 0.5 for a high-stakes action
(refunds, emails, writes), set needs_human to true.
"""
def emit_job_card(state, card: JobCard):
# This is the line that turns it from a prompt trick into observability:
agentops.record({"event": "job_card", **card})
state["history"].append(card)
if card["needs_human"] or card["current_phase"] == "LOOP_DETECTED":
raise EscalateToHuman(card)
return stateThree things are happening here. First, the system prompt forces the model to tokenize its internal state into a structured object — TensorOps’ “Bureaucracy Protocol” in its simplest form. Second, the Job Card is recorded as a structured event in the observability backend, which means it’s queryable, alertable, and aggregatable across runs. Third, two trip-wires (loop detection, low confidence on high-stakes actions) escalate immediately rather than burning tokens.
A dashboard on top of this looks nothing like a flame graph. It looks like a Kanban board: agents in PLANNING, agents EXECUTING, agents BLOCKED, agents that triggered ESCALATING in the last hour. That’s a CTO-readable view of an agent fleet. A flame graph is not.
Confidence is lost at the handoff. A confident-looking output from a low-confidence classifier cascades through downstream agents that have no signal to second-guess it.
A more advanced example, drawn from a recent engagement. A document-processing pipeline used three agents: a Classifier (decides document type), a Researcher (looks up relevant policies), and a Drafter (produces the customer response). All three were instrumented with standard LLM observability — full prompt and completion logging, token counts, latency.
Customer complaints started arriving about responses citing policies for the wrong document type. The metrics showed nothing wrong. Each agent’s outputs looked individually plausible. It took two weeks of forensic log review to find the bug: the Classifier was emitting confident classifications with very low internal confidence on a specific edge case (insurance addenda), but it had no way to express that uncertainty in its handoff. The Researcher then dutifully looked up the wrong policies, and the Drafter wrote a confident response.
After retrofitting Job Cards, the same failure mode produced this artifact within one run:
{
"agent": "Classifier",
"current_phase": "DONE",
"goal": "Classify document type",
"result": "INSURANCE_ADDENDUM",
"confidence": 0.41,
"alternatives_considered": ["POLICY_AMENDMENT", "ADDENDUM"],
"needs_human": true,
"handoff_warning": "Low confidence -- recommend Researcher verify before proceeding."
}The downstream agent’s orchestrator could now read confidence < 0.5 and handoff_warning != null and route to a human reviewer instead of barrelling forward. The bug didn’t go away. The silence around the bug went away. That’s the win.
Not the engineer who shipped the agent. The support analyst, the compliance officer, the product manager. They read Job Cards, not OpenTelemetry traces.
Documenting the work means somebody has to read the documentation. In production agent systems serving real users, this is rarely the engineer who shipped the agent — it’s the support analyst handling the complaint, the compliance officer responding to a regulator, the product manager investigating a conversion drop. Those people don’t read OpenTelemetry traces. They read Job Cards.
This is the actual differentiator between teams that scale agents and teams that don’t. Industry research suggests elite teams achieve roughly 2.2x better reliability than non-elite teams, and the gap is not about better models — it’s about whether subject matter experts can review production behavior and feed corrections back. LangSmith’s annotation queues are the cleanest implementation I’ve seen of this loop, but the principle is platform-agnostic: structured artifacts beget human review begets fewer production failures.
If your observability stack ends at a flame graph, you’ve built call observability. If it produces a record a non-engineer can audit and correct, you’ve built agent observability.
Yes — for high-risk AI systems. The EU AI Act requires traceability of functioning and decision-making throughout the lifecycle. Token counts are not defensible in front of an auditor; Job Card chains are.
Most of this article has been framed around debugging and quality. There’s a second framing that CTOs in regulated industries are increasingly forced to confront: agent observability is becoming the substrate for AI compliance.
The EU AI Act, in force and ramping toward full applicability, requires comprehensive observability solutions for high-risk AI systems — specifically, the ability to track functioning and decision-making throughout the lifecycle. NIST’s AI RMF points the same direction. Sectoral regulators in finance and healthcare are pulling on the same thread.
Token counts and latency percentiles are not defensible in front of an auditor. A Job Card chain that reads:
“Agent decided to issue refund because policy section 4.2 applies, retrieved from document v3.1 dated 2026-01-15, customer met all three eligibility criteria as evaluated by tool check_refund_eligibility returning true, confidence 0.94, no human escalation needed.”— is defensible. Build for the second from the start. Retrofitting it is expensive, and retrofitting it under regulatory pressure is more expensive still.
Four steps: read the Agent Bureaucracy article, define a Job Card schema before picking a platform, separate the LLM gateway from the agent runtime in traces, and build a review surface for non-engineers before you scale.
For engineering leaders staring at this and wondering where to begin, the path is less heroic than it sounds. Four concrete steps, in order:
First, read the TensorOps article on Agent Bureaucracy. It’s the cleanest articulation of the mental shift this requires, and it will save you arguments with engineers who think more logs equals more observability.
Second, adopt a Job Card schema for your agents before picking a platform. The schema is portable. The platform is not. Start with the six fields above (current_phase, goal, confidence, attempts, needs_human, loop_detected) and extend from there.
Third, separate the LLM gateway from the agent runtime in your traces. The gateway layer (model call, tokens, latency, cost) is one trace context. The agent layer (Job Cards, tool decisions, lineage, side effects) is another. They link, but they shouldn’t be conflated. Most teams complaining about “noisy traces” have collapsed these two layers and lost the signal in both.
Fourth, build a review surface for non-engineers before you scale the agent. Whether it’s LangSmith’s annotation queues, AgentOps’ session replay, or a glorified internal Streamlit app — the people who will detect quality problems are not the people who built the agent. Give them somewhere to look. Show them Job Cards, not flame graphs.
LLM call observability is a metrics problem. Agent observability is a documentation problem. Teams that solve the first and ignore the second are the ones currently surprised by their own agents in production.
LLM call observability tells you what the model did. Agent observability tells you what the work was, who decided what, where the data came from, and whether the outcome was right. The first is a metrics problem. The second is a documentation problem.
The teams treating it as a metrics problem are the ones currently surprised by their own agents in production. The teams treating it as a documentation problem — the teams forcing their agents to file status reports like junior employees — are the ones whose CTOs sleep at night.
Stop instrumenting the model. Start documenting the work.
The agents of 2026 won’t be judged by how cleverly they reason in a demo. They’ll be judged by how transparently they document their reasoning in production — and how quickly you can understand and improve it.
True observability isn’t more logs. It’s better documentation of the work. The future of reliable AI agents isn’t hidden in token streams. It’s written in the status reports they’re required to file.
TensorOps’ blog has a series on the operational realities of running LLMs in production, including pieces on Agent Bureaucracy, GPT-4.1 deprecation, and training custom LLMs in 2026. For the academic foundation of the agent-artifacts-as-first-class-citizens argument, Dong, Lu, and Zhu’s AgentOps: Enabling Observability of LLM Agents (arXiv:2411.05285) is the canonical reference. For the systems-level view, AgentSight (arXiv:2508.02736) is worth a read.