Agent Observability

Every agent-native app gets observability out of the box. Traces, automated evals, user feedback, and A/B experiments work with zero configuration — all data lives in the app's own SQL database.

What's Captured Automatically

When a user sends a message, the framework automatically records:

Token usage — input, output, cache read, cache write
Cost — computed from token counts and model pricing
Latency — total duration and time per tool call
Tool calls — which actions were invoked, success/error status, duration
Automated evals — 5 quality scores computed after every run

No code changes needed. The instrumentation hooks into production-agent.ts transparently.

The Dashboard

Add the dashboard to any template with a single route:

// app/routes/observability.tsx
import { ObservabilityDashboard } from "@agent-native/core/client";

export default function ObservabilityPage() {
  return (
    <div className="min-h-screen bg-background p-6">
      <ObservabilityDashboard />
    </div>
  );
}

The dashboard has 5 tabs:

Tab	What it shows
Overview	Key metrics — runs, cost, latency, tool success rate, satisfaction, eval score
Conversations	Trace list with drill-down to individual spans (agent_run, llm_call, tool_call)
Evals	Automated eval scores by criteria, trends over time
Experiments	A/B test list with status badges, variant results with confidence intervals
Feedback	Thumbs up/down stream, category breakdown, frustration scores

User Feedback

Explicit Feedback

Thumbs up/down buttons render inline on every agent message in the chat UI. Thumbs down opens a category popover (Inaccurate, Not helpful, Wrong tool, Too slow). This is wired into AssistantChat.tsx automatically.

Implicit Feedback (Frustration Index)

The framework computes a Frustration Index (0-100) from conversation signals:

Signal	Weight	What it detects
Rephrasing	30%	User repeats similar messages
Retry patterns	20%	"Try again", "no that's wrong"
Abandonment	20%	Session ends shortly after response
Sentiment	15%	Negative language patterns
Length trend	15%	Declining message lengths

Score interpretation: 0-20 = healthy, 20-40 = friction, 40-60 = dissatisfied, 60+ = broken session.

Automated Evals

Five deterministic scorers run after every agent run:

Criteria	What it measures	Score range
`tool_success_rate`	% of tool calls without errors	0-1
`step_efficiency`	Penalizes excessive LLM iterations for tool-using runs	0-1
`latency_score`	Normalized against 10s/tool baseline	0-1
`cost_efficiency`	Normalized against cost baseline	0-1
`error_recovery`	Did the agent recover from tool errors?	0 or 1

LLM-as-Judge (Optional)

Enable sampled LLM-based evaluation by setting evalSampleRate:

import { putSetting } from "@agent-native/core/settings";

await putSetting("observability-config", {
  enabled: true,
  evalSampleRate: 0.05, // 5% of runs
});

Custom criteria use natural language rubrics:

const criteria = {
  name: "helpfulness",
  description: "Was the response helpful and complete?",
  rubric: "0.0 = unhelpful, 0.5 = partially helpful, 1.0 = fully resolved",
};

A/B Experiments

Test different models, temperatures, or agent configurations:

// Create via API
POST /_agent-native/observability/experiments
{
  "name": "sonnet-vs-haiku",
  "variants": [
    { "id": "control", "weight": 50, "config": { "model": "claude-sonnet-4-6" } },
    { "id": "treatment", "weight": 50, "config": { "model": "claude-haiku-4-5-20251001" } }
  ],
  "metrics": ["cost", "latency", "satisfaction"]
}

// Start the experiment
PUT /_agent-native/observability/experiments/:id
{ "status": "running" }

The agent loop automatically resolves the user's variant and applies the config override. Assignment uses consistent hashing — same user always gets the same variant.

Configuration

All settings are stored in the observability-config key:

{
  enabled: true,           // Master switch
  capturePrompts: false,   // Store prompt content in traces
  captureToolArgs: false,  // Store action input arguments
  captureToolResults: false, // Store action results
  evalSampleRate: 0,       // 0-1, fraction of runs to LLM-judge
  exporters: []            // OTLP export targets
}

Content is redacted by default — only token counts, costs, and timing are stored. Opt in to content capture when needed for debugging.

API Endpoints

All auto-mounted at /_agent-native/observability/:

Method	Path	Purpose
GET	`/`	Overview stats
GET	`/traces`	List trace summaries
GET	`/traces/:runId`	Trace detail (summary + spans)
GET	`/traces/:runId/evals`	Evals for a run
POST	`/feedback`	Submit feedback
GET	`/feedback`	List feedback
GET	`/feedback/stats`	Feedback aggregation
GET	`/satisfaction`	Satisfaction scores
GET	`/evals/stats`	Eval statistics
POST	`/experiments`	Create experiment
GET	`/experiments`	List experiments
PUT	`/experiments/:id`	Update experiment
POST	`/experiments/:id/results`	Compute results
GET	`/experiments/:id/results`	Get results

All endpoints support ?since=N (ms timestamp) and ?limit=N query params.

Export to External Platforms

Send traces to Langfuse, Datadog, Grafana, or any OTel-compatible backend:

await putSetting("observability-config", {
  enabled: true,
  exporters: [
    {
      type: "otlp",
      endpoint: "https://cloud.langfuse.com/api/public/otel",
      headers: { Authorization: "Bearer sk-..." },
    },
  ],
});

The framework emits gen_ai.* semantic convention spans compatible with the OpenTelemetry GenAI spec.