Agent Observability

Every agent-native app gets observability out of the box. Traces, automated evals, user feedback, and A/B experiments work with zero configuration — all data lives in the app's own SQL database.

What's Captured Automatically

When a user sends a message, the framework automatically records:

  • Token usage — input, output, cache read, cache write
  • Cost — computed from token counts and model pricing
  • Latency — total duration and time per tool call
  • Tool calls — which actions were invoked, success/error status, duration
  • Automated evals — 5 quality scores computed after every run

No code changes needed. The instrumentation hooks into production-agent.ts transparently.

The Dashboard

Add the dashboard to any template with a single route:

// app/routes/observability.tsx
import { ObservabilityDashboard } from "@agent-native/core/client";

export default function ObservabilityPage() {
  return (
    <div className="min-h-screen bg-background p-6">
      <ObservabilityDashboard />
    </div>
  );
}

The dashboard has 5 tabs:

Tab What it shows
Overview Key metrics — runs, cost, latency, tool success rate, satisfaction, eval score
Conversations Trace list with drill-down to individual spans (agent_run, llm_call, tool_call)
Evals Automated eval scores by criteria, trends over time
Experiments A/B test list with status badges, variant results with confidence intervals
Feedback Thumbs up/down stream, category breakdown, frustration scores

User Feedback

Explicit Feedback

Thumbs up/down buttons render inline on every agent message in the chat UI. Thumbs down opens a category popover (Inaccurate, Not helpful, Wrong tool, Too slow). This is wired into AssistantChat.tsx automatically.

Implicit Feedback (Frustration Index)

The framework computes a Frustration Index (0-100) from conversation signals:

Signal Weight What it detects
Rephrasing 30% User repeats similar messages
Retry patterns 20% "Try again", "no that's wrong"
Abandonment 20% Session ends shortly after response
Sentiment 15% Negative language patterns
Length trend 15% Declining message lengths

Score interpretation: 0-20 = healthy, 20-40 = friction, 40-60 = dissatisfied, 60+ = broken session.

Automated Evals

Five deterministic scorers run after every agent run:

Criteria What it measures Score range
tool_success_rate % of tool calls without errors 0-1
step_efficiency Penalizes excessive LLM iterations for tool-using runs 0-1
latency_score Normalized against 10s/tool baseline 0-1
cost_efficiency Normalized against cost baseline 0-1
error_recovery Did the agent recover from tool errors? 0 or 1

LLM-as-Judge (Optional)

Enable sampled LLM-based evaluation by setting evalSampleRate:

import { putSetting } from "@agent-native/core/settings";

await putSetting("observability-config", {
  enabled: true,
  evalSampleRate: 0.05, // 5% of runs
});

Custom criteria use natural language rubrics:

const criteria = {
  name: "helpfulness",
  description: "Was the response helpful and complete?",
  rubric: "0.0 = unhelpful, 0.5 = partially helpful, 1.0 = fully resolved",
};

A/B Experiments

Test different models, temperatures, or agent configurations:

// Create via API
POST /_agent-native/observability/experiments
{
  "name": "sonnet-vs-haiku",
  "variants": [
    { "id": "control", "weight": 50, "config": { "model": "claude-sonnet-4-6" } },
    { "id": "treatment", "weight": 50, "config": { "model": "claude-haiku-4-5-20251001" } }
  ],
  "metrics": ["cost", "latency", "satisfaction"]
}

// Start the experiment
PUT /_agent-native/observability/experiments/:id
{ "status": "running" }

The agent loop automatically resolves the user's variant and applies the config override. Assignment uses consistent hashing — same user always gets the same variant.

Configuration

All settings are stored in the observability-config key:

{
  enabled: true,           // Master switch
  capturePrompts: false,   // Store prompt content in traces
  captureToolArgs: false,  // Store action input arguments
  captureToolResults: false, // Store action results
  evalSampleRate: 0,       // 0-1, fraction of runs to LLM-judge
  exporters: []            // OTLP export targets
}

Content is redacted by default — only token counts, costs, and timing are stored. Opt in to content capture when needed for debugging.

API Endpoints

All auto-mounted at /_agent-native/observability/:

Method Path Purpose
GET / Overview stats
GET /traces List trace summaries
GET /traces/:runId Trace detail (summary + spans)
GET /traces/:runId/evals Evals for a run
POST /feedback Submit feedback
GET /feedback List feedback
GET /feedback/stats Feedback aggregation
GET /satisfaction Satisfaction scores
GET /evals/stats Eval statistics
POST /experiments Create experiment
GET /experiments List experiments
PUT /experiments/:id Update experiment
POST /experiments/:id/results Compute results
GET /experiments/:id/results Get results

All endpoints support ?since=N (ms timestamp) and ?limit=N query params.

Export to External Platforms

Send traces to Langfuse, Datadog, Grafana, or any OTel-compatible backend:

await putSetting("observability-config", {
  enabled: true,
  exporters: [
    {
      type: "otlp",
      endpoint: "https://cloud.langfuse.com/api/public/otel",
      headers: { Authorization: "Bearer sk-..." },
    },
  ],
});

The framework emits gen_ai.* semantic convention spans compatible with the OpenTelemetry GenAI spec.