Agent Observability
Every agent-native app gets observability out of the box. Traces, automated evals, user feedback, and A/B experiments work with zero configuration — all data lives in the app's own SQL database.
What's Captured Automatically
When a user sends a message, the framework automatically records:
- Token usage — input, output, cache read, cache write
- Cost — computed from token counts and model pricing
- Latency — total duration and time per tool call
- Tool calls — which actions were invoked, success/error status, duration
- Automated evals — 5 quality scores computed after every run
No code changes needed. The instrumentation hooks into production-agent.ts transparently.
The Dashboard
Add the dashboard to any template with a single route:
// app/routes/observability.tsx
import { ObservabilityDashboard } from "@agent-native/core/client";
export default function ObservabilityPage() {
return (
<div className="min-h-screen bg-background p-6">
<ObservabilityDashboard />
</div>
);
}
The dashboard has 5 tabs:
| Tab | What it shows |
|---|---|
| Overview | Key metrics — runs, cost, latency, tool success rate, satisfaction, eval score |
| Conversations | Trace list with drill-down to individual spans (agent_run, llm_call, tool_call) |
| Evals | Automated eval scores by criteria, trends over time |
| Experiments | A/B test list with status badges, variant results with confidence intervals |
| Feedback | Thumbs up/down stream, category breakdown, frustration scores |
User Feedback
Explicit Feedback
Thumbs up/down buttons render inline on every agent message in the chat UI. Thumbs down opens a category popover (Inaccurate, Not helpful, Wrong tool, Too slow). This is wired into AssistantChat.tsx automatically.
Implicit Feedback (Frustration Index)
The framework computes a Frustration Index (0-100) from conversation signals:
| Signal | Weight | What it detects |
|---|---|---|
| Rephrasing | 30% | User repeats similar messages |
| Retry patterns | 20% | "Try again", "no that's wrong" |
| Abandonment | 20% | Session ends shortly after response |
| Sentiment | 15% | Negative language patterns |
| Length trend | 15% | Declining message lengths |
Score interpretation: 0-20 = healthy, 20-40 = friction, 40-60 = dissatisfied, 60+ = broken session.
Automated Evals
Five deterministic scorers run after every agent run:
| Criteria | What it measures | Score range |
|---|---|---|
tool_success_rate |
% of tool calls without errors | 0-1 |
step_efficiency |
Penalizes excessive LLM iterations for tool-using runs | 0-1 |
latency_score |
Normalized against 10s/tool baseline | 0-1 |
cost_efficiency |
Normalized against cost baseline | 0-1 |
error_recovery |
Did the agent recover from tool errors? | 0 or 1 |
LLM-as-Judge (Optional)
Enable sampled LLM-based evaluation by setting evalSampleRate:
import { putSetting } from "@agent-native/core/settings";
await putSetting("observability-config", {
enabled: true,
evalSampleRate: 0.05, // 5% of runs
});
Custom criteria use natural language rubrics:
const criteria = {
name: "helpfulness",
description: "Was the response helpful and complete?",
rubric: "0.0 = unhelpful, 0.5 = partially helpful, 1.0 = fully resolved",
};
A/B Experiments
Test different models, temperatures, or agent configurations:
// Create via API
POST /_agent-native/observability/experiments
{
"name": "sonnet-vs-haiku",
"variants": [
{ "id": "control", "weight": 50, "config": { "model": "claude-sonnet-4-6" } },
{ "id": "treatment", "weight": 50, "config": { "model": "claude-haiku-4-5-20251001" } }
],
"metrics": ["cost", "latency", "satisfaction"]
}
// Start the experiment
PUT /_agent-native/observability/experiments/:id
{ "status": "running" }
The agent loop automatically resolves the user's variant and applies the config override. Assignment uses consistent hashing — same user always gets the same variant.
Configuration
All settings are stored in the observability-config key:
{
enabled: true, // Master switch
capturePrompts: false, // Store prompt content in traces
captureToolArgs: false, // Store action input arguments
captureToolResults: false, // Store action results
evalSampleRate: 0, // 0-1, fraction of runs to LLM-judge
exporters: [] // OTLP export targets
}
Content is redacted by default — only token counts, costs, and timing are stored. Opt in to content capture when needed for debugging.
API Endpoints
All auto-mounted at /_agent-native/observability/:
| Method | Path | Purpose |
|---|---|---|
| GET | / |
Overview stats |
| GET | /traces |
List trace summaries |
| GET | /traces/:runId |
Trace detail (summary + spans) |
| GET | /traces/:runId/evals |
Evals for a run |
| POST | /feedback |
Submit feedback |
| GET | /feedback |
List feedback |
| GET | /feedback/stats |
Feedback aggregation |
| GET | /satisfaction |
Satisfaction scores |
| GET | /evals/stats |
Eval statistics |
| POST | /experiments |
Create experiment |
| GET | /experiments |
List experiments |
| PUT | /experiments/:id |
Update experiment |
| POST | /experiments/:id/results |
Compute results |
| GET | /experiments/:id/results |
Get results |
All endpoints support ?since=N (ms timestamp) and ?limit=N query params.
Export to External Platforms
Send traces to Langfuse, Datadog, Grafana, or any OTel-compatible backend:
await putSetting("observability-config", {
enabled: true,
exporters: [
{
type: "otlp",
endpoint: "https://cloud.langfuse.com/api/public/otel",
headers: { Authorization: "Bearer sk-..." },
},
],
});
The framework emits gen_ai.* semantic convention spans compatible with the OpenTelemetry GenAI spec.