# reliable agents as code

Build AI agents like you build software.

Agents, datasets, and evals are code — branch a change, review it in a PR, gate every deploy in CI. The reliability loop your code already has, now for your agents.

terminal
$npm create agentmark@latest -- --cloud
click to copy

Git-native · Open core · OpenTelemetry · MCP

## the platform

Built for engineers. Ready for your whole team.

Your source of truth is git — agents, datasets, and evals as files. The platform adds what files alone can't: hosted tracing, experiments, prompt management, alerts, annotation queues, and real-time collaboration.

app.agentmark.co
Observability preview

## the problem

Your agents are degrading right now.

Most teams find out from users, not dashboards. Here's what that looks like.

Silent regression

−23%

response quality

Prompt shipped Monday. Response quality dropped 23%. No alert fired. A user complained on Friday.

Caught 4 days later

Runaway loop

$47,000

total damage

Two agents got stuck coordinating. Week 1 cost $127. Week 4 cost $18,400. The team mistook the spike for user growth.

Caught on the invoice

Fluent failure

200 OK

HTTP status

Agent called a tool with a wrong parameter. Database returned zero rows. The agent told the user: "I couldn't find any data." Every dashboard stayed green.

Never caught

The fix is the same one that works for code: evals, observability, alerts, and version control.

That's AgentMark.

## the solution

One reliability loop, anchored in git.

Commit a change, gate it with evals in CI, ship it, then trace, alert, and fix in production. One workflow, and every step lives in your repo.

reliability.loop
while (shipping) {
commit()// edit an agent, dataset, or eval — a file in your repo
ci()// evals run; a regression blocks the PR, not your users
merge()// green evals gate the deploy to main
// ── deployed to production──
trace()// every span, token, and tool call, in your OTel stack
alert()// quality, cost, or latency crosses your threshold
fix()// diagnose the trace, add the case, commit the fix
}
// improve your agents with every commit

Catch regressions before deploy with evals in CI, then catch what slips through with traces and alerts. It's one workflow — versioned in git, not scattered across a dashboard.

## in production

Catch problems before your users do.

Metrics, traces, experiments, and alerts — each tied back to the exact agent, dataset, or eval in git that produced it. One workflow, from commit to production.

Metrics

Know your cost, latency, and error rate before a user complains — not after.

  • Know your cost per request and error rate before a user complains
  • Track latency and quality trends across model versions and prompt changes
  • Correlate token usage spikes with specific prompt or code changes
Read the docs
Monthly MetricsMarch 2025
Cost
$257.12
-5.2% vs Feb
Avg. Latency
312ms
+42ms vs Feb
Requests
24.5K
+12% vs Feb
Error Rate
0.8%
-0.3% vs Feb
Tokens Used
3.2M
+8% vs Feb
Quality Score
92%
+2.5% vs Feb
Week 1
Week 2
Week 3
Week 4

“AgentMark is, by far, the best agent representation layer of this new stack. You're the only people I've seen that take actual developer needs seriously in this regard.”

Dominic Vinyard

Dominic Vinyard

Founding AI Designer

San Francisco, CA

## integrations

Works with your entire stack.

No proprietary SDKs. Standard OpenTelemetry for traces, git for version control, and direct support for every major model and framework.

Foundation
TypeScript
TypeScript
JavaScript
JavaScript
Python
Python
OpenTelemetry
OpenTelemetry
GitHub
GitHub
GitLab
GitLab
Agent Frameworks
Claude
Claude
Vercel AI
Vercel AI
LlamaIndex
LlamaIndex
Mastra AI
Mastra AI
LangChain
LangChain
Pydantic AI
Pydantic AI
Model Providers
OpenAI
OpenAI
Anthropic
Anthropic
Google
Google
AWS Bedrock
AWS Bedrock
Azure OpenAI
Azure OpenAI
Ollama
Ollama
Groq
Groq
Mistral
Mistral
Cohere
Cohere
DeepSeek
DeepSeek
Perplexity
Perplexity
Fireworks
Fireworks
Together AI
Together AI
xAI Grok
xAI Grok

## editor-native

Debug a production trace without leaving your editor.

Agents, datasets, and evals are just files — your AI assistant reads, writes, and refactors them over MCP. Ask it what failed in prod and it pulls the trace, names the root cause, and points at the line to fix.

Debug any trace with a question.

Connect AgentMark via MCP and ask Claude Code exactly what went wrong. It pulls the spans, identifies the root cause, and tells you precisely where to fix it.

trace-7f3aAgentMark MCP

trace-7f3a returned "I don't have enough information" — we expected a product recommendation. What went wrong?

AgentMark MCP
ready

// any MCP-capable editor

Claude Code
Cursor
Copilot
Windsurf

## ownership

Your data, your repo, your standards.

Closed platforms own your data and lock you into their SDKs. AgentMark doesn't.

AgentMark
Without AgentMark
Where agents, datasets & evals live
Your git repo
Their database
Roll back a change
git revert
Click through their UI
Evaluations
As code, in your CI
In their platform
Your telemetry
OpenTelemetry standard
Proprietary SDK
Auditing changes
Your commit history
Their activity log
Local development
Fully supported
Limited or none
If you leave
Files stay in git
Migration required

## faq

Questions

The questions engineers ask before they commit.

## get started

Ready to ship agents you can trust?

Tell us what you're building — we'll get you set up.

Request accessFree for individual developers and small teams
Schedule a demoFor engineering teams evaluating at scale