Pick the right starting point
Don’t build AgentOS from scratch. Start from a template that’s closest to what you want to ship.| Template | Right when |
|---|---|
| Scout | You’re building a context agent that pulls from external systems (Slack, Drive, MCP, custom) |
| Dash | You’re building a data agent that answers questions from your database |
| Coda | You’re building a code companion that lives in Slack |
| Demo OS | You want a kitchen-sink reference with every feature wired up |
| Bare AgentOS | You want full control and don’t mind wiring it yourself |
.env.production flow. You’re 90% of the way to deploy on day one.
Replace the demo data
Templates ship with synthetic data so the first run works. Replace it with your own:| Template | Swap |
|---|---|
| Scout | Configure context providers in scout/contexts.py. Point at your S3 buckets, your Drive folders, your MCP servers. |
| Dash | Replace scripts/generate_data.py with a loader for your dataset. Then rewrite knowledge/ for your tables. |
| Coda | Edit repos.yaml to point at your repos. Make sure your GITHUB_ACCESS_TOKEN has the right scopes. |
| Demo OS | Fork the agent that’s closest to yours. Modify its instructions, knowledge, tools. Remove the agents you don’t need. |
The shipping checklist
Each row maps to a page in this section. Don’t re-read them here; this is the punch list.| Step | Do | Where |
|---|---|---|
| Pick interfaces | Slack for B2B, Telegram for personal, WhatsApp for support, AG-UI for browser, custom HTTP for everything else. Start with one. | Interfaces |
| Wire auth | RUNTIME_ENV=prd and JWT_VERIFICATION_KEY. Control plane issues tokens, your service verifies. | Security & Auth |
| Turn on tracing | tracing=True from day one. The first time a user reports a bad answer, you’ll have the trace tree to debug from. | Observability |
| Gate irreversible actions | requires_confirmation=True for user approval, @approval for admin approval. Don’t add approval everywhere — friction kills adoption. | Human Approval |
| Schedule proactive work | Register recurring jobs in the app lifespan so they survive restarts. | Scheduling |
| Deploy | Railway via template scripts, AWS/GCP/Azure via container + managed Postgres + secrets, or self-hosted Docker Compose. | Deploy |
Iterate from real usage
The first version is wrong about things, and that’s expected. The interesting question is how fast can you find what’s wrong and fix it. The loop:- A user reports a bad answer (Slack, support ticket, your own dogfooding).
- Find the run. Filter
agno_sessionsbyuser_idandcreated_atto narrow it down. The session ID gives you the full thread. - Pull the trace from
agno_tracesandagno_spans. Look at what tools were called, what the model saw, what came back. - Replay it locally. Run the same input through the same agent in a script. Reproduce.
- Patch the prompt, swap the tool, add a learning, fix the knowledge. Re-run.
- Add the case to your eval suite so the regression can’t come back.
- Ship.
| Signal | Where |
|---|---|
| Wrong answers users complained about | agno_sessions joined with the feedback table you build |
| Tools failing in production | agno_spans filtered by status='error' |
| Slow runs | agno_spans end_time - start_time per agent |
| Cost spikes | agno_sessions.total_tokens grouped by agent_id and day |
| Regression after a change | Run evals before deploying |
Keep the agent learning
The agents that get better over time have feedback loops baked in. The agents that get worse are the ones nobody is watching. Three patterns the templates use: LearningMachine. Agno’s built-in pattern for agents that store discovered facts and retrieve them on the next run. When the agent figures out that therevenue_v2 table replaced revenue last March, it writes that to a learnings table; the next time someone asks a revenue question, that learning is in the prompt. Dash uses this end-to-end. See Learning.
Knowledge updates as a feedback loop. When you find a wrong-answer pattern, add the right answer to knowledge. The user who asked the question gets a fix; every subsequent user gets the right answer the first time. Pal and Scout both lean on this.
Eval-gated deploys. Every deploy runs an eval suite. Regressions block the merge. The eval suite grows with the bugs you find — every postmortem ends with a new eval case. Over time the suite becomes the institutional memory of what your agent should and shouldn’t do.
What’s left
You shipped. The rest is normal product work that AgentOS doesn’t try to solve:| Concern | Where |
|---|---|
| Pricing, billing, rate limits | A layer in front of AgentOS. Stripe, your own metering, your own limits. |
| Multi-tenant isolation | Per-tenant db instances or schema-per-tenant within one db |
| Compliance (SOC 2, HIPAA, etc.) | RBAC + audit logs + data residency choices in your storage backend |
| Custom UIs | AG-UI for the chat surface; build the rest as a normal web app calling AgentOS |
See it work end-to-end
| Template | Product |
|---|---|
| Scout tutorial | Enterprise context agent over S3 + Slack + Drive + MCP |
| Dash tutorial | Self-learning data agent over a SaaS metrics dataset |
| Coda tutorial | Code companion that lives in Slack and triages issues |