GenAI: from software, through automation, to agents
Three eras of software in one diagram. What changes when the system has to reason, and what breaks if you treat it like it doesn't.
- genai
- agents
- architecture
Software has been through two eras and is now in its third. Every one of them is still running, it is the ratio that is changing.
Software is deterministic logic operating on structured data. It does what you tell it, in the order you tell it, for as long as the power stays on.
Automation is software with its finger on other software. Cron, queues, webhooks, Zapier, workflow engines. It moves data and triggers actions across systems you don't own. It is still deterministic, the inputs and outputs are structured, but the blast radius is larger.
AI systems add a third ingredient: a model that does something deterministic code cannot do, which is interpret unstructured input and decide. That is the step change. Not text generation, decision-making over fuzzy inputs.
The mistake most teams make in 2026 is treating an AI system like a nicer automation. It is not. It is a new architectural primitive with new failure modes.
The anatomy of an AI system
Peel the marketing off any "agent" and you find the same five pieces.
Reasoning core. The model itself, prompt, structured output contract, optional thinking budget. Often a cascade: cheap model for triage, strong model for the hard cases. The prompt is source code. Version it like source code.
Context. Everything the model needs that wasn't in its weights, a retrieval pipeline (RAG), conversational memory, structured state. Bad context is the most common reason an "agent" sounds confident and is wrong.
Tools. Functions the model can call: APIs, SQL, code execution, search, another agent. A tool is a contract, name, schema, side effects, and every tool you add roughly doubles the surface area of failures you will see in production.
Guardrails. Input validation, output validation, policy checks, rate limits. These are cheap, deterministic, and the thing you will most regret skipping. Don't rely on the model to follow the rules. The rules live in code; the model gives the model's best guess.
Evals and observability. Tests, but continuous. You need ground truth, human-graded, rule-based, or LLM-as-judge (calibrated against humans on a fixed set), and you need to see every live decision in a trace viewer. Without these you are flying a plane with the windows painted over.
What changes when you add a model
Three things become load-bearing that weren't before.
Latency budgets
A single model call is 200ms, 20 seconds, depending on model, input size, and route. An agent that makes five calls to plan, five to act, and three to verify is a user waiting thirteen seconds.
You end up designing for a latency budget the way database engineers design for IOPS. Cache aggressively. Parallelize tool calls. Stream the first token so the user knows it started. Pick the smallest model that clears the eval bar. There is no such thing as "free" inference.
Cost per decision
At unit economics, every decision has a price. A RAG query over a 10k-token context on a frontier model is measured in cents. Multiply by millions of requests per day and the accounting department gets involved.
Two disciplines keep you solvent:
- Token budget as a product constraint. Retrieve less. Summarize context before passing it on. Use structured output to avoid burning tokens on prose when you want JSON.
- Model routing. A gateway that picks the right model per request, cheap model for 80% of queries, expensive model for the 20% that need it. Route on features, not vibes.
Evals > unit tests
Deterministic software has unit tests. AI systems have evals: a fixed set of inputs with expected outputs, a scoring function, a dashboard that shows you are not regressing.
Evals are slow, messy, and the single most useful artifact in the system. Without them, every prompt change is a coin flip. With them, you can refactor the core, switch models, and ship with confidence.
Build evals before you build the agent, not after.
The architecture that survives
A pattern that holds up once you get past the prototype:
- A gateway in front of every model call. Routes requests, enforces rate limits, retries on transient errors, injects tracing. You do not want application code talking directly to a model vendor.
- RAG as a standalone service. Not
embeddings_in_my_app.py. A service with its own indexing pipeline, eval suite, and latency budget. The application treats it like any other backend. - Tools behind typed contracts. Pydantic or JSON Schema in, validated output out. Every tool call is a row in a table with its arguments, return value, and latency. Treat it like a payment.
- A trace per conversation. Every prompt, every tool call, every retrieval, every reasoning step. OpenTelemetry or a purpose-built tracer. On-call staff should be able to reproduce any failure from the trace alone.
- Offline evals in CI, online evals in production. The offline suite blocks merges. The online suite, sampled, graded, dashboarded, catches what the offline set didn't anticipate.
The boring part that isn't
The glamorous work in GenAI is prompt engineering and model selection. The load-bearing work is evals, observability, and the gateway. The teams that will still be running their AI systems in two years are the ones investing more in the second list than the first.
Every era of software has rewarded the teams who treated their runtime as a serious piece of infrastructure. AI systems are no exception. The only difference is that the runtime now has an opinion.