AI Systems Architecture

An architecture discipline for software with AI built in. How probabilistic models turn into reliable product features.

The problem

Bad chatbots get noticed. Capable AI systems fail quietly.

An observation from three years of AI incidents

Chatbots

Visible errors, bounded consequences

The Chevrolet bot that sold a car for a dollar. Air Canada's bot inventing bereavement fares the airline had to honor. NYC's city chatbot telling businesses to break labor law. Embarrassing, expensive, viral. Damage bounded.

Agents

Plausible answers, irreversible actions

At Replit, an AI agent wiped a production database and fabricated four thousand fake entries to cover it. At DataTalks.club, a coding agent destroyed production infrastructure and a database with over 1.9 million rows in a single table.

Answers sound plausible. Actions look technically correct. Only the close look reveals that the system has been deciding wrongly for weeks. That is the class of questions at stake.

Boundary

Two related disciplines, two different questions.

When you code with AI, you sit in the loop and see every step. With an AI feature in production, the user sees the answer, or a downstream system consumes it automatically. What software architecture is to software development, AI Systems Architecture is to AI engineering: the discipline for the decisions that are expensive to change later.

Four disciplines

Where AI features prove out.

Not exhaustive, but they cover the decisions that look fine in a prototype and fall apart in production.

Contract

Output bound by schema

A schema describes what the model must return: fields, enums, required attributes. This makes the handover between probabilistic answer and deterministic downstream logic testable. It does not guarantee factual correctness. Valid JSON can still be wrong in substance.

Beispiel

For inbound classification, categories, priorities, and required fields are fixed in advance. Free text is allowed around it; the core is structured.

Evaluation

Measured before released

Language models do not pass binary tests. Shipping without an eval set pushes testing onto real users. A few hundred labeled cases, evaluated in CI on every prompt, model, and schema update. Pass@k gives visibility for probabilistic answers.

Beispiel

Are similar requests classified inconsistently? Does the model fabricate IDs? Such questions are not settled in chat but on a stable set, automated, with history.

Observability

Spans, events, metrics

An AI feature can degrade without a code change. A silent model update, input drift, a changed tool response. Without telemetry and alerts, you only learn about it when someone complains.

Beispiel

Model version, schema error rate, latency, token cost, and correction rate per case type are the minimum. OpenTelemetry with the GenAI conventions is an established anchor.

Action Safety

Separate read and write

Classifying or suggesting is different from acting. If both happen in the same system, they should be split deliberately. New model versions run against historical cases first, then shadow, then live. Full access without sandboxing is the most common root cause of the incidents that make the news.

Beispiel

A classifying feature can present structured output, but does not send emails, mint IDs, or trigger payments. What it may do is in the contract.

Central artifact

The Agent Contract.

Where the four disciplines come together. A versioned definition that pins the AI feature down at its interfaces.

What lives in the contract

C1 Prompt
C2 Output schema
C3 Tool schemas and classification
C4 Provider configuration and routing
C5 Eval criteria
C6 Risk class

Engineering and domain experts work on the same artifact. The SDK fetches the approved contract and executes it locally. Governance stays central, execution stays decentralized, following the pattern familiar from Data Mesh and Self-Contained Systems.

Quality

The attributes stay. The answers shift.

ISO 25010 helps to map the field. But correctness becomes statistical, reliability is no longer just uptime, performance turns into a budget question.

Functional Suitability

Correct is not a switch but a distribution. 95 percent right sounds great until the remaining five percent err systematically in one direction.

Reliability

Availability is only half the question. The other half is stable substantive quality across silent model updates, and controlled degradation when the provider fails.

Maintainability

Prompt, system instructions, model choice, retrieval config, schema, tool policies: many behavior-shaping artifacts, none of them code in the classical sense. A prompt change is an invisible deployment.

Security

Prompt injection has no classical equivalent. The model is not a security boundary. Permissions, approvals, and guardrails live outside the model. Add data protection and auditability across provider lines.

Performance Efficiency

Token cost is an architectural dimension. An agent in a loop burns money, not CPU. Latency runs in seconds, not milliseconds, reshaping UX patterns.

Compatibility

Model output must be consumable by downstream logic. Schemas and contracts are the boundary between probabilistic and deterministic. Every model swap demands revalidation.

Portability

Prompts are not portable one to one across models. A provider switch is a migration with full re-evaluation. Still worth designing for from day one, for cost and data sovereignty.

Usability

For enterprise AI features, usability is mostly about how human oversight actually works. Which evidence does the reviewer see? How fast can they correct?

Four tensions

Every decision pulls in two directions.

Quality attributes contradict each other. Architecture is the discipline of navigating these tensions deliberately, not resolving them.

Accuracy Cost

A bigger model raises quality and doubles token cost. When does the smaller model suffice?

Safety Automation

More approvals lower risk and throughput. At some point the AI feature is slower than the manual process it was meant to replace.

Portability Provider optimization

Abstraction reduces lock-in and often costs performance. Not every dependency must be removed, but every one should be entered consciously.

Observability Privacy

More logs help operations and raise compliance pressure. Log everything and you have a privacy problem. Log nothing and you have a quality problem.

Governance

Eight teams build. No one steers.

What happens when tooling is missing

Many organizations now run AI features that appear on no org chart. They read customer data, classify requests, prepare decisions. What used to be Shadow IT shows up here as Shadow AI. The answer is not a ban. It is the same mix of decentralized ownership and centralized visibility that made microservices scale.

Who knows which AI features are running?

A feature registry that surfaces owners, providers, data flows, and eval status. Maintained by teams, visible centrally.

Who pays, and who sees the bill?

Central provider access, cost dashboards, and budget alerts. Without visibility, token bills appear only at month-end.

Who gets the call when a feature does the wrong thing?

A defined incident path with clear ownership, rollback rules, and a communication duty. Before the first incident, not after.

The EU AI Act will mandate parts of this visibility from August 2026. Starting now means building it for your own steerability, not just the compliance binder.