Software for the generative age: From precision to probability

For decades, good software meant predictable software. Press the same button, get the same result. This principle, called determinism, was the foundation of everything from banking systems to government services. It meant that given identical inputs, a system would always return identical outputs. It was predictable, testable, and auditable.

But generative AI has flipped this on its head.

Today’s generative AI systems don’t just process information; they create it. And they create something different each time. A government chatbot might explain the same policy in three different ways to three different citizens. An AI assistant might suggest different next steps for identical situations.

This shift from precision to probability is fundamentally changing how we build software. It’s especially critical for organizations that serve the public, where consistency, fairness, and accountability are requirements.

Determinism: The foundation of traditional software

Until recently, deterministic systems were the gold standard. In a deterministic environment, identical inputs always produce identical outputs. This consistency was essential in high-stakes domains like financial transactions, licensing systems, health services, and government operations.

Consider a simple government form: submit your information, and you get either approved or denied based on clear, unchanging rules. The same application with the same information will always yield the same result. This predictability allowed teams to write reliable unit tests, manage software delivery pipelines, and define what “working correctly” actually meant.

Generative AI and the rise of non-determinism

Generative AI works differently. Instead of following fixed rules, these systems predict what comes next based on probabilities learned from massive datasets.

Ask an AI-powered help system “how do I report a problem?” and you might get a step-by-step numbered list. Ask the same question again, and you might get a conversational explanation with examples. Both answers could be perfectly correct and helpful, just expressed differently.

This variation happens because of several factors:

Sampling techniques that control how “creative” or focused the responses should be
Context sensitivity, where previous parts of the conversation subtly influence what comes next
Prompt phrasing, where changes in wording can lead to noticeably different outputs
Infrastructure variations in how the model runs, creating slight differences even with identical inputs

This variability is by design. This non-determinism gives generative systems their adaptability and expressiveness.

Instead of one “correct” answer, these systems return many acceptable answers, each shaped by probabilities. When the model generates text, it’s essentially choosing from a weighted menu of possibilities. Maybe there’s a 60% chance it picks one phrase, 20% for another, with the remainder distributed across other options.

This probabilistic approach mirrors how humans actually communicate. We don’t give identical responses to identical questions. We adapt based on context, mood, and what we think will be most helpful. Increasingly, users expect this same level of personalization from digital services.

A paradigm shift in “correctness”

Traditional services were beautifully binary. A function either worked or it didn’t. A test either passed or failed. You could point to a specific expected output and say, “that’s correct” or “this is wrong.”

But what does “correct” mean when an AI system can give multiple valid responses to the same question?

Consider a citizen asking about tax deadlines. A generative system might respond:

“The deadline is April 30.”
“You have until April 30 to file your return.”
“Tax returns are due by April 30, but you can request an extension.”
“The filing deadline is April 30. Need help with your return?”

Which one is “correct”? All of them. They’re factually accurate, appropriately helpful, and contextually relevant. The difference is in tone, detail level, and anticipation of follow-up needs.

This fundamental change in how we think about “correct” outputs requires us to redefine success. Instead of asking “did this output match exactly?” we ask:

Is the answer reasonable and on-topic?
Does it meet the user’s needs?
Is it factually grounded or appropriately hedged?
Is it aligned with the service’s tone, policy, and constraints?
Could a human expert look at this and say “Yes, that’s helpful”?

Quality becomes something you evaluate rather than assert, making testing much more complex.

Even when organizations attempt to eliminate this variability by reducing randomness settings, the non-deterministic nature persists. Recent studies show that even when these settings are dialed to maximum predictability, large models still produce different outputs up to 15% of the time, especially for complex tasks.

Why this matters for digital services

For public sector organizations, this shift opens up entirely new ways to serve citizens. Consider what becomes possible:

Dynamic translation of complex legal policies into plain language, automatically adjusted for different reading levels
Multilingual guidance that doesn’t just translate words, but adapts cultural context and regulatory nuances
Personalized pathways where citizens describe their situation in their own words, rather than navigating through predetermined dropdown menus
Contextual help that understands not just what someone is asking, but why they’re asking it

These capabilities can make government services more human and accessible. But they also introduce new challenges that public sector organizations can’t ignore.

Accountability: If the same citizen question produces two slightly different answers, how do you explain that variance? Can your system validate that both responses are appropriate? Should a human review them? How do you log these interactions for auditing purposes?
Fairness: When an AI system gives personalized responses, how do you ensure it’s not inadvertently providing better service to some citizens than others? How do you maintain equity when the system itself introduces variation?
Trust: Citizens need to trust that government systems are working in their best interest. When responses can vary, how do you maintain that trust while explaining why the variation exists?

These aren’t just engineering challenges. They strike at the core of what it means to provide reliable, equitable public services in an age of probabilistic systems.

Getting the best of both worlds

Here’s the good news: you don’t have to choose between deterministic reliability and generative flexibility. The most effective approach combines both.

Deterministic systems still handle the critical, high-stakes logic. Your eligibility calculations, payment processing, and compliance checks remain predictable and auditable. But around this reliable core, you can layer generative tools that provide flexibility, guidance, and better user experiences.

Arizona’s Department of Child Safety uses AI to automatically classify documents, prefill forms from uploaded files and provide policy guidance to case workers. But the core child welfare decisions remain entirely with the trained human case workers following established protocols. The AI handles the time-consuming paperwork and research, providing the case workers more time and resources to make better informed decisions.

What this means for the creation of citizen services

This shift affects how teams approach every stage of building digital services, requiring new mindsets and new methods.

For design: You’re no longer designing for a single, predictable user flow. Instead, you’re designing systems that can gracefully handle multiple valid outcomes. This means focusing on usefulness over uniformity, building in context and recovery options, and always providing fallbacks when the AI doesn’t quite hit the mark.

For development: You’re building hybrid architectures that layer generative components within deterministic boundaries. This requires new approaches to logging, filtering, and validation to ensure everything remains traceable and auditable.

For testing: The days of simple “expected output vs. actual output” comparisons are over. You need new methods like semantic similarity scoring, confidence thresholds, and sometimes human evaluation to determine if an AI response is “good enough.”

What unites these approaches? Treating uncertainty as a design constraint, not a problem to solve.

Embracing the probabilistic future

If we want to integrate generative AI into digital services, we have to accept that predictability and adaptability must coexist. Deterministic systems will continue to underpin mission-critical logic, but around them, interfaces, guidance tools, and assistants can generate, interpret, and respond dynamically.

This fundamental shift from precision to probability is changing how we build systems and redefining what “working correctly” means in the first place. This requires new skills, new safeguards, and a new definition of success. We’re moving from asking “did the system give the correct answer?” to “did it help the user move forward, safely, and effectively?”

The organizations that will thrive in this new landscape are those that learn to design for uncertainty while maintaining the trust and reliability their users depend on. The key is to strike a balance between predictability and adaptability.