LinkedIn Post | LinkedIn Article
May 26, 2026
There’s a class of software system we don’t quite have a name for yet, where code (often LLM-generated and maintained) is combined with layered prompts, producing a mix of LLM and traditional outputs. The canonical example is the much-maligned AI customer service chatbot, but it spans nearly everything in the vast space between a one-off Claude chat and a stand-alone executable.
These systems make us uncomfortable for a lot of reasons. I want to focus on one that’s mostly hidden from view and surfaces the emergence of a profound gap in how most IT organizations approach architecture: these systems actively and necessarily lack reproducibility.
This is a real change. Decades of software operations have been built on consistent output — feed the system these inputs, get this set of outputs. That’s not true when LLMs play a significant role and, importantly, it’s not a defect to be engineered away. It’s a feature. These systems are not meant to be, should not be, and in the end cannot be, strictly reproducible.
In a well designed system, yes, your outputs will be similar. But there is no guarantee they’ll be the same. The standard explanation points to something called temperature. An LLM’s temperature impacts how the model selects its next word. Low temperature means it consistently picks the most probable option; higher temperature settings mean it samples more broadly across the alternatives. Intuitively, if its temperature is set to zero, an LLM should be fully deterministic, producing the same output every time; if its temperature is set to 1, the system should vary dramatically response to response, a state sometimes referred to as total non-determinism.
But that’s not what actually happens. Even with their temperature set to zero, LLMs produce (more mildly, to be sure) non deterministic results.
Some of this is easily mathematically identifiable: floating point arithmetic has its issues, and depending on how work gets distributed at the hardware level, the same inputs can produce subtly different numbers. But that’s only part of the picture: the very structure of LLM’s pulls towards non-determinism (for a more technical dive into this, see the endnotes, especially the He/Thinking Machines Lab piece).
Let’s take two quick examples of how non-determinism in these systems carries significant risk.
Say you work in development–the fundraising kind of development. You may obsess over every detail of your outbound communications, ensuring everything is absolutely perfect in your appeal letter before your system does its mail-merge magic. Or, perhaps you use those as a template, adding your personalized details about the target’s family or pets or profession to each and every one. In either case, how comfortable are you with the likelihood of an LLM producing slight variations from letter to letter? Misidentifying a family name can have real impact on fundraising — if you think it doesn’t, you haven’t spent much time in that world.
A more technical example: QA. Almost all software testing is dependent on reproducibility–given X inputs, does the system always produce Y output? But non-deterministic systems don’t behave this way: ask a well-trained chatbot to help with an invoicing issue and it may proceed along any one of a range of directions. Importantly, even the correct, positive-outcome results may vary. Which makes testing the chatbot a bit of a nightmare for traditional QA.
Traditional QA has depended on “verification”–a fixed answer to match against–here, no such correctness exists. Vitally, verification-driven QA doesn’t go away–it remains crucial for testing business logic, or the nuts and bolts of prompt assembly. But entire ranges of the system now have to be evaluated through measurement, driving the rise of statistical evaluation against tolerance bands (ranging from simple rubric scoring to more complex measurements of semantic similarity and the like).
The industry has responded, of course–most strongly by insisting that it’s actually an entirely new discipline. EvalOps is emerging as the term of choice, and it requires more people and, because it’s 2026, more LLM driven systems. The incentive confusion is clear: the problems are being solved by the same vendors creating them, and the solution requires you to buy more of their products and more of their implementations. Furthermore, the markers of operational success of the newly minted EvalOps platforms are being defined by the same organizations selling those products. Unsurprisingly, they score well against those metrics. This hits hardest in QA: historically, this is the discipline specifically tasked with understanding how a system behaves for-realsies, as opposed to how it was designed to behave. Burying hidden knowledge here is especially dangerous.
What the QA example shows is that non-determinism doesn’t sit in the outer layers of a system: it’s far more fundamental, existing in its core composition. This is not an output problem, and it’s not a process failure. It’s an architectural challenge. Until it’s treated that way—with the same seriousness as data architecture, security boundaries, or audit posture—it will continue to surface as a series of mysterious operational issues that nobody quite owns.
The QA/EvalOps trajectory is the early warning. An entire discipline is being rebuilt around a property of these systems that most organizations haven’t yet acknowledged exists, sold back to them by the same vendors, measured by metrics those vendors defined.
AI hasn’t reduced what we need from technical architecture. It has expanded it. There’s a new domain sitting alongside and intertwined with everything architects already own, with its own skills, its own standards, and its own risks. Organizations that treat AI adoption as a tooling question rather than an architectural one are accumulating that risk invisibly. The bill will come due — the only open question is whether it arrives as a slow drag on quality or as something more abrupt.
Further Reading
He, Horace and Thinking Machines Lab, “Defeating Nondeterminism in LLM Inference“, Thinking Machines Lab: Connectionism, Sep 2025.
https://martynassubonis.substack.com/p/zero-temperature-randomness-in-llms


























