Performance (Not) Guaranteed

A note on the importance of performance guarantees in AI deployment and the need for industry to pivot back to its traditional standards.

Mar 13, 2025

A Large Language Model (LLM) can produce coherent text in natural language, writing essays on any subject to which you set it—but it will hallucinate some of the content. A multimodal model can produce images or video of startlingly realistic quality—but the physical relations between objects and people will invariably lack grounding in reality. Self-driving vehicles can autonomously maneuver around unstructured environments—but the more unfamiliar and exotic the environment, the more likely it will be to crash in inhuman ways or freeze in the middle of the road.

All these are examples of artificial intelligence (AI) models or the systems they underpin failing to provide performance guarantees; that is, reliability of performance sufficient to deploy without human regular intervention or oversight and consistent with the marketed purpose(s) of the product.

In part, these are technical problems. My (highly biased) opinion is that they are technical problems likely irresolvable, to any extent we would consider reasonable, within the Machine Learning paradigm. Some deep integration of neural and symbolic techniques in the architectures of such models seems necessary.

This is not, however, the issue at play here. The issue at play here is commercial, or industry-driven.

Why is that the public and policymakers have a kind of intimate familiarity with these failure modes? Why do they pass around examples of them on social media, debate them, interact with them freely, and so forth?

The immediate answer is that consumers generally have access to models that lack performance guarantees. The more important answer is that industry has lowered its standards for the deployment of AI models over the course of the post-2012 Deep Learning Revolution, accelerating after 2022. It readily deploys models that fail to offer performance guarantees.

The general observation is not new. In his 2018 critical appraisal of Deep Learning, Gary Marcus explicitly noted that

it is comparatively easy to make systems that work in some limited set of circumstances (short term gain), but quite difficult to guarantee that they will work in alternative circumstances with novel data that may not resemble previous training data (long term debt, particularly if one system is used as an element in another larger system.

I disagree with Marcus on a fair amount, but I have trouble with the criticism he often gets that he ‘shifts goalposts’ and such—he is making this exact point to this day, emphasizing reliability over mere capability.

In October 2022, just a few weeks before OpenAI decided to dominate the news for the next two years, Don Monroe wrote favorably on neuro-symbolic techniques in Communications of the ACM. Why? Because neural networks not only “have well-known weaknesses,” but they also do “not provide the sorts of performance guarantees that are customary in computer science.”

Customary indeed. There was a point at which deploying models that hallucinate, fail, or crash and freeze would simply be considered below standard; not appropriate. The goal would have been to continue hammering out the technical problems until such a time that the model can offer those performance guarantees in deployment.

We are now over two years into the generative AI boom. Models routinely fail for their intended purposes. Euphoria over general-purpose models has neglected the fact that “general-purpose” might once have meant “can faithfully execute any task to which the model is set at an expected level of performance.” Yet, GPT-based models—meant to be “general-purpose” models by the major companies—do not offer this. You can set them to any task (assuming they can handle the kinds of data involved) but they will not faithfully execute them at an expected level of performance. They will and do fail.

The industry has thus shifted to capability gains, believing this more important in the immediate term. But these gains are misleading.

Consider how, when o1-preview was tested by an Arizona State University research group on a collection of planning problems (“Blocksworld”), they observe striking capability gains over past LLMs. All good, right?

Not quite. The same study reported the results of a classical (symbolic) planning system, Fast Downward, which achieved perfect scores across all Blocksworld planning tests (in a fraction of the time and compute, to boot). In other words: Fast Downward—and not o1-preview—offered performance guarantees.

Some are likely to argue that o1-preview is intended as a general-purpose system whereas Fast Downward is a mere planner. But this merely draws from the lowered standard. Is there really a meaningful sense in which both of these systems can “plan” on the relevant tests? If Fast Downward, which will guarantee the accuracy of its outputs, can engage in planning, then do we really wish to say that a separate model—whose score degrades quite startlingly as the problems expand—is also engaged in planning?

Perhaps some would say this. But it would be an odd choice. It lowers our standards for what we expect from AI.

The particular example holds over the general landscape. A vehicle could be called “autonomous” if it drives itself quite well in clean-cut and familiar situations but requires human oversight or intervention when encountering an unfamiliar situation—but is that really an autonomous vehicle? Pay attention to the specifics you come across, and you’ll notice applications (often drawing from nearly the same underlying technology) routinely employ these terms without offering the kinds of outputs one would expect of them.

All this is rather unlike technologies taken completely for granted. When you ignite the engine of your car, a series of controlled explosions are unfolding under the hood—largely out of view, out of mind, save for some occasional car trouble. Most cars do not blow up unexpectedly. They offer performance guarantees, in more ways than one. The same holds for airplanes, boats, and any other system whose performance is both sensitive (relevant to human safety) and hinges on a multitude of sub-systems operating in tandem. When these systems do fail catastrophically, it is typically within a remarkably small margin of error and sometimes because of perverse regulatory or market incentives.1

The industry will be wise to return to its earlier standard of achievement, sooner than later, and recognize the utter importance of performance guarantees in deployment. I suspect, however, that we have to get through some additional AGI hoopla before that happens.

You will note the difference in reactions across industries to catastrophic failures. People have high safety expectations in travel and the like. It is not clear whether they have the same scrutiny for digital technologies (yet).

On Making a Mind

Discussion about this post

Ready for more?