Are better models better?

First, I try the answer cold, and I get an answer that’s specific, unsourced, and wrong. Then I try helping it with the primary source, and I get a different wrong answer with a list of sources, that are indeed the US Census, and the first link goes to the correct PDF… but the number is still wrong. Hmm. Let’s try giving it the actual PDF? Nope. Explaining exactly where in the PDF to look? Nope. Asking it to browse the web? Nope, nope, nope…

The problem here is not so much that the number is wrong, as that I have no way to know without doing all the work myself anyway. It might be right. A different prompt might be closer to being right. If I paid for Pro, that would perhaps be more likely to be right. But I don’t need an answer that’s perhaps more likely to be right, especially if I can’t tell. I need an answer that is right.

Of course, these models don’t do ‘right’. They are probabilistic, statistical systems that tell you what a good answer would probably look like. They are not deterministic systems that tell you what the answer is. They do not ‘know’ or ‘understand’ - they approximate. A ‘better’ model approximates more closely, and it may be dramatically better at one category of question than another (though we may not know why, or even understand what the categories are). But that still is not the same as providing a ‘correct’ answer - it is not the same as a model that ‘knows’ or ‘understands’ that it should find a column labeled 1980 and a row labeled ‘elevator operators’.

How and whether this changes, this year or this decade, is one part of the central debate about whether these models will keep scaling, and indeed about AGI, where the only thing we can say for sure is that we do not have a theoretical framework that can tell us. We don’t know. Maybe that ‘understanding’ will emerge spontaneously as the models scale. Maybe, like Zeno’s Paradoxes, the models will never reach the target but will still converge to be right 99.99% of the time, so it won’t necessarily matter if they ‘understand’. Maybe some other, unknown theoretical breakthrough or breakthroughs are needed. Maybe the ‘reasoning’ in OpenAI’s O3 is a path to solve this, and maybe not. Plenty of people have opinions, but so far, we don’t know, and for the time being, ‘error rates’ (if that’s even the right way to think about this) are not a gap that will get closed with a bit more engineering, the way the iPhone got copy/paste or dialup was replaced by broadband: as far as we know, they are a fundamental property of the technology.

This prompts a few kinds of question.

Narrowly, most of the people building companies with generative AI today, hoping to automate boring back-office processes inside big companies, are wrapping generative AI models as API calls inside traditional deterministic software. They’re managing the error rate (and the UX gap of chatbots themselves, which I’ve written about a lot elsewhere) with tooling, process, control and UX, and with pre-processing and post-processing. They’re putting the horse in harness and giving it blinkers and reins, because that’s the only way to get a predictable result.

However, it may be that as the models get better, they can go to the top of the stack. The LLM tells SAP what queries to run, and perhaps the user can see and validate what going on, but now you use the probabilistic system to control the deterministic system. This is one way to think about ‘agentic’ systems (which might be the Next Big Thing or might be forgotten in six months) - the LLM turns everything else into an API call. Which way around is better? Should you control the LLM within something predictable, or give the LLM predictable tools?

This takes me to a second set of questions. The useful critique of my ‘elevator operator’ problem is not that I’m prompting it wrong or using the wrong version of the wrong model, but that I am in principle trying to use a non-deterministic system for a a deterministic task. I’m trying to use a LLM as though it was SQL: it isn’t, and it’s bad at that. If you try my elevator question above on Claude, it tells you point-blank that this looks like a specific information retrieval question and that it will probably hallucinate, and refuses to try. This is turning a weakness into a strength: LLMs are very bad at knowing if they are wrong (a deterministic problem), but very good at knowing if they would probably be wrong (a probabilistic problem).

Part of the concept of ‘Disruption’ is that important new technologies tend to be bad at the things that matter to the previous generation of technology, but they do something else important instead. Asking if an LLM can do very specific and precise information retrieval might be like asking if an Apple II can match the uptime of a mainframe, or asking if you can build Photoshop inside Netscape. No, they can’t really do that, but that’s not the point and doesn’t mean they’re useless. They do something else, and that ‘something else’ matters more and pulls in all of the investment, innovation and company creation. Maybe, 20 years later, they can do the old thing too - maybe you can run a bank on PCs and build graphics software in a browser, eventually - but that’s not what matters at the beginning. They unlock something else.

What is that ‘something else’ for generative AI, though? How do you think conceptually about places where that error rate is a feature, not a bug?

Machine learning started working as image recognition, but it was much more than that, and it took a while to work out that the right way to think about it was as pattern recognition. You could philosophise for a long time about the ‘right way’ to think about what PCs, the web or mobile really were. What is that for generative AI? I don’t think anyone has really worked it out yet, but using it as a new set of API calls within traditional patterns of software feels like using the new thing to do the old things.

Meanwhile, there’s an old English joke about a Frenchman who says ‘that’s all very well in practice, but does it work in theory?’ You can spend too long philosophising about ‘what this really means’ and not enough time just going out and building and using things, and this is a chart of exactly that - everyone in Silicon Valley is building things with AI. Some of them will be wrong and many will be boring, but some of them will find the new thing.