The State of AI – March 2024

On Claude 3's capabilities, Gemini's alignment, AI market opportunity, open vs closed-source models, and more...

Mar 09, 2024

Let the battle between intelligences commence!

About a year ago, OpenAI announced GPT-4. This foundation model stunned the world and, most would argue, held the spot as the best commercially available LLM until the announcement of Claude-3 last week.

During this time, we’ve seen massive movement in the AI industry across the entire stack. Nvidia is slowly seeing growing competition in AMD’s MI300X chip, Google’s home-grown TPUs, and startups like Groq and Cerebras. There are now foundation model alternatives to GPT-4, including Gemini Ultra, Claude-3, Mistral Large, and more expected to come soon. The code generation space is no longer entirely monopolized by GitHub Copilot, with Codeium, Cursor, and other startups making significant progress.

Now that we are a bit further into the AI revolution, bits and pieces appear to fall into place, and so I figured it would be a good time to talk about it and make some predictions for where I think things might be headed.

1. The consumer AI space will be winner-take-all

I define the consumer AI space as ChatGPT Plus, Gemini Advanced, Claude Pro, and all the other monthly subscriptions, and I believe that this market will be very big. For comparison, Netflix, Spotify, and Amazon Prime all have roughly one quarter billion subscribers. Suppose the AI assistants market will grow to roughly that size over time, and that cost stays in the $10-20 per month range. This amounts to $30-60b in revenue per year. For comparison, the net income of Apple in 2023 was roughly $100b, for Microsoft and Google roughly $75b, and for Meta roughly $40b, and these are all trillion dollar companies. If the consumer AI space can maintain healthy margins, it could literally be a trillion dollar opportunity.

Of course, it is not yet clear whether the future of consumer AI will be a paid subscription or a free good-enough version. Most people I’ve talked to still do not pay for ChatGPT Plus, not to mention any of the other more recent alternatives such as Gemini Advanced or Claude Pro. However, everyone I know who pays for these subscriptions literally cannot imagine life without it. People who work in the AI space are trying all of the alternatives but feel overwhelmed. It seems obvious to me that people will continue to want an AI assistant for the common use cases we’ve seen such as question answering and research, and it feels almost predetermined that this space, just like search and social media, will converge towards a single winner.

My sense is that either OpenAI, Google, or Apple should win this space if they execute properly. The OpenAI case is straightforward – ChatGPT is a word that already has found its place in the common vocabulary, and most people I know who pay for an AI assistant pay for ChatGPT Plus and nothing else. The case for Google and Apple are different but also quite obvious – they can integrate directly into your devices, calendars, emails, search, and have a good chance of winning by nature of existing distribution channels.

2. Open-source is falling behind

A year ago when the LLM revolution was taking off, there was significant uncertainty about how open-source vs closed-source models would play out. Some felt it was unlikely that open-source would be able to compete with closed-source labs, effectively giving away for free a technology that costs significant amounts of money to produce, while others felt that the crowd-sourcing of wisdom and work of the masses would allow open-source models to catch up to and surpass those of closed labs, and pointed to other developer tools where open-source won out.

Roughly one year has passed, and my impression is that open-source is falling behind. In particular, it is falling behind in precisely the ways that people were worried about a year ago – they’ve produced tremendous advances in fine-tuning and other post-training techniques, but have only been able to apply them to small models on the scale of 7B parameters and less. There are some larger open-source models like Llama 2 70B and Mixtral 8x7B, but these are effectively generous donations by large research labs and have not been reproduced by smaller players. Furthermore, it does not appear that the open-source community has the compute capacity yet to significantly improve on these larger models the way they’ve improved on the smaller models. Even if they did, these models are a significant step behind the frontiers of GPT-4, Claude 3, and Gemini Ultra.

Nothing surprising has played out here so far – GPU, data, and talent costs are prohibitively high right now, and there was never a clear economic argument for how a company could survive while giving away their product for free. Meta AI doesn’t need to make a profit and so they will probably continue to hand out Llamas for free, but Mistral and Musk’s xAI have grokked the economics behind LLMs, and have since late last year stopped open-sourcing their best models.

The only counterargument to the above, which should bring hope to the open-source community, is that open-sourcing models has proven to be an incredibly powerful marketing tool for tier-2 research labs. One of the best models on HuggingFace today is Qwen 72B, a model produced by Alibaba Cloud. Prior to this release, I as well as many of my friends in the AI space had never heard of Alibaba’s AI research lab, but with it they’ve proven to be quite a formidable research team. The same dynamic played out with Mistral AI a few months earlier. The marketing value of open-sourcing models means that we are likely to continue to see improvements in open-source models even though the economics of giving away your product for free at a first glance doesn’t make sense.

Unfortunately, I don’t think that the dynamic above is going to be sufficient, and I expect the gap between open and closed source models to continue to grow. The reason for this ties into the next section – the cost of these models is getting serious.

3. On the scale and cost of foundation models

AI research labs no longer publish research, so it is much harder to get a sense of what it takes to produce the models we’ve seen come out over the past few months, but my takeaway from the bits and pieces we’ve seen is that the scale and cost of these models is starting to get really serious.

A few example points:

Google DeepMind mention in their Gemini paper that their most advanced model was trained across data centers. Gemini Ultra was not trained on a single massive Google-sized data center, it was trained on multiple of those data centers.
It was revealed that Reddit had come to an agreement with Google and other anonymous AI labs to sell access to their data for $60 million per year. These labs are scrubbing the internet for all the data they can find, and this does not come cheap.
Over the course of 2023 Anthropic raised $750 million and came to an agreement with Google to raise another $2 billion over time. It is likely that a significant amount of this money went towards training Claude 3. Similarly, Musk’s xAI is supposedly looking to raise $6 billion on a $20 billion valuation, whereas it was previously rumored to only be looking to raise $1 billion. xAI currently only has around 20 employees, so most of the money is not going towards labor cost. Presumably most of it is going towards compute.
Meta AI is building out a 350,000 H100 GPU data center this year, with each H100 going for around $30k each, which adds up to $10 billion or so in just H100 GPU costs.

All together, it looks like the industry will spend tens if not hundreds of billions of dollars over the next few years, and most of it looks to be going towards first compute and second data. Amusingly, if we suppose that compute is the bottleneck to scaling AI further, and that the total expenditure for the next few years is around $100 billion, and that the AI labs want this number to grow 100x, then we suddenly find ourselves surpassing Sam Altman’s $7 trillion number. Perhaps he knew all along.

4. Frontier model capabilities are hard to evaluate

When ChatGPT first came out, people were stunned that it could solve LeetCode style coding problems, answer riddles, and call functions given a proper interface. Back in those good days, evaluation was easy, because the capabilities of those models were simply not that extensive. We had not yet saturated MMLU (undergraduate level academic knowledge questions), GSM8K (grade-school math problems), HellaSwag (common knowledge), and many other beloved benchmarks.

With the release of Claude 3 and its stunning display of capabilities, we are entering a very interesting era where the average human being in many ways is insufficiently intelligent to properly evaluate frontier foundation models. Now, this does not mean that GPT-4 and Claude 3 are AGI-level models (there are still significant capabilities gaps) but merely that these models now have capabilities in some domains that significantly surpass those of average humans.

One particularly relevant example is the new GPQA benchmark (Google-Proof Question Answering), which asks PhD-level questions that are sufficiently difficult that non-PhDs with access to Google and significant time to think are unable to answer these questions. On the diamond set containing their highest quality questions, highly-skilled non-expert humans with access to Google get around 22% correctness, GPT-4 gets 36%, Claude 3 gets 50%, and PhD experts get 81%. In other words, current frontier model capabilities are somewhere between skilled and expert humans, which means that we now need PhDs to evaluate them on scientific question answering style tasks.

I have two takeaways from this observation. First of all, model intelligence is no longer the bottleneck for most LLM applications, since a vast portion of cognitive work in the modern economy does not depend on PhD level expertise. Rather, context, reasoning, consistency, style, cost, latency, and other factors become more important. The risk here is that AI research labs overly focus on the aspects of models that are easy to evaluate, such as these PhD-level science questions, rather than on other aspects that are harder to evaluate but more important for producing downstream economic value (Goodhart’s law). Or, alternatively, that we in due time discover that LLMs strengths lie in information compression, and that other capabilities like reasoning and planning do not emerge or grow as rapidly when we scale up models further. This will be very interesting and important to look out for over the next few years.

The second takeaway is that, in my opinion, running out of data no longer appears to be as big of a problem as it’s been made out to be. The point of collecting more data is for the model to learn more kinds of facts, reasoning, etc., but more data doesn’t help if all the data is low quality or sampled from the same domain. In a year or two, once frontier models solve GPQA and reach PhD-level capabilities across scientific domains, it becomes far less clear what value more internet-quality data contributes to these models.

More likely, the big AI labs will pivot to generating their own custom datasets by collaborating with experts, universities, and other leading institutions, as well as training in settings unbounded by the quality of existing data (e.g. it’s been rumored that Claude 3 was extensively trained using reinforcement learning). Simulations, sandboxed code environments, and other data generation approaches will likely make up an increasingly growing portion of the training data going into these models over time.

One big question here is how this affects scaling laws – the work that has been published by OpenAI and Google DeepMind only cover standard hyper-parameters like model size, token count, expert count, and some RLHF related ones. Clearly these existing laws don’t hold in the speculated new domains where the quality of data is significantly higher and quantity significantly lower. Though the labs presumably are aware of this and are planning accordingly, they have not published anything on this topic, it means that we the public will have to come to terms with having incomplete foresight into how scaling laws will play out in the coming few years.

5. Google’s Gemini fiasco is concerning for alignment

As many of you know, Google DeepMind’s Gemini model was recently roasted for being diverse in situations that made no sense.

I’m not going to comment on the cultural forces or the leadership dynamics at Google that led to this outcome. Instead, I want to highlight that this demonstrated a terrible failure on Google DeepMind’s part to properly align their AI models, and although no harm was caused by Gemini generating diverse Nazis, this is rather discomforting thinking ahead to the impact that future models will have and how fast the technology is progressing.

My concern is that Google DeepMind is an absolutely outstanding team, amongst the very best in the world, with near endless resources, and despite all of this failed to align the model to their world view. Some may object and claim that the model indeed was aligned – that Google wanted the model to be excessively pro-diversity, but I would object back. No human being who is supportive of diversity (as far as I know of) would ever consider representing diverse Nazis, because it makes no sense once you think about the underlying motivation for why diversity is important.

Humans who are supportive of diversity do so because diverse viewpoints lead to better outcomes (or, as they might say more bluntly, a room full of old white men probably don’t fully understand the preferences of minorities, women, LGBTQ+, etc), and because it helps break down stereotypes around how only certain kinds of people are expected to serve certain roles in society. The motivation behind the “diversity intervention” of foundation models is that, relative to a diversity-egalitarian society, our historical data is skewed towards a distribution that is suboptimal due to historical factors, and thus the models need to be recalibrated post-training to correct for this deficit.

Diverse Nazis do not contribute to either of these motivating reasons. The fact that Gemini proactively generated such images means that it failed to properly grasp the values and principles it had been trained to obey. It did not understand the sociopolitical dynamics that led to diversity being such a critical priority for the Google team, and as such failed to responsibly promote diversity once deployed in the real world. This, despite the fact that Google DeepMind is one of the best AI research labs in the world, and that diversity clearly is one of the most important values for the Google leadership team.

Why was Gemini improperly aligned, why did it fail to capture the underlying dynamics of how to be pro-diversity? My guess is that it has to do with spurious correlations and an insufficiently intelligent base model. There’s a good amount of research showing how smaller models are susceptible to bias due to spurious correlation (search “transformers text spurious correlation” on Google scholar for examples), where for example the presence of words like “Spielberg” can trick a model into thinking that a movie review is positive even when it is not. As we’ve scaled up models and their capabilities have improved, this has become less of an issue, but it is still something that we see from time to time.

In this case, I think that Gemini Pro likely was a smaller (relative to GPT-4 and Gemini Ultra) model which lacked a deep understanding of the world, and so if the post-training process lacked explanations for why promoting diversity is good and examples of when to abstain from promoting it, it makes sense that the model would blindly produce diverse images in all situations. It’s also not clear how much of this behavior was due to the system prompt versus a behavior that was actually ingrained into the model due to training, but in either case this remains a concern – for all we know, the system prompt did not explicitly state that the model should present people as diverse “in all circumstances with no exceptions”, and so you would expect a reasonably intelligent system with common sense to behave better than Gemini did.

A question this brings up for me is how this will change as model capabilities continue to grow. A more intelligent base model will probably have a better understanding of the underlying motivations behind diverse representation, and as such should avoid making the mistakes Gemini made, but the underlying problem persists. As we strive to align models to our values, there will be other deeply complex issues that the models will be insufficiently capable of understanding. All we can hope for is that by then we will have figured out how to make the models more aware of their limitations, such that they don’t make harmful decisions that they think are aligned with our goals and values.

6. Applications – performance, cost, and integration

Based off of what I can see on the internet and from conversations with friends, it appears that certain LLM use cases have taken off much faster than others. I’m sure this will change as capabilities improve and new applications come out, but for now, some of the most common use cases I’ve seen are:

Research (recommendations, explain concepts)
Role-play (AI friends/relationships, therapy)
Writing (homework, articles)
Coding (either through Copilot or ChatGPT)

There are some applications and use cases I’ve heard a ton of people ask for that do not yet exist or do not yet work very well, such as scheduling/AI executive assistants, automating call center workers (every friend working at a B2C startup), better coding (e.g. AI writing complete pull-requests), and lots of niche B2B paperwork tasks.

Amusingly, the current limitations also explain the most common current use cases:

LLMs greatest strength right now is being able to memorize internet-scale number of facts. This makes recommendations and research a great use case.
LLMs hallucinate and make mistakes. When talking to someone/something, mistakes are common and not a significant issue.
LLMs have seen millions of essays and articles. If you have an idea but need help writing it, it can help you get the structure right. It does not yet have the creativity or critical thinking to write original essays by itself.
A large portion of programming (not software engineering) is just boilerplate code figuring out the right functions to call and what data to pass to them. LLMs handles this wonderfully.

What’s holding LLMs back from doing more? Three things – performance, cost, and integration.

Performance is obvious. Any builder who has spent time with OpenAI’s GPT-4 API will have stories about how it will forget a rule repeated multiple times in the prompt, make basic reasoning mistakes, and generally lack the common sense to perform human tasks without having every detail specified in the prompt, and even then find ways to make dumb mistakes. The good news here is that models have continued to get better over time and there are no signs of this trend slowing down.

Cost is also obvious. The Claude 3 release was amazing, but the cost is still prohibitive at $15 for a million prompt tokens and $75 for a million generated ones. Most real tasks in the enterprise world, based on my experiences and conversations with a range of people, require tens of thousands of tokens in the prompt and hundreds if not thousands of tokens to be generated. $1 to automate a single task, not counting development, cloud, and so on, is for many tasks even more expensive than human labor. The good news is, once again, that costs have been coming down and does not look to slow down.

Lastly, we have integration, which in my opinion is the most interesting one. When I was working at AKASA (AI healthcare startup), I saw several tasks that were worked manually at hospitals that were very automatable, but were extraordinarily painful to automate due to difficulties both reading and writing data from legacy healthcare systems. Based on what I hear from people working in other enterprises, the story is more or less the same across industries.

This is one reason I don’t believe in the narrative that “GPT wrappers have no moat”. Sure, calling an AI API is easy, but you know what else is (relatively) easy? Spinning up a CRUD app and cold calling small businesses, but that describes half of all B2B SaaS startups in YC and they have had no problem building real businesses with healthy margins. Integration is a valid moat because it is necessary, slow, and often really hard! The main reason I think Google or Apple could steal the consumer AI market from OpenAI and others is integration into existing products that people already use (mail, calendar, devices).

I find it interesting that people seem rather unimaginative and short-sighted when talking about how to integrate LLMs. They talk about RAG and connecting to their database and maintaining memory. Is integrating LLMs into a codebase really necessary if, at some point in the future, we can give an LLM a sandboxed terminal, have it pull an entire codebase, and have it implement and push entire pull-requests containing entirely new features? If an LLM was given a computer and phone, just like the other call center workers, do we need custom integration? All of the above, of course, presuming we know that the model is aligned and safe to deploy.

We’re nowhere near these kinds of capabilities yet, but if the current trends continue it seems reasonable to expect this to happen eventually. As performance goes up and cost comes down, integration will remain the final bottleneck to unlocking the value of AI in applications. Let’s be more creative and serious about what integrating AI into our enterprises and society might look like.

7. Lastly, a personal announcement

I’m excited to announce that I’ll be joining OpenAI’s applied research team next week! As such, I’ll unfortunately not be writing as much about LLMs and AI research for some unknown period of time to stay clear of potential leaks. That aside, I’ll continue to write about other topics that interest me. Stay tuned!

John’s Contemplations

Discussion about this post