Demystifying Model Space
Why a next-word prediction chatbot may be more capable (and dangerous) than you think
There’s a lot of confusion around the capabilities and limitations of large language models (LLMs) and especially OpenAI’s Generative Pre-trained Transformer (GPT), which is a type of LLM. Part of the confusion stems from how the models are trained. The latest GPT models undergo multiple training phases, but I’m going to focus on two main ones: next-word1 prediction and reinforcement learning through human feedback (RLHF). The training process is important, but it’s not necessarily the best way to understand the capabilities of the model. To do that, it's helpful to think in terms of model space. In the context of LLMs, model space refers to the multidimensional space representing all possible configurations of a model's parameters, with each point in the space corresponding to a unique combination of parameter values.
Let’s think about the model space for the set of models with 175 billion parameters. This is the space where all such models live. GPT-3, by virtue of having 175 billion parameters, lives in this space. So do an incredible range of models. It seems quite likely that models that would be considered artificial general intelligence (AGI) live in this space. There are likely models that could do more to promote human flourishing than any technology before and models that could lead to catastrophe, unimaginable suffering, and species extinction.
It’s important to understand how big this space is. We could spend millions of years jumping randomly around this space and never reach a model that can form a coherent sentence. Estimates suggest that GPT-3 requires 500-800 GB of storage. The entire text of Wikipedia takes about 20 GB. Models in this space have the capacity to store all the information of Wikipedia, and much more, within their internal structure. (This isn’t an exact comparison, but it gives you a sense of how much information models in this space contain.)
When a new model begins training, it starts with randomized weights—it’s at some random point in model space. If you asked it a question, you might get a result like this:
Human: Why is the sky blue?
GPT: WhbuiVt ok2z j4Kla2jw;oca2
Not particularly interesting. Then we start the next-word prediction phase of the training process. We ask it to predict the next token and every time it guesses wrong, we shift the model weights ever so slightly in the direction that would have given it the right answer. You can think of that as taking a teeny-tiny step in model space. This happens over and over and over again.
After a while, we’ve reached somewhere more interesting in model space. The responses contain real words. And after a bit more training, even full sentences. With every step, we move around model space, bit by bit.
The question is, where is this next-word prediction leading us to? Let’s look at an example. Say the model is trying to predict the next word in this sentence: “I’m American and on Sundays, I like to watch…”. Think about how you would predict the next word — what chain of reasoning would you go through? Well, what does it mean to watch something, and what are the kinds of things people watch? One could watch TV, one could watch one’s children, or one could watch one’s diet. And why is there a reference to being American? Are they signifying something that is particular to Americans? Perhaps they’re watching professional football. And if it is, how would they say it? “Football”? “Pro football”? “NFL football”? “The Colts”? As you can see, becoming the world’s best next-word predictor would force a model into part of model space where it can “reason”, where it knows a lot about human behavior, desires, and preferences.
There are many implications of this training method. One is how little relationship there is between the decisions the designers make and the behavior of the final model. They make decisions about which algorithm to use to help the model find more useful spaces faster, but that hardly gives them any control over the result. Their most important decision is in what training data to include, but that’s not a particularly fine-pointed instrument. And it’s not even clear what the best approach is here. Imagine they want to make a model that never uses profanity. They could easily remove all the profanity; that would certainly prevent the model from cursing and potentially offending people. But that’s not as great an idea as it might sound. It’s like raising a kid who never hears profanity. Sure, they might never swear themselves, but they'll also not know how to deal with it when they hear it from someone else.
Given the weak relationship between the designers' actions and the model’s behavior, you can probably guess the relationship between the designers’ values and the model’s behavior. If the model gets better at responding like a human, that doesn’t necessarily mean it has values more like a human. Instead, it’s better to think of it as a simulator that can image human values. It's remarkable (and concerning) how little relationship there is between the designers' will and some aspects of the LLM, at least so far (we’ll get to the RLHF phase in a bit).
The point is, we don’t know all that much about the model. It’s taken an ungodly number of steps in an ungodly large model space. Somewhere in that space may be the keys to Utopia, somewhere (likely not far from Utopia) are the keys to ruin, and the vast, vast majority of the space is uninteresting. (Note that it’s possible that the keys to Utopia aren’t in a 175-billion-parameter space. If that’s the case, then we could simply imagine a larger space where they do exist.)
People assume that there are people who “know” what’s going on inside the models. Here’s another implication of the training process: This assumption is not accurate. There are people who generally understand the process better than others, and who have spent more time experimenting with it than others, but that’s about it. The designers often find out about new features and bugs on Twitter, just like everyone else. I don’t mean to push the point too far — they are in a better position by having direct access to the model, and they can run it without their added filters, but this benefit only goes so far. I’ve discussed before the limitations of interpretability techniques for LLMs.
Let’s talk about part two of the training process: reinforcement learning through human feedback (RLHF). This is where the model is tweaked. Essentially, the model generates several potential responses, and a human evaluator provides feedback such as, "The first two are good, the third doesn't make sense, and the fourth is offensive." The model then learns and improves based on this feedback. This stage of the model training is different in important ways. This is where the designers can have more influence. But the important thing to note here is that that’s after the model has already found some very interesting place in model space. This raises a concern that this phase doesn’t improve the underlying model so much as makes it wear a more socially-acceptable mask.
With a fully trained model in place, let's take a brief detour to differentiate between GPT models and GPT products. So far, we've focused on the model. For the product, there are more levers for the designers to get involved. The first way is through prompt engineering, wherein the model is told English language instructions before it is sent the users’ instructions. They might be something like, “You are an advanced AI chatbot. Your mission is to be helpful, harmless, and honest. You will always be nice.” Just the fact that this happens gives you an idea of how much control the designers have. They have to tell it “be nice” and hope for the best (and there are many cases where the best is not what happened).
On top of that, they can always place some wrapper code around the model. They could implement a feature that checks if the LLM’s response contains any profanity, and if so, the product will instead respond with, “I’m sorry. I don’t know the answer to that,” or some other anodyne statement. So at the product level, the designers do have many tricks they can play to avoid a PR nightmare, but this doesn’t mean they can control the underlying model.
It’s important to understand this distinction because in some ways the designers have lots of control, but in other ways very little. They can make the model appear to have their values or political agenda, but they can’t get the model to adopt them. It’s a bit like forcing the model to act as a specific character. You can get an actor to read any lines, but that doesn’t mean you’ve changed their true opinions. There is where the accusations of political bias in the model have some merit. Below is an example of political bias found by AllSides, a company that examines political bias in the media. These likely come from the human feedback layer and not the baseline LLM.
When thinking in terms of model space, you can see how the faulty “it’s just autocomplete” line of thinking is. Yes, it can autocomplete sentences, but to do that it’s developed capabilities far beyond that. It’s also been called a “stochastic parrot”. I think this is really missing what’s going on here. There are numerous examples that contradict this characterization. How can a stochastic parrot make up new words? Or write poems? Someone on Reddit posted a poem GPT-4 created. It’s not just a “good poem for an AI”. It’s a damn good poem.
There are likely some models that could be well-described as stochastic parrots in this model space. But those won’t be the best at next-word prediction, so we shouldn’t expect to end up there. This line of thinking is too focused on the training technique and misses the larger picture. We need to separate the training method from the result. It’s true that the training method significantly affects the result, but it doesn’t predetermine the position of the AI within model space.
Another source of confusion comes from comparing GPT models to the human brain. It has been noted that the human brain is composed of highly specialized regions. For example, Broca's area is dedicated to speech production, while the occipital lobe is responsible for processing visual information. Delving deeper into the brain's structure, we can identify numerous neuron types with varying concentrations in distinct regions, featuring specialized neurons such as spindle neurons, Betz cells, and Purkinje neurons. The thinking is that human brains require differentiation to be smart, and LLMs don’t have this degree of differentiation, so they can’t be generally intelligent.
LLMs are made up of stacked transformers, and while they have some specializations like attention heads, feed-forward networks, and fully connected layers, they aren’t nearly as differentiated as the human brain. At least, right out of the box. But that doesn’t mean they can’t develop differentiated parts during the training process. Those models exist in the same model space, too. The truth is, we don’t know how differentiated a trained LLM is. Research on circuits has shown that subsections of the model do specialize in certain areas. For example, AI safety company Anthropic found a type of neuron that appeared to represent numbers “when and only when they refer to a number of people”. That seems like a specialization.
I should point out that there are some approximations here. In reality, LLM architectures are not fully flexible—they exist in a transformer architecture and thus do not have full access to all the model space we’ve been talking about. However, this architecture seems to be incredibly flexible, so it’s not clear what the limitations are. We know it’s capable of learning everything we’ve seen GPT do so far. It’s hard to know what the limits are. Based on the results from DeepMind’s Chinchilla paper, we don’t think GPT-3 is at the optimal place in model space. It’s possible that GPT-4 has many more parameters (they didn’t specify in the paper), but it’s also possible that it’s in the exact same space, just trained on more data so that it’s in a “smarter” place.
In addition, the training process isn't necessarily equally likely to arrive at all places in model space. There are good reasons to think it would find a good area (local minimum, in machine learning terms) and stay around there, even if there is a better region out there.
It’s not clear what the limits are of this space. Human brains are not in this particular model space. There are so many differences between artificial and biological neural networks that I hate to make a comparison, but the space that human brains are in is probably more like a 1 quadrillion-parameter space.
The point is, there is not a 1:1 correspondence between how a model was trained and what it can do. Next-word prediction is a general task that requires general intelligence to solve. We can build something generally intelligent with that task. The world of model space is vast and complex, and as we continue to explore it, we'll likely uncover even more surprising capabilities of next-word prediction chatbots.
Technically it’s next-token prediction, but for our purposes, we can think of it as next-word prediction.