A recent study questions whether large language models (LLMs) form coherent models of the world, despite their accurate output on complex tasks like generating directions or playing games. Researchers have found that while LLMs provide near-perfect driving directions, they fail with unexpected changes, suggesting that the models don’t understand the underlying rules.
Researchers from Harvard and MIT have investigated how large language models (LLMs) learn to represent complex real-world information.
The central idea is that these models could theoretically capture implicit representations (a “world model”) of the underlying structures of the data they were trained on.
Imagine building a navigation app. Typically, you would map out all the streets in a city and use algorithms to find directions. LLMs suggest another approach: train the model with previously taken routes (e.g., “go east, then north”) and let it predict the next direction based on the learned pattern.
If the model is successful, it would have implicitly learned the city map without ever seeing it directly. But to use LLMs in this way, it is essential to know whether they can learn such representations. That is what the researchers set out to answer.
To investigate whether the models learn the underlying “world model,” the researchers proposed analyzing how they handle structured domains such as board games and navigation.
These domains can be represented by deterministic finite automata (DFAs), a mathematical way of modeling states and transitions. For example:
Board games: States represent positions in the game, and transitions correspond to moves.
Navigation: States represent locations, and transitions are movements (turning left, moving forward, etc.).
The researchers created metrics based on the Myhill-Nerode theorem from language theory to assess whether the models can capture the underlying states and transitions correctly.
On the left, is a visual representation of a Myhill-Nerode boundary and interior. On the right, are examples of two states for Cumulative Connect-4. Both states have the same set of valid next moves. The shortest sequence in the Myhill-Nerode boundary is of length 4, and the boundary contains sequences up to length 30.
Two main metrics were used:
Sequence Compression: If two paths (or sequences) lead to the same state in the model, the LLM should generate identical predictions for both. This assesses whether the model understands that the states are equivalent.
Sequence Distinctiveness: If two paths lead to different states, the LLM should predict distinct continuities for each path. This assesses whether the model can separate the states properly.
These metrics are model-agnostic, meaning they are independent of the architecture or training, and are based solely on the sequences generated.
A visual representation of our two evaluation metrics. A compression error is a model that fails to recognize that two sequences that result in the same state should accept the same suffixes. A distinction error is a model that fails to find the correct distinction suffixes for two sequences that lead to different states. These metrics measure errors at the boundary, which are visually represented above. Green line (true) and magenta line (generative model).
To test their approach, the researchers used data from real taxi rides in New York City. They trained LLMs on sequences of turns made by taxis (“right turn, go straight, left turn”) and evaluated whether the models could reconstruct the map of the city.
The models seemed to work well. They predicted the next correct turn almost 100% of the time.
When applying the new metrics, the researchers found that the models did not recover the true map of Manhattan. The implicit “map” generated was incoherent, with impossible streets or intersections that did not exist. This made the models fragile, especially when dealing with detours.
Reconstructed maps of Manhattan from sequences produced by three models: the true-world model (a), the true-world model corrupted with noise (b), and a transformer trained on random walks (c). Edges extend from nodes in their specified cardinal direction. In the zoomed-in images, edges belonging to the true graph are black, and fake edges added by the reconstruction algorithm are red.
In addition to navigation, the researchers applied the metrics to games such as chess Othello, and logic puzzles. The models performed well on the main tasks but also showed inconsistencies in their underlying understanding of rules and states. The results showed that while LLMs can perform impressive tasks, this does not always mean that they fully understand the world they are modeling.
This weakness is concerning for scientific applications, where we rely on models to learn something new and true about the world. The work also highlighted the importance of theoretically grounded metrics to assess whether a model truly captures the underlying logic of a domain. Despite the advances, extending these ideas to more complex domains beyond DFAs is a challenge for future research.
READ MORE:
Evaluating the World Model Implicit in a Generative Model
Keyon Vafa, Justin Y. Chen, Jon Kleinberg, Sendhil Mullainathan, and Ashesh Rambachan
38th Conference on Neural Information Processing Systems (NeurIPS 2024)
Abstract:
Recent work suggests that large language models may implicitly learn world models. How should we assess this possibility? We formalize this question for the case where the underlying reality is governed by a deterministic finite automaton. This includes problems as diverse as simple logical reasoning, geographic navigation, game-playing, and chemistry. We propose new evaluation metrics for world model recovery inspired by the classic Myhill-Nerode theorem from language theory. We illustrate their utility in three domains: game playing, logic puzzles, and navigation. In all domains, the generative models we consider do well on existing diagnostics for assessing world models, but our evaluation metrics reveal their world models to be far less coherent than they appear. Such incoherence creates fragility: using a generative model to solve related but subtly different tasks can lead to failures. Building generative models that meaningfully capture the underlying logic of the domains they model would be immensely valuable; our results suggest new ways to assess how close a given model is to that goal.
Comments