How Large Language Models Anticipate Future Words

Humans are renowned for their ability to think ahead while speaking, predicting upcoming language input with remarkable accuracy. But do language models exhibit a similar foresight? Recent research delves into this intriguing question, uncovering two potential explanations for why transformer language models prepare information in advance: pre-caching and breadcrumbs.

Pre-caching involves the model computing features at the current time step that may not be immediately needed but will prove useful for future steps. Conversely, breadcrumbs suggest that the features most relevant at the current time step inherently benefit future inference.

To test these hypotheses, researchers conducted “myopic training,” limiting language models from considering gradients from past time steps. In synthetic data settings, clear evidence for pre-caching emerged, indicating that successful models prepare information for the next word in advance. However, in autoregressive language modeling experiments, the breadcrumbs hypothesis appeared more applicable, suggesting that relevant features at any time step naturally benefit future inference.

Credit: Tesfu Assefa

Examples of Pre-caching and Breadcrumbs in Action

Pre-caching 

Consider a language model trained on a dataset of simple arithmetic problems. When given the input “2 + 3 =,” the model needs to predict the next token, which should be “5.” In this case, the model pre-caches the information that “2 + 3” will result in “5” even before seeing the “=” symbol. Here, the model computes and stores intermediate arithmetic results in advance, ensuring that it can predict the correct answer once the full equation is presented. This pre-caching behavior is crucial in synthetic data settings where specific future outcomes need preparation.

Breadcrumbs

Now, consider a language model trained on natural language text, such as a news article. When the model processes the sentence, “The stock market saw a significant rise today as investors showed confidence in the new economic policies,” it might need to predict the next word “policies” after reading “new economic.” Here, the breadcrumbs hypothesis is at play. The model uses the context from the current and preceding words to make an informed prediction. The features relevant to “new” and “economic” are naturally beneficial for predicting “policies” without deliberate preparation, as they all relate to the same context.

In the arithmetic example, the model benefits from pre-caching because it needs to prepare specific future outcomes based on the current input. In contrast, the news article example showcases the breadcrumbs hypothesis, where relevant features at the current time step (e.g., “new” and “economic”) inherently aid future predictions (e.g., “policies”) without additional pre-computation.

Conclusion

When performing gradient descent, the off-diagonal terms in the gradient of the expected loss with respect to the model’s parameters reveal how weights at one position influence predictions at future positions. This insight underpins the distinction between myopic and non-myopic models, where myopic models prioritize immediate predictions over future ones.

The study provides evidence that while transformers do pre-cache information in synthetic tasks, in natural language settings, they likely operate under the breadcrumbs hypothesis, using features relevant to both current and future tokens without deliberate preparation. This understanding enhances our comprehension of how language models process and anticipate linguistic input, drawing a fascinating parallel between human and artificial cognitive processes.

Let us know your thoughts! Sign up for a Mindplex account now, join our Telegram, or follow us on Twitter