back

Will AI Start Forgetting What It Knows? The Risks of Recursion in Model Training

Nov. 14, 2024.
3 mins. read. 23 Interactions

Researchers warn of "model collapse": a slow AI degradation from over-relying on generated data. Can real data save generative models from losing their human-like edge?

About the Writer

Yeabsera

10.70985 MPXR

I'm an electrical and computer engineer with a big love for AI. As part of an AI ethics team at an international software company, I’m working hard to close the digital literacy gap globally and make sure tech is fair for all.

Credit: Tesfu Assefa

Are AI models starting to forget what makes them so impressively human-like? Researchers warn of an insidious issue dubbed “model collapse” – a process where, over successive generations of training on model-generated data, AI systems may drift further from their original training data, potentially degrading performance and reliability.

Introduction

The advent of Large Language Models (LLMs) and generative models like GPT and Stable Diffusion has reshaped AI and content creation. From chatbots to advanced image synthesis, these systems demonstrate remarkable capabilities. However, a fundamental issue looms: training new models on data generated by earlier models increases the risk of “model collapse.” This degenerative process raises concerns about the sustainability and reliability of AI models over time.

Mechanism of Model Collapse

The term “model collapse” describes a degenerative process wherein a generative model gradually loses its ability to represent the true data distribution—particularly the “tails” or outer edges of this data. This issue is rooted in two errors.

First, statistical approximation error occurs when models rely increasingly on generated rather than genuine data. Over multiple generations, critical information from the original dataset becomes less represented, leading to a warped view of the data landscape.

A second factor, functional approximation error, emerges when the model’s architecture fails to capture the original data’s intricacies. Even though neural networks can theoretically model complex functions, simplified architectures often lead to overconfidence in the AI’s outputs. Together, these errors create a feedback loop that gradually shifts each generation away from the initial data distribution.

Effects Across Models

To better understand model collapse, researchers examined its effects on various generative models, including Gaussian Mixture Models (GMMs) and Variational Autoencoders (VAEs).

Tests using GMMs revealed that while these models initially performed well, their ability to represent the original data degraded significantly by the 2,000th generation of recursive training. This loss of variance led to a significant misrepresentation of the initial distribution.

VAEs, which generate data from latent variables, exhibited even more pronounced effects. By the 20th generation, the model output had converged into an unimodal form, missing out on the original dataset’s diverse characteristics. The disappearance of “tails” suggests a loss of data nuance.

Implications for Large Language Models

While concerning for GMMs and VAEs, model collapse is even more worrisome for LLMs like GPT, BERT, and RoBERTa, which rely on extensive corpora for pre-training. In an experiment involving the OPT-125m language model fine-tuned on the Wikitext-2 corpus, researchers observed performance declines within just five generations when no original data was retained. Perplexity scores, measuring the model’s understanding, increased from 34 to over 50, indicating a significant drop in task accuracy. When 10% of the original data was preserved, performance remained stable across 10 generations, highlighting a potential countermeasure.

Credit: Tesfu Assefa

Mitigation Strategies

To address this degenerative phenomenon, researchers propose several strategies. Maintaining a subset of the original dataset across generations has proven highly effective. Just 10% of genuine data appeared to significantly slow collapse, maintaining accuracy and stability.

Another approach involves improving data sampling techniques during generation. Using methods like importance sampling or resampling strategies helps retain the original data’s diversity and richness.

Enhanced regularization techniques during training can prevent models from overfitting on generated data, thus reducing early collapse. These measures help models maintain balanced task comprehension even when trained on generated datasets.

Conclusion

Model collapse poses a significant risk to the future of generative AI, challenging their long-term accuracy and reliability. Addressing this requires strategies like retaining real data, refining sampling techniques, and implementing effective regularization. Focused research and mitigation can help AI models preserve their adaptability and effectiveness, ensuring they remain valuable tools for the future.

Reference

Shumailov Ilia, Zakhar Shumaylov, Yiren Zhao, Yarin Gal, Nicolas Papernot, and Ross Anderson. “The Curse of Recursion: Training on Generated Data Makes Models Forget.” arXiv.org, May 27, 2023. https://arxiv.org/abs/2305.17493.

Let us know your thoughts! Sign up for a Mindplex account now, join our Telegram, or follow us on Twitter

Comment on this article

4 Comments

4 thoughts on “Will AI Start Forgetting What It Knows? The Risks of Recursion in Model Training

  1. It is a good article but I think it overlooks the solution for model degradation as a result of recursive training. It could have been better if the article writer added adaptive sampling techniques that can be through each epoch.

    1 Like
    Dislike
    Share
    Reply
  2. It points to the critical problem of "model collapse," where over-reliance on generated data erodes the capability of a model to represent diverse and nuanced information. This underlines the need for sustainable training practices that will ensure, over time, that generative AI is reliable. This emphasis must be placed on addressing these risks to encourage the AI community to focus on innovative and long-term adaptability in model development.

    2 Likes
    Dislike
    Share
    Reply
  3. Really deep insight. Thank you.

    1 Like
    Dislike
    Share
    Reply
  4. Model collapse is a serious concern for the future of generative AI, especially as models increasingly train on synthetic data. Preserving a portion of the original training dataset appears crucial for mitigating this degenerative process. Improved sampling and regularization techniques also offer promising avenues for maintaining model integrity across generations. Continued research and development of these strategies are essential to ensure the long-term reliability and effectiveness of AI models.

    2 Likes
    Dislike
    Share
    Reply

16

Like

Dislike

Share

4

Comments
Reactions
💯 💘 😍 🎉 👏
🟨 😴 😡 🤮 💩

Here is where you pick your favorite article of the month. An article that collected the highest number of picks is dubbed "People's Choice". Our editors have their pick, and so do you. Read some of our other articles before you decide and click this button; you can only select one article every month.

People's Choice
Bookmarks