AI Faces Data Crisis: Musk Warns Of Exhausted Human Knowledge

Reading time: 3 min

Posted on Jan 13, 2025

Written by Kiara Fabbri Multimedia Journalist
Fact-Checked by Justyn Newman Lead Cybersecurity Editor

Artificial intelligence companies have depleted available human knowledge for training their models, Elon Musk revealed during a livestreamed interview, as reported by The Guardian.

In a Rush? Here are the Quick Facts!

Elon Musk says AI firms have exhausted human knowledge for model training.
Musk suggests “synthetic data” is essential for advancing AI systems.
AI hallucinations complicate using synthetic data, risking errors in generated content.

The billionaire suggested that firms must increasingly rely on “synthetic” data—content generated by AI itself—to develop new systems, a method already gaining traction. “The cumulative sum of human knowledge has been exhausted in AI training. That happened basically last year,” Musk said, as reported by The Guardian.

This is set to mark a significant challenge for AI models like GPT-4, which rely on massive datasets sourced from the internet to identify patterns and predict text outputs.

Musk, who founded xAI in 2023, highlighted synthetic data as the primary solution for advancing AI. However, he cautioned about the risks associated with the practice, particularly AI “hallucinations,” where models generate inaccurate or nonsensical information, as reported by The Guardian.

The Guardian notes that leading tech companies, including Meta and Microsoft, have adopted synthetic data for their AI models, such as Llama and Phi-4. Google and OpenAI have also incorporated this approach.

For example, Gartner estimates that 60% of the data used for AI and analytics projects in 2024 was synthetically generated, as reported by TechCrunch.

Additionally, training on synthetic data offers significant cost savings. TechCrunch notes that AI startup Writer claims its Palmyra X 004 model, developed using almost entirely synthetic sources, cost just $700,000 to create.

In comparison, estimates suggest a similar-sized model from OpenAI would cost around $4.6 million to develop, said TechCrunch. However, while synthetic data enables continued model refinement, experts warn of potential drawbacks.

The Guardian reported that Andrew Duncan, director of foundational AI at the Alan Turing Institute, noted that reliance on synthetic data risks “model collapse,” where outputs lose quality over time.

“When you start to feed a model synthetic stuff you start to get diminishing returns,” Duncan said, adding that biases and reduced creativity could also arise.

The growing prevalence of AI-generated content online poses another concern. Duncan warned that such material might inadvertently enter training datasets, further compounding the challenges, as reported by The Guardian.

Duncan referenced a study published in 2022 that predicted high-quality text data for AI training could be depleted by 2026 if current trends persist. The researchers also projected that low-quality language data might run out between 2030 and 2050, while low-quality image data could be exhausted between 2030 and 2060.

Furthermore, a more recent study published in July warns that AI models risk degradation as AI-generated data increasingly saturates the internet. Researchers found that models trained on AI-generated outputs produce nonsensical results over time, a phenomenon termed “model collapse.”

This degradation could slow AI advancements, emphasizing the need for high-quality, diverse, and human-generated data sources.