Investigation Reveals Apple, Nvidia, And Others Used YouTube Videos To Train AI
A new investigation by the nonprofit news studio Proof News and Wired revealed that major AI firms like Anthropic, Nvidia, Apple, and Salesforce used thousands of YouTube videos to train AI models despite YouTube’s policies against harvesting without permission.
Researchers with technical expertise analyzed training datasets publicly available and discovered these Silicon Valley companies and others used transcripts from 173,536 YouTube videos from over 48,000 channels.
Proof News explained that they’ve found material from YouTube stars like Mr. Beast, PewDiePie, Jacksepticeye, and Marques Brownlee as well as educational content from channels from MIT, Harvard, Khan Academy, and news publications like BBC, NPR, and Wall Street Journal. A few popular shows like “Jimmy Kimmel Live,” “The Late Show With Stephen Colbert,” and “Last Week Tonight With John Oliver,” were also mentioned in the study as part of the collection.
YouTube Subtitles, as the dataset was called, also includes translations into languages like Arabic, German, and Japanese, and was built by EleutherAI, a nonprofit AI research group.
According to a paper published by EleutherAI, the dataset is part of a compilation called Pile which includes material from other sources as well. Apple, Nvidia, Salesforce, Bloomberg, Databricks, and Antropic—focused on “AI safety”—have confirmed to have used the Pile to train AI models through research papers and documents.
Proof News also launched yesterday a tool to help content creators, researchers, and the public find the videos used in the database. “We built a tool so you can search the data for yourself”, explained the organization through a press release, “be advised that the search tool will occasionally return false negatives for channels and videos that are in the dataset. Make sure to spell your channel or video title correctly.”
Youtubers included in the research have also expressed their concern and vexation. “It’s theft,” said Dave Wiskus, the CEO of Nebula, to Proof News and Wired after learning their content had been used to train AI models. “Will this be used to exploit and harm artists? Yes, absolutely”.
Leave a Comment
Cancel