Study Reveals Growing Data Restrictions Impacting AI Training

Image by Adisorn, from Adobe Stock

Study Reveals Growing Data Restrictions Impacting AI Training

Reading time: 3 min

A new study led by a MIT research group, reveals a growing trend of websites limiting the use of their data for AI training. The study examined 14,000 web domains and found that restrictions have been placed on 5% of all data. Additionally, over 28% of data from the highest-quality sources across three commonly used AI training datasets is restricted. This study is the first large-scale longitudinal audit of consent protocols for web domains used in AI training corpora.

Generative AI systems, like ChatGPT, Gemini, and Claude, rely heavily on vast amounts of data to function effectively. The quality of these AI tools’ outputs depends significantly on the quality of the data they are trained on. Historically, gathering this data was relatively straightforward, but the recent surge in generative AI has led to tensions with data owners. Many data owners are uneasy about their content being used for AI training without compensation or proper consent.

As a result, there has been a pushback from publishers. Some have placed paywalls or modified their terms of service to limit the use of their data for AI training. Others have taken more drastic measures, such as blocking the automated web crawlers that companies use to collect data. Legal actions and restrictions through robots.txt files and terms of service changes are becoming more common.

The consequences of this data squeeze are multifaceted. It will make developing AI systems more difficult, as they rely heavily on this data for training. The restrictions may also bias AI models by limiting them to less diverse data sets. Additionally, copyright issues could arise if AI models are trained on data that websites don’t want used for that purpose.

The restrictions are having a significant impact. In just one year, a significant portion of data from important websites has become restricted, and this trend is expected to continue.

Shayne Longpre, the study’s lead author, states: “We’re seeing a rapid decline in consent to use data across the web that will have ramifications not just for A.I. companies, but for researchers, academics and noncommercial entities.”

This means that smaller AI companies and academic researchers who depend on freely available datasets could be disproportionately affected, as they often lack the resources to license data directly from publishers.

For example, Common Crawl, a dataset comprising billions of pages of web content and maintained by a nonprofit, has been cited in over 10,000 academic studies, illustrating its critical role in research.

The study highlights the need for new tools that give website owners more control over how their data is used. Ideally, these tools would allow them to differentiate between commercial and non-commercial uses, permitting access for research or educational purposes.

The situation also serves as a reminder to big A.I. companies. They need to find ways to collaborate with data owners and offer them value in return for access. A more sustainable approach is crucial for the continued development of A.I.

Longpre emphasised the need for big AI companies to collaborate with data owners and offer them value in return for access. For years, these companies have treated the internet as an “all-you-can-eat data buffet” without giving much in return to data owners. However, this approach is unsustainable, and as data owners become more protective of their content, AI companies will need to find ways to work with them to ensure continued access to high-quality data.

Did you like this article? Rate it!
I hated it I don't really like it It was ok Pretty good! Loved it!

We're thrilled you enjoyed our work!

As a valued reader, would you mind giving us a shoutout on Trustpilot? It's quick and means the world to us. Thank you for being amazing!

Rate us on Trustpilot
5.00 Voted by 1 users
Title
Comment
Thanks for your feedback
Loader
Please wait 5 minutes before posting another comment.
Comment sent for approval.

Leave a Comment

Loader
Loader Show more...