Photo by Aleks Marinkovic on Unsplash

Harvard Releases Free Large-Scale AI Training Database

Reading time: 2 min

Posted on Dec 12, 2024

Written by Andrea Miliani Tech News Expert

Harvard University announced it will release a large data set of almost 1 million public-domain books for AI training for free, created by its new program Institutional Data Initiative (IDI).

In a Rush? Here are the Quick Facts!

Harvard in collaboration with Google Books released a dataset with almost 1 million public-domain books to train AI models for free
The dataset was created by the new Institutional Data Initiative, an initiative backed by Microsoft and OpenAI
Small organizations can benefit from this data collection to compete more fairly in the AI sphere

According to Wired, the dataset includes publications scanned by Google Books that are not protected by copyright anymore—it usually expires 70 years after the author’s death or its publication. The data collection covers multiple formats and genres, from creative writing by famous authors like Charles Dickens, Shakespeare, and Dante to textbooks and dictionaries.

According to IDI’s executive director Greg Leppert, the goal is to “level the playing field” and allow more organizations and small projects to join the AI race with valuable tools. The data set’s size is larger than the one used to train popular AI models like Meta’s Llama. “I think about it a bit like the way that Linux has become a foundational operating system for so much of the world,” said Leppert.

The IDI was officially launched today and it has been supported by OpenAI and Microsoft with funding and encouraging words. The initiative aims to work with knowledge institutions like government agencies and libraries “to develop data collections and best practices for artificial intelligence.” The details of how the new dataset can be downloaded have not been revealed, only that Google will help with the distribution.

This new data collection should avoid disputes for copyright infringement as many AI companies have been facing this year. “Large public domain datasets like these further demolish the ‘necessity defense’ some AI companies use to justify scraping copyrighted work to train their models,” said Ed Newton-Rex, a former executive at Stability AI who now runs a nonprofit that certifies ethically-trained AI tools to Wired.

Newton-Rex recently led a petition to stop tech companies from scraping data to train their AI models.

Harvard Releases Free Large-Scale AI Training Database

We're thrilled you enjoyed our work!

Leave a Comment