1 min read

Link: Harvard adds copyright-free fuel to the AI fire.

Harvard University has released a dataset of nearly one million public-domain books to help train AI tools. This initiative, supported by Microsoft and OpenAI, aims to democratize access to high-quality training data.

The dataset includes a diverse range of materials, from Shakespeare to obscure Czech math textbooks, making it significantly larger than previous collections used in AI training. Greg Leppert highlighted its potential to level the playing field for smaller entities in AI development.

Despite the new dataset, AI companies will still need to source additional data to enhance their models and maintain competitive edges. Burton Davis of Microsoft asserts that these open data pools align with fostering a community-focused data ecosystem.

The dataset's release comes amid ongoing legal battles concerning the use of copyrighted data in AI training. If AI companies prevail in court, they might continue using data without copyright agreements, but a loss would necessitate significant changes.

Alongside books, the initiative plans to include scanned newspaper articles with hopes for further collaborations. Although the release method isn't finalized, the initiative expects Google to assist in making the dataset publicly accessible.

Other organizations are also fostering the development of public-domain AI resources, recognizing the potential to reduce reliance on copyrighted materials. Initiatives like the French AI startup Pleis' Common Corpus are establishing new precedents for using open data in AI training.

 #

--

Yoooo, this is a quick note on a link that made me go, WTF? Find all past links here.