Releasing Common Corpus: the largest public domain dataset for training LLMs

« (…) Common Corpus is an international initiative coordinated by Pleias, involving researchers in LLM pretraining, AI ethics and cultural heritage like , in association with major organizations committed to an open science approach for AI (HuggingFace, Occiglot, Eleuther, Nomic AI). Common Corpus has received the support of Lang:IA, a state start-up supported by the French Ministry of Culture and the Direction du numérique (Agent Public. Pleias is a French start-up specialized in the training of Large Language Models for document processing on fully open and auditable corpus.
Contrary to what most large AI companies claim, the release of Common Corpus aims to show it is possible to train Large Language Model on fully open and reproducible corpus, without using copyright content from Common Crawl and other more dubious sources. (…) »

source >, Pierre-Carl Langlais, 20 mars 2024