PleIAs – OCRonos-Vintage | The case for specialized pre-training: ultra-fast foundation models for dedicated tasks

« Pre-training foundation models is generally thought to be the exclusive domain of a handful of AI labs and big tech companies. This is changing. Thanks to more powerful GPUs, optimized programming frameworks (like llm.c) and better datasets, pre-training is becoming affordable and a competitive alternative to fine-tuning.
At PleIAs we are successfully experimenting with a new category of models: specialized pre-training. Theses models are designed from the ground up for specific tasks, exclusively trained on a large custom instruction dataset (at least 5-10B tokens) and, so far, yielding performance comparable to much larger generalist models.
We release a new example of specialized pre-training, OCRonos-Vintage. It’s a 124 million parameter model trained on 18 billion tokens from cultural heritage archives to perform OCR correction. Despite an extremely small size and lack of generalist capacities, OCRonos-Vintage is currently one the best available model for this task. We are currently deploying it at scale to pre-process a large cultural heritage corpus of more than 700 billion tokens.
OCRonos-Vintage was trained on the new H100 cluster on Jean Zay (compute grant n°GC011015451) with llm.c, the new pre-training library originally developed by Andrej Karpathy for pegagogic purposes. Thanks to the unprecedented performance of llm.c, the multiple past experiments released over the past weeks by Yuchen Jin (Hyperbolic Labs) and our advanced data preprocessing pipelines, training from scratch was as easy, short and eventless as a llama fine-tune. (…) »

source > huggingface.co, Pierre-Carl Langlais, 4 août 2024

Accueil