Dell Research Harvard | The American Stories dataset

« The American Stories dataset is a collection of full article texts extracted from historical U.S. newspaper images. It includes nearly 20 million scans from the public domain Chronicling America collection maintained by the Library of Congress. The dataset is designed to address the challenges posed by complex layouts and low OCR quality in existing newspaper datasets. It was created using a novel deep learning pipeline that incorporates layout detection, legibility classification, custom OCR, and the association of article texts spanning multiple bounding boxes. It employs efficient architectures specifically designed for mobile phones to ensure high scalability. The dataset offers high-quality data that can be utilized for various purposes. It can be used to pre-train large language models and improve their understanding of historical English and world knowledge. (…) »

source > huggingface.co, Hugging Face, Dell Research Harvard, AmericanStories (Revision 3484aca) }, 2023, doi 10.57967/hf/0757

Dell Research Harvard | The American Stories dataset

Partager