Pierre-Carl Langlais's picture

Pierre-Carl Langlais

Pclanglais

·

Dorialexander

AI & ML interests

Open data & open LLMs

Recent Activity

updated a dataset 3 days ago

PleIAs/common_corpus

updated a dataset 3 days ago

PleIAs/common_corpus

updated a dataset 3 days ago

PleIAs/common_corpus

View all activity

Organizations

Posts 6

Post

2903

We release today our first foundation model and experiment with a new category: specialized pre-training.

OCRonos-Vintage is a 124m parameters model trained end-to-end by Pleias on llm.c from 18 billion tokens from cultural heritage archives. Despite its small size it achieve nearly state of the art results for OCR correction of historical English sources. OCRonos-Vintage is also an historical model with an unusual cut-off date: December 29th, 1955…

We look forward to replicate this approach very soon on other "hard" tasks commonly associated with generalist LLMs/SLMs: RAG, function calling, summarization, document segmentation…

OCRonos-Vintage: PleIAs/OCRonos-Vintage
CPU Demo: PleIAs/OCRonos-Vintage-CPU
GPU Demo: PleIAs/OCRonos-Vintage-GPU
Our annoncement and call for specialized pre-training: https://huggingface.co/blog/Pclanglais/specialized-pre-training

Articles 7

Article

80

They Said It Couldn’t Be Done

View all Articles

Papers 1

arxiv:2501.08365

spaces 9

Reversed Zotero

Editorialization

Correction-OCR

Tchap

Motta

tag_theme

models 38

Pclanglais/Popeye-1929

Text-to-Image • Updated Dec 31, 2024 • 25 •

Pclanglais/Pleias-Nano-onnx

Text Generation • Updated Dec 9, 2024 • 18

Pclanglais/Pleias-Pico-onnx

Updated Dec 9, 2024 • 8

Pclanglais/Headlines-OCR-Correction

Updated Oct 25, 2024 • 21

Pclanglais/SynthRag3

Updated Sep 11, 2024 • 6

Pclanglais/SynthRag2

Updated Sep 9, 2024 • 10

Pclanglais/SynthRag1

Updated Sep 8, 2024 • 11

Pclanglais/Experiment1

Updated Sep 5, 2024 • 11

Pclanglais/Segmentext-Marianne

Updated Aug 28, 2024 • 8

Pclanglais/OCRonos-Vintage-GGUF

Updated Aug 11, 2024

datasets 11

Pclanglais/tokenized_sample

Viewer • Updated 4 days ago • 1.54M • 374

Pclanglais/pdf_sample_10k

Viewer • Updated Nov 30, 2024 • 415k • 32 • 1

Pclanglais/open-science

Viewer • Updated Nov 15, 2024 • 10.8M • 216

Pclanglais/LLM-for-DH

Viewer • Updated Jul 14, 2024 • 1.62k • 16

Pclanglais/youtube-commons-metadata

Viewer • Updated Jun 19, 2024 • 6.91M • 59

Pclanglais/OCR-test

Viewer • Updated Apr 22, 2024 • 20.1k • 26 • 1

Pclanglais/AllWikidataCharacters

Viewer • Updated Apr 14, 2024 • 180k • 63 • 7

Pclanglais/wiki-dataset

Viewer • Updated Jan 4, 2024 • 282 • 143

Pclanglais/Mickey-1928-dataset

Viewer • Updated Dec 31, 2023 • 96 • 146 • 7

Pclanglais/MonadGPT

Viewer • Updated Nov 10, 2023 • 10.8k • 73 • 11