AI and NLP: EuroBert, the Multilingual Encoder Serving European Languages

While the performance of LLMs is making headlines, encoder models remain fundamental building blocks of NLP and are among the most downloaded on Hugging Face. Developed through a collaboration between the MICS lab at CentraleSupélec, Diabolocom, Artefact, and Unbabel, the open-source encoder suite EuroBERT represents a significant advancement in the field of multilingual NLP, combining sovereignty, transparency, and performance.

EuroBert, developed as part of three ongoing theses, is available in three sizes (210 million, 610 million, and 2.1 billion parameters). It closely follows the architecture of Llama 3 and has been trained on a corpus of 5 trillion tokens (twice as many as classic encoders), including multilingual datasets, code, and mathematics.

The training pipeline includes two phases: pretraining and fine-tuning, and uses the masked language modeling (MLM) objective.

It supports eight major European languages (English, French, German, Spanish, Italian, Dutch, Portuguese, and Polish) and seven extra-European languages (Chinese, Russian, Japanese, Vietnamese, Arabic, Turkish, and Hindi).

A major advantage of EuroBERT lies in its ability to natively handle sequences of up to 8,192 tokens, whereas classic encoder models like BERT and its variants (such as RoBERTa) are generally limited to sequences of 512 tokens, which can fragment text understanding. This extended context length enhances the precision of analyses, even for the most complex NLP tasks.

Various Applications

The capabilities of EuroBERT position it as an essential building block for:

Information retrieval and text extraction: its efficiency in spotting and classifying documents opens up possibilities for companies seeking to optimize their information flows;
Technical and scientific language processing: its extensive training allows it to better understand and analyze complex texts, particularly in mathematics and programming;
Automatic translation and summarization: it competes with existing state-of-the-art solutions while ensuring precision adapted to European languages.

A Fruitful Public-Private Collaboration

This project was carried out by CIFRE doctoral students Nicolas Boizard, Hippolyte Gisserot-Boukhlef, and Duarte Alves, under the guidance of Pierre Colombo, Céline Hudelot, and André Martins. In addition to the teams from MICS, IST, Diabolocom, Artefact, and Unbabel, it received support from teams at the Université Grenoble Alpes, CNRS, LISN (Interdisciplinary Laboratory of Digital Sciences), Illuin Technology, IRT Saint-Exupéry, and CINES. The article detailing their work is available at https://arxiv.org/abs/2503.05500.

Trained on the Adastra supercomputer by Genci, EuroBERT opens strategic perspectives for businesses and research. Beyond a technical advancement, it illustrates Europe's capacity to innovate and develop sovereign AI solutions.

Fully open source, it is available under the Apache 2.0 license on https://huggingface.co/EuroBERT

Translated from IA et NLP : EuroBert, l'encodeur multilingue au service des langues européennes

AI and NLP: EuroBert, the Multilingual Encoder Serving European Languages

Table of contents

Various Applications

A Fruitful Public-Private Collaboration