File size: 4,374 Bytes
1ef680a
 
 
 
 
 
fc78143
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1ef680a
 
 
 
 
 
2c8ad5c
1ef680a
2c8ad5c
424abe1
2c8ad5c
 
 
 
 
 
424abe1
2c8ad5c
1ef680a
2c8ad5c
 
 
 
1ef680a
2c8ad5c
1ef680a
2c8ad5c
 
1ef680a
2c8ad5c
1ef680a
2c8ad5c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1ef680a
 
 
376bb59
 
1ef680a
 
 
 
 
 
 
 
 
 
 
 
 
376bb59
1ef680a
 
 
 
 
 
376bb59
 
 
 
 
 
 
 
 
 
2c8ad5c
376bb59
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
---
tags:
- generated_from_trainer
model-index:
- name: EUBERT
  results: []
language:
- bg
- cs
- da
- de
- el
- en
- es
- et
- fi
- fr
- ga
- hr
- hu
- it
- lt
- lv
- mt
- nl
- pl
- pt
- ro
- sk
- sl
- sv
---

<!-- This model card has been generated automatically according to the information the Trainer had access to. You
should probably proofread and complete it, then remove this comment. -->


## Model Card: EUBERT

### Overview

- **Model Name**: EUBERT
- **Model Version**: 1.0
- **Date of Release**: 02 October 2023
- **Model Architecture**: BERT (Bidirectional Encoder Representations from Transformers)
- **Training Data**: Documents registered by the European Publications Office
- **Model Use Case**: Text Classification, Question Answering, Language Understanding

### Model Description

EUBERT is a pretrained BERT uncased model that has been trained on a vast corpus of documents registered by the [European Publications Office](https://op.europa.eu/).
These documents span the last 30 years, providing a comprehensive dataset that encompasses a wide range of topics and domains.
EUBERT is designed to be a versatile language model that can be fine-tuned for various natural language processing tasks, 
making it a valuable resource for a variety of applications.

### Intended Use

EUBERT serves as a starting point for building more specific natural language understanding models.
Its versatility makes it suitable for a wide range of tasks, including but not limited to:

1. **Text Classification**: EUBERT can be fine-tuned for classifying text documents into different categories, making it useful for applications such as sentiment analysis, topic categorization, and spam detection.

2. **Question Answering**: By fine-tuning EUBERT on question-answering datasets, it can be used to extract answers from text documents, facilitating tasks like information retrieval and document summarization.

3. **Language Understanding**: EUBERT can be employed for general language understanding tasks, including named entity recognition, part-of-speech tagging, and text generation.

### Performance

The specific performance metrics of EUBERT may vary depending on the downstream task and the quality and quantity of training data used for fine-tuning.
Users are encouraged to fine-tune the model on their specific task and evaluate its performance accordingly.

### Considerations

- **Data Privacy and Compliance**: Users should ensure that the use of EUBERT complies with all relevant data privacy and compliance regulations, especially when working with sensitive or personally identifiable information.

- **Fine-Tuning**: The effectiveness of EUBERT on a given task depends on the quality and quantity of the training data, as well as the fine-tuning process. Careful experimentation and evaluation are essential to achieve optimal results.

- **Bias and Fairness**: Users should be aware of potential biases in the training data and take appropriate measures to mitigate bias when fine-tuning EUBERT for specific tasks.

### Conclusion

EUBERT is a pretrained BERT model that leverages a substantial corpus of documents from the European Publications Office. It offers a versatile foundation for developing natural language processing solutions across a wide range of applications, enabling researchers and developers to create custom models for text classification, question answering, and language understanding tasks. Users are encouraged to exercise diligence in fine-tuning and evaluating the model for their specific use cases while adhering to data privacy and fairness considerations.


--- 

## Training procedure

Dedicated Byte Level BPE tokenizer vocabulary size 2**16, min frequency 2

### Training hyperparameters

The following hyperparameters were used during training:
- learning_rate: 5e-05
- train_batch_size: 32
- eval_batch_size: 32
- seed: 42
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: linear
- num_epochs: 1

### Training results

Coming soon 

### Framework versions

- Transformers 4.33.3
- Pytorch 2.0.1+cu117
- Datasets 2.14.5
- Tokenizers 0.13.3

### Infrastructure 

- **Hardware Type:** 4 x GPUs 24GB
- **Hours used:** 60
- **Cloud Provider:** EuroHPC
- **Compute Region:** Meluxina


# Model Card Authors

Sebastien Campion

# Model Card Contact

[email protected]