Multilingual E5 for Document Classification (DocLayNet)

This model is a fine-tuned version of intfloat/multilingual-e5-large for document text classification based on the DocLayNet dataset.

Evaluation results

  • Test Loss: 0.5192, Test Acc: 0.9719

Usage:


# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-classification", model="kaixkhazaki/multilingual-e5-doclaynet")

prediction = pipe("This is some text from a financial report")
print(prediction)

Model description

  • Base model: intfloat/multilingual-e5-large
  • Task: Document text classification
  • Languages: Multilingual

Training data

{
    'financial_reports': 0,
    'government_tenders': 1,
    'laws_and_regulations': 2,
    'manuals': 3,
    'patents': 4,
    'scientific_articles': 5
}

Training procedure

Trained on single gpu for 2 epochs for apx. 20 minutes.

hyperparameters:

{
    'batch_size': 8,
    'num_epochs': 10,
    'learning_rate': 2e-5,
    'weight_decay': 0.01,
    'warmup_ratio': 0.1,
    'gradient_clip': 1.0,
    'label_smoothing': 0.1,
    'optimizer': 'AdamW',
    'scheduler': 'cosine_with_warmup'
}
Downloads last month
230
Safetensors
Model size
560M params
Tensor type
F32
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Model tree for kaixkhazaki/multilingual-e5-doclaynet

Finetuned
(73)
this model

Dataset used to train kaixkhazaki/multilingual-e5-doclaynet

Evaluation results