POS Tagging - Token Segmentation & Categories
Simple script to extract tokens and their POS categories using Hugging Face.
from transformers import pipeline
# Load model and tokenizer
pos_pipeline = pipeline("ner", model="jordigonzm/mdeberta-v3-base-multilingual-pos-tagger")
# Input text
text = "On January 3rd, 2024, the $5.7M prototype—a breakthrough in AI-driven robotics—successfully passed all 37 rigorous performance tests!"
# Run POS tagging
tokens = pos_pipeline(text)
# Print tokens and their categories
for token in tokens:
print(token["word"], "→", token["entity"])
POS Tagging with Stopword Extraction
Automatically detects and extracts nouns and stopwords from text.
This script performs Part-of-Speech (POS) tagging. It correctly reconstructs words, assigns POS labels, and extracts two key word categories:
- Nouns & Proper Nouns (NOUN, PROPN) → Important words in the text.
- Stopwords (DET, ADP, PRON, AUX, CCONJ, SCONJ, PART) → Articles, prepositions, conjunctions, etc.
from transformers import pipeline
# Load the pre-trained POS tagging model
pos_pipeline = pipeline("ner", model="jordigonzm/mdeberta-v3-base-multilingual-pos-tagger")
# Input text
text = "Companies interested in providing the service must take care of signage and information boards."
# Run POS tagging
tokens = pos_pipeline(text)
# Print raw tokens and their POS tags
print("\nTokens POS tagging:")
for token in tokens:
print(f"{token['word']:10} → {token['entity']}")
# Reconstruct words correctly
words, buffer, labels = [], [], []
for token in tokens:
raw_word = token["word"]
if raw_word.startswith("▁"): # New word starts
if buffer:
words.append("".join(buffer)) # Add the completed word
labels.append(buffer_label)
buffer = [raw_word.replace("▁", "")]
buffer_label = token["entity"]
else:
buffer.append(raw_word) # Continue word construction
# Add last word in buffer
if buffer:
words.append("".join(buffer))
labels.append(buffer_label)
# Print final POS tagging results
print("\nPOS tagging results:")
for word, label in zip(words, labels):
print(f"{word:<15} → {label}")
# Define valid POS tags for extraction
noun_tags = {"NOUN", "PROPN"} # Nouns & Proper Nouns
stopword_tags = {"DET", "ADP", "PRON", "AUX", "CCONJ", "SCONJ", "PART"} # Common stopword POS tags
# Extract nouns and stopwords separately
filtered_nouns = [word for word, tag in zip(words, labels) if tag in noun_tags]
stopwords = [word for word, tag in zip(words, labels) if tag in stopword_tags]
# Print extracted words
print("\nFiltered Nouns and Proper Nouns:", filtered_nouns)
print("\nStopwords detected:", stopwords)
Multilingual POS Tagging
Overview
This report outlines the evaluation framework and potential training configurations for a multilingual POS tagging model. The model is based on a Transformer architecture and is assessed after a limited number of training epochs.
Expected Ranges
- Validation Loss: Typically between
0.02
and0.1
, depending on dataset complexity and regularization. - Overall Precision: Expected to range from
96%
to99%
, influenced by dataset diversity and tokenization quality. - Overall Recall: Generally between
96%
and99%
, subject to similar factors as precision. - Overall F1-score: Expected range:
96%
to99%
, balancing precision and recall. - Overall Accuracy: Can vary between
97%
and99.5%
, contingent on language variations and model robustness. - Evaluation Speed: Typically
100-150 samples/sec
|25-40 steps/sec
, depending on batch size and hardware.
Training Configurations
- Model: Transformer-based architecture (e.g., BERT, RoBERTa, XLM-R)
- Training Epochs:
2
to5
, depending on convergence and validation performance. - Batch Size:
1
to16
, balancing memory constraints and stability. - Learning Rate:
1e-6
to5e-4
, adjusted based on optimization dynamics and warm-up strategies.
- Downloads last month
- 249
Inference Providers
NEW
This model is not currently available via any of the supported third-party Inference Providers, and
HF Inference API was unable to determine this model's library.
Model tree for jordigonzm/mdeberta-v3-base-multilingual-pos-tagger
Base model
microsoft/mdeberta-v3-base