README: LoRA Fine-Tuned BERT for Body Parts NER

Model Overview

This repository contains a LoRA fine-tuned BERT model designed for Named Entity Recognition (NER) to identify body parts in text. The model is lightweight, efficient, and optimized for deployment in applications requiring precise entity extraction.

Features

Base Model: BERT (bert-base-uncased)
LoRA Fine-Tuning: Applied for token classification to recognize body parts.
Custom Inference Method: Predicts entities with labels, start-end positions, and confidence scores.
Lightweight Adapter: Uses LoRA adapters to reduce memory footprint.

Model Usage

This model can be used for Named Entity Recognition (NER) to identify body parts in text sentences.

Installation

Clone the repository:

git clone https://huggingface.co/gsri-18/lora_finetuned_bert_on_body_parts_ner_dataset_synthetic
cd lora_finetuned_bert_on_body_parts_ner_dataset_synthetic

Install the required dependencies:

pip install transformers peft torch prettytable

Example Usage

Example input and output:

import torch
from transformers import BertTokenizerFast, BertForTokenClassification
from peft import PeftModel
from prettytable import PrettyTable  # For tabular output

# Device configuration
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Hugging Face model repository name
base_model_name = "bert-base-uncased"
lora_model_name = "gsri-18/lora_finetuned_bert_on_body_parts_ner_dataset_synthetic"

# Load the tokenizer
tokenizer = BertTokenizerFast.from_pretrained(base_model_name)

# Load the base model
base_model = BertForTokenClassification.from_pretrained(base_model_name, num_labels=3).to(device)

# Load the LoRA fine-tuned model
model = PeftModel.from_pretrained(base_model, lora_model_name).to(device)

# Set the model to evaluation mode
model.eval()

# Function to predict entities with confidence scores
def predict_entities(sentence, label_mapping):
    # Tokenize the sentence
    inputs = tokenizer(sentence, return_tensors="pt", truncation=True, padding=True).to(device)
    with torch.no_grad():
        # Get model outputs
        outputs = model(**inputs)

    # Process logits
    logits = outputs.logits
    probabilities = torch.nn.functional.softmax(logits, dim=-1)
    predictions = torch.argmax(logits, dim=-1).squeeze().cpu().numpy()
    confidence_scores = torch.max(probabilities, dim=-1).values.squeeze().cpu().numpy()
    tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"].squeeze().cpu().numpy())

    # Extract entities
    entities = []
    for i, (token, prediction, confidence) in enumerate(zip(tokens, predictions, confidence_scores)):
        if token.startswith("##"):  # Skip subword tokens
            continue
        label = label_mapping[prediction]
        if label != "O":  # Skip tokens labeled as "O"
            start = sentence.find(token)
            end = start + len(token) - 1
            entities.append({
                "token": token,
                "label": label,
                "start": start,
                "end": end,
                "confidence": round(float(confidence), 8)  # Round to 8 decimal places
            })

    return entities

# Process and print predictions for multiple sentences
def process_and_print_predictions(sentences, label_mapping):
    for i, sentence in enumerate(sentences, 1):
        print(f"\nSentence {i}: {sentence}")
        entities = predict_entities(sentence, label_mapping)

        if not entities:
            print("No body parts were detected in the sentence.")
        else:
            # Create a table for better formatting
            table = PrettyTable()
            table.field_names = ["Token", "Label", "Start", "End", "Confidence"]

            for entity in entities:
                table.add_row([entity['token'], entity['label'], entity['start'], entity['end'], f"{entity['confidence']:.8f}"])

            print(table)

# Test sentences
test_sentences = [
    "The arm connects to the shoulder and elbow.",
    "He felt pain in his knee and spinal cord after running.",
    "The Named Entity Recognition model is working well."
]

label_mapping = {0: "O", 1: "B-BODY", 2: "I-BODY"}

# Display results for all sentences
process_and_print_predictions(test_sentences, label_mapping)

Output

For the input sentences:

"The arm connects to the shoulder and elbow."

+----------+--------+-------+-----+------------+
|  Token   | Label  | Start | End | Confidence |
+----------+--------+-------+-----+------------+
|   arm    | B-BODY |   4   |  6  | 0.99613851 |
| shoulder | B-BODY |   24  |  31 | 0.99928027 |
|  elbow   | B-BODY |   37  |  41 | 0.99879104 |
+----------+--------+-------+-----+------------+

"He felt pain in his knee and spinal cord after running."

+--------+--------+-------+-----+------------+
| Token  | Label  | Start | End | Confidence |
+--------+--------+-------+-----+------------+
|  knee  | B-BODY |   20  |  23 | 0.99953437 |
| spinal | B-BODY |   29  |  34 | 0.99941099 |
|  cord  | I-BODY |   36  |  39 | 0.99498111 |
+--------+--------+-------+-----+------------+

"The Named Entity Recognition model is working well." No body parts were detected in the sentence.

Files in the Repository

adapter_config.json: Configuration for the LoRA adapter.
adapter_model.safetensors: The fine-tuned LoRA adapter weights.
tokenizer.json, vocab.txt, tokenizer_config.json: Tokenizer files.
model-bert-synthetic-bpr-lora-large.pth: The base model weights.

License

no-license

For issues or questions, please create an issue in the repository or reach out to the contributors.

gsri-18
/

lora_finetuned_bert_on_body_parts_ner_dataset_synthetic