|
--- |
|
license: apache-2.0 |
|
pipeline_tag: text-generation |
|
--- |
|
|
|
# Bigram Language Model |
|
|
|
## Overview |
|
This repository contains a simple Bigram Language Model implemented in PyTorch. The model is trained to predict the next character in a sequence, given the current character. It's a character-level model and can be used for tasks like text generation. |
|
|
|
## Model Details |
|
- **Model Type**: Character-level Language Model |
|
- **Architecture**: Simple lookup table for character bigrams |
|
- **Training Data**: [https://huggingface.co/datasets/csebuetnlp/xlsum/viewer/bengali] |
|
|
|
## Requirements |
|
- Python 3.x |
|
- PyTorch |
|
- JSON (for loading the tokenizer) |
|
|
|
## Installation |
|
First, clone this repository: |
|
|
|
|
|
## Loading the Model |
|
To load the model, you need to initialize it with the vocabulary size and load the pre-trained weights: |
|
|
|
```python |
|
import torch |
|
from model import BigramLanguageModel |
|
|
|
vocab_size = 225 |
|
model = BigramLanguageModel(vocab_size) |
|
|
|
model.load_state_dict(torch.load('path_to_your_model.pth', map_location=torch.device('cpu'))) |
|
model.eval() |
|
|
|
import json |
|
|
|
with open('tokenizer_mappings.json', 'r', encoding='utf-8') as f: |
|
mappings = json.load(f) |
|
stoi = mappings['stoi'] |
|
itos = mappings['itos'] |
|
|
|
# Example usage |
|
encode = lambda s: [stoi[c] for c in s] |
|
decode = lambda l: ''.join([itos[i] for i in l]) |
|
|
|
|
|
context = torch.tensor([encode("Your initial text")], dtype=torch.long) |
|
generated_text_indices = model.generate(context, max_new_tokens=100) |
|
print(decode(generated_text_indices[0].tolist())) |
|
|
|
|
|
|