File size: 1,509 Bytes
01b311c ec30895 01b311c ec30895 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 |
---
license: apache-2.0
pipeline_tag: text-generation
---
# Bigram Language Model
## Overview
This repository contains a simple Bigram Language Model implemented in PyTorch. The model is trained to predict the next character in a sequence, given the current character. It's a character-level model and can be used for tasks like text generation.
## Model Details
- **Model Type**: Character-level Language Model
- **Architecture**: Simple lookup table for character bigrams
- **Training Data**: [https://huggingface.co/datasets/csebuetnlp/xlsum/viewer/bengali]
## Requirements
- Python 3.x
- PyTorch
- JSON (for loading the tokenizer)
## Installation
First, clone this repository:
## Loading the Model
To load the model, you need to initialize it with the vocabulary size and load the pre-trained weights:
```python
import torch
from model import BigramLanguageModel
vocab_size = 225
model = BigramLanguageModel(vocab_size)
model.load_state_dict(torch.load('path_to_your_model.pth', map_location=torch.device('cpu')))
model.eval()
import json
with open('tokenizer_mappings.json', 'r', encoding='utf-8') as f:
mappings = json.load(f)
stoi = mappings['stoi']
itos = mappings['itos']
# Example usage
encode = lambda s: [stoi[c] for c in s]
decode = lambda l: ''.join([itos[i] for i in l])
context = torch.tensor([encode("Your initial text")], dtype=torch.long)
generated_text_indices = model.generate(context, max_new_tokens=100)
print(decode(generated_text_indices[0].tolist()))
|