--- license: apache-2.0 pipeline_tag: text-generation --- # Bigram Language Model ## Overview This repository contains a simple Bigram Language Model implemented in PyTorch. The model is trained to predict the next character in a sequence, given the current character. It's a character-level model and can be used for tasks like text generation. ## Model Details - **Model Type**: Character-level Language Model - **Architecture**: Simple lookup table for character bigrams - **Training Data**: [https://huggingface.co/datasets/csebuetnlp/xlsum/viewer/bengali] ## Requirements - Python 3.x - PyTorch - JSON (for loading the tokenizer) ## Installation First, clone this repository: ## Loading the Model To load the model, you need to initialize it with the vocabulary size and load the pre-trained weights: ```python import torch from model import BigramLanguageModel vocab_size = 225 model = BigramLanguageModel(vocab_size) model.load_state_dict(torch.load('path_to_your_model.pth', map_location=torch.device('cpu'))) model.eval() import json with open('tokenizer_mappings.json', 'r', encoding='utf-8') as f: mappings = json.load(f) stoi = mappings['stoi'] itos = mappings['itos'] # Example usage encode = lambda s: [stoi[c] for c in s] decode = lambda l: ''.join([itos[i] for i in l]) context = torch.tensor([encode("Your initial text")], dtype=torch.long) generated_text_indices = model.generate(context, max_new_tokens=100) print(decode(generated_text_indices[0].tolist()))