Azerbaijani Language GPT Model
This repository contains an implementation of a GPT (Generative Pre-trained Transformer) model trained on Azerbaijani Wikipedia data. The model is designed to understand and generate Azerbaijani text.
Project Structure
.
βββ README.md
βββ az_tokenizer.json # Trained tokenizer for Azerbaijani text
βββ az_wiki_data.json # Collected Wikipedia data
βββ best_model.pt # Saved state of the best trained model
βββ collect_data.py # Script for collecting Wikipedia articles
βββ generate.py # Text generation script using the trained model
βββ prepare_data.py # Data preprocessing and tokenizer training
βββ requirements.txt # Project dependencies
βββ train.py # GPT model training script
Setup
- Create and activate virtual environment:
python -m venv .venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
- Install dependencies based on your system:
For Mac with Apple Silicon (M1/M2):
# Install PyTorch for Apple Silicon
pip3 install --pre torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/nightly/cpu
# Install other required packages
pip install transformers wikipedia-api beautifulsoup4 requests
For other systems:
pip install -r requirements.txt
Platform-Specific Notes
Apple Silicon (M1/M2) Macs
- Uses MPS (Metal Performance Shaders) for acceleration
- Optimized memory management for Apple Silicon
- May require specific PyTorch nightly builds
CUDA-enabled GPUs
- Automatically utilizes CUDA if available
- Implements mixed precision training
- Memory optimization through gradient accumulation
Data Collection
- Collect Azerbaijani Wikipedia articles:
python collect_data.py
This will save articles to az_wiki_data.json
- Prepare data and train tokenizer:
python prepare_data.py
This will create az_tokenizer.json
Training
Train the GPT model:
python train.py
The training script:
- Uses mixed precision training
- Implements gradient accumulation
- Saves model checkpoints every 5 epochs
- Saves the best model based on validation loss
Model Architecture
- Transformer-based architecture
- Configuration adjustable in
train.py
:- Embedding dimension: 512
- Attention heads: 8
- Layers: 6
- Block size: 128
- Batch size: 4
Text Generation
Generate text using the trained model:
python generate.py
The generate.py
script:
- Loads the trained model and tokenizer
- Generates text based on a user-provided prompt
- Implements sampling strategies such as nucleus sampling and temperature scaling
Files Description
collect_data.py
: Collects articles from Azerbaijani Wikipedia using categories like history, culture, literature, and geographyprepare_data.py
: Preprocesses text and trains a BPE tokenizertrain.py
: Contains GPT model implementation and training loopgenerate.py
: Generates text using the trained model and sampling strategiesaz_wiki_data.json
: Collected and preprocessed Wikipedia articlesaz_tokenizer.json
: Trained BPE tokenizer for Azerbaijani textbest_model.pt
: Saved state of the best model during training
Training Output
The model saves:
- Best model state as
best_model.pt
- Regular checkpoints as
checkpoint_epoch_N.pt
- Interrupted training state as
interrupt_checkpoint.pt
Memory Requirements
- Recommended: GPU with at least 8GB memory
- For larger models: Use gradient accumulation steps
- Adjustable batch size and model size based on available memory
Troubleshooting
Common Issues:
Memory Errors:
- Reduce batch size
- Enable gradient accumulation
- Reduce model size
- Clear GPU cache regularly
PyTorch Installation:
- For Apple Silicon: Use the nightly build command
- For CUDA: Install appropriate CUDA version
Data Loading:
- Reduce number of workers if getting process errors
- Enable pin memory for faster data transfer
Future Improvements
- Implement model evaluation metrics
- Add data augmentation techniques
- Implement distributed training
- Add model compression techniques