gpt-wiki-az / README.md
IsmatS's picture
Upload folder using huggingface_hub
463c2c1 verified
|
raw
history blame
4.31 kB

Azerbaijani Language GPT Model

This repository contains an implementation of a GPT (Generative Pre-trained Transformer) model trained on Azerbaijani Wikipedia data. The model is designed to understand and generate Azerbaijani text.

Project Structure

.
β”œβ”€β”€ README.md
β”œβ”€β”€ az_tokenizer.json        # Trained tokenizer for Azerbaijani text
β”œβ”€β”€ az_wiki_data.json        # Collected Wikipedia data
β”œβ”€β”€ best_model.pt            # Saved state of the best trained model
β”œβ”€β”€ collect_data.py          # Script for collecting Wikipedia articles
β”œβ”€β”€ generate.py              # Text generation script using the trained model
β”œβ”€β”€ prepare_data.py          # Data preprocessing and tokenizer training
β”œβ”€β”€ requirements.txt         # Project dependencies
└── train.py                 # GPT model training script

Setup

  1. Create and activate virtual environment:
python -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate
  1. Install dependencies based on your system:

For Mac with Apple Silicon (M1/M2):

# Install PyTorch for Apple Silicon
pip3 install --pre torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/nightly/cpu

# Install other required packages
pip install transformers wikipedia-api beautifulsoup4 requests

For other systems:

pip install -r requirements.txt

Platform-Specific Notes

Apple Silicon (M1/M2) Macs

  • Uses MPS (Metal Performance Shaders) for acceleration
  • Optimized memory management for Apple Silicon
  • May require specific PyTorch nightly builds

CUDA-enabled GPUs

  • Automatically utilizes CUDA if available
  • Implements mixed precision training
  • Memory optimization through gradient accumulation

Data Collection

  1. Collect Azerbaijani Wikipedia articles:
python collect_data.py

This will save articles to az_wiki_data.json

  1. Prepare data and train tokenizer:
python prepare_data.py

This will create az_tokenizer.json

Training

Train the GPT model:

python train.py

The training script:

  • Uses mixed precision training
  • Implements gradient accumulation
  • Saves model checkpoints every 5 epochs
  • Saves the best model based on validation loss

Model Architecture

  • Transformer-based architecture
  • Configuration adjustable in train.py:
    • Embedding dimension: 512
    • Attention heads: 8
    • Layers: 6
    • Block size: 128
    • Batch size: 4

Text Generation

Generate text using the trained model:

python generate.py

The generate.py script:

  • Loads the trained model and tokenizer
  • Generates text based on a user-provided prompt
  • Implements sampling strategies such as nucleus sampling and temperature scaling

Files Description

  • collect_data.py: Collects articles from Azerbaijani Wikipedia using categories like history, culture, literature, and geography
  • prepare_data.py: Preprocesses text and trains a BPE tokenizer
  • train.py: Contains GPT model implementation and training loop
  • generate.py: Generates text using the trained model and sampling strategies
  • az_wiki_data.json: Collected and preprocessed Wikipedia articles
  • az_tokenizer.json: Trained BPE tokenizer for Azerbaijani text
  • best_model.pt: Saved state of the best model during training

Training Output

The model saves:

  • Best model state as best_model.pt
  • Regular checkpoints as checkpoint_epoch_N.pt
  • Interrupted training state as interrupt_checkpoint.pt

Memory Requirements

  • Recommended: GPU with at least 8GB memory
  • For larger models: Use gradient accumulation steps
  • Adjustable batch size and model size based on available memory

Troubleshooting

Common Issues:

  1. Memory Errors:

    • Reduce batch size
    • Enable gradient accumulation
    • Reduce model size
    • Clear GPU cache regularly
  2. PyTorch Installation:

    • For Apple Silicon: Use the nightly build command
    • For CUDA: Install appropriate CUDA version
  3. Data Loading:

    • Reduce number of workers if getting process errors
    • Enable pin memory for faster data transfer

Future Improvements

  • Implement model evaluation metrics
  • Add data augmentation techniques
  • Implement distributed training
  • Add model compression techniques