Upload folder using huggingface_hub

Browse files

Files changed (11) hide show

.DS_Store +0 -0
.gitignore +163 -0
README.md +148 -0
az_tokenizer.json +0 -0
az_wiki_data.json +0 -0
collect_data.py +127 -0
generate.py +68 -0
prepare_data.py +124 -0
push_to_hf.py +17 -0
requirements.txt +42 -0
train.py +274 -0

.DS_Store ADDED Viewed

Binary file (6.15 kB). View file

.gitignore ADDED Viewed

	@@ -0,0 +1,163 @@

+# Byte-compiled / optimized / DLL files
+__pycache__/
+*.py[cod]
+*$py.class
+.vscode
+/wandb
+# C extensions
+*.so
+best_model.pt
+# Distribution / packaging
+.Python
+build/
+develop-eggs/
+dist/
+downloads/
+eggs/
+.eggs/
+lib/
+lib64/
+parts/
+sdist/
+var/
+wheels/
+share/python-wheels/
+*.egg-info/
+.installed.cfg
+*.egg
+MANIFEST
+# PyInstaller
+#  Usually these files are written by a python script from a template
+#  before PyInstaller builds the exe, so as to inject date/other infos into it.
+*.manifest
+*.spec
+# Installer logs
+pip-log.txt
+pip-delete-this-directory.txt
+# Unit test / coverage reports
+htmlcov/
+.tox/
+.nox/
+.coverage
+.coverage.*
+.cache
+nosetests.xml
+coverage.xml
+*.cover
+*.py,cover
+.hypothesis/
+.pytest_cache/
+cover/
+# Translations
+*.mo
+*.pot
+# Django stuff:
+*.log
+local_settings.py
+db.sqlite3
+db.sqlite3-journal
+# Flask stuff:
+instance/
+.webassets-cache
+# Scrapy stuff:
+.scrapy
+# Sphinx documentation
+docs/_build/
+# PyBuilder
+.pybuilder/
+target/
+# Jupyter Notebook
+.ipynb_checkpoints
+# IPython
+profile_default/
+ipython_config.py
+# pyenv
+#   For a library or package, you might want to ignore these files since the code is
+#   intended to run in multiple environments; otherwise, check them in:
+# .python-version
+# pipenv
+#   According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
+#   However, in case of collaboration, if having platform-specific dependencies or dependencies
+#   having no cross-platform support, pipenv may install dependencies that don't work, or not
+#   install all needed dependencies.
+#Pipfile.lock
+# poetry
+#   Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control.
+#   This is especially recommended for binary packages to ensure reproducibility, and is more
+#   commonly ignored for libraries.
+#   https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control
+#poetry.lock
+# pdm
+#   Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control.
+#pdm.lock
+#   pdm stores project-wide configurations in .pdm.toml, but it is recommended to not include it
+#   in version control.
+#   https://pdm.fming.dev/latest/usage/project/#working-with-version-control
+.pdm.toml
+.pdm-python
+.pdm-build/
+# PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm
+__pypackages__/
+# Celery stuff
+celerybeat-schedule
+celerybeat.pid
+# SageMath parsed files
+*.sage.py
+# Environments
+.env
+.venv
+env/
+venv/
+ENV/
+env.bak/
+venv.bak/
+# Spyder project settings
+.spyderproject
+.spyproject
+# Rope project settings
+.ropeproject
+# mkdocs documentation
+/site
+# mypy
+.mypy_cache/
+.dmypy.json
+dmypy.json
+# Pyre type checker
+.pyre/
+# pytype static type analyzer
+.pytype/
+# Cython debug symbols
+cython_debug/
+# PyCharm
+#  JetBrains specific template is maintained in a separate JetBrains.gitignore that can
+#  be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore
+#  and can be added to the global gitignore or merged into this file.  For a more nuclear
+#  option (not recommended) you can uncomment the following to ignore the entire idea folder.
+#.idea/

README.md ADDED Viewed

	@@ -0,0 +1,148 @@

+# Azerbaijani Language GPT Model
+This repository contains an implementation of a GPT (Generative Pre-trained Transformer) model trained on Azerbaijani Wikipedia data. The model is designed to understand and generate Azerbaijani text.
+## Project Structure
+```
+.
+├── README.md
+├── az_tokenizer.json        # Trained tokenizer for Azerbaijani text
+├── az_wiki_data.json        # Collected Wikipedia data
+├── best_model.pt            # Saved state of the best trained model
+├── collect_data.py          # Script for collecting Wikipedia articles
+├── generate.py              # Text generation script using the trained model
+├── prepare_data.py          # Data preprocessing and tokenizer training
+├── requirements.txt         # Project dependencies
+└── train.py                 # GPT model training script
+```
+## Setup
+1. Create and activate virtual environment:
+```bash
+python -m venv .venv
+source .venv/bin/activate  # On Windows: .venv\Scripts\activate
+```
+2. Install dependencies based on your system:
+For Mac with Apple Silicon (M1/M2):
+```bash
+# Install PyTorch for Apple Silicon
+pip3 install --pre torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/nightly/cpu
+# Install other required packages
+pip install transformers wikipedia-api beautifulsoup4 requests
+```
+For other systems:
+```bash
+pip install -r requirements.txt
+```
+## Platform-Specific Notes
+### Apple Silicon (M1/M2) Macs
+- Uses MPS (Metal Performance Shaders) for acceleration
+- Optimized memory management for Apple Silicon
+- May require specific PyTorch nightly builds
+### CUDA-enabled GPUs
+- Automatically utilizes CUDA if available
+- Implements mixed precision training
+- Memory optimization through gradient accumulation
+## Data Collection
+1. Collect Azerbaijani Wikipedia articles:
+```bash
+python collect_data.py
+```
+This will save articles to `az_wiki_data.json`
+2. Prepare data and train tokenizer:
+```bash
+python prepare_data.py
+```
+This will create `az_tokenizer.json`
+## Training
+Train the GPT model:
+```bash
+python train.py
+```
+The training script:
+- Uses mixed precision training
+- Implements gradient accumulation
+- Saves model checkpoints every 5 epochs
+- Saves the best model based on validation loss
+## Model Architecture
+- Transformer-based architecture
+- Configuration adjustable in `train.py`:
+  - Embedding dimension: 512
+  - Attention heads: 8
+  - Layers: 6
+  - Block size: 128
+  - Batch size: 4
+## Text Generation
+Generate text using the trained model:
+```bash
+python generate.py
+```
+The `generate.py` script:
+- Loads the trained model and tokenizer
+- Generates text based on a user-provided prompt
+- Implements sampling strategies such as nucleus sampling and temperature scaling
+## Files Description
+- `collect_data.py`: Collects articles from Azerbaijani Wikipedia using categories like history, culture, literature, and geography
+- `prepare_data.py`: Preprocesses text and trains a BPE tokenizer
+- `train.py`: Contains GPT model implementation and training loop
+- `generate.py`: Generates text using the trained model and sampling strategies
+- `az_wiki_data.json`: Collected and preprocessed Wikipedia articles
+- `az_tokenizer.json`: Trained BPE tokenizer for Azerbaijani text
+- `best_model.pt`: Saved state of the best model during training
+## Training Output
+The model saves:
+- Best model state as `best_model.pt`
+- Regular checkpoints as `checkpoint_epoch_N.pt`
+- Interrupted training state as `interrupt_checkpoint.pt`
+## Memory Requirements
+- Recommended: GPU with at least 8GB memory
+- For larger models: Use gradient accumulation steps
+- Adjustable batch size and model size based on available memory
+## Troubleshooting
+Common Issues:
+1. Memory Errors:
+   - Reduce batch size
+   - Enable gradient accumulation
+   - Reduce model size
+   - Clear GPU cache regularly
+2. PyTorch Installation:
+   - For Apple Silicon: Use the nightly build command
+   - For CUDA: Install appropriate CUDA version
+3. Data Loading:
+   - Reduce number of workers if getting process errors
+   - Enable pin memory for faster data transfer
+## Future Improvements
+- [ ] Implement model evaluation metrics
+- [ ] Add data augmentation techniques
+- [ ] Implement distributed training
+- [ ] Add model compression techniques

az_tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

az_wiki_data.json ADDED Viewed

The diff for this file is too large to render. See raw diff

collect_data.py ADDED Viewed

	@@ -0,0 +1,127 @@

+import wikipediaapi
+import json
+from tqdm import tqdm
+import time
+def get_wiki_pages(categories=["Azərbaycan tarixi", "Azərbaycan mədəniyyəti",
+                             "Azərbaycan ədəbiyyatı", "Azərbaycan coğrafiyası"],
+                  min_length=500, max_pages=1000):
+    """
+    Recursively collect substantial Azerbaijani Wikipedia pages from multiple categories
+    """
+    wiki = wikipediaapi.Wikipedia(
+        language='az',
+        extract_format=wikipediaapi.ExtractFormat.WIKI,
+        user_agent='AzGPTDataCollector/1.0'
+    )
+    collected_pages = {}
+    visited_pages = set()
+    def collect_pages(category_title):
+        if len(collected_pages) >= max_pages:
+            return
+        category = wiki.page(f"Kateqoriya:{category_title}")
+        if not category.exists():
+            print(f"Category not found: {category_title}")
+            return
+        # First, process all articles in this category
+        for member in category.categorymembers.values():
+            if len(collected_pages) >= max_pages:
+                return
+            if member.title in visited_pages:
+                continue
+            visited_pages.add(member.title)
+            # Skip if it's a category or template page
+            if member.title.startswith('Kateqoriya:') or member.title.startswith('Şablon:'):
+                continue
+            # Skip if content is too short
+            if len(member.text) < min_length:
+                continue
+            collected_pages[member.title] = {
+                'title': member.title,
+                'text': member.text,
+                'url': member.fullurl,
+                'length': len(member.text)
+            }
+            print(f"Collected: {member.title} ({len(member.text)} chars)")
+            # Delay to avoid hitting API limits
+            time.sleep(0.1)
+        # Then process subcategories
+        for subcategory in category.categorymembers.values():
+            if subcategory.title.startswith('Kateqoriya:'):
+                collect_pages(subcategory.title.replace('Kateqoriya:', ''))
+    # Start collection from each category
+    for category in categories:
+        print(f"\nStarting collection from category: {category}")
+        collect_pages(category)
+    return collected_pages
+def preprocess_text(text):
+    """
+    Enhanced text preprocessing for Azerbaijani text
+    """
+    # Remove extra whitespace
+    text = ' '.join(text.split())
+    # Add space after punctuation if missing
+    for punct in '.!?،؛:()[]{}«»':
+        text = text.replace(punct, punct + ' ')
+    # Fix common OCR errors in Azerbaijani text
+    replacements = {
+        'i': 'ı',  # Replace dotted i with dotless ı where appropriate
+        'І': 'I',
+        '...': '…',
+    }
+    for old, new in replacements.items():
+        text = text.replace(old, new)
+    return text
+def save_dataset(pages, output_file='az_wiki_data.json'):
+    """
+    Save collected pages to a JSON file
+    """
+    with open(output_file, 'w', encoding='utf-8') as f:
+        json.dump(pages, f, ensure_ascii=False, indent=2)
+    print(f"Saved {len(pages)} pages to {output_file}")
+def main():
+    # Collect pages with minimum length requirement
+    print("Starting data collection...")
+    pages = get_wiki_pages(min_length=500, max_pages=100)  # 500 chars minimum length
+    # Preprocess and save
+    print("\nPreprocessing and saving data...")
+    for title in pages:
+        pages[title]['text'] = preprocess_text(pages[title]['text'])
+    save_dataset(pages)
+    # Print statistics
+    total_chars = sum(page['length'] for page in pages.values())
+    if pages:
+        print(f"\nCollection complete!")
+        print(f"Total pages: {len(pages)}")
+        print(f"Total characters: {total_chars}")
+        print(f"Average page length: {total_chars / len(pages):.2f} characters")
+        # Print some titles as examples
+        print("\nSample of collected articles:")
+        for title in list(pages.keys())[:5]:
+            print(f"- {title} ({pages[title]['length']} chars)")
+if __name__ == "__main__":
+    main()

generate.py ADDED Viewed

	@@ -0,0 +1,68 @@

+import torch
+from tokenizers import Tokenizer
+from train import GPT, GPTConfig  # Assuming your model definition is in train.py
+import torch.nn.functional as F
+def nucleus_sampling(logits, p=0.9):
+    sorted_logits, sorted_indices = torch.sort(logits, descending=True)
+    cumulative_probs = torch.cumsum(F.softmax(sorted_logits, dim=-1), dim=-1)
+    sorted_indices_to_remove = cumulative_probs > p
+    sorted_indices_to_remove[..., 1:] = sorted_indices_to_remove[..., :-1].clone()
+    sorted_indices_to_remove[..., 0] = 0
+    logits[sorted_indices[sorted_indices_to_remove]] = -float('Inf')
+    probabilities = F.softmax(logits, dim=-1)
+    next_token_id = torch.multinomial(probabilities, num_samples=1).item()
+    return next_token_id
+def load_model_and_tokenizer():
+    # Load the model configuration and tokenizer
+    config = GPTConfig()
+    model = GPT(config)
+    model.load_state_dict(torch.load('best_model.pt', map_location=torch.device('cpu')))
+    model.eval()  # Set model to evaluation mode
+    tokenizer = Tokenizer.from_file("az_tokenizer.json")  # Load tokenizer
+    return model, tokenizer
+def apply_repetition_penalty(logits, input_ids, penalty=1.2):
+    # Penalize the logits for tokens that have already been generated
+    for token_id in set(input_ids):
+        logits[0, token_id] /= penalty
+    return logits
+def generate_text(model, tokenizer, prompt, max_new_tokens=50, temperature=0.001, p=0.95, repetition_penalty=1.5, device='cpu'):
+    model = model.to(device)
+    input_ids = tokenizer.encode(prompt).ids
+    input_tensor = torch.tensor([input_ids], dtype=torch.long).to(device)
+    for _ in range(max_new_tokens):
+        with torch.no_grad():
+            output_logits, _ = model(input_tensor)
+        # Apply temperature scaling
+        logits = output_logits[:, -1, :] / temperature
+        # Apply repetition penalty
+        logits = apply_repetition_penalty(logits.clone(), input_ids, penalty=repetition_penalty)
+        # Use nucleus sampling
+        next_token_id = nucleus_sampling(logits[0], p=p)
+        input_ids.append(next_token_id)
+        input_tensor = torch.tensor([input_ids], dtype=torch.long).to(device)
+        if next_token_id == tokenizer.token_to_id('[END]'):  # Replace with actual end token if applicable
+            break
+    generated_text = tokenizer.decode(input_ids)
+    return generated_text.replace(' i ', ' ')  # Example: minor post-processing to clean up spaces
+def main():
+    model, tokenizer = load_model_and_tokenizer()
+    prompt = "Azərbaycanın tarixi"  # Your input prompt
+    generated_text = generate_text(model, tokenizer, prompt, p=0.9)  # Adjust p as needed
+    print("Generated Text:", generated_text)
+if __name__ == '__main__':
+    main()

prepare_data.py ADDED Viewed

	@@ -0,0 +1,124 @@

+import json
+import torch
+from torch.utils.data import Dataset, DataLoader
+from transformers import AutoTokenizer, PreTrainedTokenizerFast
+from tokenizers import Tokenizer, models, normalizers, pre_tokenizers, trainers, processors
+from tokenizers.models import BPE
+from tokenizers.trainers import BpeTrainer
+from tokenizers.pre_tokenizers import Whitespace
+import numpy as np
+from tqdm import tqdm
+class AzerbaijaniTokenizer:
+    def __init__(self, vocab_size=50000):
+        self.tokenizer = Tokenizer(BPE(unk_token="[UNK]"))
+        self.tokenizer.normalizer = normalizers.Sequence([
+            normalizers.NFD(),
+            normalizers.Lowercase(),
+            normalizers.StripAccents(),
+        ])
+        self.tokenizer.pre_tokenizer = pre_tokenizers.Sequence([
+            pre_tokenizers.WhitespaceSplit(),
+            pre_tokenizers.Punctuation(),
+        ])
+        self.trainer = BpeTrainer(
+            vocab_size=vocab_size,
+            special_tokens=["[PAD]", "[UNK]", "[CLS]", "[SEP]", "[MASK]"],
+            min_frequency=2
+        )
+    def train(self, texts):
+        """Train the tokenizer on the given texts"""
+        print("Training tokenizer...")
+        self.tokenizer.train_from_iterator(texts, trainer=self.trainer)
+    def save(self, path):
+        """Save the tokenizer to a file"""
+        self.tokenizer.save(path)
+    def load(self, path):
+        """Load the tokenizer from a file"""
+        self.tokenizer = Tokenizer.from_file(path)
+    def get_vocab_size(self):
+        return self.tokenizer.get_vocab_size()
+class WikiTextDataset(Dataset):
+    def __init__(self, texts, tokenizer, max_length=512):
+        self.tokenizer = tokenizer
+        self.max_length = max_length
+        print("Tokenizing texts...")
+        self.examples = []
+        for text in tqdm(texts):
+            # Tokenize the text
+            tokens = self.tokenizer.encode(text).ids
+            # Create sequences of max_length tokens
+            for i in range(0, len(tokens) - max_length, max_length // 2):
+                chunk = tokens[i:i + max_length]
+                if len(chunk) < max_length:
+                    # Pad if necessary
+                    chunk = chunk + [0] * (max_length - len(chunk))
+                self.examples.append(chunk)
+    def __len__(self):
+        return len(self.examples)
+    def __getitem__(self, idx):
+        # Return input and target sequences (for next token prediction)
+        tokens = self.examples[idx]
+        return torch.tensor(tokens[:-1]), torch.tensor(tokens[1:])
+def prepare_data_and_tokenizer():
+    # Load the collected Wikipedia data
+    print("Loading Wikipedia data...")
+    with open('az_wiki_data.json', 'r', encoding='utf-8') as f:
+        wiki_data = json.load(f)
+    # Extract texts
+    texts = [page['text'] for page in wiki_data.values()]
+    # Create and train tokenizer
+    tokenizer = AzerbaijaniTokenizer(vocab_size=50000)
+    tokenizer.train(texts)
+    # Save the tokenizer
+    tokenizer.save("az_tokenizer.json")
+    print(f"Tokenizer vocabulary size: {tokenizer.get_vocab_size()}")
+    # Create dataset
+    dataset = WikiTextDataset(texts, tokenizer.tokenizer)
+    # Create data loaders
+    train_size = int(0.9 * len(dataset))
+    val_size = len(dataset) - train_size
+    train_dataset, val_dataset = torch.utils.data.random_split(
+        dataset, [train_size, val_size]
+    )
+    train_loader = DataLoader(
+        train_dataset,
+        batch_size=16,
+        shuffle=True,
+        num_workers=4
+    )
+    val_loader = DataLoader(
+        val_dataset,
+        batch_size=16,
+        shuffle=False,
+        num_workers=4
+    )
+    print(f"Total sequences: {len(dataset)}")
+    print(f"Training sequences: {len(train_dataset)}")
+    print(f"Validation sequences: {len(val_dataset)}")
+    return tokenizer, train_loader, val_loader
+if __name__ == "__main__":
+    tokenizer, train_loader, val_loader = prepare_data_and_tokenizer()

push_to_hf.py ADDED Viewed

	@@ -0,0 +1,17 @@

+import os
+from dotenv import load_dotenv
+from huggingface_hub import login, HfApi
+# Load the Hugging Face token from .env
+load_dotenv()
+hf_token = os.getenv("HUGGINGFACE_TOKEN")
+# Log in to Hugging Face
+login(token=hf_token)
+# Define your repository ID
+repo_id = "IsmatS/gpt-wiki-az"
+# Initialize HfApi and upload the model folder
+api = HfApi()
+api.upload_folder(folder_path="./", path_in_repo="", repo_id=repo_id)

requirements.txt ADDED Viewed

	@@ -0,0 +1,42 @@

+beautifulsoup4==4.12.3
+certifi==2024.8.30
+charset-normalizer==3.4.0
+click==8.1.7
+docker-pycreds==0.4.0
+filelock==3.16.1
+fsspec==2024.10.0
+gitdb==4.0.11
+GitPython==3.1.43
+huggingface-hub==0.26.2
+idna==3.10
+Jinja2==3.1.4
+MarkupSafe==3.0.2
+mpmath==1.3.0
+networkx==3.4.2
+numpy==2.1.3
+packaging==24.2
+pillow==11.0.0
+platformdirs==4.3.6
+protobuf==5.28.3
+psutil==6.1.0
+PyYAML==6.0.2
+regex==2024.11.6
+requests==2.32.3
+safetensors==0.4.5
+sentry-sdk==2.18.0
+setproctitle==1.3.3
+setuptools==75.5.0
+six==1.16.0
+smmap==5.0.1
+soupsieve==2.6
+sympy==1.13.1
+tokenizers==0.20.3
+torch==2.6.0.dev20241113
+torchaudio==2.5.0.dev20241113
+torchvision==0.20.0.dev20241113
+tqdm==4.67.0
+transformers==4.46.2
+typing_extensions==4.12.2
+urllib3==2.2.3
+wandb==0.18.6
+Wikipedia-API==0.7.1

train.py ADDED Viewed

	@@ -0,0 +1,274 @@

+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from torch.utils.data import Dataset, DataLoader
+from torch.optim.lr_scheduler import CosineAnnealingLR
+import math
+from tqdm import tqdm
+import json
+from tokenizers import Tokenizer
+from datetime import datetime
+import gc
+class GPTConfig:
+    def __init__(
+        self,
+        vocab_size=22588,
+        n_embd=768,      # Reduced from 2048
+        n_head=12,       # Reduced from 16
+        n_layer=8,       # Reduced from 12
+        dropout=0.1,
+        block_size=256,  # Reduced from 512
+        learning_rate=3e-4,
+        max_epochs=50,
+        batch_size=8,    # Reduced from 64
+        grad_clip=1.0,
+    ):
+        self.vocab_size = vocab_size
+        self.n_embd = n_embd
+        self.n_head = n_head
+        self.n_layer = n_layer
+        self.dropout = dropout
+        self.block_size = block_size
+        self.learning_rate = learning_rate
+        self.max_epochs = max_epochs
+        self.batch_size = batch_size
+        self.grad_clip = grad_clip
+# Model Architecture
+class SelfAttention(nn.Module):
+    def __init__(self, config):
+        super().__init__()
+        assert config.n_embd % config.n_head == 0
+        self.w_k = nn.Linear(config.n_embd, config.n_embd)
+        self.w_q = nn.Linear(config.n_embd, config.n_embd)
+        self.w_v = nn.Linear(config.n_embd, config.n_embd)
+        self.attn_drop = nn.Dropout(config.dropout)
+        self.resid_drop = nn.Dropout(config.dropout)
+        self.proj = nn.Linear(config.n_embd, config.n_embd)
+        self.n_head = config.n_head
+        self.n_embd = config.n_embd
+    def forward(self, x):
+        B, T, C = x.size()
+        k = self.w_k(x).view(B, T, self.n_head, C // self.n_head).transpose(1, 2)
+        q = self.w_q(x).view(B, T, self.n_head, C // self.n_head).transpose(1, 2)
+        v = self.w_v(x).view(B, T, self.n_head, C // self.n_head).transpose(1, 2)
+        att = (q @ k.transpose(-2, -1)) * (1.0 / math.sqrt(k.size(-1)))
+        att = F.softmax(att, dim=-1)
+        att = self.attn_drop(att)
+        y = att @ v
+        y = y.transpose(1, 2).contiguous().view(B, T, C)
+        y = self.resid_drop(self.proj(y))
+        return y
+class Block(nn.Module):
+    def __init__(self, config):
+        super().__init__()
+        self.ln1 = nn.LayerNorm(config.n_embd)
+        self.attn = SelfAttention(config)
+        self.ln2 = nn.LayerNorm(config.n_embd)
+        self.mlp = nn.Sequential(
+            nn.Linear(config.n_embd, 4 * config.n_embd),
+            nn.GELU(),
+            nn.Linear(4 * config.n_embd, config.n_embd),
+            nn.Dropout(config.dropout),
+        )
+    def forward(self, x):
+        x = x + self.attn(self.ln1(x))
+        x = x + self.mlp(self.ln2(x))
+        return x
+class GPT(nn.Module):
+    def __init__(self, config):
+        super().__init__()
+        self.tok_emb = nn.Embedding(config.vocab_size, config.n_embd)
+        self.pos_emb = nn.Parameter(torch.zeros(1, config.block_size, config.n_embd))
+        self.drop = nn.Dropout(config.dropout)
+        self.blocks = nn.ModuleList([Block(config) for _ in range(config.n_layer)])
+        self.ln_f = nn.LayerNorm(config.n_embd)
+        self.head = nn.Linear(config.n_embd, config.vocab_size, bias=False)
+        self.block_size = config.block_size
+        self.apply(self._init_weights)
+    def _init_weights(self, module):
+        if isinstance(module, (nn.Linear, nn.Embedding)):
+            module.weight.data.normal_(mean=0.0, std=0.02)
+            if isinstance(module, nn.Linear) and module.bias is not None:
+                module.bias.data.zero_()
+        elif isinstance(module, nn.LayerNorm):
+            module.bias.data.zero_()
+            module.weight.data.fill_(1.0)
+    def forward(self, idx, targets=None):
+        b, t = idx.size()
+        assert t <= self.block_size, f"Cannot forward sequence of length {t}, block size is only {self.block_size}"
+        token_embeddings = self.tok_emb(idx)
+        position_embeddings = self.pos_emb[:, :t, :]
+        x = self.drop(token_embeddings + position_embeddings)
+        for block in self.blocks:
+            x = block(x)
+        x = self.ln_f(x)
+        logits = self.head(x)
+        loss = None
+        if targets is not None:
+            loss = F.cross_entropy(logits.view(-1, logits.size(-1)), targets.view(-1))
+        return logits, loss
+class WikiTextDataset(Dataset):
+    def __init__(self, texts, tokenizer, max_length=256):  # Reduced max_length
+        self.tokenizer = tokenizer
+        self.max_length = max_length
+        print("Tokenizing texts...")
+        self.examples = []
+        for text in tqdm(texts):
+            tokens = self.tokenizer.encode(text).ids
+            for i in range(0, len(tokens) - max_length, max_length // 2):
+                chunk = tokens[i:i + max_length]
+                if len(chunk) < max_length:
+                    chunk = chunk + [0] * (max_length - len(chunk))
+                self.examples.append(chunk)
+    def __len__(self):
+        return len(self.examples)
+    def __getitem__(self, idx):
+        tokens = self.examples[idx]
+        return torch.tensor(tokens[:-1]), torch.tensor(tokens[1:])
+def train():
+    # Clear GPU memory
+    torch.cuda.empty_cache()
+    gc.collect()
+    print("Loading Wikipedia data...")
+    with open('az_wiki_data.json', 'r', encoding='utf-8') as f:
+        wiki_data = json.load(f)
+    texts = [page['text'] for page in wiki_data.values()]
+    tokenizer = Tokenizer.from_file("az_tokenizer.json")
+    dataset = WikiTextDataset(texts, tokenizer)
+    train_size = int(0.9 * len(dataset))
+    val_size = len(dataset) - train_size
+    train_dataset, val_dataset = torch.utils.data.random_split(dataset, [train_size, val_size])
+    config = GPTConfig()
+    train_loader = DataLoader(
+        train_dataset,
+        batch_size=config.batch_size,
+        shuffle=True,
+        num_workers=2,  # Reduced from 4
+        pin_memory=True
+    )
+    val_loader = DataLoader(
+        val_dataset,
+        batch_size=config.batch_size,
+        shuffle=False,
+        num_workers=2,  # Reduced from 4
+        pin_memory=True
+    )
+    model = GPT(config)
+    model = model.to('cuda')
+    print(f"Number of parameters: {sum(p.numel() for p in model.parameters())/1e6:.2f}M")
+    optimizer = torch.optim.AdamW(model.parameters(), lr=config.learning_rate)
+    scheduler = CosineAnnealingLR(optimizer, T_max=config.max_epochs)
+    scaler = torch.amp.GradScaler()  # Updated deprecation warning
+    def run_epoch(split, epoch_num=0):
+        is_train = split == 'train'
+        model.train(is_train)
+        if not is_train:
+            model.eval()
+        loader = train_loader if is_train else val_loader
+        losses = []
+        pbar = tqdm(enumerate(loader), total=len(loader)) if is_train else enumerate(loader)
+        for it, (x, y) in pbar:
+            # Clear memory
+            torch.cuda.empty_cache()
+            x = x.to('cuda', non_blocking=True)
+            y = y.to('cuda', non_blocking=True)
+            with torch.amp.autocast(device_type='cuda'):  # Updated deprecation warning
+                logits, loss = model(x, y)
+            losses.append(loss.item())
+            if is_train:
+                scaler.scale(loss).backward()
+                scaler.unscale_(optimizer)
+                torch.nn.utils.clip_grad_norm_(model.parameters(), config.grad_clip)
+                scaler.step(optimizer)
+                scaler.update()
+                optimizer.zero_grad(set_to_none=True)
+                pbar.set_description(f"epoch {epoch_num+1} iter {it}: train loss {loss.item():.5f}")
+            # Delete unnecessary tensors
+            del x, y, logits
+            if is_train:
+                del loss
+        mean_loss = torch.tensor(losses).mean().item()
+        return mean_loss
+    best_val_loss = float('inf')
+    try:
+        for epoch in range(config.max_epochs):
+            print(f"\nEpoch {epoch+1}/{config.max_epochs}")
+            train_loss = run_epoch('train', epoch_num=epoch)
+            with torch.no_grad():
+                val_loss = run_epoch('val')
+            scheduler.step()
+            if val_loss < best_val_loss:
+                best_val_loss = val_loss
+                print(f"Saving best model with val_loss: {val_loss:.4f}")
+                torch.save(model.state_dict(), 'best_model.pt')
+            print(f"Epoch {epoch+1}: train_loss: {train_loss:.4f}, val_loss: {val_loss:.4f}")
+            if (epoch + 1) % 5 == 0:
+                torch.save({
+                    'epoch': epoch,
+                    'model_state_dict': model.state_dict(),
+                    'optimizer_state_dict': optimizer.state_dict(),
+                    'scheduler_state_dict': scheduler.state_dict(),
+                    'train_loss': train_loss,
+                    'val_loss': val_loss,
+                }, f'checkpoint_epoch_{epoch+1}.pt')
+    except KeyboardInterrupt:
+        print('Training interrupted, saving checkpoint...')
+        torch.save({
+            'epoch': epoch,
+            'model_state_dict': model.state_dict(),
+            'optimizer_state_dict': optimizer.state_dict(),
+            'scheduler_state_dict': scheduler.state_dict(),
+            'train_loss': train_loss,
+            'val_loss': val_loss,
+        }, 'interrupt_checkpoint.pt')
+if __name__ == '__main__':
+    train()