The translator app:

image/png

Model Name

German to English Translator

Model Description

This model translates german language to english language. It used Sequence to Sequence Transformer(Seq2SeqTransformer) for training.

  • Developed by: Neelima Monjusha Preeti
  • Model type: Seq2SeqTransformer
  • Language(s): Python
  • License: MIT
  • Contact: [email protected]

Task Description

This app translates German to English. First the language is tokenized, passed through encoder, decoder and trained with Seq2SeqTransformer. Then as output the language is english.

Data Processing

Defining source and target languages and then Tokenization. Tokenizers for German and English are initialized using spaCy (spacy library). The get_tokenizer function from spaCy is used to obtain tokenizers for each language. A function yield_tokens is defined to tokenize sentences from the data iterator for both source and target languages. Special symbols and indices:

Special indices are defined for unknown words (UNK_IDX), padding (PAD_IDX), beginning of sequence (BOS_IDX), and end of sequence (EOS_IDX). Special symbols are defined as ['', '', '', ''].

Then vocabulary is built.For each language (source and target), the code iterates over the training data and builds a vocabulary using the build_vocab_from_iterator function. It uses the tokenization function defined earlier to tokenize the data. The vocabulary is built with a minimum frequency of 1 (including all tokens) and special symbols are added first. For each language's vocabulary, the default index for unknown tokens (UNK_IDX) is set.

token_transform[SRC_LANGUAGE] = get_tokenizer('spacy', language='de_core_news_sm')
token_transform[TGT_LANGUAGE] = get_tokenizer('spacy', language='en_core_web_sm')


def yield_tokens(data_iter: Iterable, language: str) -> List[str]:
    language_index = {SRC_LANGUAGE: 0, TGT_LANGUAGE: 1}

    for data_sample in data_iter:
        yield token_transform[language](data_sample[language_index[language]])

# Define special symbols and indices
UNK_IDX, PAD_IDX, BOS_IDX, EOS_IDX = 0, 1, 2, 3
# Make sure the tokens are in order of their indices to properly insert them in vocab
special_symbols = ['<unk>', '<pad>', '<bos>', '<eos>']

for ln in [SRC_LANGUAGE, TGT_LANGUAGE]:
    # Training data Iterator
    train_iter = Multi30k(split='train', language_pair=(SRC_LANGUAGE, TGT_LANGUAGE))
  
    vocab_transform[ln] = build_vocab_from_iterator(yield_tokens(train_iter, ln),
                                                    min_freq=1,
                                                    specials=special_symbols,
                                                    special_first=True)

for ln in [SRC_LANGUAGE, TGT_LANGUAGE]:
  vocab_transform[ln].set_default_index(UNK_IDX)

Model Architecture

For machine translation I used Seq2SeqTransformer. class PositionalEncoding(nn.Module) adds positional encodings to token embeddings, while class TokenEmbedding(nn.Module) converts token indices into dense embeddings using an embedding layer. The parameters defined and initialized for the model are:

num_encoder_layers: Number of layers in the encoder stack -- 3.

num_decoder_layers: Number of layers in the decoder stack-- 3.

emb_size: The dimensionality of token embeddings-- 512.

nhead: The number of attention heads in the multi-head attention mechanism-- 512.

src_vocab_size: Vocabulary size of the source language.

tgt_vocab_size: Vocabulary size of the target language.

dim_feedforward: Dimensionality of the feedforward network (defaulted to 512).

dropout: Dropout probability (defaulted to 0.1).

The loss function and optimizer are calculated with this:

loss_fn = torch.nn.CrossEntropyLoss(ignore_index=PAD_IDX)
optimizer = torch.optim.Adam(transformer.parameters(), lr=0.0001, betas=(0.9, 0.98), eps=1e-9)

Then the model is passed through encoder and decoder layers.

The helper functions and list are

sequential_transforms(*transforms)
tensor_transform(token_ids: List[int])
collate_fn(batch)
text_transform = {}

These utility functions and transformations handle the preprocessing of text data, including tokenization, numericalization, adding special tokens, and collating samples into batch tensors suitable for training a sequence-to-sequence transformer model.

Then the model is trained with Seq2SeqTransformer and evaluated with function evaluate(model).

Result Analysis

greedy_decode() - this function takes

model: The sequence-to-sequence transformer model.

src: The source sequence tensor.

src_mask: The mask for the source sequence.

max_len: The maximum length of the output sequence.

start_symbol: The index of the start symbol in the target vocabulary

as parameter and returns the generated target sequence tensor ys, which contains the complete translation.

Test input:

The function for translating german to english is - translate().

def translate(src_sentence: str):
    model = Seq2SeqTransformer(NUM_ENCODER_LAYERS, NUM_DECODER_LAYERS, EMB_SIZE,NHEAD, SRC_VOCAB_SIZE, TGT_VOCAB_SIZE, FFN_HID_DIM)

    model.load_state_dict(torch.load('./transformer_model.pth'))
    model.to(DEVICE)
    model.eval()
    src = text_transform[SRC_LANGUAGE](src_sentence).view(-1, 1)
    num_tokens = src.shape[0]
    src_mask = (torch.zeros(num_tokens, num_tokens)).type(torch.bool)
    tgt_tokens = greedy_decode(
        model,  src, src_mask, max_len=num_tokens + 5, start_symbol=BOS_IDX).flatten()
    return " ".join(vocab_transform[TGT_LANGUAGE].lookup_tokens(list(tgt_tokens.cpu().numpy()))).replace("<bos>", "").replace("<eos>", "")

This function first loads the saved model. Then it tokenizes and implements greedy_decode for getting the translated output. Then returns the output.

Hugging Face Interface:

For creating interface gradio and torch as well as Seq2SeqTransformer, translate and greedy_decode function from the germantoenglish.py file was loaded.

import gradio as gr
import torch
from germantoenglish import Seq2SeqTransformer, translate, greedy_decode 

The the app takes input a german line and output shows the translated english text.

if __name__ == "__main__":
    iface = gr.Interface(
        fn=translate, 
        inputs=[
            gr.components.Textbox(label="Text")

    ],
    outputs=["text"],  
    cache_examples=False, 
    title="GermanToEnglish",  
    )
iface.launch(share=True)

The app interface looks like this:

image/png

Project Structure

|---Readme.md
|
|---germantoenglish.py-The full code for processing, training, evaluating is here
|
|---app.py- This is for creating the app interface
|
|---Modeltensors- needed tensor file for loading app
|
|---requirements.txt- necessary packages and dataset which needs to be downloaded for the app to work.
|
|--translate_model.pth- the model file which is loaded for the app

How to Run


git clone https://huggingface.co/spaces/neelimapreeti297/GermanToEnglish

cd GermanToEnglish

pip install -r requirements.txt

python app.py

License

This project is licensed under the MIT License.

Contributor

Neelima Monjusha Preeti - [email protected]

App link: https://huggingface.co/spaces/neelimapreeti297/GermanToEnglish

Downloads last month
7
Safetensors
Model size
36.1M params
Tensor type
F32
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Dataset used to train neelimapreeti297/GermanToEnglish