---
license: apache-2.0
datasets:
- indonlu
language:
- id
metrics:
- bleu
pipeline_tag: text-generation
---
_Copyright 2023 Anugrah Akbar Praramadhan. All rights reserved._
_Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at_
_[http://www.apache.org/licenses/LICENSE-2.0)](http://www.apache.org/licenses/LICENSE-2.0)_
_Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License._
## Model Description
A GPT-2 *(Generative Pretrained Transformer-2)* model is a transformer based architecture for Causal Language Modeling, meaning it's required a left token/word as an input prompt
for generating the right/next token, developed by Open AI *{Radford, Alec and Wu, Jeff and Child, Rewon and Luan, David and Amodei, Dario and Sutskever, Ilya}*.
See the paper here:
[https://d4mucfpksywv.cloudfront.net/better-language-models/language-models.pdf](https://d4mucfpksywv.cloudfront.net/better-language-models/language-models.pdf)
## Limitation
Since GPT-2 is an unsupervised model and trained using an unlabelled of text sequences without any explicit supervision,
the clarity and output of this model often comes with randomness. To overcome this issue we have to create a specific seed for determined output.
Supported language for this model is only English *(get from GPT-2 pretrained model)* and Indonesian *(fine tune using Indonesian Wikipedia Dataset)*.
## How To Use
Direct use of using Pytorch:
```python
>>> from transformers import pipeline, AutoTokenizer, AutoModelForCausalLM, set_seed
>>> model_name = 'anugrahap/gpt2-indo-textgen'
>>> tokenizer = AutoTokenizer.from_pretrained(model_name, padding_side='left')
>>> model = AutoModelForCausalLM.from_pretrained(model_name, pad_token_id=tokenizer.eos_token_id)
>>> generator = pipeline('text-generation', model=model, tokenizer=tokenizer)
>>> #set_seed(1)
>>> result = generator("Skripsi merupakan tugas akhir mahasiswa", min_length=10, max_length=30, num_return_sequences=1)
>>> result[0]["generated_text"]
```
### Learn more
| [GPT-2 Pretrained Model Medium-345M Parameters](https://github.com/openai/gpt-2/blob/master/download_model.py)
| [Indonesian Wikipedia Dataset - 433MB by IndoNLP](https://drive.google.com/file/d/1ZoKd31yr3soveU0O38XEIFUBKx-D66t5/view?usp=sharing)
| [Project Repository](https://huggingface.co/spaces/anugrahap/gpt2-indo-text-gen/tree/main)