File size: 5,997 Bytes

---
language: ko
tags:
- text-2-text-generation
---


# Model Card for  Bert base model for Korean
 
# Model Details
 
## Model Description
 
More information needed.
 
- **Developed by:** kiyoung kim
- **Shared by [Optional]:** kiyoung kim
- **Model type:** Text2Text Generation 
- **Language(s) (NLP):** Korean
- **License:** More information needed 
- **Parent Model:** bert-base-multilingual-uncased
- **Resources for more information:**
  - [GitHub Repo](https://github.com/kiyoungkim1/LM-kor)
 	


# Uses
 

## Direct Use
This model can be used for the task of text2text generation.
 
## Downstream Use [Optional]
 
More information needed.
 
## Out-of-Scope Use
 
The model should not be used to intentionally create hostile or alienating environments for people. 
 
# Bias, Risks, and Limitations
 
 
Significant research has explored bias and fairness issues with language models (see, e.g., [Sheng et al. (2021)](https://aclanthology.org/2021.acl-long.330.pdf) and [Bender et al. (2021)](https://dl.acm.org/doi/pdf/10.1145/3442188.3445922)). Predictions generated by the model may include disturbing and harmful stereotypes across protected classes; identity characteristics; and sensitive, social, and occupational groups.



## Recommendations
 
 
Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.

# Training Details
 
## Training Data
* 70GB Korean text dataset and 42000 lower-cased subwords are used 
 
The model authors also note in the [GitHub Repo](https://github.com/kiyoungkim1/LM-kor):
 
> 학습에 사용한 데이터는 다음과 같습니다.
 1.) 국내 주요 커머스 리뷰 1억개 + 블로그 형 웹사이트 2000만개 (75GB)
  2.) 모두의 말뭉치 (18GB)
 3.) 위키피디아와 나무위키 (6GB)
불필요하거나 너무 짤은 문장, 중복되는 문장들을 제외하여 100GB의 데이터 중 최종적으로 70GB (약 127억개의 token)의 텍스트 데이터를 학습에 사용하였습니다.
데이터는 화장품(8GB), 식품(6GB), 전자제품(13GB), 반려동물(2GB) 등등의 카테고리로 분류되어 있으며 도메인 특화 언어모델 학습에 사용하였습니다
 
 
## Training Procedure

 
### Preprocessing

The model authors also note in the [GitHub Repo](https://github.com/kiyoungkim1/LM-kor): 
> BERT 모델에는 whole-word-masking이 적용되었습니다.
 
> 한글, 영어, 숫자와 일부 특수문자를 제외한 문자는 학습에 방해가된다고 판단하여 삭제하였습니다(예시: 한자, 이모지 등)
[Huggingface tokenizers](https://github.com/huggingface/tokenizers) 의 wordpiece모델을 사용해 40000개의 subword를 생성하였습니다.
여기에 2000개의 unused token과 넣어 학습하였으며, unused token는 도메인 별 특화 용어를 담기 위해 사용됩니다.
 
### Speeds, Sizes, Times
More information needed 

 
# Evaluation
 
 
## Testing Data, Factors & Metrics
 
### Testing Data
 
More information needed 
 
 
### Factors
More information needed
 
### Metrics
 
More information needed
 
 
## Results 
 
* Check the model performance and other language models for Korean in [github](https://github.com/kiyoungkim1/LM-kor)

|                       | **NSMC**<br/>(acc) | **Naver NER**<br/>(F1) | **PAWS**<br/>(acc) | **KorNLI**<br/>(acc) | **KorSTS**<br/>(spearman) | **Question Pair**<br/>(acc) |  **Korean-Hate-Speech (Dev)**<br/>(F1) |
| :-------------------- | :----------------: | :--------------------: | :----------------: | :------------------: | :-----------------------: | :-------------------------: | :-----------------------------------:  |
| kcbert-base           |       89.87        |         85.00          |       67.40        |        75.57         |           75.94           |            93.93            |                **68.78**               |
|**OURS**|
| **bert-kor-base**     |       90.87        |         87.27          |       82.80        |        82.32         |           84.31           |            95.25            |                  68.45                 |


 
# Model Examination
 
More information needed
 
# Environmental Impact
 
Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
 
- **Hardware Type:** More information needed
- **Hours used:** More information needed
- **Cloud Provider:** More information needed
- **Compute Region:** More information needed
- **Carbon Emitted:** More information needed
 
# Technical Specifications [optional]
 
## Model Architecture and Objective

More information needed 
 
## Compute Infrastructure
 
More information needed 
 
### Hardware
 
 
More information needed
 
### Software
 
More information needed.
 
# Citation

 
**BibTeX:**
 
 
```bibtex
@misc{kim2020lmkor,
  author = {Kiyoung Kim},
  title = {Pretrained Language Models For Korean},
  year = {2020},
  publisher = {GitHub},
  howpublished = {\url{https://github.com/kiyoungkim1/LMkor}}
}
```
 
 
 
 
# Glossary [optional]
More information needed 
 
# More Information [optional]
* Cloud TPUs are provided by [TensorFlow Research Cloud (TFRC)](https://www.tensorflow.org/tfrc/) program.
 
* Also, [모두의 말뭉치](https://corpus.korean.go.kr/) is used for pretraining data. 

 
# Model Card Authors [optional]
 
Kiyoung kim in collaboration with Ezi Ozoani and the Hugging Face team


# Model Card Contact
 
More information needed
 
# How to Get Started with the Model
 
Use the code below to get started with the model.
 
<details>
<summary> Click to expand </summary>

```python
 # only for pytorch in transformers
from transformers import BertTokenizerFast, EncoderDecoderModel

tokenizer = BertTokenizerFast.from_pretrained("kykim/bertshared-kor-base")
model = EncoderDecoderModel.from_pretrained("kykim/bertshared-kor-base")
 ```
</details>