Update README.md
Browse files
README.md
CHANGED
@@ -20,6 +20,27 @@ You can use the raw model for text generation or fine-tune it to a downstream ta
|
|
20 |
|
21 |
Note that the texts should be segmented into words using Juman++ in advance.
|
22 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
23 |
### Preprocessing
|
24 |
|
25 |
The texts are normalized using zenhan, segmented into words using Juman++, and tokenized using SentencePiece. Juman++ 2.0.0-rc3 was used for pretraining.
|
|
|
20 |
|
21 |
Note that the texts should be segmented into words using Juman++ in advance.
|
22 |
|
23 |
+
### How to use
|
24 |
+
|
25 |
+
You can use this model directly with a pipeline for text generation. Since the generation relies on some randomness, we set a seed for reproducibility:
|
26 |
+
|
27 |
+
```python
|
28 |
+
from transformers import pipeline, set_seed
|
29 |
+
generator = pipeline('text-generation', model='nlp-waseda/gpt2-xl-japanese')
|
30 |
+
|
31 |
+
set_seed(42)
|
32 |
+
generator("早稲田 大学 で 自然 言語 処理 を", max_length=30, do_sample=True, pad_token_id=2, num_return_sequences=5)
|
33 |
+
```
|
34 |
+
|
35 |
+
```python
|
36 |
+
from transformers import ReformerTokenizer, GPT2Model
|
37 |
+
tokenizer = ReformerTokenizer.from_pretrained('nlp-waseda/gpt2-small-japanese')
|
38 |
+
model = GPT2Model.from_pretrained('nlp-waseda/gpt2-small-japanese')
|
39 |
+
text = "早稲田 大学 で 自然 言語 処理 を"
|
40 |
+
encoded_input = tokenizer(text, return_tensors='pt')
|
41 |
+
output = model(**encoded_input)
|
42 |
+
```
|
43 |
+
|
44 |
### Preprocessing
|
45 |
|
46 |
The texts are normalized using zenhan, segmented into words using Juman++, and tokenized using SentencePiece. Juman++ 2.0.0-rc3 was used for pretraining.
|