Ericu950 commited on
Commit
4db6bb4
1 Parent(s): dbcc9bd

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +243 -18
README.md CHANGED
@@ -1,43 +1,268 @@
1
  ---
2
- base_model: []
 
 
 
 
 
 
3
  library_name: transformers
4
  tags:
 
 
 
 
5
  - mergekit
6
  - merge
7
 
8
  ---
9
- # PapyLlamaMerged
10
 
11
- This is a merge of pre-trained language models created using [mergekit](https://github.com/cg123/mergekit).
 
12
 
13
- ## Merge Details
14
- ### Merge Method
15
 
16
- This model was merged using the [TIES](https://arxiv.org/abs/2306.01708) merge method using /mimer/NOBACKUP/groups/naiss2024-22-361/Eric_Pap/Llama-3.1-8B-Instruct as a base.
17
 
18
- ### Models Merged
19
 
20
- The following models were included in the merge:
21
- * /mimer/NOBACKUP/groups/naiss2024-22-201/PapInsc3/Papyllama2
22
 
23
- ### Configuration
 
 
 
 
 
 
 
24
 
25
- The following YAML configuration was used to produce this model:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
26
 
27
  ```yaml
28
  models:
29
- - model: /mimer/NOBACKUP/groups/naiss2024-22-361/Eric_Pap/Llama-3.1-8B-Instruct
30
- - model: /mimer/NOBACKUP/groups/naiss2024-22-201/PapInsc3/Papyllama2
31
  parameters:
32
- density: 1.1 # Fixed density, slightly more sparse than the original
33
- weight: 0.6 # Fixed weight to keep the fine-tuned model's influence high
34
  merge_method: ties
35
- base_model: /mimer/NOBACKUP/groups/naiss2024-22-361/Eric_Pap/Llama-3.1-8B-Instruct
36
  parameters:
37
  normalize: true
38
  dtype: bfloat16
39
 
40
 
41
-
42
-
43
  ```
 
1
  ---
2
+ license: apache-2.0
3
+ language:
4
+ - grc
5
+ datasets:
6
+ - Ericu950/Papyri_1
7
+ base_model:
8
+ - meta-llama/Meta-Llama-3.1-8B-Instruct
9
  library_name: transformers
10
  tags:
11
+ - papyrology
12
+ - textual criticism
13
+ - philology
14
+ - Ancient Greek
15
  - mergekit
16
  - merge
17
 
18
  ---
19
+ # Papy_2_Llama-3.1-8B-Instruct_text
20
 
21
+ This is a finetuned version Llama-3.1-8B-Instruct specialized on reconstructing spans of 1–20 missing characters in ancient Greek documentary papyri. In spans of 1–10 missing characters it did so with a Character Error Rate of 14.9%, a top-1 accuracy of 73.5%, and top-20 of 85.9% on a test set of 7,811 papyrus editions. It replaces Papy_2_Llama-3.1-8B-Instruct_text.
22
+ See https://arxiv.org/abs/2409.13870.
23
 
24
+ ## Usage
 
25
 
26
+ To run the model on a GPU with large memory capacity, follow these steps:
27
 
 
28
 
29
+ ### 1. Download and load the model
 
30
 
31
+ ```python
32
+ import json
33
+ from transformers import pipeline, AutoTokenizer, LlamaForCausalLM
34
+ from accelerate import init_empty_weights, load_checkpoint_and_dispatch
35
+ import torch
36
+ import warnings
37
+ warnings.filterwarnings("ignore", message=".*copying from a non-meta parameter in the checkpoint*")
38
+ model_id = "Ericu950/Papy_2_Llama-3.1-8B-Instruct_text"
39
 
40
+ with init_empty_weights():
41
+ model = LlamaForCausalLM.from_pretrained(model_id)
42
+
43
+ model = load_checkpoint_and_dispatch(
44
+ model,
45
+ model_id,
46
+ device_map="auto",
47
+ offload_folder="offload",
48
+ offload_state_dict=True,
49
+ )
50
+
51
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
52
+
53
+ generation_pipeline = pipeline(
54
+ "text-generation",
55
+ model=model,
56
+ tokenizer=tokenizer,
57
+ device_map="auto",
58
+ )
59
+ ```
60
+
61
+ ### 2. Run inference on a papyrus fragment of your choice
62
+ ```python
63
+ papyrus_edition = """
64
+ ετουσ τεταρτου αυτοκρατοροσ καισαροσ ουεσπασιανου σεβαστου ------------------
65
+ ομολογει παυσιριων απολλωνιου του παuσιριωνοσ μητροσ ---------------τωι γεγονοτι αυτωι
66
+ εκ τησ γενομενησ και μετηλλαχυιασ αυτου γυναικοσ -------------------------
67
+ απο τησ αυτησ πολεωσ εν αγυιαι συγχωρειν ειναι ----------------------------------
68
+ --------------------σ αυτωι εξ ησ συνεστιν ------------------------------------
69
+ ----τησ αυτησ γενεασ την υπαρχουσαν αυτωι οικιαν ------------
70
+ ------------------ ---------καὶ αιθριον και αυλη απερ ο υιοσ διοκοροσ --------------------------
71
+ --------εγραψεν του δ αυτου διοσκορου ειναι ------------------------------------
72
+ ---------- και προ κατενγεγυηται τα δικαια --------------------------------------
73
+ νησ κατα τουσ τησ χωρασ νομουσ· εαν δε μη ---------------------------------------
74
+ υπ αυτου τηι του διοσκορου σημαινομενηι -----------------------------------ενοικισμωι του
75
+ ημισουσ μερουσ τησ προκειμενησ οικιασ --------------------------------- διοσκοροσ την τουτων αποχην
76
+ ---------------------------------------------μηδ υπεναντιον τουτοισ επιτελειν μηδε
77
+ ------------------------------------------------ ανασκευηι κατ αυτησ τιθεσθαι ομολογιαν μηδε
78
+ ----------------------------------- επιτελεσαι η χωρισ του κυρια ειναι τα διομολογημενα
79
+ παραβαινειν, εκτεινειν δε τον παραβησομενον τωι υιωι διοσκορωι η τοισ παρ αυτου καθ εκαστην
80
+ εφοδον το τε βλαβοσ και επιτιμον αργυριου δραχμασ 0 και εισ το δημο[7 missing letters] ισασ και μηθεν
81
+ ησσον· δ -----ιων ομολογιαν συνεχωρησεν·
82
+ """
83
+ system_prompt = "Fill in the missing letters in this papyrus fragment!"
84
+ input_messages = [
85
+ {"role": "system", "content": system_prompt},
86
+ {"role": "user", "content": papyrus_edition},
87
+ ]
88
+ terminators = [
89
+ tokenizer.eos_token_id,
90
+ tokenizer.convert_tokens_to_ids("<|eot_id|>")
91
+ ]
92
+ outputs = generation_pipeline(
93
+ input_messages,
94
+ max_new_tokens=10,
95
+ num_beams=30, # Set this as high as your memory will allow!
96
+ num_return_sequences=10,
97
+ early_stopping=True,
98
+ )
99
+ beam_contents = []
100
+ for output in outputs:
101
+ generated_text = output.get('generated_text', [])
102
+ for item in generated_text:
103
+ if item.get('role') == 'assistant':
104
+ beam_contents.append(item.get('content'))
105
+ real_response = "σιον τασ"
106
+ print(f"The masked sequence: {real_response}")
107
+ for i, content in enumerate(beam_contents, start=1):
108
+ print(f"Suggestion {i}: {content}")
109
+ ```
110
+ ### Expected Output:
111
+ ```
112
+ The masked sequence: σιον τασ
113
+ Suggestion 1: σιον τασ
114
+ Suggestion 2: σιν τασ ι
115
+ Suggestion 3: σ τασ ισα
116
+ Suggestion 4: σιου τασ
117
+ Suggestion 5: συ τασ ισ
118
+ Suggestion 6: ιον τασ ι
119
+ Suggestion 7: ν τασ ισα
120
+ Suggestion 8: σ ισασ κα
121
+ Suggestion 9: σασ τασ ι
122
+ Suggestion 10: σιωι τασ
123
+ ```
124
+ ## Usage on free tier in Google Colab
125
+
126
+ If you don’t have access to a larger GPU but want to try the model out, you can run it in a quantized format in Google Colab. **The quality of the responses will deteriorate significantly!** Follow these steps:
127
+
128
+ ### Step 1: Connect to free GPU
129
+ 1. Click Connect arrow_drop_down near the top right of the notebook.
130
+ 2. Select Change runtime type.
131
+ 3. In the modal window, select T4 GPU as your hardware accelerator.
132
+ 4. Click Save.
133
+ 5. Click the Connect button to connect to your runtime. After some time, the button will present a green checkmark, along with RAM and disk usage graphs. This indicates that a server has successfully been created with your required hardware.
134
+
135
+
136
+ ### Step 2: Install Dependencies
137
+
138
+ ```python
139
+ !pip install -U bitsandbytes
140
+ import os
141
+ os._exit(00)
142
+ ```
143
+
144
+ ### Step 3: Download and quantize the model
145
+ ```python
146
+ from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, pipeline
147
+ import torch
148
+ quant_config = BitsAndBytesConfig(
149
+ load_in_4bit=True,
150
+ bnb_4bit_compute_dtype=torch.bfloat16
151
+ )
152
+ model = AutoModelForCausalLM.from_pretrained("Ericu950/Papy_2_Llama-3.1-8B-Instruct_text",
153
+ device_map = "auto", quantization_config = quant_config)
154
+ tokenizer = AutoTokenizer.from_pretrained("Ericu950/Papy_2_Llama-3.1-8B-Instruct_text")
155
+ generation_pipeline = pipeline(
156
+ "text-generation",
157
+ model=model,
158
+ tokenizer=tokenizer,
159
+ device_map="auto",
160
+ )
161
+
162
+ ```
163
+ ### Step 4: Run inference on a papyrus fragment of your choice
164
+ ```python
165
+ papyrus_edition = """
166
+ ετουσ τεταρτου αυτοκρατοροσ καισαροσ ουεσπασιανου σεβαστου ------------------
167
+ ομολογει παυσιριων απολλωνιου του παuσιριωνοσ μητροσ ---------------τωι γεγονοτι αυτωι
168
+ εκ τησ γενομενησ και μετηλλαχυιασ αυτου γυναικοσ -------------------------
169
+ απο τησ αυτησ πολεωσ εν αγυιαι συγχωρειν ειναι ----------------------------------
170
+ --------------------σ αυτωι εξ ησ συνεστιν ------------------------------------
171
+ ----τησ αυτησ γενεασ την υπαρχουσαν αυτωι οικιαν ------------
172
+ ------------------ ---------καὶ αιθριον και αυλη απερ ο υιοσ διοκοροσ --------------------------
173
+ --------εγραψεν του δ αυτου διοσκορου ειναι ------------------------------------
174
+ ---------- και προ κατενγεγυηται τα δικαια --------------------------------------
175
+ νησ κατα τουσ τησ χωρασ νομουσ· εαν δε μη ---------------------------------------
176
+ υπ αυτου τηι του διοσκορου σημαινομενηι -----------------------------------ενοικισμωι του
177
+ ημισουσ μερουσ τησ προκειμενησ οικιασ --------------------------------- διοσκοροσ την τουτων αποχην
178
+ ---------------------------------------------μηδ υπεναντιον τουτοισ επιτελειν μηδε
179
+ ------------------------------------------------ ανασκευηι κατ αυτησ τιθεσθαι ομολογιαν μηδε
180
+ ----------------------------------- επιτελεσαι η χωρισ του κυρια ειναι τα διομολογημενα
181
+ παραβαινειν, εκτεινειν δε τον παραβησομενον τωι υιωι διοσκορωι η τοισ παρ αυτου καθ εκαστην
182
+ εφοδον το τε βλαβοσ και επιτιμον αργυριου δραχμασ 0 και εισ το δημο[7 missing letters] ισασ και μηθεν
183
+ ησσον· δ -----ιων ομολογιαν συνεχωρησεν·
184
+ """
185
+ system_prompt = "Fill in the missing letters in this papyrus fragment!"
186
+ input_messages = [
187
+ {"role": "system", "content": system_prompt},
188
+ {"role": "user", "content": papyrus_edition},
189
+ ]
190
+ terminators = [
191
+ tokenizer.eos_token_id,
192
+ tokenizer.convert_tokens_to_ids("<|eot_id|>")
193
+ ]
194
+ outputs = generation_pipeline(
195
+ input_messages,
196
+ max_new_tokens=10,
197
+ num_beams=30, # Set this as high as your memory will allow!
198
+ num_return_sequences=10,
199
+ early_stopping=True,
200
+ )
201
+ beam_contents = []
202
+ for output in outputs:
203
+ generated_text = output.get('generated_text', [])
204
+ for item in generated_text:
205
+ if item.get('role') == 'assistant':
206
+ beam_contents.append(item.get('content'))
207
+ real_response = "σιον τασ"
208
+ print(f"The masked characters: {real_response}")
209
+ for i, content in enumerate(beam_contents, start=1):
210
+ print(f"Suggestion {i}: {content}")
211
+ ```
212
+ ### Expected Output:
213
+ ```
214
+ The masked characters: σιον τασ
215
+ Suggestion 1: σιον τα 00·
216
+ Suggestion 2: σιον αυτωι·
217
+ Suggestion 3: σιον 00 00
218
+ Suggestion 4: σιον και 0·
219
+ Suggestion 5: σιον τα 00··
220
+ Suggestion 6: σιον τασ 0
221
+ Suggestion 7: σιον τα 000·
222
+ Suggestion 8: σιον τα 0ο
223
+ Suggestion 9: σιον τασασ·
224
+ Suggestion 10: σιον τα 00
225
+ ```
226
+ Observe that performance declines! If we change
227
+ ```python
228
+ load_in_4bit=True,
229
+ bnb_4bit_compute_dtype=torch.bfloat16
230
+ ```
231
+ in the second cell to
232
+ ```python
233
+ load_in_8bit=True,
234
+ ```
235
+
236
+ we get
237
+ ```
238
+ The masked characters: σιον τασ
239
+ Suggestion 1: σιον τασ
240
+ Suggestion 2: σιν τασ ι
241
+ Suggestion 3: σ τασ ισα
242
+ Suggestion 4: σιου τασ
243
+ Suggestion 5: σ ισασ κα
244
+ Suggestion 6: συ τασ ισ
245
+ Suggestion 7: σασ τασ ι
246
+ Suggestion 8: ν τασ ισα
247
+ Suggestion 9: ιον τασ ι
248
+ Suggestion 10: σισ τασ ι
249
+ ```
250
+ ## Information about configuration for merging
251
+
252
+ The finetuned model was remerged with Llama-3.1-8B-Instruct using the [TIES](https://arxiv.org/abs/2306.01708) merge method. This did not afect CER or top-1 accuracy, but the effect on top-20 accuracy was positive. The following YAML configuration was used:
253
 
254
  ```yaml
255
  models:
256
+ - model: original # Llama 3.1
257
+ - model: DDbDP_reconstructer_5 # A model fintuned on the 95 % of the DDbDP for 11 epochs
258
  parameters:
259
+ density: 1.1
260
+ weight: 0.5
261
  merge_method: ties
262
+ base_model: original # Llama 3.1
263
  parameters:
264
  normalize: true
265
  dtype: bfloat16
266
 
267
 
 
 
268
  ```