aashish1904 commited on
Commit
18a60c4
·
verified ·
1 Parent(s): 6889267

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +181 -0
README.md ADDED
@@ -0,0 +1,181 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+
2
+ ---
3
+
4
+ library_name: transformers
5
+ base_model: Qwen/Qwen2.5-1.5B-Instruct
6
+ license: apache-2.0
7
+ datasets:
8
+ - shibing624/chinese_text_correction
9
+ language:
10
+ - zh
11
+ metrics:
12
+ - f1
13
+ tags:
14
+ - text-generation-inference
15
+ widget:
16
+ - text: "文本纠错:\n少先队员因该为老人让坐。"
17
+
18
+ ---
19
+
20
+ [![QuantFactory Banner](https://lh7-rt.googleusercontent.com/docsz/AD_4nXeiuCm7c8lEwEJuRey9kiVZsRn2W-b4pWlu3-X534V3YmVuVc2ZL-NXg2RkzSOOS2JXGHutDuyyNAUtdJI65jGTo8jT9Y99tMi4H4MqL44Uc5QKG77B0d6-JfIkZHFaUA71-RtjyYZWVIhqsNZcx8-OMaA?key=xt3VSDoCbmTY7o-cwwOFwQ)](https://hf.co/QuantFactory)
21
+
22
+
23
+ # QuantFactory/chinese-text-correction-1.5b-GGUF
24
+ This is quantized version of [shibing624/chinese-text-correction-1.5b](https://huggingface.co/shibing624/chinese-text-correction-1.5b) created using llama.cpp
25
+
26
+ # Original Model Card
27
+
28
+
29
+
30
+
31
+ # Chinese Text Correction Model
32
+ 中文文本纠错模型chinese-text-correction-1.5b:用于拼写纠错、语法纠错
33
+
34
+ `shibing624/chinese-text-correction-1.5b` evaluate test data:
35
+
36
+ The overall performance of CSC **test**:
37
+
38
+ |input_text|predict_text|
39
+ |:--- |:--- |
40
+ |文本纠错:\n少先队员因该为老人让坐。|少先队员应该为老人让座。|
41
+
42
+ # Models
43
+
44
+ | Name | Base Model | Download |
45
+ |-----------------|-------------------|-----------------------------------------------------------------------|
46
+ | chinese-text-correction-1.5b | Qwen/Qwen2.5-1.5B-Instruct | [🤗 Hugging Face](https://huggingface.co/shibing624/chinese-text-correction-1.5b) |
47
+ | chinese-text-correction-1.5b-lora | Qwen/Qwen2.5-1.5B-Instruct | [🤗 Hugging Face](https://huggingface.co/shibing624/chinese-text-correction-1.5b-lora) |
48
+ | chinese-text-correction-7b | Qwen/Qwen2.5-7B-Instruct | [🤗 Hugging Face](https://huggingface.co/shibing624/chinese-text-correction-7b) |
49
+ | chinese-text-correction-7b-lora | Qwen/Qwen2.5-7B-Instruct | [🤗 Hugging Face](https://huggingface.co/shibing624/chinese-text-correction-7b-lora) |
50
+
51
+
52
+ ### 评估结果
53
+ - 评估指标:F1
54
+ - CSC(Chinese Spelling Correction): 拼写纠错模型,表示模型可以处理音似、形似、语法等长度对齐的错误纠正
55
+ - CTC(CHinese Text Correction): 文本纠错模型,表示模型支持拼写、语法等长度对齐的错误纠正,还可以处理多字、少字等长度不对齐的错误纠正
56
+ - GPU:Tesla V100,显存 32 GB
57
+
58
+ | Model Name | Model Link | Base Model | Avg | SIGHAN-2015 | EC-LAW | MCSC | GPU/CPU | QPS |
59
+ |:-----------------|:------------------------------------------------------------------------------------------------------------------------|:---------------------------|:-----------|:------------|:-------|:-------|:--------|:--------|
60
+ | Kenlm-CSC | [shibing624/chinese-kenlm-klm](https://huggingface.co/shibing624/chinese-kenlm-klm) | kenlm | 0.3409 | 0.3147 | 0.3763 | 0.3317 | CPU | 9 |
61
+ | Mengzi-T5-CSC | [shibing624/mengzi-t5-base-chinese-correction](https://huggingface.co/shibing624/mengzi-t5-base-chinese-correction) | mengzi-t5-base | 0.3984 | 0.7758 | 0.3156 | 0.1039 | GPU | 214 |
62
+ | ERNIE-CSC | [PaddleNLP/ernie-csc](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/legacy/examples/text_correction/ernie-csc) | PaddlePaddle/ernie-1.0-base-zh | 0.4353 | 0.8383 | 0.3357 | 0.1318 | GPU | 114 |
63
+ | MacBERT-CSC | [shibing624/macbert4csc-base-chinese](https://huggingface.co/shibing624/macbert4csc-base-chinese) | hfl/chinese-macbert-base | 0.3993 | 0.8314 | 0.1610 | 0.2055 | GPU | **224** |
64
+ | ChatGLM3-6B-CSC | [shibing624/chatglm3-6b-csc-chinese-lora](https://huggingface.co/shibing624/chatglm3-6b-csc-chinese-lora) | THUDM/chatglm3-6b | 0.4538 | 0.6572 | 0.4369 | 0.2672 | GPU | 3 |
65
+ | Qwen2.5-1.5B-CTC | [shibing624/chinese-text-correction-1.5b](https://huggingface.co/shibing624/chinese-text-correction-1.5b) | Qwen/Qwen2.5-1.5B-Instruct | 0.6802 | 0.3032 | 0.7846 | 0.9529 | GPU | 6 |
66
+ | Qwen2.5-7B-CTC | [shibing624/chinese-text-correction-7b](https://huggingface.co/shibing624/chinese-text-correction-7b) | Qwen/Qwen2.5-7B-Instruct | **0.8225** | 0.4917 | 0.9798 | 0.9959 | GPU | 3 |
67
+
68
+ ## Usage (pycorrector)
69
+
70
+ 本项目开源在`pycorrector`项目:[pycorrector](https://github.com/shibing624/pycorrector),可支持大模型微调后用于文本纠错,通过如下命令调用:
71
+
72
+ Install package:
73
+ ```shell
74
+ pip install -U pycorrector
75
+ ```
76
+
77
+ ```python
78
+ from pycorrector.gpt.gpt_corrector import GptCorrector
79
+
80
+ if __name__ == '__main__':
81
+ error_sentences = [
82
+ '真麻烦你了。希望你们好好的跳无',
83
+ '少先队员因该为老人让坐',
84
+ '机七��习是人工智能领遇最能体现智能的一个分知',
85
+ '一只小鱼船浮在平净的河面上',
86
+ '我的家乡是有明的渔米之乡',
87
+ ]
88
+ m = GptCorrector("shibing624/chinese-text-correction-1.5b")
89
+
90
+ batch_res = m.correct_batch(error_sentences)
91
+ for i in batch_res:
92
+ print(i)
93
+ print()
94
+ ```
95
+
96
+ ## Usage (HuggingFace Transformers)
97
+ Without [pycorrector](https://github.com/shibing624/pycorrector), you can use the model like this:
98
+
99
+ First, you pass your input through the transformer model, then you get the generated sentence.
100
+
101
+ Install package:
102
+ ```
103
+ pip install transformers
104
+ ```
105
+
106
+ ```python
107
+ # pip install transformers
108
+ from transformers import AutoModelForCausalLM, AutoTokenizer
109
+ checkpoint = "shibing624/chinese-text-correction-1.5b"
110
+
111
+ device = "cuda" # for GPU usage or "cpu" for CPU usage
112
+ tokenizer = AutoTokenizer.from_pretrained(checkpoint)
113
+ model = AutoModelForCausalLM.from_pretrained(checkpoint).to(device)
114
+
115
+ input_content = "文本纠错:\n少先队员因该为老人让坐。"
116
+
117
+ messages = [{"role": "user", "content": input_content}]
118
+ input_text=tokenizer.apply_chat_template(messages, tokenize=False)
119
+
120
+ print(input_text)
121
+
122
+ inputs = tokenizer.encode(input_text, return_tensors="pt").to(device)
123
+ outputs = model.generate(inputs, max_new_tokens=1024, temperature=0, do_sample=False, repetition_penalty=1.08)
124
+
125
+ print(tokenizer.decode(outputs[0]))
126
+ ```
127
+
128
+ output:
129
+ ```shell
130
+ 少先队员应该为老人让座。
131
+ ```
132
+
133
+
134
+ 模型文件组成:
135
+ ```
136
+ shibing624/chinese-text-correction-1.5b
137
+ |-- added_tokens.json
138
+ |-- config.json
139
+ |-- generation_config.json
140
+ |-- merges.txt
141
+ |-- model.safetensors
142
+ |-- model.safetensors.index.json
143
+ |-- README.md
144
+ |-- special_tokens_map.json
145
+ |-- tokenizer_config.json
146
+ |-- tokenizer.json
147
+ `-- vocab.json
148
+ ```
149
+
150
+ #### 训练参数:
151
+
152
+ - num_epochs: 8
153
+ - batch_size: 4
154
+ - steps: 36000
155
+ - eval_loss: 0.14
156
+ - base model: Qwen/Qwen2.5-1.5B-Instruct
157
+ - train data: [shibing624/chinese_text_correction](https://huggingface.co/datasets/shibing624/chinese_text_correction)
158
+ - train time: 9 days 8 hours
159
+ - eval_loss: ![](https://huggingface.co/shibing624/chinese-text-correction-1.5b-lora/resolve/main/eval_loss_1.5b.png)
160
+ - train_loss: ![](https://huggingface.co/shibing624/chinese-text-correction-1.5b-lora/resolve/main/train_loss_1.5b.png)
161
+
162
+ ### 训练数据集
163
+ #### 中文纠错数据集
164
+
165
+ - 数据:[shibing624/chinese_text_correction](https://huggingface.co/datasets/shibing624/chinese_text_correction)
166
+
167
+
168
+ 如果需要训练Qwen的纠错模型,请参考[https://github.com/shibing624/pycorrector](https://github.com/shibing624/pycorrector) 或者 [https://github.com/shibing624/MedicalGPT](https://github.com/shibing624/MedicalGPT)
169
+
170
+ ## Citation
171
+
172
+ ```latex
173
+ @software{pycorrector,
174
+ author = {Xu Ming},
175
+ title = {pycorrector: Implementation of language model finetune},
176
+ year = {2024},
177
+ url = {https://github.com/shibing624/pycorrector},
178
+ }
179
+ ```
180
+
181
+