Dang Phuong Nam commited on
Commit
7d6748a
1 Parent(s): 5b619b1

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +92 -7
README.md CHANGED
@@ -1,17 +1,17 @@
1
  ---
2
  license: apache-2.0
3
  language:
4
- - vi
5
  library_name: transformers
6
  pipeline_tag: text-classification
7
  tags:
8
- - transformers
9
- - cross-encoder
10
- - rerank
11
  datasets:
12
- - unicamp-dl/mmarco
13
  widget:
14
- - query: tỉnh nào có diện tích lớn nhất việt nam.
15
  output:
16
  - label: >-
17
  nghệ an có diện tích lớn nhất việt nam
@@ -19,4 +19,89 @@ widget:
19
  - label: >-
20
  bắc ninh có diện tích nhỏ nhất việt nam
21
  score: 0.05
22
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: apache-2.0
3
  language:
4
+ - vi
5
  library_name: transformers
6
  pipeline_tag: text-classification
7
  tags:
8
+ - transformers
9
+ - cross-encoder
10
+ - rerank
11
  datasets:
12
+ - unicamp-dl/mmarco
13
  widget:
14
+ - text: tỉnh nào có diện tích lớn nhất việt nam.
15
  output:
16
  - label: >-
17
  nghệ an có diện tích lớn nhất việt nam
 
19
  - label: >-
20
  bắc ninh có diện tích nhỏ nhất việt nam
21
  score: 0.05
22
+ ---
23
+
24
+ # Reranker
25
+
26
+ * [Usage](#usage)
27
+ * [Using FlagEmbedding](#using-flagembedding)
28
+ * [Using Huggineface transformers](#using-huggingface-transformers)
29
+ * [Fine tune](#fine-tune)
30
+ * [Data format](#data-format)
31
+
32
+ Different from embedding model, reranker uses question and document as input and directly output similarity instead of
33
+ embedding.
34
+ You can get a relevance score by inputting query and passage to the reranker.
35
+ And the score can be mapped to a float value in [0,1] by sigmoid function.
36
+
37
+ ## Usage
38
+
39
+ ### Using FlagEmbedding
40
+
41
+ ```
42
+ pip install -U FlagEmbedding
43
+ ```
44
+
45
+ Get relevance scores (higher scores indicate more relevance):
46
+
47
+ ```python
48
+ from FlagEmbedding import FlagReranker
49
+
50
+ reranker = FlagReranker('namdp/bge-reranker-vietnamese',
51
+ use_fp16=True) # Setting use_fp16 to True speeds up computation with a slight performance degradation
52
+
53
+ score = reranker.compute_score(['query', 'passage'])
54
+ print(score) # -5.65234375
55
+
56
+ # You can map the scores into 0-1 by set "normalize=True", which will apply sigmoid function to the score
57
+ score = reranker.compute_score(['query', 'passage'], normalize=True)
58
+ print(score) # 0.003497010252573502
59
+
60
+ scores = reranker.compute_score([['what is panda?', 'hi'], ['what is panda?',
61
+ 'The giant panda (Ailuropoda melanoleuca), sometimes called a panda bear or simply panda, is a bear species endemic to China.']])
62
+ print(scores) # [-8.1875, 5.26171875]
63
+
64
+ # You can map the scores into 0-1 by set "normalize=True", which will apply sigmoid function to the score
65
+ scores = reranker.compute_score([['what is panda?', 'hi'], ['what is panda?',
66
+ 'The giant panda (Ailuropoda melanoleuca), sometimes called a panda bear or simply panda, is a bear species endemic to China.']],
67
+ normalize=True)
68
+ print(scores) # [0.00027803096387751553, 0.9948403768236574]
69
+ ```
70
+
71
+ ### Using Huggingface transformers
72
+
73
+ ```
74
+ pip install -U transformers
75
+ ```
76
+
77
+ Get relevance scores (higher scores indicate more relevance):
78
+
79
+ ```python
80
+ import torch
81
+ from transformers import AutoModelForSequenceClassification, AutoTokenizer
82
+
83
+ tokenizer = AutoTokenizer.from_pretrained('namdp/bge-reranker-vietnamese')
84
+ model = AutoModelForSequenceClassification.from_pretrained('namdp/bge-reranker-vietnamese')
85
+ model.eval()
86
+
87
+ pairs = [['what is panda?', 'hi'], ['what is panda?',
88
+ 'The giant panda (Ailuropoda melanoleuca), sometimes called a panda bear or simply panda, is a bear species endemic to China.']]
89
+ with torch.no_grad():
90
+ inputs = tokenizer(pairs, padding=True, truncation=True, return_tensors='pt', max_length=512)
91
+ scores = model(**inputs, return_dict=True).logits.view(-1, ).float()
92
+ print(scores)
93
+ ```
94
+
95
+ ## Fine-tune
96
+
97
+ ### Data Format
98
+
99
+ Train data should be a json file, where each line is a dict like this:
100
+
101
+ ```
102
+ {"query": str, "pos": List[str], "neg": List[str]}
103
+ ```
104
+
105
+ `query` is the query, and `pos` is a list of positive texts, `neg` is a list of negative texts, `prompt` indicates the
106
+ relationship between query and texts. If you have no negative texts for a query, you can random sample some from the
107
+ entire corpus as the negatives.