lytang commited on
Commit
6a30129
·
verified ·
1 Parent(s): 7693225

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +17 -11
README.md CHANGED
@@ -16,27 +16,29 @@ The model is doing predictions on the *sentence-level*. It takes as input a docu
16
  whether the sentence is supported by the document: **MiniCheck-Model(document, claim) -> {0, 1}**
17
 
18
 
19
- MiniCheck-Flan-T5-Large is fine tuned from `google/flan-t5-large` ([Chung et al., 2022](https://arxiv.org/pdf/2210.11416.pdf))
20
  on the combination of 35K data:
21
  - 21K ANLI data ([Nie et al., 2020](https://aclanthology.org/2020.acl-main.441.pdf))
22
  - 14K synthetic data generated from scratch in a structed way (more details in the paper).
23
 
24
 
25
  ### Model Variants
26
- We also have other two MiniCheck model variants:
27
- - [lytang/MiniCheck-RoBERTa-Large](https://huggingface.co/lytang/MiniCheck-RoBERTa-Large)
28
- - [lytang/MiniCheck-DeBERTa-v3-Large](https://huggingface.co/lytang/MiniCheck-DeBERTa-v3-Large)
 
29
 
30
 
31
  ### Model Performance
32
 
33
  <p align="center">
34
- <img src="./cost-vs-bacc.png" width="360">
35
  </p>
36
 
37
 
 
38
  The performance of these models is evaluated on our new collected benchmark (unseen by our models during training), [LLM-AggreFact](https://huggingface.co/datasets/lytang/LLM-AggreFact),
39
- from 10 recent human annotated datasets on fact-checking and grounding LLM generations. Our most capable model MiniCheck-Flan-T5-Large outperform all
40
  exisiting specialized fact-checkers with a similar scale by a large margin (4-10% absolute increase) and is on par with GPT-4, but 400x cheaper. See full results in our work.
41
 
42
  Note: We only evaluated the performance of our models on real claims -- without any human intervention in
@@ -53,13 +55,15 @@ Please first clone our [GitHub Repo](https://github.com/Liyan06/MiniCheck) and i
53
 
54
  ```python
55
  from minicheck.minicheck import MiniCheck
 
 
56
 
57
  doc = "A group of students gather in the school library to study for their upcoming final exams."
58
  claim_1 = "The students are preparing for an examination."
59
  claim_2 = "The students are on vacation."
60
 
61
- # model_name can be one of ['roberta-large', 'deberta-v3-large', 'flan-t5-large']
62
- scorer = MiniCheck(model_name='flan-t5-large', device=f'cuda:0', cache_dir='./ckpts')
63
  pred_label, raw_prob, _, _ = scorer.score(docs=[doc, doc], claims=[claim_1, claim_2])
64
 
65
  print(pred_label) # [1, 0]
@@ -72,14 +76,16 @@ print(raw_prob) # [0.9805923700332642, 0.007121307775378227]
72
  import pandas as pd
73
  from datasets import load_dataset
74
  from minicheck.minicheck import MiniCheck
 
 
75
 
76
- # load 13K test data
77
  df = pd.DataFrame(load_dataset("lytang/LLM-AggreFact")['test'])
78
  docs = df.doc.values
79
  claims = df.claim.values
80
 
81
- scorer = MiniCheck(model_name='flan-t5-large', device=f'cuda:0', cache_dir='./ckpts')
82
- pred_label, raw_prob, _, _ = scorer.score(docs=docs, claims=claims) # ~ 20 mins, depending on hardware
83
  ```
84
 
85
  To evalaute the result on the benchmark
 
16
  whether the sentence is supported by the document: **MiniCheck-Model(document, claim) -> {0, 1}**
17
 
18
 
19
+ **MiniCheck-Flan-T5-Large is the best fack-checking model with size < 1B** and reaches GPT-4 performance. It is fine tuned from `google/flan-t5-large` ([Chung et al., 2022](https://arxiv.org/pdf/2210.11416.pdf))
20
  on the combination of 35K data:
21
  - 21K ANLI data ([Nie et al., 2020](https://aclanthology.org/2020.acl-main.441.pdf))
22
  - 14K synthetic data generated from scratch in a structed way (more details in the paper).
23
 
24
 
25
  ### Model Variants
26
+ We also have other three MiniCheck model variants:
27
+ - [bespokelabs/Bespoke-Minicheck-7B](https://huggingface.co/bespokelabs/Bespoke-MiniCheck-7B) (Model Size: 7B)
28
+ - [lytang/MiniCheck-RoBERTa-Large](https://huggingface.co/lytang/MiniCheck-RoBERTa-Large) (Model Size: 0.4B)
29
+ - [lytang/MiniCheck-DeBERTa-v3-Large](https://huggingface.co/lytang/MiniCheck-DeBERTa-v3-Large) (Model Size: 0.4B)
30
 
31
 
32
  ### Model Performance
33
 
34
  <p align="center">
35
+ <img src="./performance.png" width="550">
36
  </p>
37
 
38
 
39
+
40
  The performance of these models is evaluated on our new collected benchmark (unseen by our models during training), [LLM-AggreFact](https://huggingface.co/datasets/lytang/LLM-AggreFact),
41
+ from 11 recent human annotated datasets on fact-checking and grounding LLM generations. MiniCheck-Flan-T5-Large outperform all
42
  exisiting specialized fact-checkers with a similar scale by a large margin (4-10% absolute increase) and is on par with GPT-4, but 400x cheaper. See full results in our work.
43
 
44
  Note: We only evaluated the performance of our models on real claims -- without any human intervention in
 
55
 
56
  ```python
57
  from minicheck.minicheck import MiniCheck
58
+ import os
59
+ os.environ["CUDA_VISIBLE_DEVICES"] = "0"
60
 
61
  doc = "A group of students gather in the school library to study for their upcoming final exams."
62
  claim_1 = "The students are preparing for an examination."
63
  claim_2 = "The students are on vacation."
64
 
65
+ # model_name can be one of ['roberta-large', 'deberta-v3-large', 'flan-t5-large', 'Bespoke-MiniCheck-7B']
66
+ scorer = MiniCheck(model_name='flan-t5-large', cache_dir='./ckpts')
67
  pred_label, raw_prob, _, _ = scorer.score(docs=[doc, doc], claims=[claim_1, claim_2])
68
 
69
  print(pred_label) # [1, 0]
 
76
  import pandas as pd
77
  from datasets import load_dataset
78
  from minicheck.minicheck import MiniCheck
79
+ import os
80
+ os.environ["CUDA_VISIBLE_DEVICES"] = "0"
81
 
82
+ # load 29K test data
83
  df = pd.DataFrame(load_dataset("lytang/LLM-AggreFact")['test'])
84
  docs = df.doc.values
85
  claims = df.claim.values
86
 
87
+ scorer = MiniCheck(model_name='flan-t5-large', cache_dir='./ckpts')
88
+ pred_label, raw_prob, _, _ = scorer.score(docs=docs, claims=claims) # ~ 500 docs/min, depending on hardware
89
  ```
90
 
91
  To evalaute the result on the benchmark