Update README.md
Browse files
README.md
CHANGED
@@ -71,9 +71,11 @@ print("Predictions:", predictions)
|
|
71 |
```
|
72 |
|
73 |
|
74 |
-
##
|
75 |
|
76 |
-
|
|
|
|
|
77 |
|
78 |
The results on the example-level can be seen in the table below.
|
79 |
|
@@ -81,18 +83,14 @@ The results on the example-level can be seen in the table below.
|
|
81 |
<img src="https://github.com/KRLabsOrg/LettuceDetect/blob/main/assets/example_level_lettucedetect.png?raw=true" alt="Example-level Results" width="800"/>
|
82 |
</p>
|
83 |
|
84 |
-
|
85 |
-
|
86 |
-
The other non-prompt based model is [Luna](https://aclanthology.org/2025.coling-industry.34.pdf) which is also a token-level model but uses a DeBERTA-large encoder model. Our model is overall better that the Luna architecture (65.4 vs 79.22 F1 score on the _overall_ data type).
|
87 |
|
88 |
-
|
89 |
|
90 |
<p align="center">
|
91 |
<img src="https://github.com/KRLabsOrg/LettuceDetect/blob/main/assets/span_level_lettucedetect.png?raw=true" alt="Span-level Results" width="800"/>
|
92 |
</p>
|
93 |
|
94 |
-
Our model achieves the best scores throughout each data-type and also overall, beating the previous best model (Finetuned LLAMA-2-13B) by a significant margin.
|
95 |
-
|
96 |
## Citing
|
97 |
|
98 |
If you use the model or the tool, please cite the following:
|
|
|
71 |
```
|
72 |
|
73 |
|
74 |
+
## Performance
|
75 |
|
76 |
+
**Example level results**
|
77 |
+
|
78 |
+
We evaluate our model on the test set of the [RAGTruth](https://aclanthology.org/2024.acl-long.585/) dataset. Our large model, **lettucedetect-large-v1**, achieves an overall F1 score of 79.22%, outperforming prompt-based methods like GPT-4 (63.4%) and encoder-based models like [Luna](https://aclanthology.org/2025.coling-industry.34.pdf) (65.4%). It also surpasses fine-tuned LLAMA-2-13B (78.7%) (presented in [RAGTruth](https://aclanthology.org/2024.acl-long.585/)) and is competitive with the SOTA fine-tuned LLAMA-3-8B (83.9%) (presented in the [RAG-HAT paper](https://aclanthology.org/2024.emnlp-industry.113.pdf)). Overall, **lettucedetect-large-v1** and **lettucedect-base-v1** are very performant models, while being very effective in inference settings.
|
79 |
|
80 |
The results on the example-level can be seen in the table below.
|
81 |
|
|
|
83 |
<img src="https://github.com/KRLabsOrg/LettuceDetect/blob/main/assets/example_level_lettucedetect.png?raw=true" alt="Example-level Results" width="800"/>
|
84 |
</p>
|
85 |
|
86 |
+
**Span-level results**
|
|
|
|
|
87 |
|
88 |
+
At the span level, our model achieves the best scores across all data types, significantly outperforming previous models. The results can be seen in the table below. Note that here we don't compare to models, like [RAG-HAT](https://aclanthology.org/2024.emnlp-industry.113.pdf), since they have no span-level evaluation presented.
|
89 |
|
90 |
<p align="center">
|
91 |
<img src="https://github.com/KRLabsOrg/LettuceDetect/blob/main/assets/span_level_lettucedetect.png?raw=true" alt="Span-level Results" width="800"/>
|
92 |
</p>
|
93 |
|
|
|
|
|
94 |
## Citing
|
95 |
|
96 |
If you use the model or the tool, please cite the following:
|