Post
Understanding BARTScore πΉ
BARTScore is a text-generation evaluation metric that treats model evaluation as a text-generation task π
Other metrics approach the evaluation problem from different ML task perspectives; for instance, ROUGE and BLUE formulate it as an unsupervised matching task, BLUERT and COMET as a supervised regression, and BEER as a supervised ranking task.
Meanwhile, BARTScore formulates it as a text-generation task. Its idea is to leverage BART's pre-trained contextual embeddings to return a score that measures either the faithfulness, precision, recall, or F-score response of the main text-generation model.
For example, if we want to measure faithfulness, the way it works is that we would take the source and the generated text from our model and use BART to calculate the log token probability of the generated text given the source; we can then weight those results and return the sum.
BARTScore correlates nicely with human scores, and it is relatively simple to implement.
π Here is the original BARTScore paper: BARTScore: Evaluating Generated Text as Text Generation (2106.11520)
π§βπ» And the GitHub repo to use this metric: https://github.com/neulab/BARTScore
BARTScore is a text-generation evaluation metric that treats model evaluation as a text-generation task π
Other metrics approach the evaluation problem from different ML task perspectives; for instance, ROUGE and BLUE formulate it as an unsupervised matching task, BLUERT and COMET as a supervised regression, and BEER as a supervised ranking task.
Meanwhile, BARTScore formulates it as a text-generation task. Its idea is to leverage BART's pre-trained contextual embeddings to return a score that measures either the faithfulness, precision, recall, or F-score response of the main text-generation model.
For example, if we want to measure faithfulness, the way it works is that we would take the source and the generated text from our model and use BART to calculate the log token probability of the generated text given the source; we can then weight those results and return the sum.
BARTScore correlates nicely with human scores, and it is relatively simple to implement.
π Here is the original BARTScore paper: BARTScore: Evaluating Generated Text as Text Generation (2106.11520)
π§βπ» And the GitHub repo to use this metric: https://github.com/neulab/BARTScore