Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Sign Up
santiviquezΒ 
posted an update Feb 6
Post
Understanding BARTScore πŸ›Ή

BARTScore is a text-generation evaluation metric that treats model evaluation as a text-generation task πŸ”„

Other metrics approach the evaluation problem from different ML task perspectives; for instance, ROUGE and BLUE formulate it as an unsupervised matching task, BLUERT and COMET as a supervised regression, and BEER as a supervised ranking task.

Meanwhile, BARTScore formulates it as a text-generation task. Its idea is to leverage BART's pre-trained contextual embeddings to return a score that measures either the faithfulness, precision, recall, or F-score response of the main text-generation model.

For example, if we want to measure faithfulness, the way it works is that we would take the source and the generated text from our model and use BART to calculate the log token probability of the generated text given the source; we can then weight those results and return the sum.

BARTScore correlates nicely with human scores, and it is relatively simple to implement.

πŸ“‘ Here is the original BARTScore paper: BARTScore: Evaluating Generated Text as Text Generation (2106.11520)
πŸ§‘β€πŸ’» And the GitHub repo to use this metric: https://github.com/neulab/BARTScore
In this post