How to solve factual inconsistency when fine tuning
I fine tuned the led-large-book-summary model on financial reports and was able to get a rouge-1 score close to 0.6
But after manually checking each model generated summary with the actual financial report, it turned out that most key information specially numbers are wrong.
I would like to know your experiences, workarounds and ideas on this issue even this is not strongly related to the model itself.
Thanks
Hi! Thanks for reaching out. So some bad news is that this is a largely unsolved problem that happens to decrease (at least in text-generation models) by scaling to massive sizes and training them well. With smaller models,. this is harder... and in both cases still unsolved. See for example section 3.6 of the diff transformer paper which just came out yesterday and is demonstrating their improvements on summarization, but the numbers are still rather low (at least for the case where the only acceptable level of hallucination is 0%).
Some comments:
- is your ROUGE-1 score actually 60, or 0.6? ROUGE is typically measured as a 'percentage' from 0 to 100 (perfect) etc and transformers reports it this way, so 0.6 would be low on this scale
- what are you using as your parameters for generating summaries? See a page I put together here a while back.
- Generally I would recommend using beam search with
num_beams
4 or higher. - Additionally, in this case where you want exact numbers from your input, you should also likely investigate the use (or in this case, avoiding using) of
encoder_no_repeat_ngram_size
as this could force the model to never repeat a number in the source text (and therefore cause it to hallucinate)
- Generally I would recommend using beam search with
- LED is a largely outdated model for many reasons. If your use case is okay with the openai usage terms, I would recommend fine-tuning this pegasus-x model I trained on a variety of documents and GPT-4 generated summaries.
- a potentially low hanging fruit to increase accuracy (other than what I've said above) is DoLA decoding which is supposed to reduce hallucinations. I am unsure if anyone has tried it for text2text models though, so if you do try it I would be curious as to what you find.
more detail
In general, recalling numbers is hard because of tokenization. [warning: theory] An idea I have is that hallucinations/poor understanding is at least partially because of tokenization, especially when the tokenizer and the model that was trained with it do not treat each number as a separate token. See below, the model 'sees' "55282384" as an concatenation of several different tokens:
In [5]: inputs = tokenizer.encode("$55282384")
In [6]: inputs
Out[6]: [0, 1629, 3118, 2517, 1922, 6232, 2]
In [7]: tokenizer.convert_ids_to_tokens(inputs)
Out[7]: ['<s>', '$', '55', '28', '23', '84', '</s>']
This split is somewhat arbitrary and makes it hard to keep track of. As far as I know, there is no long context text2text model that has been pretrained with a tokenizer that would split the number into digits by itself.
Thank you and really appreciate your response.
Best ROUGE-1 score we got was 60%. For the data set we used the sections of 1000 annual reports and created the summary labels using GPT-4o.
And then we checked the summaries of both GPT-4o and LED. Factual accuracy of GPT-4o was pretty high (almost all numbers were correct) and in LED, it was really low (almost all numbers were wrong compared to the original content and GPT-4o generated summary).
We already fine tuned LED without
encoder_no_repeat_ngram_size
.
Previously tried these 2 models of yours long-t5-tglobal-base-16384-book-summary and pszemraj/pegasus-x-large-book-summary but their ROUGE-1 scores were low compared to LED.
As you suggested will try the pszemraj/pegasus-x-large-book_synthsumm
model as next steps.
Our end goal is to have a quality open source model to summarize financial documents.