inference:
parameters:
do_sample: true
max_length: 512
top_p: 0.9
repetition_penalty: 1.2
language:
- en
license: mit
metrics:
- sacrebleu
- bert_score
- rouge
- meteor
- sari
- ari
- Automated Readability Index
tags:
- text2text generation
task:
name: scientific abstract simplification
type: text2text generation
widget:
- text: >-
summarize, simplify, and contextualize: The COVID-19 pandemic presented
enormous data challenges in the United States. Policy makers,
epidemiological modelers, and health researchers all require up-to-date
data on the pandemic and relevant public behavior, ideally at fine spatial
and temporal resolution. The COVIDcast API is our attempt to fill this
need: Operational since April 2020, it provides open access to both
traditional public health surveillance signals (cases, deaths, and
hospitalizations) and many auxiliary indicators of COVID-19 activity, such
as signals extracted from deidentified medical claims data, massive online
surveys, cell phone mobility data, and internet search trends. These are
available at a fine geographic resolution (mostly at the county level) and
are updated daily. The COVIDcast API also tracks all revisions to
historical data, allowing modelers to account for the frequent revisions
and backfill that are common for many public health data sources. All of
the data are available in a common format through the API and accompanying
R and Python software packages. This paper describes the data sources and
signals, and provides examples demonstrating that the auxiliary signals in
the COVIDcast API present information relevant to tracking COVID activity,
augmenting traditional public health reporting and empowering research and
decision-making.
example_title: covid-api paper, from PNAS
- text: >-
summarize, simplify, and contextualize: Potato mop-top virus (PMTV) is
considered an emerging threat to potato production in the United States.
PMTV is transmitted by a soil-borne protist, Spongospora subterranean.
Rapid, accurate, and sensitive detection of PMTV in leaves and tubers is
an essential component in PMTV management program. A rapid test that can
be adapted to in-field, on-site testing with minimal sample manipulation
could help in ensuring the sanitary status of the produce in situations
such as certification programs and shipping point inspections. Toward that
goal, a rapid and highly sensitive recombinase polymerase amplification
(RPA)-based test was developed for PMTV detection in potato tubers. The
test combines the convenience of RPA assay with a simple sample extraction
procedure, making it amenable to rapid on-site diagnosis of PMTV.
Furthermore, the assay was duplexed with a plant internal control to
monitor sample extraction and RPA reaction performance. The method
described could detect as little as 10 fg of PMTV RNA transcript in
various potato tissues, the diagnostic limit of detection (LOQ) similar to
that of traditional molecular methods.
example_title: potato paper, from PLOS ONE
- text: >-
summarize, simplify, and contextualize: One of the most thrilling cultural
experiences is to hear live symphony-orchestra music build up from a
whispering passage to a monumental fortissimo. The impact of such a
crescendo has been thought to depend only on the musicians’ skill, but
here we show that interactions between the concert-hall acoustics and
listeners’ hearing also play a major role in musical dynamics. These
interactions contribute to the shoebox-type concert hall’s established
success, but little prior research has been devoted to dynamic expression
in this three-part transmission chain as a complete system. More forceful
orchestral playing disproportionately excites high frequency harmonics
more than those near the note’s fundamental. This effect results in not
only more sound energy, but also a different tone color. The concert hall
transmits this sound, and the room geometry defines from which directions
acoustic reflections arrive at the listener. Binaural directional hearing
emphasizes high frequencies more when sound arrives from the sides of the
head rather than from the median plane. Simultaneously, these same
frequencies are emphasized by higher orchestral-playing dynamics. When the
room geometry provides reflections from these directions, the perceived
dynamic range is enhanced. Current room-acoustic evaluation methods assume
linear behavior and thus neglect this effect. The hypothesis presented
here is that the auditory excitation by reflections is emphasized with an
orchestra forte most in concert halls with strong lateral reflections. The
enhanced dynamic range provides an explanation for the success of
rectangularly shaped concert-hall geometry.
example_title: music paper, from PNAS
- text: >-
summarize, simplify, and contextualize: Children in industrialized
cultures typically succeed on Give-N, a test of counting ability, by age
4. On the other hand, counting appears to be learned much later in the
Tsimane’, an indigenous group in the Bolivian Amazon. This study tests
three hypotheses for what may cause this difference in timing: (a)
Tsimane’ children may be shy in providing behavioral responses to number
tasks, (b) Tsimane’ children may not memorize the verbal list of number
words early in acquisition, and/or (c) home environments may not support
mathematical learning in the same way as in US samples, leading Tsimane’
children to primarily acquire mathematics through formalized schooling.
Our results suggest that most of our subjects are not inhibited by shyness
in responding to experimental tasks. We also find that Tsimane’ children
(N = 100, ages 4-11) learn the verbal list later than US children, but
even upon acquiring this list, still take time to pass Give-N tasks. We
find that performance in counting varies across tasks and is related to
formal schooling. These results highlight the importance of formal
education, including instruction in the count list, in learning the
meanings of the number words.
example_title: given-n paper, from PLOS ONE
TL;DR
Our full model is out!🎉🎉🎉 It leverages the power of multi-instruction finetuning and beats the baseline by a margin. Use the full model unless the goal is comparison.
Scientific Abstract Simplification-baseline translates hard-to-read scientific abstracts😵 into more accessible language😇. We hope it can make scientific knowledge accessible for everyone🤗.
Try it now with the Hosted inference API on the right. You can choose an existing example or paste in any (perhaps full-of-jargon) abstract. Remember to prepend the instruction to the abstract ("summarize, simplify, and contextualize: "; notice, there is a whitespace after the colon). Local use refers to Section Usage.
Model Details
Model Description
Open science has significantly lowered the barriers to scientific papers. However, reachable research does not mean accessible knowledge. Scientific papers are usually replete with jargon and hard to read. A lay audience would rather trust little stories on social media than read scientific papers. They are not to blame, we human like stories. So why do not we "translate" arcane scientific abstracts into simpler yet relevant scientific stories? Some renowned journals have already taken accessibility into consideration. For example, PNAS asks authors to submit Significance Statements targeting "an undergraduate-educated scientist." Science also includes an editor abstract for a quick dive.
We therefore propose to rewrite scientific abstracts into understandable scientific stories using AI. To this end, we introduce a new corpus comprising PNAS abstract-significance pairs. We finetune an encoder-decoder Transformer model (a variant of Flan-T5) with the corpus. Our baseline model (SAS-baseline) shows promising capacity in simplifying and summarizing scientific abstracts. We hope our work can pave the last mile of scientific understanding and let people better enjoy the fruits of open science.
As an ongoing effort, we are working on re-contextualizating abstracts for better storytelling and avoiding certain jargon tokens during inference time for better readability.
- Model type: Language model
- Developed by:
- PIs: Jason Clark and Hannah McKelvey, Montana State University
- Fellow: Haining Wang, Indiana University Bloomington
- Collaborator: Zuoyu Tian, Indiana University Bloomington
- LEADING Montana State University Library, Project "TL;DR it": Automating Article Synopses for Search Engine Optimization and Citizen Science
- Language(s) (NLP): English
- License: MIT
- Parent Model: FLAN-T5-large
Usage
Use the code below to get started with the model. Remember to prepend the INSTRUCTION
for best performance.
import torch
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
INSTRUCTION = "summarize, simplify, and contextualize: "
tokenizer = AutoTokenizer.from_pretrained("haining/sas_baseline")
model = AutoModelForSeq2SeqLM.from_pretrained("haining/sas_baseline")
input_text = "The COVID-19 pandemic presented enormous data challenges in the United States. Policy makers, epidemiological modelers, and health researchers all require up-to-date data on the pandemic and relevant public behavior, ideally at fine spatial and temporal resolution. The COVIDcast API is our attempt to fill this need: Operational since April 2020, it provides open access to both traditional public health surveillance signals (cases, deaths, and hospitalizations) and many auxiliary indicators of COVID-19 activity, such as signals extracted from deidentified medical claims data, massive online surveys, cell phone mobility data, and internet search trends. These are available at a fine geographic resolution (mostly at the county level) and are updated daily. The COVIDcast API also tracks all revisions to historical data, allowing modelers to account for the frequent revisions and backfill that are common for many public health data sources. All of the data are available in a common format through the API and accompanying R and Python software packages. This paper describes the data sources and signals, and provides examples demonstrating that the auxiliary signals in the COVIDcast API present information relevant to tracking COVID activity, augmenting traditional public health reporting and empowering research and decision-making."
encoding = tokenizer(INSTRUCTION + input_text,
max_length=672,
padding='max_length',
truncation=True,
return_tensors='pt')
decoded_ids = model.generate(input_ids=encoding['input_ids'],
attention_mask=encoding['attention_mask'],
max_length=512,
top_p=.9,
do_sample=True)
print(tokenizer.decode(decoded_ids[0], skip_special_tokens=True))
Training
Data
For SAS-baseline, we finetuned Flan-T5 model with the Scientific Abstract-Significance (SAS) corpus.
Scientific Abstract-Significance | # Training/Dev/Test Samples | # Training Tokens | # Validation Tokens | # Test Tokens | Automated Readability Index (std.) |
---|---|---|---|---|---|
Abstract | 3030/200/200 | 707,071 | 45,697 | 46,985 | 18.68 (2.85) |
Significance | 3030/200/200 | 375,433 | 24,901 | 24,426 | 17.89 (3.05) |
Setup
We finetuned the base model with a standard language modeling objective: the abstracts are sources and the significance statements are targets. We inform the model with a task-spcific prefix ("summarize, simplify, and contextualize: ") during training. The training took roughly 9 hours on two NVIDIA RTX A5000 (24GB memory each) GPUs. We saved the checkpoint with the lowest validation loss for inference. We used the AdamW optimizer and a learning rate of 3e-5 with fully sharded data parallel strategy. The model (~780M parameter) was trained on Nov. 20, 2022. Notice, the readability of the significance statements is generally lower than the abstracts', but not by a large margin. Our incoming SAS-full model will leverage more corpora for scientific (re)contextualization, summarization, and simplification.
Evaluation
The model is evaluated on the SAS test set using the following metrics.
Metrics
- SacreBLEU: SacreBLEU provides hassle-free computation of shareable, comparable, and reproducible BLEU scores. Inspired by Rico Sennrich’s multi-bleu-detok.perl, it produces the official WMT scores but works with plain text. It also knows all the standard test sets and handles downloading, processing, and tokenization for you.
- BERTScore: BERTScore leverages the pre-trained contextual embeddings from BERT and matches words in candidate and reference sentences by cosine similarity. It has been shown to correlate with human judgment on sentence-level and system-level evaluation. Moreover, BERTScore computes precision, recall, and F1 measure, which can be useful for evaluating different language generation tasks.
- ROUGLE-1/2/L: ROUGE is a set of metrics and a software package used for evaluating automatic summarization and machine translation software in natural language processing. The metrics compare an automatically produced summary or translation against a reference or a set of references (human-produced) summary or translation.
- METEOR: METEOR, an automatic metric for machine translation evaluation that is based on a generalized concept of unigram matching between the machine-produced translation and human-produced reference translations. Unigrams can be matched based on their surface forms, stemmed forms, and meanings; furthermore, METEOR can be easily extended to include more advanced matching strategies. Once all generalized unigram matches between the two strings have been found, METEOR computes a score for this matching using a combination of unigram-precision, unigram-recall, and a measure of fragmentation that is designed to directly capture how well-ordered the matched words in the machine translation are in relation to the reference.
- SARI: SARI is a metric used for evaluating automatic text simplification systems. The metric compares the predicted simplified sentences against the reference and the source sentences. It explicitly measures the goodness of words that are added, deleted and kept by the system. Sari = (F1_add + F1_keep + P_del) / 3 where F1_add: n-gram F1 score for add operation F1_keep: n-gram F1 score for keep operation P_del: n-gram precision score for delete operation n = 4, as in the original paper.
- The Automated Readability Index (ARI): ARI is a readability test designed to assess the understandability of a text. Like other popular readability formulas, the ARI formula outputs a number which approximates the grade level needed to comprehend the text. For example, if the ARI outputs the number 10, this equates to a high school student, ages 15-16 years old; a number 3 means students in 3rd grade (ages 8-9 yrs. old) should be able to comprehend the text.
Implementations of SacreBLEU, BERT Score, ROUGLE, METEOR, and SARI are from Huggingface evaluate
v.0.3.0. ARI is from py-readability-metrics
v.1.4.5.
Results
Metrics | SAS-baseline |
---|---|
SacreBLEU↑ | 20.97 |
BERT Score F1↑ | 0.89 |
ROUGLE-1↑ | 0.48 |
ROUGLE-2↑ | 0.23 |
ROUGLE-L↑ | 0.32 |
METEOR↑ | 0.39 |
SARI↑ | 46.83 |
ARI↓* | 17.12 (std. 1.97) |
- Note: Half of the generated texts are too short (less than 100 words) to calcualte meaningful ARI. We therefore concatenated adjecent two texts and compute ARI for the 100 texts (instead of original 200 texts).
Contact
Please contact us for any questions or suggestions.
Disclaimer
The model (sas_baseline) is created for making scientific abstracts more accessible. Its outputs should not be used or trusted outside of its scope. There is no guarantee that the generated text is perfectly aligned with the research. Resort to human experts or original papers when a decision is critical.
Acknowledgement
This research is supported by the Institute of Museum and Library Services (IMLS) RE-246450-OLS-20.