metadata
base_model: google/pegasus-x-base
tags:
- generated_from_trainer
datasets:
- arxiv-summarization
widget:
- text: >-
[Abstract] The dominant sequence transduction models are based on complex
recurrent or convolutional neural networks in an encoder-decoder
configuration. The best performing models also connect the encoder and
decoder through an attention mechanism. We propose a new simple network
architecture, the Transformer, based solely on attention mechanisms,
dispensing with recurrence and convolutions entirely. Experiments on two
machine translation tasks show these models to be superior in quality
while being more parallelizable and requiring significantly less time to
train. Our model achieves 28.4 BLEU on the WMT 2014 English-to-German
translation task, improving over the existing best results, including
ensembles by over 2 BLEU. On the WMT 2014 English-to-French translation
task, our model establishes a new single-model state-of-the-art BLEU score
of 41.8 after training for 3.5 days on eight GPUs, a small fraction of the
training costs of the best models from the literature. We show that the
Transformer generalizes well to other tasks by applying it successfully to
English constituency parsing both with large and limited training data.
[Introduction] Recurrent neural networks, long short-term memory [13] and
gated recurrent [7] neural networks in particular, have been firmly
established as state of the art approaches in sequence modeling and
transduction problems such as language modeling and machine translation
[35, 2, 5]. Numerous efforts have since continued to push the boundaries
of recurrent language models and encoder-decoder architectures [38, 24,
15]. Recurrent models typically factor computation along the symbol
positions of the input and output sequences. Aligning the positions to
steps in computation time, they generate a sequence of hidden states ht,
as a function of the previous hidden state ht−1 and the input for position
t. This inherently sequential nature precludes parallelization within
training examples, which becomes critical at longer sequence lengths, as
memory constraints limit batching across examples. Recent work has
achieved significant improvements in computational efficiency through
factorization tricks [21] and conditional computation [32], while also
improving model performance in case of the latter. The fundamental
constraint of sequential computation, however, remains. Attention
mechanisms have become an integral part of compelling sequence modeling
and transduction models in various tasks, allowing modeling of
dependencies without regard to their distance in the input or output
sequences [2, 19]. In all but a few cases [27], however, such attention
mechanisms are used in conjunction with a recurrent network. In this work
we propose the Transformer, a model architecture eschewing recurrence and
instead relying entirely on an attention mechanism to draw global
dependencies between input and output. The Transformer allows for
significantly more parallelization and can reach a new state of the art in
translation quality after being trained for as little as twelve hours on
eight P100 GPUs.
example_title: Attention Is All You Need
- text: >-
[Abstract] In this work, we explore prompt tuning, a simple yet effective
mechanism for learning soft prompts to condition frozen language models to
perform specific downstream tasks. Unlike the discrete text prompts used
by GPT-3, soft prompts are learned through backpropagation and can be
tuned to incorporate signal from any number of labeled examples. Our
end-to-end learned approach outperforms GPT-3's few-shot learning by a
large margin. More remarkably, through ablations on model size using T5,
we show that prompt tuning becomes more competitive with scale: as models
exceed billions of parameters, our method closes the gap and matches the
strong performance of model tuning (where all model weights are tuned).
This finding is especially relevant in that large models are costly to
share and serve, and the ability to reuse one frozen model for multiple
downstream tasks can ease this burden. Our method can be seen as a
simplification of the recently proposed prefix tuning of Li and Liang
(2021), and we provide a comparison to this and other similar approaches.
Finally, we show that conditioning a frozen model with soft prompts
confers benefits in robustness to domain transfer, as compared to full
model tuning. [Introduction] With the wide success of pre-trained large
language models, a range of techniques has arisen to adapt these
general-purpose models to downstream tasks. ELMo (Peters et al., 2018)
proposed freezing the pre-trained model and learning a task-specific
weighting of its per-layer representations. However, since GPT (Radford et
al., 2018) and BERT (Devlin et al., 2019), the dominant adaptation
technique has been model tuning (or fine-tuning), where all model
parameters are tuned during adaptation, as proposed by Howard and Ruder
(2018).More recently, Brown et al. (2020) showed that prompt design (or
priming) is surprisingly effective at modulating a frozen GPT-3 model’s
behavior through text prompts. Prompts are typically composed of a task
description and/or several canonical examples. This return to freezing
pre-trained models is appealing, especially as model size continues to
increase. Rather than requiring a separate copy of the model for each
downstream task, a single generalist model can simultaneously serve many
different tasks. Unfortunately, prompt-based adaptation has several key
drawbacks. Task description is error-prone and requires human involvement,
and the effectiveness of a prompt is limited by how much conditioning text
can fit into the model’s input. As a result, downstream task quality still
lags far behind that of tuned models. For instance, GPT-3 175B fewshot
performance on SuperGLUE is 17.5 points below fine-tuned T5-XXL (Raffel et
al., 2020) (71.8 vs. 89.3) despite using 16 times more parameters. Several
efforts to automate prompt design have been recently proposed. Shin et al.
(2020) propose a search algorithm over the discrete space of words, guided
by the downstream application training data. While this technique
outperforms manual prompt design, there is still a gap relative to model
tuning. Li and Liang (2021) propose prefix tuning and show strong results
on generative tasks. This method freezes the model parameters and
backpropagates the error during tuning to prefix activations prepended to
each layer in the encoder stack, including the input layer. Hambardzumyan
et al. (2021) simplify this recipe by restricting the trainable parameters
to the input and output subnetworks of a masked language model, and show
reasonable results on classifications tasks. In this paper, we propose
prompt tuning as a further simplification for adapting language models. We
freeze the entire pre-trained model and only allow an additional k tunable
tokens per downstream task to be prepended to the input text. This soft
prompt is trained end-to-end and can condense the signal from a full
labeled dataset, allowing our method to outperform few-shot prompts and
close the quality gap with model tuning (Figure 1). At the same time,
since a single pre-trained model is recycled for all downstream tasks, we
retain the efficient serving benefits of frozen models (Figure 2). While
we developed our method concurrently with Li and Liang (2021) and
Hambardzumyan et al. (2021), we are the first to show that prompt tuning
alone (with no intermediate-layer prefixes or task-specific output layers)
is sufficient to be competitive with model tuning. Through detailed
experiments in sections 2–3, we demonstrate that language model capacity
is a key ingredient for these approaches to succeed. As Figure 1 shows,
prompt tuning becomes more competitive with scale. We compare with similar
approaches in Section 4. Explicitly separating task-specific parameters
from the generalist parameters needed for general language-understanding
has a range of additional benefits. We show in Section 5 that by capturing
the task definition in the prompt while keeping the generalist parameters
fixed, we are able to achieve better resilience to domain shifts. In
Section 6, we show that prompt ensembling, learning multiple prompts for
the same task, can boost quality and is more efficient than classic model
ensembling. Finally, in Section 7, we investigate the interpretability of
our learned soft prompts. In sum, our key contributions are: 1. Proposing
prompt tuning and showing its competitiveness with model tuning in the
regime of large language models. 2. Ablating many design choices, and
showing quality and robustness improve with scale. 3. Showing prompt
tuning outperforms model tuning on domain shift problems. 4. Proposing
prompt ensembling and showing its effectiveness.
example_title: PEFT (2104.08691)
- text: >-
[Abstract] For the first time in the world, we succeeded in synthesizing
the room-temperature superconductor (Tc≥400 K, 127∘C) working at ambient
pressure with a modified lead-apatite (LK-99) structure. The
superconductivity of LK-99 is proved with the Critical temperature (Tc),
Zero-resistivity, Critical current (Ic), Critical magnetic field (Hc), and
the Meissner effect. The superconductivity of LK-99 originates from minute
structural distortion by a slight volume shrinkage (0.48 %), not by
external factors such as temperature and pressure. The shrinkage is caused
by Cu2+ substitution of Pb2+(2) ions in the insulating network of
Pb(2)-phosphate and it generates the stress. It concurrently transfers to
Pb(1) of the cylindrical column resulting in distortion of the cylindrical
column interface, which creates superconducting quantum wells (SQWs) in
the interface. The heat capacity results indicated that the new model is
suitable for explaining the superconductivity of LK-99. The unique
structure of LK-99 that allows the minute distorted structure to be
maintained in the interfaces is the most important factor that LK-99
maintains and exhibits superconductivity at room temperatures and ambient
pressure. [Introduction] Since the discovery of the first
superconductor(1), many efforts to search for new roomtemperature
superconductors have been carried out worldwide(2, 3) through their
experimental clarity or/and theoretical perspectives(4-8). The recent
success of developing room-temperature superconductors with hydrogen
sulfide(9) and yttrium super-hydride(10) has great attention worldwide,
which is expected by strong electron-phonon coupling theory with
high-frequency hydrogen phonon modes(11, 12). However, it is difficult to
apply them to actual application devices in daily life because of the
tremendously high pressure, and more efforts are being made to overcome
the high-pressure problem(13). For the first time in the world, we report
the success in synthesizing a room-temperature and ambient-pressure
superconductor with a chemical approach to solve the temperature and
pressure problem. We named the first room temperature and ambient pressure
superconductor LK-99. The superconductivity of LK-99 proved with the
Critical temperature (Tc), Zero-resistivity, Critical current (Ic),
Critical magnetic field (Hc), and Meissner effect(14, 15). Several data
were collected and analyzed in detail to figure out the puzzle of
superconductivity of LK-99: X-ray diffraction (XRD), X-ray photoelectron
spectroscopy (XPS), Electron Paramagnetic Resonance Spectroscopy (EPR),
Heat Capacity, and Superconducting quantum interference device (SQUID)
data. Henceforth in this paper, we will report and discuss our new
findings including superconducting quantum wells associated with the
superconductivity of LK-99.
example_title: LK-99 (Not NLP)
- text: >-
[Abstract] Abstract Evaluation practices in natural language generation
(NLG) have many known flaws, but improved evaluation approaches are rarely
widely adopted. This issue has become more urgent, since neural NLG models
have improved to the point where they can often no longer be distinguished
based on the surfacelevel features that older metrics rely on. This paper
surveys the issues with human and automatic model evaluations and with
commonly used datasets in NLG that have been pointed out over the past 20
years. We summarize, categorize, and discuss how researchers have been
addressing these issues and what their findings mean for the current state
of model evaluations. Building on those insights, we lay out a long-term
vision for NLG evaluation and propose concrete steps for researchers to
improve their evaluation processes. Finally, we analyze 66 NLG papers from
recent NLP conferences in how well they already follow these suggestions
and identify which areas require more drastic changes to the status quo.
[Introduction] There are many issues with the evaluation of models that
generate natural language. For example, datasets are often constructed in
a way that prevents measuring tail effects of robustness, and they almost
exclusively cover English. Most automated metrics measure only similarity
between model output and references instead of fine-grained quality
aspects (and even that poorly). Human evaluations have a high variance
and, due to insufficient documentation, rarely produce replicable results.
These issues have become more urgent as the nature of models that generate
language has changed without significant changes to how they are being
evaluated. While evaluation methods can capture surface-level improvements
in text generated by state-of-the-art models (such as increased fluency)
to some extent, they are ill-suited to detect issues with the content of
model outputs, for example if they are not attributable to input
information. These ineffective evaluations lead to overestimates of model
capabilities. Deeper analyses uncover that popular models fail even at
simple tasks by taking shortcuts, overfitting, hallucinating, and not
being in accordance with their communicative goals. Identifying these
shortcomings, many recent papers critique evaluation techniques or propose
new ones. But almost none of the suggestions are followed or new
techniques used. There is an incentive mismatch between conducting
high-quality evaluations and publishing new models or modeling techniques.
While general-purpose evaluation techniques could lower the barrier of
entry for incorporating evaluation advances into model development, their
development requires resources that are hard to come by, including model
outputs on validation and test sets or large quantities of human
assessments of such outputs. Moreover, some issues, like the refinement of
datasets, require iterative processes where many researchers collaborate.
All this leads to a circular dependency where evaluations of generation
models can be improved only if generation models use better evaluations.
We find that there is a systemic difference between selecting the best
model and characterizing how good this model really is. Current evaluation
techniques focus on the first, while the second is required to detect
crucial issues. More emphasis needs to be put on measuring and reporting
model limitations, rather than focusing on producing the highest
performance numbers. To that end, this paper surveys analyses and
critiques of evaluation approaches (sections 3 and 4) and of commonly used
NLG datasets (section 5). Drawing on their insights, we describe how
researchers developing modeling techniques can help to improve and
subsequently benefit from better evaluations with methods available today
(section 6). Expanding on existing work on model documentation and formal
evaluation processes (Mitchell et al., 2019; Ribeiro et al., 2020), we
propose releasing evaluation reports which focus on demonstrating NLG
model shortcomings using evaluation suites. These reports should apply a
complementary set of automatic metrics, include rigorous human
evaluations, and be accompanied by data releases that allow for
re-analysis with improved metrics. In an analysis of 66 recent EMNLP,
INLG, and ACL papers along 29 dimensions related to our suggestions
(section 7), we find that the first steps toward an improved evaluation
are already frequently taken at an average rate of 27%. The analysis
uncovers the dimensions that require more drastic changes in the NLG
community. For example, 84% of papers already report results on multiple
datasets and more than 28% point out issues in them, but we found only a
single paper that contributed to the dataset documentation, leaving future
researchers to re-identify those issues. We further highlight typical
unsupported claims and a need for more consistent data release practices.
Following the suggestions and results, we discuss how incorporating the
suggestions can improve evaluation research, how the suggestions differ
from similar ones made for NLU, and how better metrics can benefit model
development itself (section 8).
example_title: NLG-Eval (2202.06935)
model-index:
- name: Long-paper-summarization-pegasus-x-b
results:
- task:
name: Summarization
type: summarization
dataset:
name: ccdv/arxiv-summarization
type: ccdv/arxiv-summarization
config: section
split: test
args: section
metrics:
- name: ROUGE-1
type: rouge
value: 35.6639
- name: ROUGE-2
type: rouge
value: 9.81362
- name: ROUGE-L
type: rouge
value: 19.9013
- name: ROUGE-LSum
type: rouge
value: 28.1444
license: mit
language:
- en
metrics:
- rouge
Long-paper-summarization-pegasus-x-b
This model is a fine-tuned version of google/pegasus-x-base on the arxiv-summarization dataset. It achieves the following results on the evaluation set:
- Loss: 2.7262
Model Description / Training and evaluation data
Base Model: Pegasus-x-base (State-of-the-art for Long Context Summarization)
Finetuning Dataset:
- We used train[25000:100000] of ArXiv Dataset (Cohan et al., 2018, NAACL-HLT 2018) [PDF]
- (Full length is 200,000+, We will upload full trained Model soon)
GPU: (RTX A6000) x 1
Train time: About 24 hours for 3 epochs
Test time: About 8 hours for test dataset.
Intended uses & limitations
- Research Paper Summarization
Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 1e-05
- train_batch_size: 1
- eval_batch_size: 1
- seed: 42
- gradient_accumulation_steps: 64
- total_train_batch_size: 64
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: linear
- lr_scheduler_warmup_steps: 390
- num_epochs: 3 (takes about 24 hours)
Training results
Training Loss | Epoch | Step | Validation Loss |
---|---|---|---|
3.401 | 0.33 | 390 | 2.3985 |
2.5444 | 0.67 | 780 | 2.2461 |
2.4849 | 1.0 | 1170 | 2.2690 |
2.5735 | 1.33 | 1560 | 2.3334 |
2.7045 | 1.66 | 1950 | 2.4330 |
2.8939 | 2.0 | 2340 | 2.5461 |
3.0773 | 2.33 | 2730 | 2.6502 |
3.2149 | 2.66 | 3120 | 2.7039 |
3.2844 | 3.0 | 3510 | 2.7262 |
Framework versions
- Transformers 4.32.1
- Pytorch 2.0.1
- Datasets 2.12.0
- Tokenizers 0.13.2