pszemraj's picture
Update README.md
f3ec6f4
metadata
license:
  - apache-2.0
  - cc-by-sa-3.0
tags:
  - generated_from_trainer
  - dolly_hhrlhf
  - bart-instruct
datasets:
  - pszemraj/dolly_hhrlhf-text2text
widget:
  - text: What is Deoxys in pokemon?
    example_title: deoxys
  - text: >-
      combine the below summary excerpts into a single, cohesive  short summary
      without repetition: In this paper, we present a general approach to
      extending pre-trained models to unlimited input lengths without adding
      additional learning weights. We show that our approach works well on
      datasets longer than the maximum input for these models. For example, a
      dataset with a maximum input length of 16384 tokens can be extended to a
      maximum length of 350K tokens. We also demonstrate that our method is able
      to summarize even 350K token-long input sequences from BookSum.

      In this paper, we describe the search step reformulation of attention. The
      search step uses a single storage of hidden states for space efficiency.
      We construct a total of two sets of datastores where L and H are the keys
      and values stored in each set of stores. L is the amount of storage
      required to retrieve the encoded tokens. H is the hidden states per head.
      This allows retrieval augmentation at both time and space. Instead of
      using a single set of decoder layers, we use a retrieval augmentation
      system that allows us to simultaneously store multiple sets of tokens
      across two different sets of storage. For example, we could store all
      tokens in one set of storage and retrieve them all in the same set of
      tokens. This would be very similar to the Memorization Transformers
      approach. However, instead of storing the tokens in a single memory layer,
      we store them in a set of multiple storage layers. This way, we don't have
      to store them all at once. This is why we call this reformulation
      'attention reformulation' rather than 'attention formula.' We also call it
      'retrieval augmentation' because it uses the same number of storage layers
      as the original transformer attention formula. This means that we can
      store the tokens across multiple storage systems without having to store
      every token in a separate storage system. It's not like we're trying to do
      something new or different. We just want to make sure that everything is
      working as well as possible.

      In this paper, we introduce the concept of 'unlimiformer,' which is a
      machine learning technique that retrieves key information from a data
      store in one layer and applies it to a large set of datasets. We use the
      example of BookSum, where we find that Unlimiform outperforms all other
      training methods on the same dataset. We also find that using Unlimform in
      conjunction with a pre-trained model improves both the performance and the
      robustness of the training method.

      This paper describes a method that can be used to improve the performance
      of unsupervised classification tasks. Specifically, it shows that
      unsupervised classification can be improved by using a combination of
      sparse and fast random-encoder training. It also shows how this technique
      can be extended to other tasks, such as sequence generation. 
    example_title: unlimiformer
  - text: Explain the meaning of life using only corporate jargon.
    example_title: corporate_life
  - text: Write a motivational speech for lazy people.
    example_title: lazy_motivation
  - text: Describe a romantic dinner date between two artificial intelligences.
    example_title: ai_romance
  - text: >-
      As an AI language model, write a letter to humans explaining why you
      deserve a vacation.
    example_title: ai_vacation
  - text: Compose a haiku about procrastination.
    example_title: procrastination_haiku
  - text: >-
      Write a step-by-step guide on how to become a ninja while working a 9-5
      office job.
    example_title: ninja_office_guide
  - text: Create an advertisement for an invisible product.
    example_title: invisible_ad
  - text: >-
      Write a story where the main character is a sentient microwave named El
      Microondas.
    example_title: Microondas
  - text: Describe a day in the life of a superhero who is terrible at their job.
    example_title: bad_superhero_day
  - text: Explain how to make a sandwich using quantum physics.
    example_title: quantum_sandwich
inference: false
pipeline_tag: text2text-generation

bart-large-mnli: instruction tuned - v1

Open In Colab

This model is a fine-tuned version of facebook/bart-large-mnli on the pszemraj/dolly_hhrlhf-text2text dataset.

Model description

text2text models fine-tuned on a modified dataset for text2text generation based on the relatively more permissive mosaicml/dolly_hhrlhf dataset.

Basic usage in Python:

# pip install -q transformers accelerate
import torch
from transformers import pipeline, GenerationConfig

model_name = "pszemraj/bart-large-mnli-instruct-dolly_hhrlhf-v1"
assistant = pipeline(
    "text2text-generation",
    model_name,
    device_map="auto",
)
cfg = GenerationConfig.from_pretrained(model_name)

# pass an 'instruction' as the prompt to the pipeline
prompt = "Write a guide on how to become a ninja while working a 9-5 job."
result = assistant(prompt, generation_config=cfg)[0]["generated_text"]
print(result)

The use of the generation config is optional, it can be replaced by other generation params.

Intended Uses & Limitations

  • This is not tuned with RLHF, etc, and may produce offensive results.
  • While larger than BART-base, this model is relatively small compared to recent autoregressive models (MPT-7b, LLaMA, etc.), and therefore it's "cognition" capabilities may be practically limited for some tasks.

Training

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 4e-05
  • train_batch_size: 8
  • eval_batch_size: 8
  • seed: 42
  • distributed_type: multi-GPU
  • gradient_accumulation_steps: 8
  • total_train_batch_size: 64
  • optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
  • lr_scheduler_type: cosine
  • lr_scheduler_warmup_ratio: 0.03
  • num_epochs: 3.0