reStructured Pre-training (RST)

RST is a new paradigm for language pre-training, which

unifies 26 different types of signal from 10 data sources (Totten Tomatoes, Dailymail, Wikipedia, Wikidata, Wikihow, Wordnet, arXiv etc ) in the world structurally, being pre-trained with a monolithcal model,
surpasses strong competitors (e.g., T0) on 52/55 popular datasets from a variety of NLP tasks (classification, IE, retrieval, generation etc)
achieves superior performance in National College Entrance Examination (Gaokao-English, 高考-英语) achieves 40 points higher than the average scores made by students and 15 points higher than GPT3 with 1/16 parameters. In particular, Qin gets a high score of 138.5 (the full mark is 150) in the 2018 English exam

In such a pre-training paradigm,

Data-centric Pre-training: the role of data will be re-emphasized, and model pre-training and fine-tuning of downstream tasks are viewed as a process of data storing and accessing
Pre-training over JSON instead of TEXT: a good storage mechanism should not only have the ability to cache a large amount of data but also consider the ease of access.

Model Description

We release all models introduced in our paper, covering 13 different application scenarios. Each model contains 11 billion parameters.

Model	Description	Recommended Application
rst-all-11b	Trained with all the signals below except signals that are used to train Gaokao models	All applications below （specialized models are recommended first if high performance is preferred）
rst-fact-retrieval-11b	Trained with the following signals: WordNet meaning, WordNet part-of-speech, WordNet synonym, WordNet antonym, wikiHow category hierarchy, Wikidata relation, Wikidata entity typing, Paperswithcode entity typing	Knowledge intensive tasks, information extraction tasks,factual checker
rst-summarization-11b	Trained with the following signals: DailyMail summary, Paperswithcode summary, arXiv summary, wikiHow summary	Summarization or other general generation tasks, meta-evaluation (e.g., BARTScore)
rst-temporal-reasoning-11b	Trained with the following signals: DailyMail temporal information, wikiHow procedure	Temporal reasoning, relation extraction, event-based extraction
rst-information-extraction-11b	Trained with the following signals: Paperswithcode entity, Paperswithcode entity typing, Wikidata entity typing, Wikidata relation, Wikipedia entity	Named entity recognition, relation extraction and other general IE tasks in the news, scientific or other domains
rst-intent-detection-11b	Trained with the following signals: wikiHow goal-step relation	Intent prediction, event prediction
rst-topic-classification-11b	Trained with the following signals: DailyMail category, arXiv category, wikiHow text category, Wikipedia section title	general text classification
rst-word-sense-disambiguation-11b	Trained with the following signals: WordNet meaning, WordNet part-of-speech, WordNet synonym, WordNet antonym	Word sense disambiguation, part-of-speech tagging, general IE tasks, common sense reasoning
rst-natural-language-inference-11b	Trained with the following signals: ConTRoL dataset, DREAM dataset, LogiQA dataset, RACE & RACE-C dataset, ReClor dataset, DailyMail temporal information	Natural language inference, multiple-choice question answering, reasoning
rst-sentiment-classification-11b	Trained with the following signals: Rotten Tomatoes sentiment, Wikipedia sentiment	Sentiment classification, emotion classification
rst-gaokao-rc-11b	Trained with multiple-choice QA datasets that are used to train the T0pp model	General multiple-choice question answering
rst-gaokao-cloze-11b	Trained with manually crafted cloze datasets	General cloze filling
rst-gaokao-writing-11b	Trained with example essays from past Gaokao-English exams and grammar error correction signals	Essay writing, story generation, grammar error correction and other text generation tasks

Have a try?

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("XLab/rst-all-11b")
model = AutoModelForSeq2SeqLM.from_pretrained("XLab/rst-all-11b")

inputs = tokenizer.encode("TEXT: this is the best cast iron skillet you will ever buy. QUERY: Is this review \"positive\" or \"negative\"", return_tensors="pt")
outputs = model.generate(inputs)
print(tokenizer.decode(outputs[0], skip_special_tokens=True, clean_up_tokenization_spaces=True))

Data for reStructure Pre-training

This dataset is a precious treasure, containing a variety of naturally occurring signals. Any downstream task you can think of (e.g., the college entrance exam mentioned in the RST paper) can benefit from being pre-trained on some of our provided signals. We spent several months collecting the following 29 signal types, accounting for a total of 46,926,447 data samples. We hope this dataset will be a valuable asset for everyone in natural language processing research.

We provide collected signals through DataLab. For efficiency, we only provide 50,000 samples at most for each signal type. If you want all the samples we collected, please fill this form. More specifically, we collected the following signals.

We will be happy :smiley: to know if the resource is helpful for your work, and please cite our work :blush:

Mine	Signal	#Sample	Use in DataLab	Some Applications
Rotten Tomatoes	(review, rating)	5,311,109	`load_dataset("rst", "rotten_tomatoes_sentiment")`	Sentiment classification
Daily Mail	(text, category)	899,904	`load_dataset("rst", "daily_mail_category")`	Topic classification
Daily Mail	(title, text, summary)	1,026,616	`load_dataset("rst", "daily_mail_summary")`	Summarization; Sentence expansion
Daily Mail	(text, events)	1,006,412	`load_dataset("rst", "daily_mail_temporal")`	Temporal reasoning
Wikidata	(entity, entity_type, text)	2,214,274	`load_dataset("rst", "wikidata_entity")`	Entity typing
Wikidata	(subject, object, relation, text)	1,526,674	`load_dataset("rst", "wikidata_relation")`	Relation extraction; Fact retrieval
wikiHow	(text, category)	112,109	`load_dataset("rst", "wikihow_text_category")`	Topic classification
wikiHow	(low_category, high_category)	4,868	`load_dataset("rst", "wikihow_category_hierarchy")`	Relation extraction; Commonsense reasoning
wikiHow	(goal, steps)	47,956	`load_dataset("rst", "wikihow_goal_step")`	Intent detection
wikiHow	(text, summary)	703,278	`load_dataset("rst", "wikihow_summary")`	Summarization; Sentence expansion
wikiHow	(goal, first_step, second_step)	47,787	`load_dataset("rst", "wikihow_procedure")`	Temporal reasoning
wikiHow	(question, description, answer, related_questions)	47,705	`load_dataset("rst", "wikihow_question")`	Question generation
Wikipedia	(text, entities)	22,231,011	`load_dataset("rst", "wikipedia_entities")`	Entity recognition
Wikipedia	(texts, titles)	3,296,225	`load_dataset("rst", "wikipedia_sections")`	Summarization
WordNet	(word, sentence, pos)	27,123	`load_dataset("rst", "wordnet_pos")`	Part-of-speech tagging
WordNet	(word, sentence, meaning, possible_meanings)	27,123	`load_dataset("rst", "wordnet_meaning")`	Word sense disambiguation
WordNet	(word, sentence, synonyms)	17,804	`load_dataset("rst", "wordnet_synonym")`	Paraphrasing
WordNet	(word, sentence, antonyms)	6,408	`load_dataset("rst", "wordnet_antonym")`	Negation
ConTRoL	(premise, hypothesis, label)	8,323	`load_dataset("rst", "qa_control")`	Natural language inference
DREAM	(context, question, options, answer)	9,164	`load_dataset("rst", "qa_dream")`	Reading comprehension
LogiQA	(context, question, options, answer)	7,974	`load_dataset("rst", "qa_logiqa")`	Reading comprehension
ReClor	(context, question, options, answer)	5,138	`load_dataset("rst", "qa_reclor")`	Reading comprehension
RACE	(context, question, options, answer)	44,880	`load_dataset("rst", "qa_race")`	Reading comprehension
RACE-C	(context, question, options, answer)	5,093	`load_dataset("rst", "qa_race_c")`	Reading comprehension
TriviaQA	(context, question, answer)	46,636	`load_dataset("rst", "qa_triviaqa")`	Reading comprehension
Arxiv	(text, category)	1,696,348	`load_dataset("rst", "arxiv_category")`	Topic classification
Arxiv	(text, summary)	1,696,348	`load_dataset("rst", "arxiv_summary")`	Summarization; Sentence expansion
Paperswithcode	(text, entities, datasets, methods, tasks, metrics)	4,731,233	`load_dataset("rst", "paperswithcode_entity")`	Entity recognition
Paperswithcode	(text, summary)	120,924	`load_dataset("rst", "paperswithcode_summary")`	Summarization; Sentence expansion

Bibtext for Citation Info

@article{yuan2022restructured,
  title={reStructured Pre-training},
  author={Yuan, Weizhe and Liu, Pengfei},
  journal={arXiv preprint arXiv:2206.11147},
  year={2022}
}