File size: 2,189 Bytes
e216422
 
 
 
 
 
 
 
 
 
29d7743
e216422
29d7743
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e216422
 
 
 
 
1875768
 
e216422
1875768
e216422
1875768
e216422
1875768
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e216422
 
 
 
 
 
 
 
29d7743
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
---
language:
 - en
tags:
- table-to-text
- tabular
datasets:
- totto
---

# BLOOM (0.56B) fine-tuned on ToTTo for Table-to-text

This model is a fine-tuned version of [bigscience/bloom-560m](https://huggingface.co/bigscience/bloom-560m) on the **ToTTo** [dataset](https://huggingface.co/datasets/totto).


## The model

It is a 560M params version of [**BLOOM**](https://bigscience.huggingface.co/blog/bloom)

## The dataset

**ToTTo** is an open-domain English table-to-text dataset with over 120,000 training examples that proposes a controlled generation task: given a Wikipedia table and a set of highlighted table cells, produce a one-sentence description.

During the dataset creation process, tables from English Wikipedia are matched with (noisy) descriptions. Each table cell mentioned in the description is highlighted and the descriptions are iteratively cleaned and corrected to faithfully reflect the content of the highlighted cells.


### Evaluation results

| Metric | Value |
|:-------:|:-----:|
| rouge1  | 0.56  |
| rouge2  | 0.33  |
| rougeL  | 0.48  |
| rougeLsum  | 0.48  |


## Usage

```py
from datasets import load_dataset
from transformers import BloomTokenizerFast, BloomForCausalLM

valid_dataset = load_dataset('totto', split='validation')

from preprocess import preprocess # This file is included in the repo

# Now we linearize the tables
valid_dataset = valid_dataset.map(preprocess) 

model_ckpt = "mrm8488/bloom-560m-finetuned-totto-table-to-text"

tokenizer = BloomTokenizerFast.from_pretrained(ckpt)
model = BloomForCausalLM.from_pretrained(ckpt).to("cuda")


def explain_hl_cells(text):
    inputs = tokenizer(text, return_tensors='pt')
    input_ids = inputs.input_ids.to("cuda")
    attention_mask = inputs.attention_mask.to("cuda")
    output = model.generate(input_ids, attention_mask=attention_mask, max_length=2048, eos_token_id=tokenizer.eos_token_id) # num_beams=3, temperature=1.9

    return tokenizer.decode(output[0], skip_special_tokens=False)

example = valid_dataset[1]

print(explain_hl_cells(example['linearized_table'])
``` 


### Framework versions

- Transformers 4.21.2
- Pytorch 1.12.1+cu113
- Datasets 2.4.0
- Tokenizers 0.12.1