yuewang-sf commited on
Commit
7fa23ec
·
1 Parent(s): 48dca45

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +101 -0
README.md ADDED
@@ -0,0 +1,101 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: BSD-3
3
+ tags:
4
+ - codet5
5
+ datasets:
6
+ - code_search_net
7
+ inference: true
8
+ ---
9
+
10
+ # CodeT5 for code summarization (base-sized model)
11
+
12
+ [CodeT5-base](https://huggingface.co/Salesforce/codet5-base) model fine-tuned on CodeSearchNet data
13
+ from [Husain et al., 2019](https://arxiv.org/abs/1909.09436) in a multi-lingual training setting (
14
+ Ruby/JavaScript/Go/Python/Java/PHP) for code summarization. It was introduced in this EMNLP 2021
15
+ paper [CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation](https://arxiv.org/abs/2109.00859)
16
+ by Yue Wang, Weishi Wang, Shafiq Joty, Steven C.H. Hoi. Please check out more
17
+ at [this repository](https://github.com/salesforce/CodeT5).
18
+
19
+ ## How to use
20
+
21
+ Here is how to use this model:
22
+
23
+ ```python
24
+ from transformers import RobertaTokenizer, T5ForConditionalGeneration
25
+
26
+ if __name__ == '__main__':
27
+ tokenizer = RobertaTokenizer.from_pretrained('Salesforce/codet5-base')
28
+ model = T5ForConditionalGeneration.from_pretrained('Salesforce/codet5-base-multi-sum')
29
+
30
+ text = """def svg_to_image(string, size=None):
31
+ if isinstance(string, unicode):
32
+ string = string.encode('utf-8')
33
+ renderer = QtSvg.QSvgRenderer(QtCore.QByteArray(string))
34
+ if not renderer.isValid():
35
+ raise ValueError('Invalid SVG data.')
36
+ if size is None:
37
+ size = renderer.defaultSize()
38
+ image = QtGui.QImage(size, QtGui.QImage.Format_ARGB32)
39
+ painter = QtGui.QPainter(image)
40
+ renderer.render(painter)
41
+ return image"""
42
+
43
+ input_ids = tokenizer(text, return_tensors="pt").input_ids
44
+
45
+ generated_ids = model.generate(input_ids, max_length=20)
46
+ print(tokenizer.decode(generated_ids[0], skip_special_tokens=True))
47
+ # this prints: "Convert a SVG string to a QImage."
48
+ ```
49
+
50
+ ## Fine-tuning data
51
+
52
+ We employ the filtered version of CodeSearchNet data
53
+ from [CodeXGLUE](https://github.com/microsoft/CodeXGLUE/tree/main/Code-Text/code-to-text) benchmark for fine-tuning on
54
+ code summarization. The data is tokenized with our pre-trained code-specific BPE (Byte-Pair Encoding) tokenizer. One can
55
+ prepare text (or code) for the model using RobertaTokenizer, with the vocab files
56
+ from [codet5-base](https://huggingface.co/Salesforce/codet5-base).
57
+
58
+ ### Data Statistic
59
+
60
+ | Programming Language | Training | Dev | Test |
61
+ | :------------------- | :------: | :----: | :----: |
62
+ | Python | 251,820 | 13,914 | 14,918 |
63
+ | PHP | 241,241 | 12,982 | 14,014 |
64
+ | Go | 167,288 | 7,325 | 8,122 |
65
+ | Java | 164,923 | 5,183 | 10,955 |
66
+ | JavaScript | 58,025 | 3,885 | 3,291 |
67
+ | Ruby | 24,927 | 1,400 | 1,261 |
68
+
69
+ ## Training procedure
70
+
71
+ We fine-tune codet5-base on six PLs (Ruby/JavaScript/Go/Python/Java/PHP) in the multi-task learning setting. We employ
72
+ balanced sampling to avoid biasing towards high-resource tasks. Please refer to
73
+ the [paper](https://arxiv.org/abs/2109.00859) for more details.
74
+
75
+ ## Evaluation results
76
+
77
+ Unlike the paper allowing to select different best checkpoints for different tasks, here we employ one checkpoint for
78
+ all PLs. Besides, we remove the prefix to specify the PL in training and inference. The results on the test set are shown as below:
79
+
80
+ | Model | Ruby | Javascript | Go | Python | Java | PHP | Overall |
81
+ | ----------- | :-------: | :--------: | :-------: | :-------: | :-------: | :-------: | :-------: |
82
+ | Seq2Seq | 9.64 | 10.21 | 13.98 | 15.93 | 15.09 | 21.08 | 14.32 |
83
+ | Transformer | 11.18 | 11.59 | 16.38 | 15.81 | 16.26 | 22.12 | 15.56 |
84
+ | [RoBERTa](https://arxiv.org/pdf/1907.11692.pdf) | 11.17 | 11.90 | 17.72 | 18.14 | 16.47 | 24.02 | 16.57 |
85
+ | [CodeBERT](https://arxiv.org/pdf/2002.08155.pdf) | 12.16 | 14.90 | 18.07 | 19.06 | 17.65 | 25.16 | 17.83 |
86
+ | [PLBART](https://arxiv.org/pdf/2002.08155.pdf) | 14.11 |15.56 | 18.91 | 19.30 | 18.45 | 23.58 | 18.32 |
87
+ | [CodeT5-small](https://arxiv.org/abs/2109.00859) |14.87 | 15.32 | 19.25 | 20.04 | 19.92 | 25.46 | 19.14 |
88
+ | [CodeT5-base](https://arxiv.org/abs/2109.00859) | 15.24 | 16.16 | 19.56 | 20.01 | 20.31 | 26.03 | 19.55 |
89
+ | [CodeT5-base-multi-sum](https://arxiv.org/abs/2109.00859) | 15.24 | 16.18 | 19.95 | 20.42 | 20.26 | 26.10 | 19.69 |
90
+
91
+ ### BibTeX entry and citation info
92
+
93
+ ```bibtex
94
+ @inproceedings{
95
+ wang2021codet5,
96
+ title={CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation},
97
+ author={Yue Wang, Weishi Wang, Shafiq Joty, Steven C.H. Hoi},
98
+ booktitle={Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021},
99
+ year={2021},
100
+ }
101
+ ```