yuewang-sf
commited on
Commit
·
90bae88
1
Parent(s):
ea4592d
Update README.md
Browse files
README.md
CHANGED
@@ -7,10 +7,9 @@ datasets:
|
|
7 |
inference: true
|
8 |
---
|
9 |
|
10 |
-
# CodeT5 for
|
11 |
|
12 |
-
[CodeT5-base](https://huggingface.co/Salesforce/codet5-base) model fine-tuned on CodeSearchNet data
|
13 |
-
from [Husain et al., 2019](https://arxiv.org/abs/1909.09436) in a multi-lingual training setting (
|
14 |
Ruby/JavaScript/Go/Python/Java/PHP) for code summarization. It was introduced in this EMNLP 2021
|
15 |
paper [CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation](https://arxiv.org/abs/2109.00859)
|
16 |
by Yue Wang, Weishi Wang, Shafiq Joty, Steven C.H. Hoi. Please check out more
|
@@ -24,7 +23,7 @@ Here is how to use this model:
|
|
24 |
from transformers import RobertaTokenizer, T5ForConditionalGeneration
|
25 |
|
26 |
if __name__ == '__main__':
|
27 |
-
tokenizer = RobertaTokenizer.from_pretrained('Salesforce/codet5-base')
|
28 |
model = T5ForConditionalGeneration.from_pretrained('Salesforce/codet5-base-multi-sum')
|
29 |
|
30 |
text = """def svg_to_image(string, size=None):
|
@@ -49,13 +48,12 @@ if __name__ == '__main__':
|
|
49 |
|
50 |
## Fine-tuning data
|
51 |
|
52 |
-
We employ the filtered version of CodeSearchNet data
|
53 |
from [CodeXGLUE](https://github.com/microsoft/CodeXGLUE/tree/main/Code-Text/code-to-text) benchmark for fine-tuning on
|
54 |
code summarization. The data is tokenized with our pre-trained code-specific BPE (Byte-Pair Encoding) tokenizer. One can
|
55 |
-
prepare text (or code) for the model using RobertaTokenizer
|
56 |
-
from [codet5-base](https://huggingface.co/Salesforce/codet5-base).
|
57 |
|
58 |
-
### Data
|
59 |
|
60 |
| Programming Language | Training | Dev | Test |
|
61 |
| :------------------- | :------: | :----: | :----: |
|
@@ -68,14 +66,13 @@ from [codet5-base](https://huggingface.co/Salesforce/codet5-base).
|
|
68 |
|
69 |
## Training procedure
|
70 |
|
71 |
-
We fine-tune codet5-base on six
|
72 |
-
balanced sampling to avoid biasing towards high-resource tasks. Please refer to
|
73 |
-
the [paper](https://arxiv.org/abs/2109.00859) for more details.
|
74 |
|
75 |
## Evaluation results
|
76 |
|
77 |
-
Unlike the paper allowing to select different best checkpoints for different
|
78 |
-
all PLs. Besides, we remove the prefix to specify the PL in training and inference. The results on the test set are shown as below:
|
79 |
|
80 |
| Model | Ruby | Javascript | Go | Python | Java | PHP | Overall |
|
81 |
| ----------- | :-------: | :--------: | :-------: | :-------: | :-------: | :-------: | :-------: |
|
@@ -85,10 +82,10 @@ all PLs. Besides, we remove the prefix to specify the PL in training and inferen
|
|
85 |
| [CodeBERT](https://arxiv.org/pdf/2002.08155.pdf) | 12.16 | 14.90 | 18.07 | 19.06 | 17.65 | 25.16 | 17.83 |
|
86 |
| [PLBART](https://arxiv.org/pdf/2002.08155.pdf) | 14.11 |15.56 | 18.91 | 19.30 | 18.45 | 23.58 | 18.32 |
|
87 |
| [CodeT5-small](https://arxiv.org/abs/2109.00859) |14.87 | 15.32 | 19.25 | 20.04 | 19.92 | 25.46 | 19.14 |
|
88 |
-
| [CodeT5-base](https://arxiv.org/abs/2109.00859) | 15.24 | 16.16 | 19.56 | 20.01 | 20.31 | 26.03 | 19.55 |
|
89 |
-
| [CodeT5-base-multi-sum](https://arxiv.org/abs/2109.00859) | 15.24
|
90 |
|
91 |
-
|
92 |
|
93 |
```bibtex
|
94 |
@inproceedings{
|
|
|
7 |
inference: true
|
8 |
---
|
9 |
|
10 |
+
# CodeT5-base for Code Summarization
|
11 |
|
12 |
+
[CodeT5-base](https://huggingface.co/Salesforce/codet5-base) model fine-tuned on CodeSearchNet data in a multi-lingual training setting (
|
|
|
13 |
Ruby/JavaScript/Go/Python/Java/PHP) for code summarization. It was introduced in this EMNLP 2021
|
14 |
paper [CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation](https://arxiv.org/abs/2109.00859)
|
15 |
by Yue Wang, Weishi Wang, Shafiq Joty, Steven C.H. Hoi. Please check out more
|
|
|
23 |
from transformers import RobertaTokenizer, T5ForConditionalGeneration
|
24 |
|
25 |
if __name__ == '__main__':
|
26 |
+
tokenizer = RobertaTokenizer.from_pretrained('Salesforce/codet5-base-multi-sum')
|
27 |
model = T5ForConditionalGeneration.from_pretrained('Salesforce/codet5-base-multi-sum')
|
28 |
|
29 |
text = """def svg_to_image(string, size=None):
|
|
|
48 |
|
49 |
## Fine-tuning data
|
50 |
|
51 |
+
We employ the filtered version of CodeSearchNet data [[Husain et al., 2019](https://arxiv.org/abs/1909.09436)]
|
52 |
from [CodeXGLUE](https://github.com/microsoft/CodeXGLUE/tree/main/Code-Text/code-to-text) benchmark for fine-tuning on
|
53 |
code summarization. The data is tokenized with our pre-trained code-specific BPE (Byte-Pair Encoding) tokenizer. One can
|
54 |
+
prepare text (or code) for the model using RobertaTokenizer with the vocab files from [codet5-base](https://huggingface.co/Salesforce/codet5-base).
|
|
|
55 |
|
56 |
+
### Data statistic
|
57 |
|
58 |
| Programming Language | Training | Dev | Test |
|
59 |
| :------------------- | :------: | :----: | :----: |
|
|
|
66 |
|
67 |
## Training procedure
|
68 |
|
69 |
+
We fine-tune codet5-base on these six programming languages (Ruby/JavaScript/Go/Python/Java/PHP) in the multi-task learning setting. We employ the
|
70 |
+
balanced sampling to avoid biasing towards high-resource tasks. Please refer to the [paper](https://arxiv.org/abs/2109.00859) for more details.
|
|
|
71 |
|
72 |
## Evaluation results
|
73 |
|
74 |
+
Unlike the paper allowing to select different best checkpoints for different programming languages (PLs), here we employ one checkpoint for
|
75 |
+
all PLs. Besides, we remove the task control prefix to specify the PL in training and inference. The results on the test set are shown as below:
|
76 |
|
77 |
| Model | Ruby | Javascript | Go | Python | Java | PHP | Overall |
|
78 |
| ----------- | :-------: | :--------: | :-------: | :-------: | :-------: | :-------: | :-------: |
|
|
|
82 |
| [CodeBERT](https://arxiv.org/pdf/2002.08155.pdf) | 12.16 | 14.90 | 18.07 | 19.06 | 17.65 | 25.16 | 17.83 |
|
83 |
| [PLBART](https://arxiv.org/pdf/2002.08155.pdf) | 14.11 |15.56 | 18.91 | 19.30 | 18.45 | 23.58 | 18.32 |
|
84 |
| [CodeT5-small](https://arxiv.org/abs/2109.00859) |14.87 | 15.32 | 19.25 | 20.04 | 19.92 | 25.46 | 19.14 |
|
85 |
+
| [CodeT5-base](https://arxiv.org/abs/2109.00859) | **15.24** | 16.16 | 19.56 | 20.01 | **20.31** | 26.03 | 19.55 |
|
86 |
+
| [CodeT5-base-multi-sum](https://arxiv.org/abs/2109.00859) | **15.24** | **16.18** | **19.95** | **20.42** | 20.26 | **26.10** | **19.69** |
|
87 |
|
88 |
+
## Citation
|
89 |
|
90 |
```bibtex
|
91 |
@inproceedings{
|