yuewang-sf commited on
Commit
90bae88
·
1 Parent(s): ea4592d

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +13 -16
README.md CHANGED
@@ -7,10 +7,9 @@ datasets:
7
  inference: true
8
  ---
9
 
10
- # CodeT5 for code summarization (base-sized model)
11
 
12
- [CodeT5-base](https://huggingface.co/Salesforce/codet5-base) model fine-tuned on CodeSearchNet data
13
- from [Husain et al., 2019](https://arxiv.org/abs/1909.09436) in a multi-lingual training setting (
14
  Ruby/JavaScript/Go/Python/Java/PHP) for code summarization. It was introduced in this EMNLP 2021
15
  paper [CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation](https://arxiv.org/abs/2109.00859)
16
  by Yue Wang, Weishi Wang, Shafiq Joty, Steven C.H. Hoi. Please check out more
@@ -24,7 +23,7 @@ Here is how to use this model:
24
  from transformers import RobertaTokenizer, T5ForConditionalGeneration
25
 
26
  if __name__ == '__main__':
27
- tokenizer = RobertaTokenizer.from_pretrained('Salesforce/codet5-base')
28
  model = T5ForConditionalGeneration.from_pretrained('Salesforce/codet5-base-multi-sum')
29
 
30
  text = """def svg_to_image(string, size=None):
@@ -49,13 +48,12 @@ if __name__ == '__main__':
49
 
50
  ## Fine-tuning data
51
 
52
- We employ the filtered version of CodeSearchNet data
53
  from [CodeXGLUE](https://github.com/microsoft/CodeXGLUE/tree/main/Code-Text/code-to-text) benchmark for fine-tuning on
54
  code summarization. The data is tokenized with our pre-trained code-specific BPE (Byte-Pair Encoding) tokenizer. One can
55
- prepare text (or code) for the model using RobertaTokenizer, with the vocab files
56
- from [codet5-base](https://huggingface.co/Salesforce/codet5-base).
57
 
58
- ### Data Statistic
59
 
60
  | Programming Language | Training | Dev | Test |
61
  | :------------------- | :------: | :----: | :----: |
@@ -68,14 +66,13 @@ from [codet5-base](https://huggingface.co/Salesforce/codet5-base).
68
 
69
  ## Training procedure
70
 
71
- We fine-tune codet5-base on six PLs (Ruby/JavaScript/Go/Python/Java/PHP) in the multi-task learning setting. We employ
72
- balanced sampling to avoid biasing towards high-resource tasks. Please refer to
73
- the [paper](https://arxiv.org/abs/2109.00859) for more details.
74
 
75
  ## Evaluation results
76
 
77
- Unlike the paper allowing to select different best checkpoints for different tasks, here we employ one checkpoint for
78
- all PLs. Besides, we remove the prefix to specify the PL in training and inference. The results on the test set are shown as below:
79
 
80
  | Model | Ruby | Javascript | Go | Python | Java | PHP | Overall |
81
  | ----------- | :-------: | :--------: | :-------: | :-------: | :-------: | :-------: | :-------: |
@@ -85,10 +82,10 @@ all PLs. Besides, we remove the prefix to specify the PL in training and inferen
85
  | [CodeBERT](https://arxiv.org/pdf/2002.08155.pdf) | 12.16 | 14.90 | 18.07 | 19.06 | 17.65 | 25.16 | 17.83 |
86
  | [PLBART](https://arxiv.org/pdf/2002.08155.pdf) | 14.11 |15.56 | 18.91 | 19.30 | 18.45 | 23.58 | 18.32 |
87
  | [CodeT5-small](https://arxiv.org/abs/2109.00859) |14.87 | 15.32 | 19.25 | 20.04 | 19.92 | 25.46 | 19.14 |
88
- | [CodeT5-base](https://arxiv.org/abs/2109.00859) | 15.24 | 16.16 | 19.56 | 20.01 | 20.31 | 26.03 | 19.55 |
89
- | [CodeT5-base-multi-sum](https://arxiv.org/abs/2109.00859) | 15.24 | 16.18 | 19.95 | 20.42 | 20.26 | 26.10 | 19.69 |
90
 
91
- ### BibTeX entry and citation info
92
 
93
  ```bibtex
94
  @inproceedings{
 
7
  inference: true
8
  ---
9
 
10
+ # CodeT5-base for Code Summarization
11
 
12
+ [CodeT5-base](https://huggingface.co/Salesforce/codet5-base) model fine-tuned on CodeSearchNet data in a multi-lingual training setting (
 
13
  Ruby/JavaScript/Go/Python/Java/PHP) for code summarization. It was introduced in this EMNLP 2021
14
  paper [CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation](https://arxiv.org/abs/2109.00859)
15
  by Yue Wang, Weishi Wang, Shafiq Joty, Steven C.H. Hoi. Please check out more
 
23
  from transformers import RobertaTokenizer, T5ForConditionalGeneration
24
 
25
  if __name__ == '__main__':
26
+ tokenizer = RobertaTokenizer.from_pretrained('Salesforce/codet5-base-multi-sum')
27
  model = T5ForConditionalGeneration.from_pretrained('Salesforce/codet5-base-multi-sum')
28
 
29
  text = """def svg_to_image(string, size=None):
 
48
 
49
  ## Fine-tuning data
50
 
51
+ We employ the filtered version of CodeSearchNet data [[Husain et al., 2019](https://arxiv.org/abs/1909.09436)]
52
  from [CodeXGLUE](https://github.com/microsoft/CodeXGLUE/tree/main/Code-Text/code-to-text) benchmark for fine-tuning on
53
  code summarization. The data is tokenized with our pre-trained code-specific BPE (Byte-Pair Encoding) tokenizer. One can
54
+ prepare text (or code) for the model using RobertaTokenizer with the vocab files from [codet5-base](https://huggingface.co/Salesforce/codet5-base).
 
55
 
56
+ ### Data statistic
57
 
58
  | Programming Language | Training | Dev | Test |
59
  | :------------------- | :------: | :----: | :----: |
 
66
 
67
  ## Training procedure
68
 
69
+ We fine-tune codet5-base on these six programming languages (Ruby/JavaScript/Go/Python/Java/PHP) in the multi-task learning setting. We employ the
70
+ balanced sampling to avoid biasing towards high-resource tasks. Please refer to the [paper](https://arxiv.org/abs/2109.00859) for more details.
 
71
 
72
  ## Evaluation results
73
 
74
+ Unlike the paper allowing to select different best checkpoints for different programming languages (PLs), here we employ one checkpoint for
75
+ all PLs. Besides, we remove the task control prefix to specify the PL in training and inference. The results on the test set are shown as below:
76
 
77
  | Model | Ruby | Javascript | Go | Python | Java | PHP | Overall |
78
  | ----------- | :-------: | :--------: | :-------: | :-------: | :-------: | :-------: | :-------: |
 
82
  | [CodeBERT](https://arxiv.org/pdf/2002.08155.pdf) | 12.16 | 14.90 | 18.07 | 19.06 | 17.65 | 25.16 | 17.83 |
83
  | [PLBART](https://arxiv.org/pdf/2002.08155.pdf) | 14.11 |15.56 | 18.91 | 19.30 | 18.45 | 23.58 | 18.32 |
84
  | [CodeT5-small](https://arxiv.org/abs/2109.00859) |14.87 | 15.32 | 19.25 | 20.04 | 19.92 | 25.46 | 19.14 |
85
+ | [CodeT5-base](https://arxiv.org/abs/2109.00859) | **15.24** | 16.16 | 19.56 | 20.01 | **20.31** | 26.03 | 19.55 |
86
+ | [CodeT5-base-multi-sum](https://arxiv.org/abs/2109.00859) | **15.24** | **16.18** | **19.95** | **20.42** | 20.26 | **26.10** | **19.69** |
87
 
88
+ ## Citation
89
 
90
  ```bibtex
91
  @inproceedings{