rohitsroch commited on
Commit
86546c5
1 Parent(s): 899b072

Push SEAD-L-6_H-256_A-8-sst2 model weights

Browse files
README.md ADDED
@@ -0,0 +1,59 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ license: apache-2.0
5
+ tags:
6
+ - SEAD
7
+ datasets:
8
+ - glue
9
+ - sst2
10
+ ---
11
+
12
+ ## Paper
13
+
14
+ ## [SEAD: SIMPLE ENSEMBLE AND KNOWLEDGE DISTILLATION FRAMEWORK FOR NATURAL LANGUAGE UNDERSTANDING](https://www.course5i.com/ai-labs/)
15
+ Aurthors: *Moyan Mei*, *Rohit Sroch*
16
+
17
+ ## Abstract
18
+
19
+ With the widespread use of pre-trained language models (PLM), there has been increased research on how to make them applicable, especially in limited resource or low latency high throughput scenarios. One of the dominant approaches is knowledge distillation (KD), where a smaller model is trained by receiving guidance from a large PLM. While there are many successful designs for learning knowledge from teachers, it remains unclear how students can learn better. Inspired by real university teaching processes, in this work we further explore knowledge distillation and propose a very simple yet effective framework, SEAD, to further improve task-specific generalization by utilizing multiple teachers. Our experiments show that SEAD leads to better performance compared to other popular KD methods [[1](https://arxiv.org/abs/1910.01108)] [[2](https://arxiv.org/abs/1909.10351)] [[3](https://arxiv.org/abs/2002.10957)] and achieves comparable or superior performance to its teacher model such as BERT [[4](https://arxiv.org/abs/1810.04805)] on total 13 tasks for the GLUE [[5](https://arxiv.org/abs/1804.07461)] and SuperGLUE [[6](https://arxiv.org/abs/1905.00537)] benchmarks.
20
+
21
+ ## SEAD-L-6_H-256_A-8-sst2
22
+
23
+ This is a student model distilled from [**BERT base**](https://huggingface.co/bert-base-uncased) as teacher by using SEAD framework on **sst2** task. For weights initialization, we used [microsoft/xtremedistil-l6-h256-uncased](https://huggingface.co/microsoft/xtremedistil-l6-h256-uncased)
24
+
25
+
26
+ ## All SEAD Checkpoints
27
+
28
+ Other Community Checkpoints: [here](https://huggingface.co/models?search=SEAD)
29
+
30
+ ## Intended uses & limitations
31
+
32
+ More information needed
33
+
34
+ ### Training hyperparameters
35
+
36
+ Please take a look at the `training_args.bin` file
37
+
38
+ ```python
39
+ $ import torch
40
+ $ hyperparameters = torch.load(os.path.join('training_args.bin'))
41
+
42
+ ```
43
+
44
+
45
+ ### Evaluation results
46
+
47
+ | eval_accuracy | eval_runtime | eval_samples_per_second | eval_steps_per_second | eval_loss | eval_samples |
48
+ |:-------------:|:------------:|:-----------------------:|:---------------------:|:---------:|:------------:|
49
+ | 0.9266 | 1.3676 | 637.636 | 20.475 | 0.2503 | 872 |
50
+
51
+
52
+ ### Framework versions
53
+
54
+ - Transformers >=4.8.0
55
+ - Pytorch >=1.6.0
56
+ - TensorFlow >=2.5.0
57
+ - Flax >=0.3.5
58
+ - Datasets >=1.10.2
59
+ - Tokenizers >=0.11.6
config.json ADDED
@@ -0,0 +1,35 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "../artifacts/best_models/sst2/L-6_H-256_A-8/student-ckpt",
3
+ "architectures": [
4
+ "BertForSequenceClassification"
5
+ ],
6
+ "attention_probs_dropout_prob": 0.1,
7
+ "classifier_dropout": null,
8
+ "finetuning_task": "sst2",
9
+ "gradient_checkpointing": false,
10
+ "hidden_act": "gelu",
11
+ "hidden_dropout_prob": 0.1,
12
+ "hidden_size": 256,
13
+ "id2label": {
14
+ "0": 0,
15
+ "1": 1
16
+ },
17
+ "initializer_range": 0.02,
18
+ "intermediate_size": 1024,
19
+ "label2id": {
20
+ "0": 0,
21
+ "1": 1
22
+ },
23
+ "layer_norm_eps": 1e-12,
24
+ "max_position_embeddings": 512,
25
+ "model_type": "bert",
26
+ "num_attention_heads": 8,
27
+ "num_hidden_layers": 6,
28
+ "pad_token_id": 0,
29
+ "position_embedding_type": "absolute",
30
+ "problem_type": "single_label_classification",
31
+ "transformers_version": "4.18.0",
32
+ "type_vocab_size": 2,
33
+ "use_cache": true,
34
+ "vocab_size": 30522
35
+ }
eval_results.json ADDED
@@ -0,0 +1,8 @@
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "eval_accuracy": 0.926605504587156,
3
+ "eval_loss": 0.25034917970853193,
4
+ "eval_runtime": 1.3676,
5
+ "eval_samples": 872,
6
+ "eval_samples_per_second": 637.636,
7
+ "eval_steps_per_second": 20.475
8
+ }
flax_model.msgpack ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:4bf8afdf97f6235dea7a852d1e43a5dd6d3318ee14e6630445244b51a5317132
3
+ size 51006182
pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:0cf8cdd2950e24cdad91dc6a30aaace5954c208b9714911cab7acc851d4a1074
3
+ size 51032629
special_tokens_map.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"unk_token": "[UNK]", "sep_token": "[SEP]", "pad_token": "[PAD]", "cls_token": "[CLS]", "mask_token": "[MASK]"}
tf_model.h5 ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:7ef077cb78c5f9f0fb90b8cf292b3c695dba0e3137a7b552c3dfb0200138503d
3
+ size 51150416
tokenizer_config.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"do_lower_case": true, "do_basic_tokenize": true, "never_split": null, "unk_token": "[UNK]", "sep_token": "[SEP]", "pad_token": "[PAD]", "cls_token": "[CLS]", "mask_token": "[MASK]", "tokenize_chinese_chars": true, "strip_accents": null, "special_tokens_map_file": null, "tokenizer_file": null, "name_or_path": "microsoft/xtremedistil-l6-h256-uncased", "tokenizer_class": "BertTokenizer"}
training_args.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:29d59bb26f6090c808d898d4fe7cae78825f7314d825fd3dd4542b027fc9c3f4
3
+ size 2760
vocab.txt ADDED
The diff for this file is too large to render. See raw diff