Update README.md
Browse files
README.md
CHANGED
@@ -3,14 +3,15 @@ language: en
|
|
3 |
license: apache-2.0
|
4 |
tags:
|
5 |
- fill-mask
|
|
|
6 |
datasets:
|
7 |
- wikipedia
|
8 |
- bookcorpus
|
9 |
---
|
10 |
-
## Model Details: 90% Sparse BERT-Base (uncased) Prune Once
|
11 |
-
This model is a sparse pre-trained model that can be fine-tuned for a wide range of language tasks. The process of weight pruning is forcing some of the weights of the neural network to zero. Setting some of the
|
12 |
|
13 |
-
Visualization of Prunce Once
|
14 |
![Zafrir2021_Fig1.png](https://s3.amazonaws.com/moonup/production/uploads/6297f0e30bd2f58c647abb1d/nSDP62H9NHC1FA0C429Xo.png)
|
15 |
|
16 |
| Model Detail | Description |
|
@@ -26,7 +27,7 @@ Visualization of Prunce Once For All method from [Zafrir et al. (2021)](https://
|
|
26 |
|
27 |
| Intended Use | Description |
|
28 |
| ----------- | ----------- |
|
29 |
-
| Primary intended uses | This is a general sparse language model; in its current form, it is not ready for downstream prediction tasks, but it can be fine-tuned for several language tasks including (but not limited to)
|
30 |
| Primary intended users | Anyone who needs an efficient general language model for other downstream tasks. |
|
31 |
| Out-of-scope uses | The model should not be used to intentionally create hostile or alienating environments for people.|
|
32 |
|
@@ -66,7 +67,7 @@ All the results are the mean of two seperate experiments with the same hyper-par
|
|
66 |
| Training and Evaluation Data | Description |
|
67 |
| ----------- | ----------- |
|
68 |
| Datasets | [English Wikipedia Dataset](https://huggingface.co/datasets/wikipedia) (2500M words). |
|
69 |
-
| Motivation | To build an efficient and accurate model for
|
70 |
| Preprocessing | "We use the English Wikipedia dataset (2500M words) for training the models on the pre-training task. We split the data into train (95%) and validation (5%) sets. Both sets are preprocessed as described in the models’ original papers ([Devlin et al., 2019](https://arxiv.org/abs/1810.04805), [Sanh et al., 2019](https://arxiv.org/abs/1910.01108)). We process the data to use the maximum sequence length allowed by the models, however, we allow shorter sequences at a probability of 0:1." |
|
71 |
|
72 |
| Ethical Considerations | Description |
|
|
|
3 |
license: apache-2.0
|
4 |
tags:
|
5 |
- fill-mask
|
6 |
+
- bert
|
7 |
datasets:
|
8 |
- wikipedia
|
9 |
- bookcorpus
|
10 |
---
|
11 |
+
## Model Details: 90% Sparse BERT-Base (uncased) Prune Once for All
|
12 |
+
This model is a sparse pre-trained model that can be fine-tuned for a wide range of language tasks. The process of weight pruning is forcing some of the weights of the neural network to zero. Setting some of the weights to zero results in sparser matrices. Updating neural network weights does involve matrix multiplication, and if we can keep the matrices sparse while retaining enough important information, we can reduce the overall computational overhead. The term "sparse" in the title of the model indicates a ratio of sparsity in the weights; for more details, you can read [Zafrir et al. (2021)](https://arxiv.org/abs/2111.05754).
|
13 |
|
14 |
+
Visualization of Prunce Once for All method from [Zafrir et al. (2021)](https://arxiv.org/abs/2111.05754):
|
15 |
![Zafrir2021_Fig1.png](https://s3.amazonaws.com/moonup/production/uploads/6297f0e30bd2f58c647abb1d/nSDP62H9NHC1FA0C429Xo.png)
|
16 |
|
17 |
| Model Detail | Description |
|
|
|
27 |
|
28 |
| Intended Use | Description |
|
29 |
| ----------- | ----------- |
|
30 |
+
| Primary intended uses | This is a general sparse language model; in its current form, it is not ready for downstream prediction tasks, but it can be fine-tuned for several language tasks including (but not limited to) question-answering, genre natural language inference, and sentiment classification. |
|
31 |
| Primary intended users | Anyone who needs an efficient general language model for other downstream tasks. |
|
32 |
| Out-of-scope uses | The model should not be used to intentionally create hostile or alienating environments for people.|
|
33 |
|
|
|
67 |
| Training and Evaluation Data | Description |
|
68 |
| ----------- | ----------- |
|
69 |
| Datasets | [English Wikipedia Dataset](https://huggingface.co/datasets/wikipedia) (2500M words). |
|
70 |
+
| Motivation | To build an efficient and accurate base model for several downstream language tasks. |
|
71 |
| Preprocessing | "We use the English Wikipedia dataset (2500M words) for training the models on the pre-training task. We split the data into train (95%) and validation (5%) sets. Both sets are preprocessed as described in the models’ original papers ([Devlin et al., 2019](https://arxiv.org/abs/1810.04805), [Sanh et al., 2019](https://arxiv.org/abs/1910.01108)). We process the data to use the maximum sequence length allowed by the models, however, we allow shorter sequences at a probability of 0:1." |
|
72 |
|
73 |
| Ethical Considerations | Description |
|