w32zhong commited on
Commit
a0f7b11
1 Parent(s): f75f323

update README.

Browse files
Files changed (1) hide show
  1. README.md +11 -22
README.md CHANGED
@@ -19,36 +19,25 @@ license: mit
19
  ---
20
 
21
  ## About
22
- Here we share a pretrained BERT model that is aware of math tokens. The math tokens are treated specially and tokenized using [pya0](https://github.com/approach0/pya0), which adds very limited new tokens for latex markup (total vocabulary is just 31,061).
23
 
24
- This model is trained on 4 x 2 Tesla V100 with a total batch size of 64, using Math StackExchange data with 2.7 million sentence pairs trained for 7 epochs.
 
25
 
26
- ### Usage
27
- Download and try it out
28
  ```sh
29
- pip install pya0==0.3.2
30
- wget https://vault.cs.uwaterloo.ca/s/gqstFZmWHCLGXe3/download -O ckpt.tar.gz
31
- mkdir -p ckpt
32
- tar xzf ckpt.tar.gz -C ckpt --strip-components=1
33
  python test.py --test_file test.txt
34
  ```
35
-
36
- ### Test file format
37
- Modify the test examples in `test.txt` to play with it.
38
-
39
- The test file is tab-separated, the first column is additional positions you want to mask for the right-side sentence (useful for masking tokens in math markups). A zero means no additional mask positions.
40
-
41
- ### Example output
42
- ![](https://i.imgur.com/xpl87KO.png)
43
-
44
- ### Upload to huggingface
45
- This repo is hosted on [Github](https://github.com/approach0/azbert), and only mirrored at [huggingface](https://huggingface.co/castorini/azbert-base).
46
 
47
  To upload to huggingface, use the `upload2hgf.sh` script.
48
  Before runnig this script, be sure to check:
49
- * check points for model and tokenizer are created under `./ckpt` folder
 
50
  * model contains all the files needed: `config.json` and `pytorch_model.bin`
51
  * tokenizer contains all the files needed: `added_tokens.json`, `special_tokens_map.json`, `tokenizer_config.json`, `vocab.txt` and `tokenizer.json`
52
  * no `tokenizer_file` field in `tokenizer_config.json` (sometimes it is located locally at `~/.cache`)
53
- * `git-lfs` is installed
54
- * having git-remote named `hgf` reference to `https://huggingface.co/castorini/azbert-base`
 
19
  ---
20
 
21
  ## About
22
+ This repository is a boilerplate to push a mask-filling model to the HuggingFace Model Hub.
23
 
24
+ ### Upload to huggingface
25
+ Download your tokenizer, model checkpoints, and optionally the training logs () to the `./ckpt` directory.
26
 
27
+ Optionally, test model using the MLM task:
 
28
  ```sh
29
+ pip install pya0
 
 
 
30
  python test.py --test_file test.txt
31
  ```
32
+ > **Note**
33
+ > Modify the test examples in `test.txt` to play with it.
34
+ > The test file is tab-separated, the first column is additional positions you want to mask for the right-side sentence (useful for masking tokens in math markups).
35
+ > A zero means no additional mask positions.
 
 
 
 
 
 
 
36
 
37
  To upload to huggingface, use the `upload2hgf.sh` script.
38
  Before runnig this script, be sure to check:
39
+ * `git-lfs` is installed
40
+ * having git-remote named `hgf` reference to `https://huggingface.co/your/repo`
41
  * model contains all the files needed: `config.json` and `pytorch_model.bin`
42
  * tokenizer contains all the files needed: `added_tokens.json`, `special_tokens_map.json`, `tokenizer_config.json`, `vocab.txt` and `tokenizer.json`
43
  * no `tokenizer_file` field in `tokenizer_config.json` (sometimes it is located locally at `~/.cache`)