callaghanmt commited on
Commit
7d60af5
1 Parent(s): 94929f6

Upload folder using huggingface_hub

Browse files
README.md ADDED
@@ -0,0 +1,71 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: en
3
+ tags:
4
+ - scibert
5
+ - fine-tuned
6
+ - scientific-embeddings
7
+ - multi-document-summarization
8
+ - scitldr
9
+ license: mit
10
+ ---
11
+
12
+ # SciBERT Fine-tuned for Scientific Multi-Document Summarization Embeddings
13
+
14
+ ## Model description
15
+
16
+ This model is a fine-tuned version of [allenai/scibert_scivocab_uncased](https://huggingface.co/allenai/scibert_scivocab_uncased) for creating embeddings used in scientific multi-document summarization tasks. It has been optimized to generate meaningful representations of scientific text that can be used in downstream summarization processes.
17
+
18
+ ## Intended uses & limitations
19
+
20
+ This model is intended for creating embeddings of scientific documents, specifically for use in multi-document summarization tasks. It should not be used for generating summaries directly, but rather for creating vector representations of scientific text that can be used as input for summarization models or algorithms.
21
+
22
+ The model may not perform optimally on non-scientific text or for tasks significantly different from multi-document summarization.
23
+
24
+ ## Training data
25
+
26
+ This model was trained on the SciTLDR dataset. SciTLDR (Scientific Too Long; Didn't Read) is a dataset of scientific papers and their corresponding TL;DR summaries. It contains around 5,400 papers from the computer science domain, primarily from arXiv. Each paper in the dataset includes:
27
+
28
+ - The paper's title
29
+ - The abstract
30
+ - The full text of the paper
31
+ - Two types of summaries:
32
+ 1. Author-written TL;DR
33
+ 2. Expert-written TL;DR
34
+
35
+ The dataset is designed to support the task of extreme summarization in the scientific domain, where the goal is to create very short, high-level summaries of scientific papers.
36
+
37
+ For more information about the SciTLDR dataset, you can refer to the [official paper](https://arxiv.org/abs/2004.15011) and the [dataset repository](https://github.com/allenai/scitldr).
38
+
39
+ ## Training procedure
40
+
41
+ The model was trained for 15 epochs with early stopping based on validation loss. The best model was saved at epoch 15.
42
+
43
+ ### Training hyperparameters
44
+
45
+ The following hyperparameters were used during training:
46
+ - learning_rate: 1e-5 to 1e-7 (cosine annealing)
47
+ - train_batch_size: 16
48
+ - eval_batch_size: 16
49
+ - optimizer: AdamW
50
+
51
+ ### Framework versions
52
+
53
+ - Transformers 4.41.2
54
+ - PyTorch 2.3.0+cu121
55
+ - Datasets 2.20.0
56
+ - Tokenizers 0.19.1
57
+ - CUDA 12.1
58
+
59
+ ## Evaluation results
60
+
61
+ The model achieved the following results:
62
+ - Training Loss: 0.2272
63
+ - Validation Loss: 0.8738
64
+
65
+ ## Model Limitations and Bias
66
+
67
+ This model is trained on scientific literature from the SciTLDR dataset, which primarily contains computer science papers from arXiv. As such, it may not generalize well to other scientific domains or non-scientific text. Users should be aware of potential biases in the training data, which may be reflected in the generated embeddings. The model's performance might be optimal for computer science-related texts but could be less effective for other scientific fields.
68
+
69
+ ## Author
70
+
71
+ callaghanmt
config.json ADDED
@@ -0,0 +1,25 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "allenai/scibert_scivocab_uncased",
3
+ "architectures": [
4
+ "BertModel"
5
+ ],
6
+ "attention_probs_dropout_prob": 0.1,
7
+ "classifier_dropout": null,
8
+ "hidden_act": "gelu",
9
+ "hidden_dropout_prob": 0.1,
10
+ "hidden_size": 768,
11
+ "initializer_range": 0.02,
12
+ "intermediate_size": 3072,
13
+ "layer_norm_eps": 1e-12,
14
+ "max_position_embeddings": 512,
15
+ "model_type": "bert",
16
+ "num_attention_heads": 12,
17
+ "num_hidden_layers": 12,
18
+ "pad_token_id": 0,
19
+ "position_embedding_type": "absolute",
20
+ "torch_dtype": "float32",
21
+ "transformers_version": "4.41.2",
22
+ "type_vocab_size": 2,
23
+ "use_cache": true,
24
+ "vocab_size": 31090
25
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:6b9e570b7412965a6a5310d035d3c5743f511c9d46f068679f6b779d21ea7f62
3
+ size 439696224
special_tokens_map.json ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ {
2
+ "cls_token": "[CLS]",
3
+ "mask_token": "[MASK]",
4
+ "pad_token": "[PAD]",
5
+ "sep_token": "[SEP]",
6
+ "unk_token": "[UNK]"
7
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,57 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "added_tokens_decoder": {
3
+ "0": {
4
+ "content": "[PAD]",
5
+ "lstrip": false,
6
+ "normalized": false,
7
+ "rstrip": false,
8
+ "single_word": false,
9
+ "special": true
10
+ },
11
+ "101": {
12
+ "content": "[UNK]",
13
+ "lstrip": false,
14
+ "normalized": false,
15
+ "rstrip": false,
16
+ "single_word": false,
17
+ "special": true
18
+ },
19
+ "102": {
20
+ "content": "[CLS]",
21
+ "lstrip": false,
22
+ "normalized": false,
23
+ "rstrip": false,
24
+ "single_word": false,
25
+ "special": true
26
+ },
27
+ "103": {
28
+ "content": "[SEP]",
29
+ "lstrip": false,
30
+ "normalized": false,
31
+ "rstrip": false,
32
+ "single_word": false,
33
+ "special": true
34
+ },
35
+ "104": {
36
+ "content": "[MASK]",
37
+ "lstrip": false,
38
+ "normalized": false,
39
+ "rstrip": false,
40
+ "single_word": false,
41
+ "special": true
42
+ }
43
+ },
44
+ "clean_up_tokenization_spaces": true,
45
+ "cls_token": "[CLS]",
46
+ "do_basic_tokenize": true,
47
+ "do_lower_case": true,
48
+ "mask_token": "[MASK]",
49
+ "model_max_length": 1000000000000000019884624838656,
50
+ "never_split": null,
51
+ "pad_token": "[PAD]",
52
+ "sep_token": "[SEP]",
53
+ "strip_accents": null,
54
+ "tokenize_chinese_chars": true,
55
+ "tokenizer_class": "BertTokenizer",
56
+ "unk_token": "[UNK]"
57
+ }
vocab.txt ADDED
The diff for this file is too large to render. See raw diff