callaghanmt
commited on
Commit
•
7d60af5
1
Parent(s):
94929f6
Upload folder using huggingface_hub
Browse files- README.md +71 -0
- config.json +25 -0
- model.safetensors +3 -0
- special_tokens_map.json +7 -0
- tokenizer.json +0 -0
- tokenizer_config.json +57 -0
- vocab.txt +0 -0
README.md
ADDED
@@ -0,0 +1,71 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
language: en
|
3 |
+
tags:
|
4 |
+
- scibert
|
5 |
+
- fine-tuned
|
6 |
+
- scientific-embeddings
|
7 |
+
- multi-document-summarization
|
8 |
+
- scitldr
|
9 |
+
license: mit
|
10 |
+
---
|
11 |
+
|
12 |
+
# SciBERT Fine-tuned for Scientific Multi-Document Summarization Embeddings
|
13 |
+
|
14 |
+
## Model description
|
15 |
+
|
16 |
+
This model is a fine-tuned version of [allenai/scibert_scivocab_uncased](https://huggingface.co/allenai/scibert_scivocab_uncased) for creating embeddings used in scientific multi-document summarization tasks. It has been optimized to generate meaningful representations of scientific text that can be used in downstream summarization processes.
|
17 |
+
|
18 |
+
## Intended uses & limitations
|
19 |
+
|
20 |
+
This model is intended for creating embeddings of scientific documents, specifically for use in multi-document summarization tasks. It should not be used for generating summaries directly, but rather for creating vector representations of scientific text that can be used as input for summarization models or algorithms.
|
21 |
+
|
22 |
+
The model may not perform optimally on non-scientific text or for tasks significantly different from multi-document summarization.
|
23 |
+
|
24 |
+
## Training data
|
25 |
+
|
26 |
+
This model was trained on the SciTLDR dataset. SciTLDR (Scientific Too Long; Didn't Read) is a dataset of scientific papers and their corresponding TL;DR summaries. It contains around 5,400 papers from the computer science domain, primarily from arXiv. Each paper in the dataset includes:
|
27 |
+
|
28 |
+
- The paper's title
|
29 |
+
- The abstract
|
30 |
+
- The full text of the paper
|
31 |
+
- Two types of summaries:
|
32 |
+
1. Author-written TL;DR
|
33 |
+
2. Expert-written TL;DR
|
34 |
+
|
35 |
+
The dataset is designed to support the task of extreme summarization in the scientific domain, where the goal is to create very short, high-level summaries of scientific papers.
|
36 |
+
|
37 |
+
For more information about the SciTLDR dataset, you can refer to the [official paper](https://arxiv.org/abs/2004.15011) and the [dataset repository](https://github.com/allenai/scitldr).
|
38 |
+
|
39 |
+
## Training procedure
|
40 |
+
|
41 |
+
The model was trained for 15 epochs with early stopping based on validation loss. The best model was saved at epoch 15.
|
42 |
+
|
43 |
+
### Training hyperparameters
|
44 |
+
|
45 |
+
The following hyperparameters were used during training:
|
46 |
+
- learning_rate: 1e-5 to 1e-7 (cosine annealing)
|
47 |
+
- train_batch_size: 16
|
48 |
+
- eval_batch_size: 16
|
49 |
+
- optimizer: AdamW
|
50 |
+
|
51 |
+
### Framework versions
|
52 |
+
|
53 |
+
- Transformers 4.41.2
|
54 |
+
- PyTorch 2.3.0+cu121
|
55 |
+
- Datasets 2.20.0
|
56 |
+
- Tokenizers 0.19.1
|
57 |
+
- CUDA 12.1
|
58 |
+
|
59 |
+
## Evaluation results
|
60 |
+
|
61 |
+
The model achieved the following results:
|
62 |
+
- Training Loss: 0.2272
|
63 |
+
- Validation Loss: 0.8738
|
64 |
+
|
65 |
+
## Model Limitations and Bias
|
66 |
+
|
67 |
+
This model is trained on scientific literature from the SciTLDR dataset, which primarily contains computer science papers from arXiv. As such, it may not generalize well to other scientific domains or non-scientific text. Users should be aware of potential biases in the training data, which may be reflected in the generated embeddings. The model's performance might be optimal for computer science-related texts but could be less effective for other scientific fields.
|
68 |
+
|
69 |
+
## Author
|
70 |
+
|
71 |
+
callaghanmt
|
config.json
ADDED
@@ -0,0 +1,25 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"_name_or_path": "allenai/scibert_scivocab_uncased",
|
3 |
+
"architectures": [
|
4 |
+
"BertModel"
|
5 |
+
],
|
6 |
+
"attention_probs_dropout_prob": 0.1,
|
7 |
+
"classifier_dropout": null,
|
8 |
+
"hidden_act": "gelu",
|
9 |
+
"hidden_dropout_prob": 0.1,
|
10 |
+
"hidden_size": 768,
|
11 |
+
"initializer_range": 0.02,
|
12 |
+
"intermediate_size": 3072,
|
13 |
+
"layer_norm_eps": 1e-12,
|
14 |
+
"max_position_embeddings": 512,
|
15 |
+
"model_type": "bert",
|
16 |
+
"num_attention_heads": 12,
|
17 |
+
"num_hidden_layers": 12,
|
18 |
+
"pad_token_id": 0,
|
19 |
+
"position_embedding_type": "absolute",
|
20 |
+
"torch_dtype": "float32",
|
21 |
+
"transformers_version": "4.41.2",
|
22 |
+
"type_vocab_size": 2,
|
23 |
+
"use_cache": true,
|
24 |
+
"vocab_size": 31090
|
25 |
+
}
|
model.safetensors
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:6b9e570b7412965a6a5310d035d3c5743f511c9d46f068679f6b779d21ea7f62
|
3 |
+
size 439696224
|
special_tokens_map.json
ADDED
@@ -0,0 +1,7 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"cls_token": "[CLS]",
|
3 |
+
"mask_token": "[MASK]",
|
4 |
+
"pad_token": "[PAD]",
|
5 |
+
"sep_token": "[SEP]",
|
6 |
+
"unk_token": "[UNK]"
|
7 |
+
}
|
tokenizer.json
ADDED
The diff for this file is too large to render.
See raw diff
|
|
tokenizer_config.json
ADDED
@@ -0,0 +1,57 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"added_tokens_decoder": {
|
3 |
+
"0": {
|
4 |
+
"content": "[PAD]",
|
5 |
+
"lstrip": false,
|
6 |
+
"normalized": false,
|
7 |
+
"rstrip": false,
|
8 |
+
"single_word": false,
|
9 |
+
"special": true
|
10 |
+
},
|
11 |
+
"101": {
|
12 |
+
"content": "[UNK]",
|
13 |
+
"lstrip": false,
|
14 |
+
"normalized": false,
|
15 |
+
"rstrip": false,
|
16 |
+
"single_word": false,
|
17 |
+
"special": true
|
18 |
+
},
|
19 |
+
"102": {
|
20 |
+
"content": "[CLS]",
|
21 |
+
"lstrip": false,
|
22 |
+
"normalized": false,
|
23 |
+
"rstrip": false,
|
24 |
+
"single_word": false,
|
25 |
+
"special": true
|
26 |
+
},
|
27 |
+
"103": {
|
28 |
+
"content": "[SEP]",
|
29 |
+
"lstrip": false,
|
30 |
+
"normalized": false,
|
31 |
+
"rstrip": false,
|
32 |
+
"single_word": false,
|
33 |
+
"special": true
|
34 |
+
},
|
35 |
+
"104": {
|
36 |
+
"content": "[MASK]",
|
37 |
+
"lstrip": false,
|
38 |
+
"normalized": false,
|
39 |
+
"rstrip": false,
|
40 |
+
"single_word": false,
|
41 |
+
"special": true
|
42 |
+
}
|
43 |
+
},
|
44 |
+
"clean_up_tokenization_spaces": true,
|
45 |
+
"cls_token": "[CLS]",
|
46 |
+
"do_basic_tokenize": true,
|
47 |
+
"do_lower_case": true,
|
48 |
+
"mask_token": "[MASK]",
|
49 |
+
"model_max_length": 1000000000000000019884624838656,
|
50 |
+
"never_split": null,
|
51 |
+
"pad_token": "[PAD]",
|
52 |
+
"sep_token": "[SEP]",
|
53 |
+
"strip_accents": null,
|
54 |
+
"tokenize_chinese_chars": true,
|
55 |
+
"tokenizer_class": "BertTokenizer",
|
56 |
+
"unk_token": "[UNK]"
|
57 |
+
}
|
vocab.txt
ADDED
The diff for this file is too large to render.
See raw diff
|
|