File size: 3,781 Bytes

aa6f0dd
 
 
 
 
a4625e0
 
 
aa6f0dd
36e083f
aa6f0dd
16923de
 
aa6f0dd
 
5aff57c
 
 
 
 
 
 
 
 
 
aa6f0dd
1249f23
aa6f0dd
 
 
f95ff31
 
 
 
 
96e0fa0
6f4048e
 
 
 
 
 
 
 
 
96e0fa0
6f4048e
 
 
96e0fa0
6f4048e
 
96e0fa0
c0d25a5
6f4048e
c0d25a5
6f4048e
c0d25a5
aa6f0dd
6f4048e
 
 
 
 
 
0541876
f95ff31
6f4048e
 
 
 
 
 
0541876
f95ff31
c0d25a5
 
 
 
aa6f0dd
96e0fa0
 
ee9a815
 
 
96e0fa0
ee9a815
 
 
 
 
96e0fa0
ee9a815
 
 
 
 
 
 
 
 
 
aa6f0dd
c0d25a5
 
aa6f0dd
 
 
 
 
 
 
 
 
 
 
 
 
 
16923de

---
license: other
base_model: microsoft/phi-1_5
tags:
- generated_from_trainer
- title
- extraction
- title extraction
model-index:
- name: titletor-phi_1-5
  results: []
datasets:
- zelalt/scientific-papers-3.5-withprompt
---

<div align="center">

# Titletor

</div>


<div align="center">
  <img src="./titletor.png" width="300"/>
</div>

This model is a fine-tuned version of [microsoft/phi-1_5](https://huggingface.co/microsoft/phi-1_5) on [zelalt/scientific-papers-3.5-withprompt](https://huggingface.co/datasets/zelalt/scientific-papers-3.5-withprompt) dataset.
It achieves the following results on the evaluation set:
- Loss: 2.1587

### Requirements
```python
!pip install accelerate transformers einops datasets peft bitsandbytes
```

## Test Dataset
If you prefer, you can use test dataset from [zelalt/scientific-papers](https://huggingface.co/datasets/zelalt/scientific-papers)
or [zelalt/arxiv-papers](https://huggingface.co/datasets/zelalt/arxiv-papers) or read your pdf as text with PyPDF2.PdfReader then give this text to LLM with adding "What is the title of this paper?" prompt.

```python
from datasets import load_dataset

test_dataset = load_dataset("zelalt/scientific-papers", split='train')
test_dataset = test_dataset.rename_column('full_text', 'text')

def formatting(example):
    text = f"What is the title of this paper? {example['text'][:180]}\n\nAnswer: "
    return {'text': text}

formatted_dataset = test_dataset.map(formatting)
```

### Sample Code
```python

import torch
from peft import PeftModel, PeftConfig
from transformers import AutoModelForCausalLM, AutoTokenizer

peft_model_id = "zelalt/titletor-phi_1-5"
config = PeftConfig.from_pretrained(peft_model_id)
model = AutoModelForCausalLM.from_pretrained(config.base_model_name_or_path, trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(config.base_model_name_or_path,trust_remote_code=True)
model = PeftModel.from_pretrained(model, peft_model_id)

#from dataset
inputs = tokenizer(f'''{formatted_dataset['text'][120]}''', return_tensors="pt", return_attention_mask=False)
outputs = model.generate(**inputs,max_new_tokens=50, pad_token_id = tokenizer.eos_token_id, eos_token_id = tokenizer.eos_token_id)
text = tokenizer.batch_decode(outputs)[0]
print(text)
```

```python
#as string
inputs = tokenizer(f'''What is the title of this paper? ...[your pdf as text]..\n\nAnswer: ''', return_tensors="pt", return_attention_mask=False)
outputs = model.generate(**inputs,max_new_tokens=50, pad_token_id = tokenizer.eos_token_id, eos_token_id = tokenizer.eos_token_id)
text = tokenizer.batch_decode(outputs)[0]
print(text)
```

**Notes**
- After running it for the first time and loading the model and tokenizer, you can only run generating part to avoid RAM crash.

### Output
Input:
```markdown
What is the title of this paper? Bursting Dynamics of the 3D Euler Equations\nin Cylindrical Domains\nFrançois Golse ∗ †\nEcole Polytechnique, CMLS\n91128 Palaiseau Cedex, France\nAlex Mahalov ‡and Basil Nicolaenko §\n\nAnswer:
```

## Output from LLM:

```markdown
What is the title of this paper? Bursting Dynamics of the 3D Euler Equations
in Cylindrical Domains
François Golse ∗ †
Ecole Polytechnique, CMLS
91128 Palaiseau Cedex, France
Alex Mahalov ‡and Basil Nicolaenko §

Answer:  Bursting Dynamics of the 3D Euler Equations in Cylindrical Domains<|endoftext|>
```

## Training and evaluation data
Train and validation dataset:
[zelalt/scientific-papers-3.5-withprompt](https://huggingface.co/datasets/zelalt/scientific-papers-3.5-withprompt)


## Training procedure

### Training hyperparameters

- total_train_batch_size: 8
- lr_scheduler_type: cosine

### Framework versions

- Transformers 4.35.2
- Pytorch 2.1.0+cu118
- Datasets 2.15.0
- Tokenizers 0.15.0