--- license: other base_model: microsoft/phi-1_5 tags: - generated_from_trainer - title - extraction - title extraction model-index: - name: titletor-phi_1-5 results: [] datasets: - zelalt/scientific-papers-3.5-withprompt ---
# Titletor
This model is a fine-tuned version of [microsoft/phi-1_5](https://huggingface.co/microsoft/phi-1_5) on [zelalt/scientific-papers-3.5-withprompt](https://huggingface.co/datasets/zelalt/scientific-papers-3.5-withprompt) dataset. It achieves the following results on the evaluation set: - Loss: 2.1587 ### Requirements ```python !pip install accelerate transformers einops datasets peft bitsandbytes ``` ## Test Dataset If you prefer, you can use test dataset from [zelalt/scientific-papers](https://huggingface.co/datasets/zelalt/scientific-papers) or [zelalt/arxiv-papers](https://huggingface.co/datasets/zelalt/arxiv-papers) or read your pdf as text with PyPDF2.PdfReader then give this text to LLM with adding "What is the title of this paper?" prompt. ```python from datasets import load_dataset test_dataset = load_dataset("zelalt/scientific-papers", split='train') test_dataset = test_dataset.rename_column('full_text', 'text') def formatting(example): text = f"What is the title of this paper? {example['text'][:180]}\n\nAnswer: " return {'text': text} formatted_dataset = test_dataset.map(formatting) ``` ### Sample Code ```python import torch from peft import PeftModel, PeftConfig from transformers import AutoModelForCausalLM, AutoTokenizer peft_model_id = "zelalt/titletor-phi_1-5" config = PeftConfig.from_pretrained(peft_model_id) model = AutoModelForCausalLM.from_pretrained(config.base_model_name_or_path, trust_remote_code=True) tokenizer = AutoTokenizer.from_pretrained(config.base_model_name_or_path,trust_remote_code=True) model = PeftModel.from_pretrained(model, peft_model_id) #from dataset inputs = tokenizer(f'''{formatted_dataset['text'][120]}''', return_tensors="pt", return_attention_mask=False) outputs = model.generate(**inputs,max_new_tokens=50, pad_token_id = tokenizer.eos_token_id, eos_token_id = tokenizer.eos_token_id) text = tokenizer.batch_decode(outputs)[0] print(text) ``` ```python #as string inputs = tokenizer(f'''What is the title of this paper? ...[your pdf as text]..\n\nAnswer: ''', return_tensors="pt", return_attention_mask=False) outputs = model.generate(**inputs,max_new_tokens=50, pad_token_id = tokenizer.eos_token_id, eos_token_id = tokenizer.eos_token_id) text = tokenizer.batch_decode(outputs)[0] print(text) ``` **Notes** - After running it for the first time and loading the model and tokenizer, you can only run generating part to avoid RAM crash. ### Output Input: ```markdown What is the title of this paper? Bursting Dynamics of the 3D Euler Equations\nin Cylindrical Domains\nFrançois Golse ∗ †\nEcole Polytechnique, CMLS\n91128 Palaiseau Cedex, France\nAlex Mahalov ‡and Basil Nicolaenko §\n\nAnswer: ``` ## Output from LLM: ```markdown What is the title of this paper? Bursting Dynamics of the 3D Euler Equations in Cylindrical Domains François Golse ∗ † Ecole Polytechnique, CMLS 91128 Palaiseau Cedex, France Alex Mahalov ‡and Basil Nicolaenko § Answer: Bursting Dynamics of the 3D Euler Equations in Cylindrical Domains<|endoftext|> ``` ## Training and evaluation data Train and validation dataset: [zelalt/scientific-papers-3.5-withprompt](https://huggingface.co/datasets/zelalt/scientific-papers-3.5-withprompt) ## Training procedure ### Training hyperparameters - total_train_batch_size: 8 - lr_scheduler_type: cosine ### Framework versions - Transformers 4.35.2 - Pytorch 2.1.0+cu118 - Datasets 2.15.0 - Tokenizers 0.15.0