Rubyando59 commited on
Commit
925031e
·
verified ·
1 Parent(s): c07c4e1

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +143 -12
README.md CHANGED
@@ -1,29 +1,132 @@
1
- # Get Started
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2
 
3
  ```python
4
  from PIL import Image
5
  from transformers import AutoProcessor, AutoModelForCausalLM, AutoConfig
6
  import torch
7
 
8
- ############## Load and configurate the model ##############
9
-
10
  device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
11
-
12
  config = AutoConfig.from_pretrained("microsoft/Florence-2-base-ft", trust_remote_code=True)
13
  config.vision_config.model_type = "davit"
14
-
15
  model = AutoModelForCausalLM.from_pretrained("sujet-ai/Lutece-Vision-Base", config=config, trust_remote_code=True).to(device).eval()
16
  processor = AutoProcessor.from_pretrained("sujet-ai/Lutece-Vision-Base", config=config, trust_remote_code=True)
17
  task = "<FinanceQA>"
18
 
19
- ############## Load input image and define the question ##############
20
-
21
  image = Image.open('test.png').convert('RGB')
22
-
23
  prompt = "How much decrease in prepaid expenses was reported?"
24
 
 
25
  inputs = processor(text=prompt, images=image, return_tensors="pt").to(device)
26
-
27
  generated_ids = model.generate(
28
  input_ids=inputs["input_ids"],
29
  pixel_values=inputs["pixel_values"],
@@ -32,9 +135,37 @@ generated_ids = model.generate(
32
  num_beams=3,
33
  )
34
 
35
- ##################### Generate the answer ###################################
36
-
37
  generated_text = processor.batch_decode(generated_ids, skip_special_tokens=False)[0]
38
  parsed_answer = processor.post_process_generation(generated_text, task=task, image_size=(image.width, image.height))
39
  print(parsed_answer[task])
40
- ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Lutece-Vision-Base
2
+
3
+ ## Model Description
4
+
5
+ Lutece-Vision-Base, named after the ancient name of Paris, is a specialized Vision-Language Model (VLM) designed for financial document analysis and question answering. This model is a fine-tuned version of the Microsoft Florence-2-base-ft, specifically tailored to interpret and answer questions about financial documents, reports, and images.
6
+
7
+ ## Model Details
8
+
9
+ - **Base Model**: microsoft/Florence-2-base-ft
10
+ - **Fine-tuning Dataset**: [sujet-ai/Sujet-Finance-QA-Vision-100k](https://huggingface.co/datasets/sujet-ai/Sujet-Finance-QA-Vision-100k)
11
+ - **Training Data**: 100,629 Q&A pairs (spanning 9,212 images)
12
+ - **Validation Data**: 589 Q&A pairs (one pair per image from a total of 6,421 entries in the validation set)
13
+ - **Language**: English
14
+ - **License**: MIT
15
+
16
+ ## Training Specifications
17
+
18
+ - **Number of Epochs**: 7
19
+ - **Learning Rate**: 1e-6
20
+ - **Optimizer**: AdamW
21
+ - **Architecture**: Encoder parameters were frozen during fine-tuning
22
+ - **Hardware**: One NVIDIA A100 GPU
23
+ - **Training Duration**: Approximately 38 hours
24
+
25
+ ## Performance and Evaluation
26
+
27
+ We evaluated the model's performance using two approaches:
28
+
29
+ 1. GPT-4o as an LLM judge
30
+ 2. Cosine similarity measurement
31
+
32
+ ### GPT-4o Evaluation
33
+
34
+ This method compares the answers generated by both the vanilla Florence model and our fine-tuned Lutece-Vision-Base model.
35
+
36
+ **Evaluation Process**:
37
+ 1. For each (image, question) pair in the validation set, we generate answers using both models.
38
+ 2. GPT-4o acts as an impartial judge, evaluating the correctness of both answers without prior knowledge of the ground truth.
39
+ 3. The evaluation considers factors such as numerical accuracy, spelling and minor wording differences, completeness of the answer, and relevance of information.
40
+
41
+ **Evaluation Criteria**:
42
+ - Numerical Accuracy: Exact matches required for numbers, dates, and quantities.
43
+ - Spelling and Minor Wording: Minor differences are acceptable if the core information is correct.
44
+ - Completeness: Answers must fully address the question.
45
+ - Relevance: Additional information is acceptable unless it contradicts the correct part of the answer.
46
+
47
+ **GPT-4o Judge Prompt**:
48
+
49
+ ```
50
+ Analyze the image and the question, then evaluate the answers provided by the Vanilla Model and the Finetuned Model.
51
+
52
+ Question: {question}
53
+ Vanilla Model Answer: {vanilla_answer}
54
+ Finetuned Model Answer: {finetuned_answer}
55
+
56
+ Your task is to determine if each answer is correct or incorrect based on the image and question.
57
+ Consider the following guidelines:
58
+
59
+ 1. Numerical Accuracy:
60
+ - For questions involving numbers (e.g., prices, dates, quantities), the answer must be exactly correct.
61
+ - Example: If the correct price is $10.50, an answer of $10.49 or $10.51 is incorrect.
62
+ - Example: If the correct date is June 15, 2023, an answer of June 14, 2023 or June 16, 2023 is incorrect.
63
+
64
+ 2. Spelling and Minor Wording:
65
+ - Minor spelling mistakes or slight wording differences should not be counted as incorrect if the core information is right.
66
+ - Example: If the correct name is "John Smith", answers like "Jon Smith" or "John Smyth" should be considered correct.
67
+ - Example: "The CEO of the company" instead of "The company's CEO" is acceptable.
68
+
69
+ 3. Completeness:
70
+ - The answer must fully address the question to be considered correct.
71
+ - Partially correct answers or answers that miss key parts of the question should be marked as incorrect.
72
+
73
+ 4. Irrelevant Information:
74
+ - Additional irrelevant information does not make an otherwise correct answer incorrect.
75
+ - However, if the irrelevant information contradicts the correct part of the answer, mark it as incorrect.
76
+
77
+ Respond using the following JSON format:
78
+ {
79
+ "vanilla_correct": <boolean>,
80
+ "finetuned_correct": <boolean>,
81
+ "explanation": "Your explanation here"
82
+ }
83
+
84
+ Where:
85
+ - "vanilla_correct" is true if the Vanilla Model's answer is correct, false otherwise.
86
+ - "finetuned_correct" is true if the Finetuned Model's answer is correct, false otherwise.
87
+ - "explanation" briefly explains your evaluation for both answers, referencing the guidelines above.
88
+
89
+ Your response should contain ONLY the JSON output, and no text before or after to avoid output parsing errors.
90
+ ```
91
+
92
+ ### Cosine Similarity Measurement
93
+
94
+ In addition to the GPT-4o evaluation, we also measured the cosine similarity between the answers given by the models and what was labeled as ground truth by GPT-4o. This provides a quantitative measure of how close the model outputs are to the expected answers in the embedding space.
95
+
96
+ **Process**:
97
+ 1. We used the BAAI/bge-base-en-v1.5 embedding model ([https://huggingface.co/BAAI/bge-base-en-v1.5](https://huggingface.co/BAAI/bge-base-en-v1.5)) to convert the answers into vector representations.
98
+ 2. Cosine similarity was calculated between the embeddings of the model-generated answers and the ground truth answers.
99
+ 3. This similarity score provides an additional metric for evaluating the models' performance, capturing semantic similarity beyond exact word matching.
100
+
101
+ **Performance Comparison**:
102
+
103
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/642d5e6f84bf892b8faa54cc/vb7MuFmvvAJPHC-dm5-1B.png)
104
+
105
+ For a detailed overview of the metrics logged during the training, please refer to our [Weights & Biases report](https://wandb.ai/fine-tune-llm/FinetuneVLM/reports/Finetuning-Lutece-Vision-Base--Vmlldzo4NjI4NzAy?accessToken=fnbibl4i4cx8ljzbfb6f81yitqe580hipliw5e7a4arueha1cjl3zqsownfikkaw).
106
+
107
+ ## Usage
108
+
109
+ Here's a quick start guide to using Lutece-Vision-Base:
110
 
111
  ```python
112
  from PIL import Image
113
  from transformers import AutoProcessor, AutoModelForCausalLM, AutoConfig
114
  import torch
115
 
116
+ # Load and configure the model
 
117
  device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
 
118
  config = AutoConfig.from_pretrained("microsoft/Florence-2-base-ft", trust_remote_code=True)
119
  config.vision_config.model_type = "davit"
 
120
  model = AutoModelForCausalLM.from_pretrained("sujet-ai/Lutece-Vision-Base", config=config, trust_remote_code=True).to(device).eval()
121
  processor = AutoProcessor.from_pretrained("sujet-ai/Lutece-Vision-Base", config=config, trust_remote_code=True)
122
  task = "<FinanceQA>"
123
 
124
+ # Load input image and define the question
 
125
  image = Image.open('test.png').convert('RGB')
 
126
  prompt = "How much decrease in prepaid expenses was reported?"
127
 
128
+ # Process input and generate answer
129
  inputs = processor(text=prompt, images=image, return_tensors="pt").to(device)
 
130
  generated_ids = model.generate(
131
  input_ids=inputs["input_ids"],
132
  pixel_values=inputs["pixel_values"],
 
135
  num_beams=3,
136
  )
137
 
138
+ # Decode and parse the answer
 
139
  generated_text = processor.batch_decode(generated_ids, skip_special_tokens=False)[0]
140
  parsed_answer = processor.post_process_generation(generated_text, task=task, image_size=(image.width, image.height))
141
  print(parsed_answer[task])
142
+ ```
143
+
144
+ ## Demo and Further Resources
145
+
146
+ - **Interactive Demo**: Try out Lutece-Vision-Base using our [Hugging Face Space](https://huggingface.co/spaces/sujet-ai/Lutece-Vision-Base-DEMO). Please note that this demo runs on CPU, so inference might be slower than on GPU.
147
+
148
+ - **Fine-tuning Tutorial**: If you're interested in fine-tuning Florence 2 for your own tasks, we recommend this excellent [tutorial on Hugging Face](https://huggingface.co/blog/finetune-florence2).
149
+
150
+ ## Limitations and Disclaimer
151
+
152
+ While Lutece-Vision-Base has been trained on a diverse set of financial documents, it may not cover all possible financial scenarios or document types. The model can make mistakes, especially in complex or ambiguous situations. Users should verify critical information and not rely solely on the model's output for making important financial decisions.
153
+
154
+ **Disclaimer**: Sujet AI provides Lutece-Vision-Base as-is, without any warranties, expressed or implied. We are not responsible for any consequences resulting from the use of this model. Users should exercise their own judgment and verify information when using the model for financial analysis or decision-making purposes.
155
+
156
+ The model may reflect biases present in its training data and should be used with this understanding. Continuous evaluation and updating of the model with diverse and representative data are recommended for maintaining its relevance and accuracy.
157
+
158
+ ## Citation and Contact
159
+
160
+ If you use Lutece-Vision-Base in your research or applications, please cite it as:
161
+
162
+ ```
163
+ @software{Lutece-Vision-Base,
164
+ author = {Sujet AI},
165
+ title = {Lutece-Vision-Base: A Fine-tuned VLM for Financial Document Analysis},
166
+ year = {2024},
167
+ url = {https://huggingface.co/sujet-ai/Lutece-Vision-Base}
168
+ }
169
+ ```
170
+
171
+ For questions, feedback, or collaborations, please reach out to us on [LinkedIn](https://www.linkedin.com/company/sujet-ai/) or visit our website [https://sujet.ai](https://sujet.ai).