File size: 4,553 Bytes
01435ff
 
 
c7e7a9d
34e425c
c7e7a9d
 
 
d3fd23a
3b7a146
0b8041a
 
dd3a543
0b8041a
d3fd23a
f662a9f
c7e7a9d
 
d3fd23a
 
cea8923
d3fd23a
cea8923
d3fd23a
 
be83a0d
 
 
c7e7a9d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
cea8923
 
 
138cc0e
68a1b59
be83a0d
 
 
 
 
 
 
 
 
 
 
2447645
68a1b59
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
138cc0e
 
 
f0a3bba
be83a0d
c7e7a9d
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
---
license: apache-2.0
---
<h1>
<a alt="About Ask2Democracy project" href="https://github.com/jorge-henao/ask2democracy">About Ask2Democracy project</a>
</h1>
<hr>

## About Ask2Democracy project
This model was developed as part of the Ask2Democracy project during the 2023 Somos NLP Hackathon. Our focus during the hackathon was on enhancing the generative capabilities in spanish training an open source model for this purpose, which is intended to be incorporated into the space demo. 
However, we encountered performance limitations due to the model's large size, which caused issues when running it on limited hardware. Specifically, we observed an inference time of approximately 70 seconds when using a GPU.

To address this issue, we are currently working on optimizing ways to integrate the model into the AskDemocracy space demo. Remaining work is required in order to improve the model's performance.
[AskDemocracy space demo](https://huggingface.co/spaces/jorge-henao/ask2democracycol)

## What's baizemocracy-lora-7B-cfqa-conv model?

This model is an open-source chat model fine-tuned with [LoRA](https://github.com/microsoft/LoRA) inspired by [Baize project](https://github.com/project-baize/baize-chatbot/tree/main/). It was trained with the Baize datasets and the ask2democracy-cfqa-salud-pension dataset, wich contains almost 4k instructions to answers questions based on a context relevant to citizen concerns and public debate in spanish.

Two model variations was trained during the Hackathon Somos NLP 2023: 
- A conversational style focused model
- A generative context focused model
  
This model variation is focused in a more conversational way of asking questions. See Pre-proccessing dataset section.
There is other model variation more focused on source based augmented retrieval generation [Baizemocracy-RAGfocused](https://huggingface.co/hackathon-somos-nlp-2023/baizemocracy-lora-7B-cfqa).

Testing is a work in progress, we decide to share both model variations with community in order to invovle more people experimenting what it works better and find other possible use cases.


- **Developed by:**
- 🇨🇴 [Jorge Henao](https://huggingface.co/jorge-henao)
- 🇨🇴 [David Torres ](https://github.com/datorresb)

## Training Parameters

- Base Model: [LLaMA-7B](https://arxiv.org/pdf/2302.13971.pdf)
- Training Epoch: 1
- Batch Size: 16
- Maximum Input Length: 512
- Learning Rate: 2e-4
- LoRA Rank: 8
- Updated Modules: All Linears

## Training Dataset

- [Ask2Democracy-cfqa-salud-pension](https://huggingface.co/datasets/hackathon-somos-nlp-2023/ask2democracy-cfqa-salud-pension) (3,806)
- [Standford Alpaca](https://github.com/tatsu-lab/stanford_alpaca) (51,942)
- [Quora Dialogs](https://github.com/project-baize/baize) (54,456):
- [StackOverflow Dialogs](https://github.com/project-baize/baize) (57,046)
- [Alpacaca chat Dialogs](https://github.com/project-baize/baize)
- [Medical chat Dialogs](https://github.com/project-baize/baize)

## About pre-processing

Ask2Democracy-cfqa-salud-pension dataset was pre-processed in a conversational style in two variations like this:
```python

def format_instruction_without_context(example):
  example["topic"] = example['input']
  input = "La conversación entre un humano y un asistente de IA."
  input += "\n[|Human|] "+example['input']
  input += "\n[|AI|] "+example["output"]
  if len(example["topics"])>0:
    topics = ", ".join(example["topics"])
    input += "\n[|Human|] "+"¿En cuáles tópicos clasificarías su respuesta?"
    input += "\n[|AI|] "+f"Aquí una lista de tópicos: {topics}."
    example["topic"] += f" ({topics})"
  example["input"] = input
  return example`

def format_instruction_with_context(example):
  example["topic"] = example['input']
  context = example['instruction'].replace("Given the context please answer the question. Context:","")
  context = ' '.join(context.strip().split())[1:-3]
  input = "La conversación entre un humano y un asistente de IA."
  input += "\n[|Human|] "+example['input']+f"\nPara responder la pregunta, usa el siguiente contexto:\n{context}"
  input += "\n[|AI|] "+example["output"]
  if len(example["topics"])>0:
    topics = ", ".join(example["topics"])
    input += "\n[|Human|] "+"¿En cuáles tópicos clasificarías su respuesta?"
    input += "\n[|AI|] "+f"Aquí una lista de tópicos: {topics}."
    example["topic"] += f" ({topics})"
  example["input"] = input
  return example

```




More details can be found in the Ask2Democracy [GitHub](https://github.com/jorge-henao/ask2democracy)