Running dolly2 locally
What sort of system requirements would I need to run this model locally - as opposed to something like say, Vicuna-13b.
Ideally a GPU with at least 32GB of RAM for the 12B model. It should work in 16GB if you load in 8-bit.
The smaller models should work in less GPU RAM too.
I can confirm that the 12B version runs on 1x RTX 3090 (24GB of VRAM) loaded in int8 precision:
from transformers import AutoTokenizer, AutoModelForCausalLM
base_model = "databricks/dolly-v2-12b"
load_8bit = True
tokenizer = AutoTokenizer.from_pretrained(base_model, padding_side="left")
model = AutoModelForCausalLM.from_pretrained(
base_model, load_in_8bit=load_8bit, torch_dtype=torch.float16, device_map="auto"
)
model.eval()
if torch.__version__ >= "2":
model = torch.compile(model)
pipe = InstructionTextGenerationPipeline(model=model, tokenizer=tokenizer)
pipe("any prompt you want to provide")
Don't forget to import the InstructionTextGenerationPipeline
given by the team.
You can also just use trust_remote_code='true' to auto-import it but this works fine for sure.
I think bitsandbytes will complain if you set bfloat16 as it will end up using fp16 for the fp parts anyway, but it just ignores that
Hi,
I want to run the dolly 12 b model in Azure cloud. Can you suggest which VM I should go for?
What's the time takes to generate a response for decent size prompt?
- fp16: between 5 and 15 sec.
- int8 and Peft: between 1 and 5 sec.
It also depends on the num_beams
you require and any other generation parameters you use. I used long inputs as a reference, between 1536 and 2048 tokens. You may also have a faster inference time if your inputs are smaller.
Here the tutorial video for how to install and use on Windows
The video includes a Gradio user interface script and teaches you how to enable load 8bit speed up and lower VRAM quantization
The results I had was not very good though for some reason :/
Dolly 2.0 : Free ChatGPT-like Model for Commercial Use - How To Install And Use Locally On Your PC
I have the 12B model running on my computer running Linux using a RTX 3060 graphics card, a I9-10900X cpu and 48GB memory. I'm using https://github.com/oobabooga/text-generation-webui as the front end. The settings I tried were GPU memory 7.5GB, CPU memory 22GB, auto-devices and load-in-8-bit.
Looking at memory usage, it looks like it gets anywhere close to using the 22GB CPU memory, but GPU memory does go above the 7.5GB limit.
It generates about 1 token per second.
I got to around 1200-1500 tokens current + context/history with the dolly 12B model.
You might be able to get more by tweaking the model settings, but this works as a starting point.
I just ran a few prompts through the model and apparently it took 6-7 mins. I run on databricks on a Standard_NC6s_v3 machine with 112GB of memory.
Any hint why inference takes so long is highly appreciated!
That's a V100 16GB. The 12B model does not fit onto that GPU. So you are mostly running on the CPU and it takes a long time.
Did you look at https://github.com/databrickslabs/dolly#generating-on-other-instances ?
You need to load in 8-bit, but a 16GB V100 will struggle with the 12B model a bit.
Use A10 or better, or use the 7B model.
@srowen Thanks a lot for the hint - completely confused a few things!
When I am trying it locally, it says the pytorch_model.bin
is not in the correct JSON format. I am using the following code:
tokenizer = AutoTokenizer.from_pretrained("databricks/dolly-v2-12b", padding_side="left")
model = AutoModelForCausalLM.from_pretrained("./pytorch_model.bin", device_map="auto", torch_dtype=torch.bfloat16)
generate_text = InstructionTextGenerationPipeline(model=model, tokenizer=tokenizer)
res = generate_text("What are the differences between dog and cat?")
print (res)
It says:
OSError: It looks like the config file at './pytorch_model.bin' is not a valid JSON file.
But changing to model = AutoModelForCausalLM.from_pretrained("databricks/dolly-v2-12b", device_map="auto", torch_dtype=torch.bfloat16)
works. I have also tried using the exact same file as in ~/.cache/huggingface/hub/models--databricks--dolly-v2-12b/blobs/
, and that also does not work.
Pass the directory containing this file, not the file path. It's looking for several artifacts in that dir, not just the model. You do not need to download the model like this.
Hi,
I want to run the dolly 12 b model in Cloudera workbench. Can anyone suggest how much RAM and GPU's I should go for?
You want an A100 ideally. See https://github.com/databrickslabs/dolly#training-on-other-instances
@chainyo , if you used LoRA, would you mind sharing your LoraConfig? (reference)
@opyate Sorry for the confusion. I discussed another alpaca/llama model loaded using the LoRa Peft loader. You can find some code snippets on this repo
But you don't need LoRa for this dolly model until you fine-tune it using the LoRa technique.
Hi, like in openAi we have token limit of 4096, do we have token limit in dolly 2 as well when we deploy it locally? Thanks!
Yes, 2048 tokens
Do you have a notebook to run dolly 2.0 on azure databricks, I try but I have error :-(
Yes, the snippet on the model page works. You need a big enough GPU and instance. You didn't say what the problem was.
can you give me the lin k I do not see the snippet
Just this very site. https://huggingface.co/databricks/dolly-v2-12b#usage
Merci, Thanks, Namaste :-)
I have this error when I try to run : We couldn't connect to 'https://huggingface.co' to load this file,
You'll have to solve that access problem yourself, it's specific to your env
Hi @srowen
I'm trying to finetune "TinyPixel/Llama-2-7B-bf16-sharded" on 8 GB ram and one GPU, but facing some issues like
model = AutoModelForCausalLM.from_pretrained(
model_name,
quantization_config=bnb_config,
trust_remote_code=True)
RuntimeError: Failed to import transformers.models.llama.modeling_llama because of the following error (look up to see its traceback):
Is it because of RAM and GPU?
Wrong forum - not a question about Dolly.