How to make it work for less experienced AI whisperers

#4
by Sloba - opened
  1. git clone https://huggingface.co/tiiuae/falcon-7b -> Saves into local directory

  2. Create new anaconda environment with Transformers=4.27.4 and python=3.9
    a) conda create --name falcon python=3.9
    b) conda activate falcon
    c) pip install transformers==4.27.4
    d) pip install huggingface-hub
    e) pip install chardet
    f) pip install cchardet
    g) pip install torch
    h) pip install einops
    i) pip install accelerate
    j) conda install cudatoolkit

  3. Following code finally gave results:

from transformers import AutoTokenizer, AutoModelForCausalLM
import transformers
import torch
model = (local windows path to directory (i.e. "X:\\ai\\falcon-7b") where is config.json pulled in step 1.

rrmodel = AutoModelForCausalLM.from_pretrained(model, 
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
    device_map="auto",)

tokenizer = AutoTokenizer.from_pretrained(model)


input_text = "Once upon a time"
input_ids = tokenizer.encode(input_text, return_tensors='pt')

Generate text

attention_mask = torch.ones(input_ids.shape)

output = rrmodel.generate(input_ids, 
            attention_mask=attention_mask, 
            max_length=200,
            do_sample=True,
            top_k=10,
            num_return_sequences=1,
            eos_token_id=tokenizer.eos_token_id,)

Decode the output

output_text = tokenizer.decode(output[0], skip_special_tokens=True)

print(output_text)

output >>>>>>>>>>
Once upon a time, a man named Charlie Brown walked into a candy store. He asked the lady behind the counter if she had any good chocolate. The lady said that she had some very good chocolate.
Charlie Brown said, "That sounds good. Can you give me a pound of it?"
The lady said, "Sure," and she put the pound of chocolate in a bag and rang up the sale.
Charlie Brown said, "That's $7.
<<<<<<<< output <<<<<<<<

[attention_mask?](NameError: name 'attention_mask' is not defined) ?

[attention_mask?](NameError: name 'attention_mask' is not defined) ?

Thanks, updated original message with fix.
Just add:
attention_mask = torch.ones(input_ids.shape)

before .... generate....

also added:
input_ids = input_ids.to('cuda')
before :
attention_mask = torch.ones(input_ids.shape)
now working nice! thx

@Sloba on what server specs did you test your model ? and how was the inference time ?

@bilelm
It is private desktop computer, i am interested in creating local LLM-s:
AMD Ryzen 9 7900X 12-Core (24 thread) 4.7GHz
64Gb DDR5 Ram 4800MHz
NVidia GeForce RTX 3090 24Gb
Model is loaded from NAS over 1GBit/s network (it takes ~2min)

It took 78 seconds for:
Question: Where girrafe lives, and how tall is giraffe?
Answer: Giraffes live in the African continent, they are the tallest land animals.
The tallest one of them is the African male giraffe, which can stand up to 5.9 meters tall.
Giraffes are herbivorous animals, they feed on leaves and grasses.
They are not dangerous animals. They are very friendly and kind to humans, they don’t attack or eat people.
Giraffes have a lifespan of 20 years.<|endoftext|>

It took 156 seconds for:
Question: Where llama lives, and how tall is llama?
Answer: The llama is a South American camelid, a member of the camel family. It is a large, sturdy animal with a thick coat. Llamas are domesticated and are used for meat, wool, and milk.
The llama is a South American camelid, a member of the camel family. It is a large, sturdy animal with a thick coat. Llamas are domesticated and are used for meat, wool, and milk.
What is a llama? A llama is a South American camelid, a member of the camel family. They are domesticated and used for meat, wool, and milk.
How do llamas look? Llama’s are a large, furry animal that looks like a mix between a camel and a giraffe. They are native to South America but are now found in many other places around the world.
Where do llamas live? Llamas live in the Andes mountains, where they graze on vegetation.

I hope this helps, for what is worth, Falcon-7B answers are pretty good.

@Sloba thank you so much for your answer.
I'm looking to test it on French, for tasks like summarization or information extraction.

Technology Innovation Institute org

Hi @Sloba , thank you for writing this short guide, we will pin it to make it easily accessible!

FalconLLM pinned discussion

@FalconLLM , @Sloba Quick question, Can I run it on Macbook Pro with intel chip with 32 RAM?

@ivyas Unfortunately I don't have access to MBP with 32G RAM to try it out.
If you decide to try it out, don't hesitate to share the results. Maybe there is someone who needs exactly the info you find in your test.

I'm trying to run this on a Apple M1 Max..
the code I use is this:

from transformers import AutoTokenizer, AutoModelForCausalLM
import transformers
import torch
model = "./falcon-7b"
device_name = 'cpu'
device = torch.device(device_name)
rrmodel = AutoModelForCausalLM.from_pretrained(model,
    trust_remote_code=True,
    device_map="auto")
rrmodel = rrmodel.to(device)
tokenizer = AutoTokenizer.from_pretrained(model)

input_text = "Once upon a time"
input_ids = tokenizer.encode(input_text, return_tensors='pt')
input_ids = input_ids.to(device)
attention_mask = torch.ones(input_ids.shape)
attention_mask = attention_mask.to(device)

output = rrmodel.generate(input_ids,
            attention_mask=attention_mask,
            max_length=200,
            do_sample=True,
            top_k=10,
            num_return_sequences=1,
            eos_token_id=tokenizer.eos_token_id,)

output_text = tokenizer.decode(output[0], skip_special_tokens=True)

print(output_text)

Using device_name = 'cpu' this take 5m 50s mins to run.

I try to use device_name = 'mps' for acceleration on the m1 chip.
But I get this error:

Traceback (most recent call last):
  File "/Users/mario/Downloads/main.py", line 19, in <module>
    output = rrmodel.generate(input_ids,
  File "/Users/mario/anaconda3/envs/falcon/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/Users/mario/anaconda3/envs/falcon/lib/python3.9/site-packages/transformers/generation/utils.py", line 1565, in generate
    return self.sample(
  File "/Users/mario/anaconda3/envs/falcon/lib/python3.9/site-packages/transformers/generation/utils.py", line 2612, in sample
    outputs = self(
  File "/Users/mario/anaconda3/envs/falcon/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1502, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/Users/mario/anaconda3/envs/falcon/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1511, in _call_impl
    return forward_call(*args, **kwargs)
  File "/Users/mario/anaconda3/envs/falcon/lib/python3.9/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/Users/mario/.cache/huggingface/modules/transformers_modules/falcon-7b/modelling_RW.py", line 753, in forward
    transformer_outputs = self.transformer(
  File "/Users/mario/anaconda3/envs/falcon/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1502, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/Users/mario/anaconda3/envs/falcon/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1511, in _call_impl
    return forward_call(*args, **kwargs)
  File "/Users/mario/anaconda3/envs/falcon/lib/python3.9/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/Users/mario/.cache/huggingface/modules/transformers_modules/falcon-7b/modelling_RW.py", line 590, in forward
    inputs_embeds = self.word_embeddings(input_ids)
  File "/Users/mario/anaconda3/envs/falcon/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1502, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/Users/mario/anaconda3/envs/falcon/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1511, in _call_impl
    return forward_call(*args, **kwargs)
  File "/Users/mario/anaconda3/envs/falcon/lib/python3.9/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/Users/mario/anaconda3/envs/falcon/lib/python3.9/site-packages/torch/nn/modules/sparse.py", line 162, in forward
    return F.embedding(
  File "/Users/mario/anaconda3/envs/falcon/lib/python3.9/site-packages/torch/nn/functional.py", line 2238, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: Placeholder storage has not been allocated on MPS device!

@BliepBlop Did you encounter this issue when you run above code?

ValueError: The current device_map had weights offloaded to the disk. Please provide an offload_folder for them. Alternatively, make sure you have safetensors installed if the model you are using offers the weights in this format.

I installed safetensors.

Technology Innovation Institute org

In principle the model should need at least 16GB of memory to run--but generation on CPU is bound to be slow.

We also recommend having a look at this blog post for more info on finetuning & inference of Falcon.

Excellent post. Thanks for providing this.
All worked for me on a 4090, Ubuntu 20.04

Alternative: https://github.com/cmp-nct/ggllm.cpp/blob/master/README.md
Includes a video how to compile it on windows, does not need a complex conda/python backend and runs with just a few GB or RAM (or VRAM) 10+ times faster than with python
Also includes exe binary release for windows (for cpu and cuda) if you don't want to get into development frameworks

can anyone help me please
i have the text data stored in .txt the text data is simple information about a technology
i want to fine tune the falcon model and the i want to ask the question to the falcon model according to that .txt file

can anyone help me please
i have the text data stored in .txt the text data is simple information about a technology
i want to fine tune the falcon model and the i want to ask the question to the falcon model according to that .txt file

Fine tuning typically involves a clean set of inputs and outputs, not a text with simple information.
You can look into fine tune projects for falcon and how their input data looks like, it will need an elaborate effort to transform your text into good input and output.

The more likely solution is to just prompt Falcon with your text and ask it to use it as information source. By using a good fine tune that follows your prompt you can increase the quality.

Thank you so much. worked a treat!

also if anyone is getting the ERROR :
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!

just add the line: input_ids = input_ids.to('cuda')

thanks again @Sloba :)

Sign up or log in to comment