KeyError

#4
by lukaemon - opened

Doing basic stuff as in readme

# pip install accelerate
from transformers import AutoTokenizer, OPTForCausalLM

tokenizer = AutoTokenizer.from_pretrained("facebook/galactica-30b")
model = OPTForCausalLM.from_pretrained("facebook/galactica-30b", device_map="auto")

input_text = "The Transformer architecture [START_REF]"
input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to("cuda")

outputs = model.generate(input_ids)
print(tokenizer.decode(outputs[0]))

Then I got this error msg:

File /opt/conda/lib/python3.8/site-packages/transformers/modeling_utils.py:2326, in PreTrainedModel.from_pretrained(cls, pretrained_model_name_or_path, *model_args, **kwargs)
   2323     if dtype_orig is not None:
   2324         torch.set_default_dtype(dtype_orig)
-> 2326     model, missing_keys, unexpected_keys, mismatched_keys, error_msgs = cls._load_pretrained_model(
   2327         model,
   2328         state_dict,
   2329         loaded_state_dict_keys,  # XXX: rename?
   2330         resolved_archive_file,
   2331         pretrained_model_name_or_path,
   2332         ignore_mismatched_sizes=ignore_mismatched_sizes,
   2333         sharded_metadata=sharded_metadata,
   2334         _fast_init=_fast_init,
   2335         low_cpu_mem_usage=low_cpu_mem_usage,
   2336         device_map=device_map,
   2337         offload_folder=offload_folder,
   2338         offload_state_dict=offload_state_dict,
   2339         dtype=torch_dtype,
   2340         load_in_8bit=load_in_8bit,
   2341     )
   2343 # make sure token embedding weights are still tied if needed
...
-> 2448 param = model_state_dict[key]
   2449 if param.device == torch.device("meta"):
   2450     if not load_in_8bit:

KeyError: 'decoder.layers.37.self_attn.out_proj.bias'

Did I miss something obvious?

Same error but different key.

KeyError: 'decoder.layers.44.self_attn_layer_norm.bias'

I think this might be a bug in the way huggingface downloads blobs where intermittent failures arent detected and can corrupt the blob.
It also does not check hashsums for the blobs so it is unable to detect that the blob was corrupted.
I lost my original output of which key it failed on but I got suspicious of my 05db345d4fcca580bed2c6e9d0fe8feead207c2c2fa8384c27c94cbd4ed0e0bf shard because it is smaller than the others.
So I deleted it and had it repulled and it's size changed in the new pull.

Old disk usage

❯ du -csh ./models--facebook--galactica-30b/blobs/*
785M    ./models--facebook--galactica-30b/blobs/0379c39b5a0cb59453b14738ef1d4924e93599aba4e57f2599036e76f36532f6
5.3G    ./models--facebook--galactica-30b/blobs/05db345d4fcca580bed2c6e9d0fe8feead207c2c2fa8384c27c94cbd4ed0e0bf
4.0K    ./models--facebook--galactica-30b/blobs/0967ef424bce6791893e9a57bb952f80fd536e93
2.5G    ./models--facebook--galactica-30b/blobs/0d6ce164b560f4601d48f61c2a8d598106faa9f4b89c39334a712429649b75c8
4.0K    ./models--facebook--galactica-30b/blobs/28e11da7e191492f3f23d2aa35e9b60f8e9becf6
9.2G    ./models--facebook--galactica-30b/blobs/30a274571d49a30bb4d6872e69b96ad191fa22c92427d160c74ce225a566bd71
24K     ./models--facebook--galactica-30b/blobs/98d10d1a52ab2b70f1deff472512cbaa6065e317
9.2G    ./models--facebook--galactica-30b/blobs/aa79446f17da0f3b9f8815a3628c2b1935936ec819f09a5865ce4e3c4ee51aa7
9.2G    ./models--facebook--galactica-30b/blobs/b919005245e2b77d57bf3a73ac18415083aa32b6e2e4e89c96b8d988453a0e7f
4.0K    ./models--facebook--galactica-30b/blobs/bc97f8a9458a1fe096bec5d8ec938a02647bc4bb
9.2G    ./models--facebook--galactica-30b/blobs/c1cad10954e544c44aabd29f31e67292d1bc819d2e7b9842f14fdcef88d58f93
2.1M    ./models--facebook--galactica-30b/blobs/e18054f92dc016b43c940dd1c4a1c5da884539c0
46G     total

Latest disk usage

785M    ./models--facebook--galactica-30b/blobs/0379c39b5a0cb59453b14738ef1d4924e93599aba4e57f2599036e76f36532f6
9.2G    ./models--facebook--galactica-30b/blobs/05db345d4fcca580bed2c6e9d0fe8feead207c2c2fa8384c27c94cbd4ed0e0bf
4.0K    ./models--facebook--galactica-30b/blobs/0967ef424bce6791893e9a57bb952f80fd536e93
9.2G    ./models--facebook--galactica-30b/blobs/0d6ce164b560f4601d48f61c2a8d598106faa9f4b89c39334a712429649b75c8
4.0K    ./models--facebook--galactica-30b/blobs/28e11da7e191492f3f23d2aa35e9b60f8e9becf6
9.2G    ./models--facebook--galactica-30b/blobs/30a274571d49a30bb4d6872e69b96ad191fa22c92427d160c74ce225a566bd71
24K     ./models--facebook--galactica-30b/blobs/98d10d1a52ab2b70f1deff472512cbaa6065e317
9.2G    ./models--facebook--galactica-30b/blobs/aa79446f17da0f3b9f8815a3628c2b1935936ec819f09a5865ce4e3c4ee51aa7
9.2G    ./models--facebook--galactica-30b/blobs/b919005245e2b77d57bf3a73ac18415083aa32b6e2e4e89c96b8d988453a0e7f
4.0K    ./models--facebook--galactica-30b/blobs/bc97f8a9458a1fe096bec5d8ec938a02647bc4bb
9.2G    ./models--facebook--galactica-30b/blobs/c1cad10954e544c44aabd29f31e67292d1bc819d2e7b9842f14fdcef88d58f93
2.1M    ./models--facebook--galactica-30b/blobs/e18054f92dc016b43c940dd1c4a1c5da884539c0
56G     total

Im computing the md5sum for the blobs now with

❯ md5sum ./models--facebook--galactica-30b/blobs/*

Let's compare?

These are my md5sums which get KeyError: 'decoder.layers.44.self_attn_layer_norm.bias':

❯ md5sum ./models--facebook--galactica-30b/blobs/*
ee6deb059a899a51aa3e1c726e935aa2  ./models--facebook--galactica-30b/blobs/0379c39b5a0cb59453b14738ef1d4924e93599aba4e57f2599036e76f36532f6
5af8e57b27eaafa9d59d4669b5f7b1f7  ./models--facebook--galactica-30b/blobs/05db345d4fcca580bed2c6e9d0fe8feead207c2c2fa8384c27c94cbd4ed0e0bf
8a80554c91d9fca8acb82f023de02f11  ./models--facebook--galactica-30b/blobs/0967ef424bce6791893e9a57bb952f80fd536e93
3a7ffd5e37b9c2552aca688fd1531723  ./models--facebook--galactica-30b/blobs/0d6ce164b560f4601d48f61c2a8d598106faa9f4b89c39334a712429649b75c8
f4484c98e948186322d8e29f2d317004  ./models--facebook--galactica-30b/blobs/28e11da7e191492f3f23d2aa35e9b60f8e9becf6
4198d299ecacb9ea6866ec62d352691e  ./models--facebook--galactica-30b/blobs/30a274571d49a30bb4d6872e69b96ad191fa22c92427d160c74ce225a566bd71
2664153e6ea77cc8a03a58a4e894984d  ./models--facebook--galactica-30b/blobs/98d10d1a52ab2b70f1deff472512cbaa6065e317
1e6ebc15971c26c46d1832b5b5247560  ./models--facebook--galactica-30b/blobs/aa79446f17da0f3b9f8815a3628c2b1935936ec819f09a5865ce4e3c4ee51aa7
ff304677b7c8d9b0aabbaf63cb0c1bbd  ./models--facebook--galactica-30b/blobs/b919005245e2b77d57bf3a73ac18415083aa32b6e2e4e89c96b8d988453a0e7f
fdda94195fbe20918df6aaa9aba70d10  ./models--facebook--galactica-30b/blobs/bc97f8a9458a1fe096bec5d8ec938a02647bc4bb
4e9c975acbbce326de42198a9d06a246  ./models--facebook--galactica-30b/blobs/c1cad10954e544c44aabd29f31e67292d1bc819d2e7b9842f14fdcef88d58f93
a74f71fa9db6a1a27c33d77d20696944  ./models--facebook--galactica-30b/blobs/e18054f92dc016b43c940dd1c4a1c5da884539c0

Can the authors check as well? I think 30b in particular might be broken because the artifacts are much smaller than expected.

If you calculated bytes / parameter ratios, my install of 30B is an obvious outlier:

Size Parameters Disk Usage Bytes / Parameter ratio
mini 125 M 480M 4.0265
base 1.3 B 5.0G 4.1298
standard 6.7 B 26G 4.1667
large 30 B 56G 2.0043
huge 120 B 453G 4.0534

If we roughly interpolate and say the models are ~ 4 bytes (fp32) per parameter, we should expect the 30b model to have 120GB of blobs. However, if you sum all the blobs in the repo. It is only ~60GB.

Get disk usage

❯ du -csh ~/.cache/huggingface/hub/*/blobs
453G /home/jackmin/.cache/huggingface/hub/models--facebook--galactica-120b/blobs
480M /home/jackmin/.cache/huggingface/hub/models--facebook--galactica-125m/blobs
5.0G /home/jackmin/.cache/huggingface/hub/models--facebook--galactica-1.3b/blobs
56G /home/jackmin/.cache/huggingface/hub/models--facebook--galactica-30b/blobs
26G /home/jackmin/.cache/huggingface/hub/models--facebook--galactica-6.7b/blobs
539G total

Bytes / Parameter Calculation

[125m] 480 * (2 ** 20) / 125e6 = 4.0265_
[1.3b] 5 * (2 ** 30) / 1.3e9 = 4.1298_
[6.7b] 26 * (2 ** 30) / 6.7e9 = 4.1667_
[30b] 56 * (2 ** 30) / 30e9 = 2.0043_
[120b] 453 * (2 ** 30) / 120e9 = 4.0534_

Artifacts in the repo

image.png

I have deduced that the 30b model pickles have no biases.

from tqdm import tqdm
import torch
from pathlib import Path
import pickle

blob_path = Path.home() / Path('.cache/huggingface/hub/models--facebook--galactica-30b/blobs')

keys2blob = {}
errors = {}
blobs = [blob for blob in blob_path.glob('./*') if blob.is_file()]

for blob in tqdm(blobs):
    try:
        keys2blob.update({k: blob for k in torch.load(blob).keys()})
    except pickle.UnpicklingError as e:
        errors[blob] = e

print(f"Num_weights: {len([i for i in keys2blob.keys() if 'weight' in i])}")
print(f"Num_biases: {len([i for i in keys2blob.keys() if 'bias' in i])}")
100%|██████████| 12/12 [00:50<00:00,  4.19s/it]
Num_weights: 290
Num_biases: 0

This is opposed to the 6.7b model which contains a lot of biases.

from tqdm import tqdm
import torch
from pathlib import Path
import pickle

blob_path = Path.home() / Path('.cache/huggingface/hub/models--facebook--galactica-6.7b/blobs')

keys2blob = {}
errors = {}
blobs = [blob for blob in blob_path.glob('./*') if blob.is_file()]

for blob in tqdm(blobs):
    try:
        keys2blob.update({k: blob for k in torch.load(blob).keys()})
    except pickle.UnpicklingError as e:
        errors[blob] = e

print(f"Num_weights: {len([i for i in keys2blob.keys() if 'weight' in i])}")
print(f"Num_biases: {len([i for i in keys2blob.keys() if 'bias' in i])}")
50%|█████     | 4/8 [00:14<00:14,  3.57s/it]
Num_weights: 260
Num_biases: 257

Update: 30b is the only model in half precision. It also has less tensors than expected.

Size Parameters Disk Usage Bytes / Parameter ratio Sum(layer.numels) Data type of tensors
mini 125 M 480M 4.0265 163,430,400 {torch.float32: 197}
base 1.3 B 5.0G 4.1298 1,417,601,024 {torch.float32: 389}
standard 6.7 B 26G 4.1667 6,862,159,872 {torch.float32: 517}
large 30 B 56G 2.0043 29,968,103,424 {torch.float16: 290}
huge 120 B 453G 4.0534 121,853,747,200 {torch.float32: 1541}

Getting key errors too like:
KeyError: 'decoder.layers.27.final_layer_norm.weight'
KeyError: 'decoder.layers.11.fc1.bias'
KeyError: 'decoder.layers.6.fc1.bias'
which are different with each run

Here are the md5sum

md5sum ./models--facebook--galactica-30b/blobs/*

ee6deb059a899a51aa3e1c726e935aa2  ./models--facebook--galactica-30b/blobs/0379c39b5a0cb59453b14738ef1d4924e93599aba4e57f2599036e76f36532f6
5af8e57b27eaafa9d59d4669b5f7b1f7  ./models--facebook--galactica-30b/blobs/05db345d4fcca580bed2c6e9d0fe8feead207c2c2fa8384c27c94cbd4ed0e0bf
8a80554c91d9fca8acb82f023de02f11  ./models--facebook--galactica-30b/blobs/0967ef424bce6791893e9a57bb952f80fd536e93
3a7ffd5e37b9c2552aca688fd1531723  ./models--facebook--galactica-30b/blobs/0d6ce164b560f4601d48f61c2a8d598106faa9f4b89c39334a712429649b75c8
f4484c98e948186322d8e29f2d317004  ./models--facebook--galactica-30b/blobs/28e11da7e191492f3f23d2aa35e9b60f8e9becf6
4198d299ecacb9ea6866ec62d352691e  ./models--facebook--galactica-30b/blobs/30a274571d49a30bb4d6872e69b96ad191fa22c92427d160c74ce225a566bd71
2664153e6ea77cc8a03a58a4e894984d  ./models--facebook--galactica-30b/blobs/98d10d1a52ab2b70f1deff472512cbaa6065e317
1e6ebc15971c26c46d1832b5b5247560  ./models--facebook--galactica-30b/blobs/aa79446f17da0f3b9f8815a3628c2b1935936ec819f09a5865ce4e3c4ee51aa7
ff304677b7c8d9b0aabbaf63cb0c1bbd  ./models--facebook--galactica-30b/blobs/b919005245e2b77d57bf3a73ac18415083aa32b6e2e4e89c96b8d988453a0e7f
fdda94195fbe20918df6aaa9aba70d10  ./models--facebook--galactica-30b/blobs/bc97f8a9458a1fe096bec5d8ec938a02647bc4bb
4e9c975acbbce326de42198a9d06a246  ./models--facebook--galactica-30b/blobs/c1cad10954e544c44aabd29f31e67292d1bc819d2e7b9842f14fdcef88d58f93
a74f71fa9db6a1a27c33d77d20696944  ./models--facebook--galactica-30b/blobs/e18054f92dc016b43c940dd1c4a1c5da884539c0

which is the same hash of @Jackmin108 :
https://huggingface.co/facebook/galactica-30b/discussions/4#637c8606d55081513c5679ef

@mrm8488 For me too other models worked fine. Only the galactica 30b model gives me key errors.

Hi, @ybelkada made a PR yesterday to fix it ASAP :)

Hey @hwasiti @Jackmin108
Just out of curiosity, have you tried it with the largest model too? The 120b

@ybelkada Unfortunately I don't have the hardware to load 120b even if I sampled down to int8.
However, there doesn't seem to be anything suspicious with the 120b checkpoint.

The reason I believe so is because you sort the names of the layers and cutoff at the last occurrence of layer 0 in the decoder, you get the same output from the 120b checkpoint as the 125m checkpoint, the tensors just have different dimensions.
This is not true for the 30b checkpoint, which is missing final layers and biases.

120b

lm_head.weight torch.Size([50000, 10240])
model.decoder.embed_positions.weight torch.Size([2050, 10240])
model.decoder.embed_tokens.weight torch.Size([50000, 10240])
model.decoder.final_layer_norm.bias torch.Size([10240])
model.decoder.final_layer_norm.weight torch.Size([10240])
model.decoder.layers.0.fc1.bias torch.Size([40960])
model.decoder.layers.0.fc1.weight torch.Size([40960, 10240])
model.decoder.layers.0.fc2.bias torch.Size([10240])
model.decoder.layers.0.fc2.weight torch.Size([10240, 40960])
model.decoder.layers.0.final_layer_norm.bias torch.Size([10240])
model.decoder.layers.0.final_layer_norm.weight torch.Size([10240])
model.decoder.layers.0.self_attn.k_proj.bias torch.Size([10240])
model.decoder.layers.0.self_attn.k_proj.weight torch.Size([10240, 10240])
model.decoder.layers.0.self_attn.out_proj.bias torch.Size([10240])
model.decoder.layers.0.self_attn.out_proj.weight torch.Size([10240, 10240])
model.decoder.layers.0.self_attn.q_proj.bias torch.Size([10240])
model.decoder.layers.0.self_attn.q_proj.weight torch.Size([10240, 10240])
model.decoder.layers.0.self_attn.v_proj.bias torch.Size([10240])
model.decoder.layers.0.self_attn.v_proj.weight torch.Size([10240, 10240])
model.decoder.layers.0.self_attn_layer_norm.bias torch.Size([10240])
model.decoder.layers.0.self_attn_layer_norm.weight torch.Size([10240])

125m

lm_head.weight torch.Size([50000, 768])
model.decoder.embed_positions.weight torch.Size([2050, 768])
model.decoder.embed_tokens.weight torch.Size([50000, 768])
model.decoder.final_layer_norm.bias torch.Size([768])
model.decoder.final_layer_norm.weight torch.Size([768])
model.decoder.layers.0.fc1.bias torch.Size([3072])
model.decoder.layers.0.fc1.weight torch.Size([3072, 768])
model.decoder.layers.0.fc2.bias torch.Size([768])
model.decoder.layers.0.fc2.weight torch.Size([768, 3072])
model.decoder.layers.0.final_layer_norm.bias torch.Size([768])
model.decoder.layers.0.final_layer_norm.weight torch.Size([768])
model.decoder.layers.0.self_attn.k_proj.bias torch.Size([768])
model.decoder.layers.0.self_attn.k_proj.weight torch.Size([768, 768])
model.decoder.layers.0.self_attn.out_proj.bias torch.Size([768])
model.decoder.layers.0.self_attn.out_proj.weight torch.Size([768, 768])
model.decoder.layers.0.self_attn.q_proj.bias torch.Size([768])
model.decoder.layers.0.self_attn.q_proj.weight torch.Size([768, 768])
model.decoder.layers.0.self_attn.v_proj.bias torch.Size([768])
model.decoder.layers.0.self_attn.v_proj.weight torch.Size([768, 768])
model.decoder.layers.0.self_attn_layer_norm.bias torch.Size([768])
model.decoder.layers.0.self_attn_layer_norm.weight torch.Size([768])

30b

decoder.embed_positions.weight torch.Size([2050, 7168])
decoder.embed_tokens.weight torch.Size([50000, 7168])
decoder.layers.0.fc1.weight torch.Size([28672, 7168])
decoder.layers.0.fc2.weight torch.Size([7168, 28672])
decoder.layers.0.self_attn.k_proj.weight torch.Size([7168, 7168])
decoder.layers.0.self_attn.out_proj.weight torch.Size([7168, 7168])
decoder.layers.0.self_attn.q_proj.weight torch.Size([7168, 7168])
decoder.layers.0.self_attn.v_proj.weight torch.Size([7168, 7168])

Great thank you very much! You really did a great job debugging here and helped me a lot understanding the rootcause of the issue.
Let's wait for https://github.com/huggingface/transformers/pull/20390 to be addressed and keep the thread open here

@ybelkada The 120B did not work eventually, but at least I think the model has been mapped to the GPUs/RAM/SSD and gave me Cuda out-of-memory error in GPU0, which is solvable I guess. I did not want to try again. The whole initialization took around 10 hrs and it was not worth it. I just felt that it is not practical to use such a slow model when I have only 64GB RAM and 2 GPUs each with 11GB. The rest will be mapped to SSD and that is not a good idea practically speaking.

I do have the intention to spin up an instance in AWS or GCP with huge RAM to fit all the models in RAM and test how fast it is if using CPU only or use the CPU/640 GB RAM with 1 GPU. The cost for such an instance is around $1-2/hr (spot instance) which is worth it in case I want a few hours to aid me in writing the intro of a research paper or something.

Hi there!
https://github.com/huggingface/transformers/pull/20390 is probably going to be merged, I can confirm that I can at least load the model with the fix, may I ask you to try the same thing on your side? The instructions would be:
pip install --upgrade git+https://github.com/younesbelkada/transformers.git@fix-opt-bias
git clone https://huggingface.co/facebook/galactica-30b/
Then modify the config.json file of the cloned repository by adding 2 lines:

"enable_bias": false,
"layer_norm_elementwise_affine":false,

Looking forward to hearing from you!

@ybelkada Does this model support half-precision (float16)?
Otherwise, I don't think it will fit my 64GB RAM.

See a similar issue with the 6.7b model:
https://huggingface.co/facebook/galactica-6.7b/discussions/6

Yes it should support float16, you just have to load it by adding the argument torch_dtype=torch.float16 when calling .from_pretrained

I have used that for the 6.7b model. So is the 6.7b model in particular not supporting the argument torch_dtype=torch.float16 when calling .from_pretrained?

This comment has been hidden

Sign up or log in to comment