Trained on custom dataset not working

#5
by andvg3 - opened

Hi authors,

I was trying to train on my custom dataset. After a while training successfully on my action data, I saved the weight to a local direction. I started running this code again:

import numpy as np
from transformers import AutoProcessor

# Load the tokenizer from the Hugging Face hub
tokenizer = AutoProcessor.from_pretrained("new_weight/fast_tokenizer")
print(tokenizer)

# Tokenize & decode action chunks (we use dummy data here)
action_data = np.random.rand(1, 20, 19, 12).tolist()   # one batch of action chunks
tokens = tokenizer(action_data)              # tokens = list[int]
decoded_actions = tokenizer.decode(tokens)

Then, the following error occured:

Traceback (most recent call last):
File "/home/X/Desktop/robocasa/test.py", line 10, in
tokens = tokenizer(action_data) # tokens = list[int]
File "/home/X/miniconda3/envs/robocasa/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2868, in call
encodings = self._call_one(text=text, text_pair=text_pair, **all_kwargs)
File "/home/X/miniconda3/envs/robocasa/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2928, in _call_one
raise ValueError(
ValueError: text input must be of type str (single example), List[str] (batch or single pretokenized example) or List[List[str]] (batch of pretokenized examples)

Could you please look at this issue and suggest how to fix it? Thanks!

Physical Intelligence org

One thing is that your input action chunk should only be 3-dimensional ([batch, chunk_horizon, action_dim]), while yours is 4-dimensional now.

Hello, author,
I also tried training the weights, and this is my code.

tokenizer = AutoProcessor.from_pretrained("michaelyeah7/my_new_tokenizer", trust_remote_code=True)

action_data_numpy = []
for traj in dataset:
    action = traj['action']
    chunked_action = create_overlapping_chunks(action, chunk_size=4)

    #tokenize chunked_action
    chunked_action_np = chunked_action.numpy()
    print("chunked_action_np shape",chunked_action_np.shape)
    tokens = tokenizer(chunked_action_np)

My input action chunk is 3-dimensional but I still encountered a similar issue.

traj[action] Tensor("concat:0", shape=(None, 7), dtype=float32)
chunked_action_np shape (36, 4, 7)
Traceback (most recent call last):
File "/home/user/open-pi-zero/fit_fast.py", line 232, in
tokens = tokenizer(chunked_action_np)
File "/home/user/miniconda3/envs/op0/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2868, in call
encodings = self._call_one(text=text, text_pair=text_pair, **all_kwargs)
File "/home/user/miniconda3/envs/op0/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2928, in _call_one
raise ValueError(
ValueError: text input must be of type str (single example), List[str] (batch or single pretokenized example) or List[List[str]] (batch of pretokenized examples).

Could you please provide some suggestions? Thank you!

One thing is that your input action chunk should only be 3-dimensional ([batch, chunk_horizon, action_dim]), while yours is 4-dimensional now.

Actually, I just played with the code. I tried both 1,2,3,4-dimensional input and none of them works.

Sign up or log in to comment