Smaller version to ease implementation experiments?
Hi. I've worked on implementing Mamba support in llama.cpp
before (see https://github.com/ggerganov/llama.cpp/pull/5328), and I'd like to eventually implement support for Jamba too.
However, for my hardware, this model is too big for quick experimentation, so I'd really appreciate it if you'd also release a smaller model with the same architecture. It doesn't need to be good (though some coherency is preferred). Ideally a Jamba model with less than 1B parameters would help a lot with this, if possible.
I second this. Loading the weights take a really long time. Some light version (with pruning?) even if the end results is not effective at all would be great for quick testing iteration.
I third this
I trained a Jamba architecture model with some code data. It's very small and has some basic code generation capabilities. Might be useful for this.
https://huggingface.co/TechxGenus/Mini-Jamba
I trained a Jamba architecture model with some code data. It's very small and has some basic code generation capabilities. Might be useful for this.
https://huggingface.co/TechxGenus/Mini-Jamba
Nice! Unfortunately, there seems to be no Mamba+MoE layer(s) in your model. I only see Mamba+MLP layers alternated with Attention+MoE layers. The attn_layer_offset
and attn_layer_period
keys in config.json
differ from those in the official Jamba-v0.1 model, and might have caused this at training time, I guess?
I trained a Jamba architecture model with some code data. It's very small and has some basic code generation capabilities. Might be useful for this.
https://huggingface.co/TechxGenus/Mini-JambaNice! Unfortunately, there seems to be no Mamba+MoE layer(s) in your model. I only see Mamba+MLP layers alternated with Attention+MoE layers. The
attn_layer_offset
andattn_layer_period
keys inconfig.json
differ from those in the official Jamba-v0.1 model, and might have caused this at training time, I guess?
Ah, this is because I set expert_layer_offset
and expert_layer_period
to be the same as attn_layer_offset
and attn_layer_period
. I wanted to first test the results of using MoE only in the Attention layer when making this version.
I will make a new version with Mamba+MoE, Mamba+MLP, Attention+MoE, Attention+MLP at the same time later.
Hi, we uploaded this version for debugging and development purposes (random weights, no training whatsoever)
https://huggingface.co/ai21labs/Jamba-tiny-random