Using Caduceus

To use the pre-trained model for masked language modeling, use the following snippet:

from transformers import AutoModelForMaskedLM, AutoTokenizer

# See the `Caduceus` collection page on the hub for list of available models.
model_name = "kuleshov-group/caduceus-ps_seqlen-131k_d_model-256_n_layer-16"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForMaskedLM.from_pretrained(model_name)

Alternatively, you can instantiate a model from scratch to train on your own data as follows:

from transformers import AutoConfig, AutoModelForMaskedLM

# Add any config overrides here, see the `config.json` file on the hub for details.
config_overrides = {}
# See the `Caduceus` collection page on the hub for list of available models.
config = AutoConfig.from_pretrained(
 "kuleshov-group/caduceus-ps_seqlen-131k_d_model-256_n_layer-16",
 **config_overrides,
) 
model = AutoModelForMaskedLM.from_config(config)

Model Details

This is the Caduceus-PS model with hidden dimension 256 and 16 MambaDNA layers. This model is reverse complement (RC) equivariant and thus no RC data augmentation is required when training this model, either during pre-training or for downstream fine-tuning. Note that the model hidden state will be twice that of a non-RC equivariant counterpart. For downstream task training and inference, and to ensure RC invariant outputs at downstream time, one can either run the downstream model on the hidden state and its RC or one can take the hidden state and its RC and average them before passing to the downstream model. To RC the hidden states, one can use: hidden_states.flip(dim=(-2, -1)) which will flip along the sequence lenght and channel dimensions.

This model was pre-trained on the human reference genome with sequence length 131,072 for 50k steps (each step contained ~1M base pairs / tokens).

For more details, please see our paper: Caduceus: Bi-Directional Equivariant Long-Range DNA Sequence Modeling.

Citation

Please cite our work using the bibtex below:

BibTeX:

@article{schiff2024caduceus,
  title={Caduceus: Bi-Directional Equivariant Long-Range DNA Sequence Modeling},
  author={Schiff, Yair and Kao, Chia-Hsiang and Gokaslan, Aaron and Dao, Tri and Gu, Albert and Kuleshov, Volodymyr},
  journal={arXiv preprint arXiv:2403.03234},
  year={2024}
}

Model Card Contact

Yair Schiff ([email protected])

Downloads last month
4,405
Safetensors
Model size
7.73M params
Tensor type
I64
·
F32
·
Inference Examples
Inference API (serverless) does not yet support model repos that contain custom code.

Collection including kuleshov-group/caduceus-ps_seqlen-131k_d_model-256_n_layer-16