How to obtain the embeddings for each gene?
Here is what I do, but I am not really sure if this is correct since there is one last decoder layer that transformers the hidden embedding to each genes' predictions. (decoder): Linear(in_features=256, out_features=25426, bias=True)
model = AutoModelForMaskedLM.from_pretrained("ctheodoris/Geneformer")
model.cls.predictions.decoder = DummyLayer()
input_ids = torch.Tensor(targets).unsqueeze(1).long()
attention_mask = torch.ones(input_ids.shape).unsqueeze(1).long()
label = torch.ones(input_ids.shape).unsqueeze(1).long()
pred = model(input_ids=input_ids, attention_mask=attention_mask, labels=label)
I wonder if this is correct.
Thank you for your interest in Geneformer! I am not completely certain what you are trying to do, but Huggingface has helpful and comprehensive documentation regarding how to interact with model outputs (e.g. https://huggingface.co/docs/transformers/main_classes/output)
You could also check the code in this repository in the example notebooks for classification, which output model predictions, and in the in silico perturber module, which extracts gene embeddings. For example, extracting gene embeddings could be accomplished with something along the lines of:
model = BertForMaskedLM.from_pretrained(/path/to/Geneformer, output_hidden_states=True, output_attentions=False)
with torch.no_grad():
outputs = model(input_ids = input_data.to("cuda"))
embeddings = outputs.hidden_states[embedding_layer_to_extract]