ctheodoris/Geneformer · How to obtain the embeddings for each gene?

Jun 23, 2023

•

edited Jun 23, 2023

Here is what I do, but I am not really sure if this is correct since there is one last decoder layer that transformers the hidden embedding to each genes' predictions. (decoder): Linear(in_features=256, out_features=25426, bias=True)

model = AutoModelForMaskedLM.from_pretrained("ctheodoris/Geneformer")
model.cls.predictions.decoder = DummyLayer()
input_ids = torch.Tensor(targets).unsqueeze(1).long()
attention_mask = torch.ones(input_ids.shape).unsqueeze(1).long()
label =  torch.ones(input_ids.shape).unsqueeze(1).long()
pred = model(input_ids=input_ids, attention_mask=attention_mask, labels=label)

I wonder if this is correct.

ctheodoris

Owner Jun 23, 2023

Thank you for your interest in Geneformer! I am not completely certain what you are trying to do, but Huggingface has helpful and comprehensive documentation regarding how to interact with model outputs (e.g. https://huggingface.co/docs/transformers/main_classes/output)

You could also check the code in this repository in the example notebooks for classification, which output model predictions, and in the in silico perturber module, which extracts gene embeddings. For example, extracting gene embeddings could be accomplished with something along the lines of:

model = BertForMaskedLM.from_pretrained(/path/to/Geneformer, output_hidden_states=True, output_attentions=False)

with torch.no_grad():
outputs = model(input_ids = input_data.to("cuda"))

embeddings = outputs.hidden_states[embedding_layer_to_extract]

ctheodoris changed discussion status to closed Jun 23, 2023