Splitting dataset

#487
by ZYSK-huggingface - opened

Hi Christina,

I have a question regarding the data splitting mechanism in prepare_data. From the code, I see that the dataset is split based on attr_to_split="individual", meaning the split is done at the sample level. However, I also noticed that attr_to_balance allows balancing by attributes like disease, lvef, age, sex, and length.

My understanding is that this might mean the function adjusts the number of cells assigned to each split to achieve balance across these attributes. However, since the train-test split is already predefined at the sample level, I am unclear on how this balancing is achieved. Does it mean that within each predefined sample split, the function adjusts the number of cells to balance the categories? Or does it work in some other way?

I would really appreciate any clarification on this. Thanks in advance!

Generally if the samples are split by individual and the attributes are patient-level, the patients will be split into groups with these attributes being balanced at that level, not by cell number. Cell numbers can be balanced separately by the user.

ctheodoris changed discussion status to closed

Generally if the samples are split by individual and the attributes are patient-level, the patients will be split into groups with these attributes being balanced at that level, not by cell number. Cell numbers can be balanced separately by the user.

Thank you for your previous explanation. I still have a couple of points I’m hoping you could clarify:

Pre-defined Train/Test IDs and Attribute Balancing
You mentioned that if the samples are split by individual, and the attributes (such as disease status, age, etc.) are patient-level, then the balancing happens at the patient level rather than by cell count. However, I’m curious how this interacts with a scenario where we’ve already explicitly defined which individuals go into the training set and which go into the test set (e.g., via train_ids and test_ids). Does balancing ever conflict with these pre-defined splits? In other words, if we already fix certain individuals to be in the training or test sets, does the balancing mechanism still have room to adjust anything, or is balancing effectively bypassed?

Performing Balancing at the Cell Level
If I wanted to conduct splitting and balancing at the level of individual cells (rather than individuals), would it be feasible to:

· Add a label column (e.g., for disease, cell type, or other attributes) into a loom file,
· Then tokenize that loom file so that each cell’s label is preserved,
· And finally specify train/test IDs at the level of cells instead of individuals?
Would the existing code (e.g., prepare_data, balance_attr_splits) still manage to balance these attributes when we’re dealing with cell-level IDs rather than patient-level IDs? Or would this require a different approach?

Thank you very much!

Sorry, I think I mixed two split methods. So one way is to customize train val and test dataset by using :
train_test_id_split_dict = {"attr_key": "individual",
"train": train_ids+eval_ids,
"test": test_ids}
,the other way is to use split_sizes={'test': 0.1, 'train': 0.8, 'valid': 0.1} for automate splitting.

So the balance function is designed for the latter, but not for the former, is my understanding correct?

If you supply the exact individuals to use for the splits, there is not a reason to perform the automated balanced splitting. If you do not specify the splits to be performed by individual patient (which is recommended), then the splits will not have knowledge of this attribute and will split the cells.

If you supply the exact individuals to use for the splits, there is not a reason to perform the automated balanced splitting. If you do not specify the splits to be performed by individual patient (which is recommended), then the splits will not have knowledge of this attribute and will split the cells.

Thank you for your patient clarification !By the way, if I want to split automately by cells and balance sex、age, I think I should offer a column to "attr_to_split"(Because I noticed that in your source code, if "attr_to_split" set None, dataset will not performer 'balance'), so is it possible to use 'barcode' to this "attr_to_split" so that datasets will be split in cell unit?

image.png

I believe this would work - though we generally split by individuals or replicates when possible to confirm generalizability.

Sign up or log in to comment