Tonic commited on
Commit
2cd7e3e
·
verified ·
1 Parent(s): f8e6414

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +208 -1
README.md CHANGED
@@ -7,4 +7,211 @@ sdk: static
7
  pinned: false
8
  ---
9
 
10
- Edit this `README.md` markdown file to author your organization card.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7
  pinned: false
8
  ---
9
 
10
+ # Project PhenoSeq: Protein Network Analysis for Phenotypic Outcomes
11
+
12
+ While demonstrating promising results in basic prediction tasks, the project identified key areas for improvement in protein-phenotype relationship modeling. The findings provide a foundation for future work in protein network analysis and phenotype prediction.
13
+
14
+ *This project represents a significant step forward in understanding protein-phenotype relationships, while highlighting important areas for future research and development in computational biology.*
15
+
16
+ ## Project Overview
17
+ PhenoSeq is an innovative project focused on understanding how protein networks contribute to organism-scale phenotypes, particularly in cancer growth and organism longevity. The project leverages protein embeddings from ESM (Evolutionary Scale Modeling) combined with graph neural networks to predict phenotypic outcomes through protein-protein interactions (PPIs).
18
+
19
+ ## Core Objectives
20
+ 1. Develop predictive models for understanding biological drivers of complex diseases
21
+ 2. Create frameworks for inferring oncogenic potential of genetic mutations
22
+ 3. Analyze clinical significance of protein modifications using sequence embeddings
23
+ 4. Establish connections between protein networks and phenotypic outcomes
24
+
25
+ ## Data Sources
26
+ The project utilized three major public databases:
27
+ - DepMap: CRISPR-based experimental data measuring protein deletion effects on cancer cell proliferation
28
+ - TCGA: The Cancer Genome Atlas data
29
+ - Longevity Database: Species longevity information
30
+
31
+ ## Methodological Approach
32
+
33
+ ### Model Development
34
+ The team developed three distinct models:
35
+
36
+ 1. **Baseline Model**
37
+ - Fully connected network predicting CRISPR scores from embeddings
38
+ - Achieved correlation of 0.55 with ground truth
39
+ - Outperformed K-nearest neighbors baseline
40
+ - Performance correlated with training set proximity
41
+
42
+ 2. **Cell Line-Specific Model**
43
+ - Incorporated cell line identity through one-hot embedding
44
+ - Included mutation status (wild type vs mutated)
45
+ - Achieved 0.44 correlation with ground truth
46
+ - Limited success in predicting cell line-specific differences
47
+
48
+ 3. **PPI-Informed Model**
49
+ - Integrated protein-protein interaction data
50
+ - Results comparable to cell line-specific model
51
+ - Limited additional performance gain from PPI integration
52
+
53
+ ### Additional Analyses
54
+ - Species Longevity Analysis
55
+ - Challenges in cross-phylogenetic prediction
56
+ - Limited success across different orders of the phylogenetic tree
57
+
58
+ - TCGA Patient Survival Analysis
59
+ - Achieved significant correlations
60
+ - Performance below initial expectations
61
+
62
+ ## Key Findings
63
+ 1. ESM3 embeddings contain valuable functional information
64
+ 2. Simple models can outperform basic baselines
65
+ 3. Current approach limitations in capturing subtle effects
66
+ 4. Challenges in predicting mutation-specific impacts
67
+
68
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/62a3bb1cd0d8c2c2169f0b88/gDGJH2ErnqGcHoF9DWuBc.png)
69
+
70
+ ## Future Directions
71
+ 1. Integration of additional data types:
72
+ - Copy number variation
73
+ - Transcriptomic information
74
+ 2. Exploration of amino acid level embeddings
75
+ 3. Enhanced signal processing methods
76
+ 4. Improved model architectures
77
+
78
+ ## Technical Achievements
79
+ - Successful implementation of protein embedding analysis
80
+ - Development of multiple predictive models
81
+ - Integration of complex biological datasets
82
+ - Novel approaches to phenotype prediction
83
+
84
+ ## Limitations and Challenges
85
+ 1. Limited success in cell line-specific predictions
86
+ 2. Challenges in cross-phylogenetic predictions
87
+ 3. Subtle effect detection limitations
88
+ 4. Data integration complexities
89
+
90
+ ## Impact and Applications
91
+ - Enhanced understanding of disease mechanisms
92
+ - Improved drug target identification
93
+ - Better prediction of genetic mutation effects
94
+ - Advanced protein function analysis
95
+
96
+ # PhenoSeq Longevity Analysis Component
97
+
98
+ This analysis revealed both the potential and limitations of using protein sequence data for predicting species longevity, highlighting the importance of taxonomic relationships in such predictions.
99
+
100
+ ## Overview
101
+ The longevity analysis component of PhenoSeq investigated the relationship between protein sequences and species lifespan across different taxonomic orders, with a particular focus on Primates, Chiroptera (bats), and Cetacea (whales).
102
+
103
+ ## Key Findings
104
+
105
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/62a3bb1cd0d8c2c2169f0b88/vS8Fe-q1lY5Oiro4FPVEP.png)
106
+
107
+ ### 1. Taxonomic Order Analysis
108
+ - The study examined lifespan distributions across multiple orders including:
109
+ - Rodentia
110
+ - Artiodactyla
111
+ - Carnivora
112
+ - Primates
113
+ - Chiroptera
114
+ - Cetacea
115
+ - Diprotodontia
116
+ - Perissodactyla
117
+
118
+ ### 2. Prediction Performance
119
+ - Mean predictions across orders were relatively successful
120
+ - However, predictions within individual orders showed limited accuracy
121
+ - High-performing proteins were not well conserved between different orders
122
+
123
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/62a3bb1cd0d8c2c2169f0b88/V9r5W8k5K9BgbuJfkf1XQ.png)
124
+
125
+ ### 3. Model Architecture Insights
126
+ - Later layers in the neural network did not provide significant additional information
127
+ - Training curves showed convergence but with limitations in prediction accuracy
128
+
129
+ ### 4. Protein Embedding Analysis
130
+ - Analysis of protein ALDOB showed that:
131
+ - Nearest neighbor species in embedding space typically belonged to the same Order/Family
132
+ - Strong taxonomic clustering was observed in the embedding space
133
+
134
+ ### 5. Hierarchical Prediction Accuracy
135
+ Correlation strength increased with taxonomic specificity:
136
+ - Order level: r = 0.8 (271 species across 12 orders)
137
+ - Family level: r = 0.92 (191 species across 27 families)
138
+ - Genus level: r = 0.97 (47 species across 15 genera)
139
+
140
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/62a3bb1cd0d8c2c2169f0b88/qHsUpGuLTIo3Nw3CHJDVM.png)
141
+
142
+ ## Technical Limitations
143
+ - Limited success in cross-order predictions
144
+ - Difficulty in generalizing predictions across distant phylogenetic relationships
145
+ - Need for order/family-specific modeling approaches
146
+
147
+ ## Key Insights
148
+ - Strong within-taxon predictions
149
+ - Decreasing accuracy with increasing phylogenetic distance
150
+ - Need for taxonomic stratification in prediction models
151
+ - High predictive power at genus level suggests strong genetic influence on longevity within closely related species
152
+
153
+
154
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/62a3bb1cd0d8c2c2169f0b88/hzumPD8BXOEAnyLzrCE5T.png)
155
+
156
+ # PhenoSeq DepMap Analysis Component
157
+
158
+ This analysis demonstrated both the potential and current limitations of using protein sequence data to predict cancer-relevant protein functions, highlighting areas for future improvement in protein-phenotype prediction models.
159
+
160
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/62a3bb1cd0d8c2c2169f0b88/_AJXp_IwAx9uzHjlXMLVT.png)
161
+
162
+ ## Overview
163
+ The DepMap component investigated protein function in cancer through CRISPR-based knockout experiments, analyzing 9,353 proteins across 1,150 different cell lines to understand their effects on cancer cell growth.
164
+
165
+ ## Three Models :
166
+
167
+ 1. **Baseline Model**
168
+ - Input: Average protein embedding across all cell lines
169
+ - Output: Average CrisprScore across all cell lines
170
+ - Architecture: Simple feedforward network using ESM3-open-small embeddings
171
+ - Performance: Achieved Pearson correlation of 0.55
172
+ - Outperformed KNN baseline across all K values
173
+
174
+ 2. **Cell-line-specific Model**
175
+ - Predicted CrisprScore effects for each protein-cell line combination
176
+ - Performance: Achieved Pearson correlation of 0.44
177
+ - Limited success in predicting protein-specific differences between cell lines
178
+ - Poor correlation (r=0.01) for individual proteins like MYC across cancer types
179
+
180
+ 3. **PPI-informed Model**
181
+ - Incorporated protein-protein interaction networks
182
+ - Aimed to predict CrisprScore effects by propagating signals through PPI networks
183
+ - Results similar to cell-line-specific model
184
+
185
+ ## Key Findings
186
+
187
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/62a3bb1cd0d8c2c2169f0b88/T-B8Wm66A-oepjyA562zv.png)
188
+
189
+ ### Model Performance
190
+ - Baseline model showed strong general prediction capability
191
+ - Distance to nearest neighbors in training set affected performance
192
+ - Larger networks didn't necessarily improve performance
193
+ - Model demonstrated true learning rather than memorization
194
+
195
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/62a3bb1cd0d8c2c2169f0b88/wllLuMZmsRpjmZRr1EJ0S.png)
196
+
197
+ ### Technical Insights
198
+ - Hyperparameter sweeps showed similar training patterns across:
199
+ - Different numbers of layers
200
+ - Various hidden dimensions
201
+ - Model struggled with fine-grained predictions of mutation effects
202
+
203
+ ### Limitations
204
+ - Poor performance in predicting effects of small sequence differences
205
+ - Limited ability to distinguish between mutations of the same protein
206
+ - Challenges in cell-line-specific predictions
207
+
208
+ ## Technical Details
209
+ - CrisprScore distribution showed varied effects of protein deletion
210
+ - Different proteins showed distinct patterns of effect across cell lines
211
+ - Model performance was consistent across different architectural choices
212
+
213
+ ## Future Implications
214
+ - Need for improved mutation-specific prediction capabilities
215
+ - Potential for enhanced protein function understanding
216
+ - Opportunity for better cancer-specific protein effect prediction
217
+