gbyuvd commited on
Commit
d4b0b88
1 Parent(s): 3479f4d

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +17 -5
README.md CHANGED
@@ -12,6 +12,7 @@ pipeline_tag: fill-mask
12
  tags:
13
  - fill-mask
14
  - chemistry
 
15
  widget:
16
  - text: >-
17
  [C] [C] [=Branch1] [C] [MASK] [O] [C] [C] [N+1] [Branch1] [C] [C] [Branch1]
@@ -55,8 +56,10 @@ license: cc-by-nc-sa-4.0
55
  # ChemFIE Base - A Lightweight Model Pre-trained on Molecular SELFIES
56
 
57
  This model is a lightweight model pre-trained on SELFIES (Self-Referencing Embedded Strings) representations of molecules. It is trained on 2.7M unique and valid molecules taken from COCONUTDB and ChemBL34, with 7.3M total generated masked examples. It is a compact model with only 11M parameters while achieving decent performance:
58
- - On varied masking: Perplexity of 1.4759, MLM Accuracy of 87.60%
59
- - On uniform 15% masking: Perplexity of 1.3978, MLM Accuracy of 89.29%
 
 
60
 
61
  The masking strategy for pretraining utilized dynamic masking approach with masking ratios ranging from 15% to 45% based on simple scoring to gauge molecular complexity.
62
 
@@ -139,7 +142,7 @@ Three weeks ago, I had an idea to train a sentence transformer based on chemical
139
 
140
  My initial attempt focused on training a sentence transformer based on SELFIES, with the goal of enabling rapid molecule similarity search and clustering. This approach potentially offers advantages over traditional fingerprinting algorithms like MACCS, as the embeddings are context-aware. I decided to fine-tune a relatively lightweight NLP-trained miniLM model by [Nils Reimers](https://huggingface.co/nreimers/MiniLM-L6-H384-uncased), as I was unsure about training from scratch and didn't even know about pre-training at that time.
141
 
142
- The next challenges were how to properly make molecule pairs that is diverse yet informative, and how to label them. After tackling those, I trained the model on a dataset built from natural compounds taken from [COCONUTDB](https://coconut.naturalproducts.net/). After some initial training, I pushed [the model to Hugging Face](https://huggingface.co/gbyuvd/ChemEmbed-v01) to get some feedback. Gladly, [Tom Aarsen](https://huggingface.co/tomaarsen) provided [valuable suggestions](https://huggingface.co/gbyuvd/ChemEmbed-v01/discussions/1), including training a custom tokenizer, exploring [Matryoshka embeddings](https://huggingface.co/blog/matryoshka), and considering training from scratch. The attempt to implement Tom's suggestions, specifically in training from scratch is the main goal of this project as well as a first experience for me.
143
 
144
  Lastly before going into the details, it's important to note that this is the result of a hands-on learning project, and as such - beside my insufficient knowledge - it may not meet rigorous scientific standards. Like any learning journey, it's messy and I myself constrained by financial, computational, and time limitations. I've had to make compromises, such as conducting incomplete experiments and chunking datasets. However, I am more than happy to receive any feedback, so that I can improve both myself and future models/projects. A more detailed article discussing this project in details is coming soon.
145
 
@@ -200,10 +203,11 @@ To ensure coverage, the tokenizer underwent evaluation to cover all tokens in th
200
 
201
  #### Generating Dynamic Masked Sequence
202
 
203
- The key method in this project is the implementation of a dynamic masking rate based on molecular complexity. I think we can heuristically infer a molecule's complexity based on the syntactic characteristics of SELFIES. Simpler tokens will have only one character, such as "*[N]*" (*l = 1*; ignoring the brackets), while more complex ones would be "*.[N+1]*" (*l = 4*). Relatively rare atoms compared to the CHONS, like *[Na]* (*l = 2*), and ionized metals like *[Fe+3]* (*l = 4*), also vary in complexity. To normalize and infer the density of many character tokens, we can sum of all tokens length ratio with the molecule's length. I will refer to this simple score as the "complexity score" hereafter. We can then normalize it and use it to determine a variable masking probability ranging from 15% to 45%. Additionally, we can employ three different masking strategies to introduce further variability. This approach aims to create a more challenging and diverse training dataset whiile getting the most out of it, potentially leading to a more robust and generalizable model for molecular representation learning. Each SELFIES string's complexity is calculated based on the logarithm of the sum of token ratios with the sequence length.
204
 
205
 
206
  **1. Complexity Score Calculation**
 
207
  The raw complexity score is calculated using the formula:
208
 
209
  $$Sc = \log\left[\sum\left(\frac{l_{\text{token}}}{n_{\text{tokens}}}\right)\right]$$
@@ -227,16 +231,19 @@ Raw complexity score: 1.5163
227
  ```
228
 
229
  **2. Normalization**
 
230
  The raw score is then normalized to a range of 0-1 using predefined minimum (1.39) and maximum (1.69) normalization values which determined from dataset's score distributions:
231
 
232
  $$Sc_{norm} = max(0, min(1, (Sc - min_{norm}) / (max_{norm} - min_{norm})))$$
233
 
234
  **3. Mapping to Masking Probability**
 
235
  I decided to use quadratic mapping with 0.3 steps, ensuring smooth masking probability adjustment in range between 15% to 45% with more complex molecules having a higher masking probability:
236
 
237
  $$P_{\text{mask}} = 0.15 + 0.3 * (Sc_{norm})^2$$
238
 
239
  **4. Multi-Strategy Masking**
 
240
  Three different masking strategies are employed for each SELFIES string:
241
  - Main Strategy:
242
  - 80% chance to mask the token
@@ -251,11 +258,13 @@ Three different masking strategies are employed for each SELFIES string:
251
  - 20% chance to keep the original token
252
  - 20% chance to replace with a random token
253
 
254
- **5. Data Augmentation:**
 
255
  - Each SELFIES string is processed three times, once with each masking strategy.
256
  - This hopefully triples the effective dataset size and introduces variability in the masking patterns.
257
 
258
  **6. Masking Process**
 
259
  - Tokens are randomly selected for masking based on the calculated masking probability.
260
  - Special tokens ([CLS] and [SEP]) are never masked.
261
  - The number of tokens to be masked is determined by the masking probability and the length of the SELFIES string.
@@ -297,6 +306,7 @@ For more information about Ranger21, you could check out [this repository](https
297
 
298
  * Dataset: `main-eval`
299
  * Number of test examples: 810,108
 
300
  #### Varied Masking Test
301
 
302
  | Chunk | Avg Loss | Perplexity | MLM Accuracy |
@@ -337,7 +347,9 @@ For more information about Ranger21, you could check out [this repository](https
337
  ###### Hardware
338
 
339
  Platform: Paperspace's Gradients
 
340
  Compute: Free-P5000 (16 GB GPU, 30 GB RAM, 8 vCPU)
 
341
  ###### Software
342
 
343
  - Python: 3.9.13
 
12
  tags:
13
  - fill-mask
14
  - chemistry
15
+ - selfies
16
  widget:
17
  - text: >-
18
  [C] [C] [=Branch1] [C] [MASK] [O] [C] [C] [N+1] [Branch1] [C] [C] [Branch1]
 
56
  # ChemFIE Base - A Lightweight Model Pre-trained on Molecular SELFIES
57
 
58
  This model is a lightweight model pre-trained on SELFIES (Self-Referencing Embedded Strings) representations of molecules. It is trained on 2.7M unique and valid molecules taken from COCONUTDB and ChemBL34, with 7.3M total generated masked examples. It is a compact model with only 11M parameters while achieving decent performance:
59
+ - On varied masking:
60
+ - Perplexity of 1.4759, MLM Accuracy of 87.60%
61
+ - On uniform 15% masking:
62
+ - Perplexity of 1.3978, MLM Accuracy of 89.29%
63
 
64
  The masking strategy for pretraining utilized dynamic masking approach with masking ratios ranging from 15% to 45% based on simple scoring to gauge molecular complexity.
65
 
 
142
 
143
  My initial attempt focused on training a sentence transformer based on SELFIES, with the goal of enabling rapid molecule similarity search and clustering. This approach potentially offers advantages over traditional fingerprinting algorithms like MACCS, as the embeddings are context-aware. I decided to fine-tune a relatively lightweight NLP-trained miniLM model by [Nils Reimers](https://huggingface.co/nreimers/MiniLM-L6-H384-uncased), as I was unsure about training from scratch and didn't even know about pre-training at that time.
144
 
145
+ The next challenges were how to properly make molecule pairs that is diverse yet informative, and how to label them. After tackling those, I trained the model on a dataset built from natural compounds taken from [COCONUTDB](https://coconut.naturalproducts.net/). After some initial training, I pushed [the model to Hugging Face](https://huggingface.co/gbyuvd/ChemEmbed-v01) to get some feedback. Gladly, [Tom Aarsen](https://huggingface.co/tomaarsen) provided [valuable suggestions](https://huggingface.co/gbyuvd/ChemEmbed-v01/discussions/1), including training a custom tokenizer, exploring [Matryoshka embeddings](https://huggingface.co/blog/matryoshka), and considering training from scratch. The attempt to implement Aarsen's suggestions, specifically in training from scratch is the main goal of this project as well as a first experience for me.
146
 
147
  Lastly before going into the details, it's important to note that this is the result of a hands-on learning project, and as such - beside my insufficient knowledge - it may not meet rigorous scientific standards. Like any learning journey, it's messy and I myself constrained by financial, computational, and time limitations. I've had to make compromises, such as conducting incomplete experiments and chunking datasets. However, I am more than happy to receive any feedback, so that I can improve both myself and future models/projects. A more detailed article discussing this project in details is coming soon.
148
 
 
203
 
204
  #### Generating Dynamic Masked Sequence
205
 
206
+ The key method in this project is the implementation of a dynamic masking rate based on molecular complexity. I think we can heuristically infer a molecule's complexity based on the syntactic characteristics of SELFIES. Simpler tokens will have only one character, such as "*[N]*" (*l = 1*; ignoring the brackets), while more complex ones would be "*.[N+1]*" (*l = 4*). Relatively rare atoms compared to the CHONS, like *[Na]* (*l = 2*), and ionized metals like *[Fe+3]* (*l = 4*), also vary in complexity. To normalize and infer the density of many character tokens, we can sum of all tokens length ratio with the molecule's length. I will refer to this simple score as the "complexity score" hereafter. We can then normalize it and use it to determine a variable masking probability ranging from 15% to 45%. Additionally, we can employ three different masking strategies to introduce further variability. This approach aims to create a more challenging and diverse training dataset while getting the most out of it, potentially leading to a more robust and generalizable model for molecular representation learning. Each SELFIES string's complexity is calculated based on the logarithm of the sum of token ratios with the sequence length.
207
 
208
 
209
  **1. Complexity Score Calculation**
210
+
211
  The raw complexity score is calculated using the formula:
212
 
213
  $$Sc = \log\left[\sum\left(\frac{l_{\text{token}}}{n_{\text{tokens}}}\right)\right]$$
 
231
  ```
232
 
233
  **2. Normalization**
234
+
235
  The raw score is then normalized to a range of 0-1 using predefined minimum (1.39) and maximum (1.69) normalization values which determined from dataset's score distributions:
236
 
237
  $$Sc_{norm} = max(0, min(1, (Sc - min_{norm}) / (max_{norm} - min_{norm})))$$
238
 
239
  **3. Mapping to Masking Probability**
240
+
241
  I decided to use quadratic mapping with 0.3 steps, ensuring smooth masking probability adjustment in range between 15% to 45% with more complex molecules having a higher masking probability:
242
 
243
  $$P_{\text{mask}} = 0.15 + 0.3 * (Sc_{norm})^2$$
244
 
245
  **4. Multi-Strategy Masking**
246
+
247
  Three different masking strategies are employed for each SELFIES string:
248
  - Main Strategy:
249
  - 80% chance to mask the token
 
258
  - 20% chance to keep the original token
259
  - 20% chance to replace with a random token
260
 
261
+ **5. Data Augmentation**
262
+
263
  - Each SELFIES string is processed three times, once with each masking strategy.
264
  - This hopefully triples the effective dataset size and introduces variability in the masking patterns.
265
 
266
  **6. Masking Process**
267
+
268
  - Tokens are randomly selected for masking based on the calculated masking probability.
269
  - Special tokens ([CLS] and [SEP]) are never masked.
270
  - The number of tokens to be masked is determined by the masking probability and the length of the SELFIES string.
 
306
 
307
  * Dataset: `main-eval`
308
  * Number of test examples: 810,108
309
+
310
  #### Varied Masking Test
311
 
312
  | Chunk | Avg Loss | Perplexity | MLM Accuracy |
 
347
  ###### Hardware
348
 
349
  Platform: Paperspace's Gradients
350
+
351
  Compute: Free-P5000 (16 GB GPU, 30 GB RAM, 8 vCPU)
352
+
353
  ###### Software
354
 
355
  - Python: 3.9.13