File size: 19,597 Bytes
72ddd75 3773332 72ddd75 3773332 72ddd75 3773332 72ddd75 3773332 3479f4d d4b0b88 72ddd75 3773332 72ddd75 3773332 72ddd75 d4b0b88 72ddd75 3773332 72ddd75 d4b0b88 72ddd75 5223613 72ddd75 3773332 72ddd75 d4b0b88 72ddd75 d4b0b88 72ddd75 90f9123 72ddd75 d4b0b88 72ddd75 d5272d8 90f9123 72ddd75 d4b0b88 72ddd75 90f9123 d5272d8 72ddd75 90f9123 d4b0b88 72ddd75 d4b0b88 72ddd75 d4b0b88 72ddd75 d4b0b88 72ddd75 d4b0b88 72ddd75 d4b0b88 72ddd75 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 |
---
datasets:
- COCONUTDB
- ChemBL34
language:
- code
library_name: transformers
metrics:
- perplexity
- accuracy
pipeline_tag: fill-mask
tags:
- fill-mask
- chemistry
- selfies
widget:
- text: >-
[C] [C] [=Branch1] [C] [MASK] [O] [C] [C] [N+1] [Branch1] [C] [C] [Branch1]
[C] [C] [C]
example_title: '[=O]'
- text: >-
[O-1] [P] [=Branch1] [C] [=O] [Branch1] [C] [MASK] [O] [P] [=Branch1] [C]
[=O] [Branch1] [C] [O-1] [O-1] .[99Tc+4]
example_title: '[O-1]'
model-index:
- name: chemselfies-base-bertmlm
results:
- task:
type: fill-mask
name: Fill-Mask
dataset:
name: main-eval-uniform
type: main-eval-uniform
metrics:
- type: perplexity
value: 1.3978
name: Perplexity
- type: accuracy
value: 0.8929
name: MLM Accuracy
- task:
type: fill-mask
name: Fill-Mask
dataset:
name: main-eval-varied
type: main-eval-varied
metrics:
- type: perplexity
value: 1.4759
name: Perplexity
- type: accuracy
value: 0.876
name: MLM Accuracy
license: cc-by-nc-sa-4.0
---
# ChemFIE Base - A Lightweight Model Pre-trained on Molecular SELFIES
This model is a lightweight model pre-trained on SELFIES (Self-Referencing Embedded Strings) representations of molecules. It is trained on 2.7M unique and valid molecules taken from COCONUTDB and ChemBL34, with 7.3M total generated masked examples. It is a compact model with only 11M parameters while achieving decent performance:
- On varied masking:
- Perplexity of 1.4759, MLM Accuracy of 87.60%
- On uniform 15% masking:
- Perplexity of 1.3978, MLM Accuracy of 89.29%
The masking strategy for pretraining utilized dynamic masking approach with masking ratios ranging from 15% to 45% based on simple scoring to gauge molecular complexity.
### Disclaimer: For Academic Purposes Only
The information and model provided is for academic purposes only. It is intended for educational and research use, and should not be used for any commercial or legal purposes. The author do not guarantee the accuracy, completeness, or reliability of the information.
# Table of Contents
1. [Model Details](#model-details)
2. [Usage](#usage)
3. [Bias and Limitations](#bias-and-limitations)
4. [Background](#background)
5. [Training Details](#training-details)
6. [Evaluation](#evaluation)
7. [Interpretability](#interpretability)
8. [Technical Specifications](#technical-specifications)
8. [Citation](#citation)
9. [Contact & Support My Work](#contact--support-my-work)
## Model Details
### Model Description
- **Model type:** Transformer (BertForMaskedLM)
- **Language:** SELFIES
- **Maximum Sequence Length:** 512 tokens
- **License:** CC-BY-NC-SA 4.0
- **Training Dataset:** COCONUTDB and ChemBL34
- **Resources for more information:**
- Github Respository (coming soon)
- Detailed article (coming soon)
## Usage
### Intended Use
You can use this model for masked language modeling but it's mostly intended to be fine-tuned on a downstream task.
### Direct Use
You can use this model directly with a pipeline for masked language modeling:
```python
from transformers import pipeline
# text = "[C] [C] [=Branch1] [C] [=O] [O] [C] [C] [N+1] [Branch1] [C] [C] [Branch1] [C] [C] [C]"
text = "[C] [C] [=Branch1] [C] [MASK] [O] [C] [C] [N+1] [Branch1] [C] [C] [Branch1] [C] [C] [C]"
mask_filler = pipeline("fill-mask", "gbyuvd/chemselfies-base-bertmlm")
mask_filler(text, top_k=5)
"""
[{'score': 0.9974672794342041,
'token': 8,
'token_str': '[=O]',
'sequence': '[C] [C] [=Branch1] [C] [=O] [O] [C] [C] [N+1] [Branch1] [C] [C] [Branch1] [C] [C] [C]'},
{'score': 0.002122726757079363,
'token': 34,
'token_str': '[=S]',
'sequence': '[C] [C] [=Branch1] [C] [=S] [O] [C] [C] [N+1] [Branch1] [C] [C] [Branch1] [C] [C] [C]'},
{'score': 0.0002627855574246496,
'token': 11,
'token_str': '[=N]',
'sequence': '[C] [C] [=Branch1] [C] [=N] [O] [C] [C] [N+1] [Branch1] [C] [C] [Branch1] [C] [C] [C]'},
{'score': 8.700760372448713e-05,
'token': 1,
'token_str': '[C]',
'sequence': '[C] [C] [=Branch1] [C] [C] [O] [C] [C] [N+1] [Branch1] [C] [C] [Branch1] [C] [C] [C]'},
{'score': 2.8387064958224073e-05,
'token': 2,
'token_str': '[=C]',
'sequence': '[C] [C] [=Branch1] [C] [=C] [O] [C] [C] [N+1] [Branch1] [C] [C] [Branch1] [C] [C] [C]'}]
"""
```
## Background
Three weeks ago, I had an idea to train a sentence transformer based on chemical "language" which so far I looked up back then, had not yet existed. While trying to do so, I found this wonderful and human-readable new molecular representation called SELFIES - developed by [Aspuru-Guzik group](https://github.com/aspuru-guzik-group/selfies). I found this representation fascinating and worth to explore, due to its robustness and at least so far proven to be versatile and easier to train a model using it. For more information on SELFIES, you could read this [blogpost](https://aspuru.substack.com/p/molecular-graph-representations-and) or check out [their github.](https://github.com/aspuru-guzik-group/selfies)
My initial attempt focused on training a sentence transformer based on SELFIES, with the goal of enabling rapid molecule similarity search and clustering. This approach potentially offers advantages over traditional fingerprinting algorithms like MACCS, as the embeddings are context-aware. I decided to fine-tune a relatively lightweight NLP-trained miniLM model by [Nils Reimers](https://huggingface.co/nreimers/MiniLM-L6-H384-uncased), as I was unsure about training from scratch and didn't even know about pre-training at that time.
The next challenges were how to properly make molecule pairs that is diverse yet informative, and how to label them. After tackling those, I trained the model on a dataset built from natural compounds taken from [COCONUTDB](https://coconut.naturalproducts.net/). After some initial training, I pushed [the model to Hugging Face](https://huggingface.co/gbyuvd/ChemEmbed-v01) to get some feedback. Gladly, [Tom Aarsen](https://huggingface.co/tomaarsen) provided [valuable suggestions](https://huggingface.co/gbyuvd/ChemEmbed-v01/discussions/1), including training a custom tokenizer, exploring [Matryoshka embeddings](https://huggingface.co/blog/matryoshka), and considering training from scratch. The attempt to implement Aarsen's suggestions, specifically in training from scratch is the main goal of this project as well as a first experience for me.
Lastly before going into the details, it's important to note that this is the result of a hands-on learning project, and as such - beside my insufficient knowledge - it may not meet rigorous scientific standards. Like any learning journey, it's messy and I myself constrained by financial, computational, and time limitations. I've had to make compromises, such as conducting incomplete experiments and chunking datasets. However, I am more than happy to receive any feedback, so that I can improve both myself and future models/projects. A more detailed article discussing this project in details is coming soon.
## Training Details
### Training Data
##### Data Sources
The dataset combines two sources of molecular data:
1. Natural compounds from [COCONUTDB](https://coconut.naturalproducts.net/) (Sorokina et al., 2021)
2. Bioactive compounds from [ChemBL34](https://ftp.ebi.ac.uk/pub/databases/chembl/ChEMBLdb/latest/) (Zdrazil et al., 2023)
##### Data Preparation
1. Fetching: Canonical SMILES were extracted from both databases.
2. De-duplication:
- Each dataset was de-duplicated internally.
- The combined dataset ("All") was further de-duplicated to ensure unique entries.
3. Validity Check and Conversion: A dual validity check was performed using RDKit and by converting them into SELFIES
##### Filtering and Chunking
- Filtering by Lipinski's Rule of Five or its subsets (e.g., Mw < 500 and LogP < 5) was omitted to maintain broader coverage for potential future expansion to organic and inorganic molecules such in PubCHEM and ZINC20.
- The dataset was chunked into 13 parts, each containing 203,458 molecules, to accommodate the 6-hour time limit on Paperspace's Gradient.
- Any leftover data was randomly distributed across the 13 chunks to ensure even distribution.
##### Validation Set
- 10% of each chunk was set aside for validation.
- These validation sets were combined into a main test set, totaling 810,108 examples.
| Dataset | Number of Valid Unique Molecules | Generated Training Examples |
| ---------- | -------------------------------- | --------------------------- |
| Chunk I | 207,727 | 560,859 |
| Chunk II | 207,727 | 560,859 |
| Chunk III | 207,727 | 560,859 |
| Chunk IV | 207,727 | 560,859 |
| Chunk V | 207,727 | 560,859 |
| Chunk VI | 207,727 | 560,859 |
| Chunk VII | 207,727 | 560,859 |
| Chunk VIII | 207,727 | 560,859 |
| Chunk IX | 207,727 | 560,859 |
| Chunk X | 207,727 | 560,859 |
| Chunk XII | 207,727 | 560,859 |
| Chunk XI | 207,727 | 560,859 |
| Chunk XIII | 207,738 | 560,889 |
| Total | 2,700,462 | 7,291,197 |
### Training Procedure
#### Tokenizer Setup
The tokenizer is a combination of my own pretrained tokenizer on the merged COCONUTDB+ChemBL34 SELFIES dataset with vocabularies from zpn's [word-level tokenizer](https://huggingface.co/zpn/pubchem_selfies_tokenizer_wordlevel) trained on PubChem. This approach was chosen to ensure comprehensive coverage while maintaining relevance to biological compounds. The tokenizer was modified to suit the BertTokenizer format, using whitespace to split input tokens.
When using or fine-tuning this model, it's crucial to separate each SELFIES token with a whitespace. For example:
```
[C] [N] [C] [C] [C] [C@H1] [Ring1] [Branch1] [C] [=C] [N] [=C] [C] [=C] [Ring1] [=Branch1]
```
To ensure coverage, the tokenizer underwent evaluation to cover all tokens in the training data. Unrecognized tokens were identified and incorporated into the tokenizer. Additionally, my previous pre-training issues, such as improper tokenization of dot symbol prefixes in complex molecules (e.g., "*.[Cl]*"), were addressed and resolved.
#### Generating Dynamic Masked Sequence
The key method in this project is the implementation of a dynamic masking rate based on molecular complexity. I think we can heuristically infer a molecule's complexity based on the syntactic characteristics of SELFIES. Simpler tokens will have only one character, such as "*[N]*" (*l = 1*; ignoring the brackets), while more complex ones would be "*.[N+1]*" (*l = 4*). Relatively rare atoms compared to the CHONS, like *[Na]* (*l = 2*), and ionized metals like *[Fe+3]* (*l = 4*), also vary in complexity. To normalize and infer the density of many character tokens, we can sum of all tokens length ratio with the molecule's length. I will refer to this simple score as the "complexity score" hereafter. We can then normalize it and use it to determine a variable masking probability ranging from 15% to 45%. Additionally, we can employ three different masking strategies to introduce further variability. This approach aims to create a more challenging and diverse training dataset while getting the most out of it, potentially leading to a more robust and generalizable model for molecular representation learning. Each SELFIES string's complexity is calculated based on the logarithm of the sum of token ratios with the sequence length.
**1. Complexity Score Calculation**
The raw complexity score is calculated using the formula:
$$Sc = \log\left[\sum\left(\frac{l_{\text{token}}}{n_{\text{tokens}}}\right)\right]$$
Example outputs:
```
Sentence A:
Tokens: ['[C]', '[C]', '[=Branch1]', '[C]', '[=O]', '[O]', '[C]']
Sum of token lengths: 29
Number of tokens: 7
Raw complexity score: 1.4214
==================================================
Sentence B:
Tokens: ['[C]', '[N+1]', '[Branch1]', '[C]', '[C]', '[Branch1]', '[C]', '[C]', '[C]']
Sum of token lengths: 41
Number of tokens: 9
Raw complexity score: 1.5163
```
**2. Normalization**
The raw score is then normalized to a range of 0-1 using predefined minimum (1.39) and maximum (1.69) normalization values which determined from dataset's score distributions:
$$Sc_{norm} = max(0, min(1, (Sc - min_{norm}) / (max_{norm} - min_{norm})))$$
**3. Mapping to Masking Probability**
I decided to use quadratic mapping with 0.3 steps, ensuring smooth masking probability adjustment in range between 15% to 45% with more complex molecules having a higher masking probability:
$$P_{\text{mask}} = 0.15 + 0.3 * (Sc_{norm})^2$$
**4. Multi-Strategy Masking**
Three different masking strategies are employed for each SELFIES string:
- Main Strategy:
- 80% chance to mask the token
- 10% chance to keep the original token
- 10% chance to replace with a random token
- Alternative Strategy 1:
- 70% chance to mask the token
- 15% chance to keep the original token
- 15% chance to replace with a random token
- Alternative Strategy 2:
- 60% chance to mask the token
- 20% chance to keep the original token
- 20% chance to replace with a random token
**5. Data Augmentation**
- Each SELFIES string is processed three times, once with each masking strategy.
- This hopefully triples the effective dataset size and introduces variability in the masking patterns.
**6. Masking Process**
- Tokens are randomly selected for masking based on the calculated masking probability.
- Special tokens ([CLS] and [SEP]) are never masked.
- The number of tokens to be masked is determined by the masking probability and the length of the SELFIES string.
This methodology aims to create a diverse and challenging dataset for masked language modeling of molecular structures, adapting the masking intensity to the complexity of each molecule and employing multiple masking strategies to improve model robustness and generalization. Also, beside masking differently based on complexity scores, the on-the-fly data generation might ensure that each run and batches - the data are masked differently. But additional and further confirmation of this is needed.
#### Training Hyperparameters
- Batch size = 128
- Num of Epoch = 1
- Total steps on all chunks = 56,966
- Training time on each chunk = 03h:24m / ~205 mins
I am using Ranger21 optimizer with these settings:
```
Core optimizer = madgrad
Learning rate of 2e-05
num_epochs of training = ** 1 epochs **
using AdaBelief for variance computation
Warm-up: linear warmup, over 964 iterations (0.22)
Lookahead active, merging every 5 steps, with blend factor of 0.5
Norm Loss active, factor = 0.0001
Stable weight decay of 0.01
Gradient Centralization = On
Adaptive Gradient Clipping = True
clipping value of 0.01
steps for clipping = 0.001
```
I turned off the warm down, since in prior experiments it led to instability of losses in my case.
For more information about Ranger21, you could check out [this repository](https://github.com/lessw2020/Ranger21).
## Evaluation
* Dataset: `main-eval`
* Number of test examples: 810,108
#### Varied Masking Test
| Chunk | Avg Loss | Perplexity | MLM Accuracy |
| ------- | -------- | ---------- | ------------ |
| I-IV | 0.4547 | 1.5758 | 0.851 |
| V-VIII | 0.4224 | 1.5257 | 0.864 |
| IX-XIII | 0.3893 | 1.4759 | 0.876 |
#### Uniform 15% Masking Test (80%:10%:10%)
| Chunk | Avg Loss | Perplexity | MLM Accuracy |
| ----- | -------- | ---------- | ------------ |
| XII | 0.3349 | 1.3978 | 0.8929 |
## Interpretability
##### Attention Head Visualization
(coming soon)
##### Neural Stacks Visualization
(coming soon)
##### Attributions in Determining Masked Tokens
(coming soon)
## Technical Specifications
### Model Architecture and Objective
- **Layers**: 8
- **Attention Heads**: 4
- **Hidden Size**: 320
- **Intermediate Size**: 1280 (4H)
- **Attention Type**: SDPA
### Compute Infrastructure
###### Hardware
Platform: Paperspace's Gradients
Compute: Free-P5000 (16 GB GPU, 30 GB RAM, 8 vCPU)
###### Software
- Python: 3.9.13
- Transformers: 4.42.4
- PyTorch: 2.3.1+cu121
- Accelerate: 0.32.0
- Datasets: 2.20.0
- Tokenizers: 0.19.1
- Ranger21: 0.0.1
- Selfies: 2.1.2
- RDKit: 2024.3.3
## Citation
If you find this project useful in your research and wish to cite it, please use the following BibTex entry:
```
@software{chemfie_basebertmlm,
author = {GP Bayu},
title = {{ChemFIE Base}: Pretraining A Lightweight BERT-like model on Molecular SELFIES},
url = {https://huggingface.co/gbyuvd/chemselfies-base-bertmlm},
version = {1.0},
year = {2024},
}
```
### References
```
`@article{zdrazil2023chembl,
title={The ChEMBL Database in 2023: a drug discovery platform spanning multiple bioactivity data types and time periods},
author={Zdrazil, Barbara and Felix, Eloy and Hunter, Fiona and Manners, Emma J and Blackshaw, James and Corbett, Sybilla and de Veij, Marleen and Ioannidis, Harris and Lopez, David Mendez and Mosquera, Juan F and Magarinos, Maria Paula and Bosc, Nicolas and Arcila, Ricardo and Kizil{\"o}ren, Tevfik and Gaulton, Anna and Bento, A Patr{\'i}cia and Adasme, Melissa F and Monecke, Peter and Landrum, Gregory A and Leach, Andrew R},
journal={Nucleic Acids Research},
year={2023},
volume={gkad1004},
doi={10.1093/nar/gkad1004}
}
@misc{chembl34,
title={ChemBL34},
year={2023},
doi={10.6019/CHEMBL.database.34}
}
@article{sorokina2021coconut,
title={COCONUT online: Collection of Open Natural Products database},
author={Sorokina, Maria and Merseburger, Peter and Rajan, Kohulan and Yirik, Mehmet Aziz and Steinbeck, Christoph},
journal={Journal of Cheminformatics},
volume={13},
number={1},
pages={2},
year={2021},
doi={10.1186/s13321-020-00478-9}
}
@article{krenn2020selfies,
title={Self-referencing embedded strings (SELFIES): A 100\% robust molecular string representation},
author={Krenn, Mario and H{\"a}se, Florian and Nigam, AkshatKumar and Friederich, Pascal and Aspuru-Guzik, Alan},
journal={Machine Learning: Science and Technology},
volume={1},
number={4},
pages={045024},
year={2020},
doi={10.1088/2632-2153/aba947}
}
```
## Contact & Support My Work
G Bayu ([email protected])
This project has been quiet a journey for me, I’ve dedicated hours on this and I would like to improve myself, this model, and future projects. However, financial and computational constraints can be challenging.
If you find my work valuable and would like to support my journey, please consider suppoting me [here](ko-fi.com/gbyuvd). Your support will help me cover costs for computational resources, data acquisition, and further development of this project. Any amount, big or small, is greatly appreciated and will enable me to continue learning and explore more.
Thank you for checking out this model, I am more than happy to receive any feedback, so that I can improve myself and the future model/projects I will be working on. |