File size: 7,326 Bytes
66d2226 e9d07cf 66d2226 60dd9a3 66d2226 e9d07cf 66d2226 60dd9a3 a1f5bee 60dd9a3 a0c3835 66d2226 712572c a0c3835 4396258 afd81af 4396258 36eea2a 4396258 66d2226 e9d07cf |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 |
---
license: apache-2.0
language:
- en
- zh
- ja
- vi
---
# InkSight Small-p
From [InkSight: Offline-to-Online Handwriting Conversion by Learning to Read and Write](https://github.com/google-research/inksight)
<div style="display: flex; gap: 0.5rem; flex-wrap: wrap; margin-bottom: 1rem;">
<a href="https://research.google/blog/a-return-to-hand-written-notes-by-learning-to-read-write/">
<img src="https://img.shields.io/badge/Google_Research_Blog-333333?&logo=google&logoColor=white" alt="Google Research Blog">
</a>
<a href="https://arxiv.org/abs/2402.05804">
<img src="https://img.shields.io/badge/Read_the_Paper-4CAF50?&logo=arxiv&logoColor=white" alt="Read the Paper">
</a>
<a href="https://huggingface.co/spaces/Derendering/Model-Output-Playground">
<img src="https://img.shields.io/badge/Output_Playground-007acc?&logo=huggingface&logoColor=white" alt="Try Demo on Hugging Face">
</a>
<a href="https://charlieleee.github.io/publication/inksight/">
<img src="https://img.shields.io/badge/🔗_Project_Page-FFA500?&logo=link&logoColor=white" alt="Project Page">
</a>
<a href="https://huggingface.co/datasets/Derendering/InkSight-Derenderings">
<img src="https://img.shields.io/badge/Dataset-InkSight-40AF40?&logo=huggingface&logoColor=white" alt="Hugging Face Dataset">
</a>
<a href="https://githubtocolab.com/google-research/inksight/blob/main/colab.ipynb">
<img src="https://img.shields.io/badge/Example_Colab-F9AB00?&logo=googlecolab&logoColor=white" alt="Example colab">
</a>
</div>
<figure>
<img src="https://charlieleee.github.io/publication/inksight/inksight_animation_gif.gif" alt="InkSight word-level" style="width: 100%;">
<figcaption>The illustration on InkSight's word-level model outputs both text and digital ink through "Recognize and Derender" inference. </figcaption>
</figure>
<div style="font-size: 16px; margin-top: 20px;">
<strong style="color: red;">Notice:</strong> Please use TensorFlow and tensorflow-text between version 2.15.0 and 2.17.0. Versions later than 2.17.0 may lead to unexpected behavior. We are currently investigating these issues.
</div>
## Example Usage
```python
from huggingface_hub import from_pretrained_keras
import tensorflow_text
model = from_pretrained_keras("Derendering/InkSight-Small-p")
cf = model.signatures['serving_default']
prompt = "Derender the ink." # "Recognize and derender." or "Derender the ink: <text>"
input_text = tf.constant([prompt], dtype=tf.string)
image_encoded = tf.reshape(tf.io.encode_jpeg(np.array(image)[:, :, :3]), (1, 1))
output = cf(**{'input_text': input_text, 'image/encoded': image_encoded})
```
<span>For full usage, please refer to the notebook: </span> <a href="https://githubtocolab.com/google-research/inksight/blob/main/colab.ipynb" target="_blank"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab" style="display: inline; vertical-align: middle;"></a>
## Model and Training Summary
<table style="width:100%; border-collapse: collapse; font-family: Arial, sans-serif;">
<tr>
<th style="width: 30%; border: 1px solid #333; padding: 10px; background-color: #f2f2f2;">Model Architecture</th>
<td style="border: 1px solid #333; padding: 10px;">A multimodal sequence-to-sequence Transformer model with the mT5 encoder-decoder architecture. It takes text tokens and ViT dense image embeddings as inputs to an encoder and autoregressively predicts discrete text and ink tokens with a decoder.</td>
</tr>
<tr>
<th style="width: 30%; border: 1px solid #333; padding: 10px; background-color: #f2f2f2;">Input(s)</th>
<td style="border: 1px solid #333; padding: 10px;">A pair of image and text.</td>
</tr>
<tr>
<th style="width: 30%; border: 1px solid #333; padding: 10px; background-color: #f2f2f2;">Output(s)</th>
<td style="border: 1px solid #333; padding: 10px;">Generated digital ink and text.</td>
</tr>
<tr>
<th style="width: 30%; border: 1px solid #333; padding: 10px; background-color: #f2f2f2;">Usage</th>
<td style="border: 1px solid #333; padding: 10px;">
<strong>Application:</strong> The model is for research prototype, and the public version is released and available for the public.<br>
<strong>Known Caveats:</strong> None.
</td>
</tr>
<tr>
<th style="width: 30%; border: 1px solid #333; padding: 10px; background-color: #f2f2f2;">System Type</th>
<td style="border: 1px solid #333; padding: 10px;">
<strong>System Description:</strong> This is a standalone model.<br>
<strong>Upstream Dependencies:</strong> None.<br>
<strong>Downstream Dependencies:</strong> None.
</td>
</tr>
<tr>
<th style="width: 30%; border: 1px solid #333; padding: 10px; background-color: #f2f2f2;">Implementation Frameworks</th>
<td style="border: 1px solid #333; padding: 10px;">
<strong>Hardware & Software:</strong> Hardware: TPU v5e.<br>
Software: T5X , JAX/Flax, Flaxformer.<br>
<strong>Compute Requirements:</strong> We train all of our models for 340k steps with batch size 512. With frozen ViT encoders, the training of Small-i takes ∼33h on 64 TPU v5e chips and the training of Large-i takes ∼105h on 64 TPU v5e chips.
</td>
</tr>
<tr>
<th style="width: 30%; border: 1px solid #333; padding: 10px; background-color: #f2f2f2;">Data Overview</th>
<td style="border: 1px solid #333; padding: 10px;">
<strong>Training Datasets:</strong> The ViT encoder of Small-p is pretrained on ImageNet-21k, mT5 encoder and decoder are initialized from scratch. The entire model is trained on the mixture of publicly available datasets described in next section.
</td>
</tr>
<tr>
<th style="width: 30%; border: 1px solid #333; padding: 10px; background-color: #f2f2f2;">Evaluation Results</th>
<td style="border: 1px solid #333; padding: 10px;">
<strong>Evaluation Methods:</strong> Human evaluation (reported in Section 4.5.1 of the paper) and automated evaluations (reported in Section 4.5.2 of the paper).
</td>
</tr>
<tr>
<th style="width: 30%; border: 1px solid #333; padding: 10px; background-color: #f2f2f2;">Model Usage & Limitations</th>
<td style="border: 1px solid #333; padding: 10px;">
<strong>Sensitive Use:</strong> The model is capable of converting images to digital inks. This model should not be used for any of the privacy-intruding use cases, e.g., forging handwritings.<br>
<strong>Known Limitations:</strong> Reported in Appendix I of the paper.<br>
<strong>Ethical Considerations & Potential Societal Consequences:</strong> Reported in Sections 6.1 and 6.2 of the paper.
</td>
</tr>
</table>
## Citation
If you find our work useful for your research and applications, please cite using this BibTeX:
```bibtex
@article{mitrevski2024inksight,
title={InkSight: Offline-to-Online Handwriting Conversion by Learning to Read and Write},
author={Mitrevski, Blagoj and Rak, Arina and Schnitzler, Julian and Li, Chengkun and Maksai, Andrii and Berent, Jesse and Musat, Claudiu},
journal={arXiv preprint arXiv:2402.05804},
year={2024}
}
``` |