TF-Keras
File size: 7,326 Bytes
66d2226
 
e9d07cf
 
 
 
 
66d2226
 
60dd9a3
 
66d2226
e9d07cf
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
66d2226
 
60dd9a3
a1f5bee
60dd9a3
 
 
 
 
 
a0c3835
 
 
 
66d2226
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
712572c
a0c3835
4396258
 
 
 
 
 
 
 
 
 
 
 
 
afd81af
4396258
 
 
 
36eea2a
4396258
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
66d2226
 
 
 
 
 
 
 
 
 
e9d07cf
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
---
license: apache-2.0
language:
- en
- zh
- ja
- vi
---

# InkSight Small-p 
From [InkSight: Offline-to-Online Handwriting Conversion by Learning to Read and Write](https://github.com/google-research/inksight)

<div style="display: flex; gap: 0.5rem; flex-wrap: wrap; margin-bottom: 1rem;">
<a href="https://research.google/blog/a-return-to-hand-written-notes-by-learning-to-read-write/">
 <img src="https://img.shields.io/badge/Google_Research_Blog-333333?&logo=google&logoColor=white" alt="Google Research Blog">
</a>
<a href="https://arxiv.org/abs/2402.05804">
 <img src="https://img.shields.io/badge/Read_the_Paper-4CAF50?&logo=arxiv&logoColor=white" alt="Read the Paper">
</a>
<a href="https://huggingface.co/spaces/Derendering/Model-Output-Playground">
 <img src="https://img.shields.io/badge/Output_Playground-007acc?&logo=huggingface&logoColor=white" alt="Try Demo on Hugging Face">
</a>
<a href="https://charlieleee.github.io/publication/inksight/">
 <img src="https://img.shields.io/badge/🔗_Project_Page-FFA500?&logo=link&logoColor=white" alt="Project Page">
</a>
<a href="https://huggingface.co/datasets/Derendering/InkSight-Derenderings">
 <img src="https://img.shields.io/badge/Dataset-InkSight-40AF40?&logo=huggingface&logoColor=white" alt="Hugging Face Dataset">
</a>
<a href="https://githubtocolab.com/google-research/inksight/blob/main/colab.ipynb">
 <img src="https://img.shields.io/badge/Example_Colab-F9AB00?&logo=googlecolab&logoColor=white" alt="Example colab">
</a>
</div>

<figure>
  <img src="https://charlieleee.github.io/publication/inksight/inksight_animation_gif.gif" alt="InkSight word-level" style="width: 100%;">
  <figcaption>The illustration on InkSight's word-level model outputs both text and digital ink through "Recognize and Derender" inference. </figcaption>
</figure>




<div style="font-size: 16px; margin-top: 20px;">
    <strong style="color: red;">Notice:</strong> Please use TensorFlow and tensorflow-text between version 2.15.0 and 2.17.0. Versions later than 2.17.0 may lead to unexpected behavior. We are currently investigating these issues.
</div>


## Example Usage

```python
from huggingface_hub import from_pretrained_keras
import tensorflow_text

model = from_pretrained_keras("Derendering/InkSight-Small-p")
cf = model.signatures['serving_default']

prompt = "Derender the ink." # "Recognize and derender." or "Derender the ink: <text>"

input_text = tf.constant([prompt], dtype=tf.string)
image_encoded = tf.reshape(tf.io.encode_jpeg(np.array(image)[:, :, :3]), (1, 1))
output = cf(**{'input_text': input_text, 'image/encoded': image_encoded})
```

<span>For full usage, please refer to the notebook: </span> <a href="https://githubtocolab.com/google-research/inksight/blob/main/colab.ipynb" target="_blank"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab" style="display: inline; vertical-align: middle;"></a>

## Model and Training Summary

<table style="width:100%; border-collapse: collapse; font-family: Arial, sans-serif;">
    <tr>
        <th style="width: 30%; border: 1px solid #333; padding: 10px; background-color: #f2f2f2;">Model Architecture</th>
        <td style="border: 1px solid #333; padding: 10px;">A multimodal sequence-to-sequence Transformer model with the mT5 encoder-decoder architecture. It takes text tokens and ViT dense image embeddings as inputs to an encoder and autoregressively predicts discrete text and ink tokens with a decoder.</td>
    </tr>
    <tr>
        <th style="width: 30%; border: 1px solid #333; padding: 10px; background-color: #f2f2f2;">Input(s)</th>
        <td style="border: 1px solid #333; padding: 10px;">A pair of image and text.</td>
    </tr>
    <tr>
        <th style="width: 30%; border: 1px solid #333; padding: 10px; background-color: #f2f2f2;">Output(s)</th>
        <td style="border: 1px solid #333; padding: 10px;">Generated digital ink and text.</td>
    </tr>
    <tr>
        <th style="width: 30%; border: 1px solid #333; padding: 10px; background-color: #f2f2f2;">Usage</th>
        <td style="border: 1px solid #333; padding: 10px;">
            <strong>Application:</strong> The model is for research prototype, and the public version is released and available for the public.<br>
            <strong>Known Caveats:</strong> None.
        </td>
    </tr>
    <tr>
        <th style="width: 30%; border: 1px solid #333; padding: 10px; background-color: #f2f2f2;">System Type</th>
        <td style="border: 1px solid #333; padding: 10px;">
            <strong>System Description:</strong> This is a standalone model.<br>
            <strong>Upstream Dependencies:</strong> None.<br>
            <strong>Downstream Dependencies:</strong> None.
        </td>
    </tr>
    <tr>
        <th style="width: 30%; border: 1px solid #333; padding: 10px; background-color: #f2f2f2;">Implementation Frameworks</th>
        <td style="border: 1px solid #333; padding: 10px;">
            <strong>Hardware & Software:</strong> Hardware: TPU v5e.<br>
            Software: T5X , JAX/Flax, Flaxformer.<br>
            <strong>Compute Requirements:</strong> We train all of our models for 340k steps with batch size 512. With frozen ViT encoders, the training of Small-i takes ∼33h on 64 TPU v5e chips and the training of Large-i takes ∼105h on 64 TPU v5e chips.
        </td>
    </tr>
    <tr>
        <th style="width: 30%; border: 1px solid #333; padding: 10px; background-color: #f2f2f2;">Data Overview</th>
        <td style="border: 1px solid #333; padding: 10px;">
            <strong>Training Datasets:</strong> The ViT encoder of Small-p is pretrained on ImageNet-21k, mT5 encoder and decoder are initialized from scratch. The entire model is trained on the mixture of publicly available datasets described in next section.
        </td>
    </tr>
    <tr>
        <th style="width: 30%; border: 1px solid #333; padding: 10px; background-color: #f2f2f2;">Evaluation Results</th>
        <td style="border: 1px solid #333; padding: 10px;">
            <strong>Evaluation Methods:</strong> Human evaluation (reported in Section 4.5.1 of the paper) and automated evaluations (reported in Section 4.5.2 of the paper).
        </td>
    </tr>
    <tr>
        <th style="width: 30%; border: 1px solid #333; padding: 10px; background-color: #f2f2f2;">Model Usage & Limitations</th>
        <td style="border: 1px solid #333; padding: 10px;">
            <strong>Sensitive Use:</strong> The model is capable of converting images to digital inks. This model should not be used for any of the privacy-intruding use cases, e.g., forging handwritings.<br>
            <strong>Known Limitations:</strong> Reported in Appendix I of the paper.<br>
            <strong>Ethical Considerations & Potential Societal Consequences:</strong> Reported in Sections 6.1 and 6.2 of the paper.
        </td>
    </tr>
</table>


## Citation

If you find our work useful for your research and applications, please cite using this BibTeX:
```bibtex
@article{mitrevski2024inksight,
  title={InkSight: Offline-to-Online Handwriting Conversion by Learning to Read and Write},
  author={Mitrevski, Blagoj and Rak, Arina and Schnitzler, Julian and Li, Chengkun and Maksai, Andrii and Berent, Jesse and Musat, Claudiu},
  journal={arXiv preprint arXiv:2402.05804},
  year={2024}
}
```