TF Lite
File size: 4,119 Bytes
593a0d0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
47e54ca
 
 
593a0d0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
---
license: apache-2.0
---

## Ishara: ASL Fingerspelling Recognition

Ishara is a deep learning model designed for accurate recognition of American Sign Language (ASL) fingerspelling. It is based on a hybrid architecture that combines **Squeezeformer** and **Conformer** blocks with **Conv1D layers** for efficient feature extraction from hand, face, and pose landmark data.

This model is a submission to the Google ASLFR Competition and achieves robust performance on character-level prediction tasks.

---

## Model Description

Ishara processes sequences of normalized hand, face, and pose landmarks to predict fingerspelled words at the character level. The architecture is designed to handle temporal variability and missing data using a combination of:

- **Squeezeformer blocks**: For efficient sequence modeling.
- **Conformer blocks**: For enhanced feature extraction.
- **Conv1D layers**: For initial temporal feature extraction.

The output predictions are character-level sequences optimized using **Connectionist Temporal Classification (CTC)** loss.

---

## Dataset

The model was trained and evaluated on the dataset provided by the [Google ASLFR Competition](https://www.kaggle.com/competitions/asl-fingerspelling), which consists of:

- **Hand landmarks**: 21 points each for left and right hands.
- **Face landmarks**: 40 key points.
- **Pose landmarks**: 10 key points.
- **Labels**: Text sequences representing fingerspelled words.

---

## Usage

### Inference with TFLite

The model is available in TensorFlow Lite format for real-time inference. To use the model:

```python
import tensorflow as tf

# Load the TFLite model
interpreter = tf.lite.Interpreter("model.tflite")
interpreter.allocate_tensors()

# Define input-output
input_details = interpreter.get_input_details()
output_details = interpreter.get_output_details()

# Input a sequence of landmarks
input_data = ... # Preprocessed input sequence
interpreter.set_tensor(input_details[0]['index'], input_data)
interpreter.invoke()

# Get the prediction
output_data = interpreter.get_tensor(output_details[0]['index'])
print("Predicted Sequence:", output_data)
```

---

### Training Workflow

You can replicate the training process using TensorFlow. The training loop is as follows:

```python
from model import get_model

# Define the model
model = get_model(
    dim=256,
    num_conv_squeeze_blocks=2,
    num_conv_conform_blocks=2,
    kernel_sizes=[11, 5, 3],
    num_conv_per_block=3,
    dropout_rate=0.2
)

# Train the model
history = model.fit(
    train_dataset,
    validation_data=val_dataset,
    epochs=N_EPOCHS,
    callbacks=[validation_callback, lr_callback, WeightDecayCallback()]
)
```

---

## Model Evaluation

The model's performance is evaluated using:

- **Levenshtein Distance**: Measures character-level accuracy.
- **Normalized Character Error Rate (CER)**: Quantifies the model's robustness.
- **Real-Time Inference Speed**: Assessed on 1080p video inputs.

---

## Results

- **Normalised Levenshtein Distance**: [0.728]
- **Inference Speed**: [200ms]
- **Model Size**: [17.9 Mb]

---

## Deployment

The model is optimized for deployment in real-time systems using TensorFlow Lite. This makes it suitable for integration into mobile and embedded systems for ASL recognition tasks.

---

## License

This model is released under the [Apache License 2.0](http://www.apache.org/licenses/LICENSE-2.0).

---

## Acknowledgments

- **Google ASLFR Competition**: For providing the dataset.
- **TensorFlow Team**: For the deep learning framework.
- **Paper Authors**: For inspiring the architecture.
  - [Squeezeformer](https://arxiv.org/abs/2206.00888)
  - [Conformer](https://arxiv.org/abs/2005.08100)

---

## Citation

If you use this model, please consider citing:

```
@misc{ishara_asl,
  title={Ishara: ASL Fingerspelling Recognition},
  author={Niharika Gupta, Tanay Srinivasa, Tanmay Nanda, Zoya Ghoshal},
  year={2024},
  howpublished={\url{https://huggingface.co/ishara-asl}}
}
```

---

## Contact

For questions or collaboration, feel free to reach out:

- **Tanmay Nanda**: [email protected]