File size: 3,066 Bytes
5665097
 
 
 
 
 
 
 
 
 
 
 
 
26ddb6c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
---
title: Turkish Tiktokenizer
emoji: πŸ‘
colorFrom: red
colorTo: red
sdk: streamlit
sdk_version: 1.41.1
app_file: app.py
pinned: false
license: cc-by-nc-nd-4.0
short_description: Turkish Morphological Tokenizer
---

# Turkish Tiktokenizer Web App

A Streamlit-based web interface for the Turkish Morphological Tokenizer. This app provides an interactive way to tokenize Turkish text with real-time visualization and color-coded token display.

## Features

- πŸ”€ Turkish text tokenization with morphological analysis
- 🎨 Color-coded token visualization
- πŸ”’ Token count and ID display
- πŸ“Š Special token highlighting (uppercase, space, newline, etc.)
- πŸ”„ Version selection from GitHub commit history
- 🌐 Direct integration with GitHub repository

## Demo

You can try the live demo at [Hugging Face Spaces](https://huggingface.co/spaces/YOUR_USERNAME/turkish-tiktokenizer) (Replace with your actual Spaces URL)

## Installation

1. Clone the repository:
```bash
git clone https://github.com/malibayram/tokenizer.git
cd tokenizer/streamlit_app
```

2. Install dependencies:
```bash
pip install -r requirements.txt
```

## Usage

1. Run the Streamlit app:
```bash
streamlit run app.py
```

2. Open your browser and navigate to http://localhost:8501

3. Enter Turkish text in the input area and click "Tokenize"

## How It Works

1. **Text Input**: Enter Turkish text in the left panel
2. **Tokenization**: Click the "Tokenize" button to process the text
3. **Visualization**:
   - Token count is displayed at the top
   - Tokens are shown with color-coding:
     - Special tokens (uppercase, space, etc.) have predefined colors
     - Regular tokens get unique colors for easy identification
   - Token IDs are displayed below the visualization

## Code Structure

- `app.py`: Main Streamlit application
  - UI components and layout
  - GitHub integration
  - Tokenization logic
  - Color generation and visualization
- `requirements.txt`: Python dependencies

## Technical Details

- **Tokenizer Source**: Fetched directly from GitHub repository
- **Caching**: Uses Streamlit's caching for better performance
- **Color Generation**: HSV-based algorithm for visually distinct colors
- **Session State**: Maintains text and results between interactions
- **Error Handling**: Graceful handling of GitHub API and tokenization errors

## Deployment to Hugging Face Spaces

1. Create a new Space:
   - Go to https://huggingface.co/spaces
   - Click "Create new Space"
   - Select "Streamlit" as the SDK
   - Choose a name for your Space

2. Upload files:
   - `app.py`
   - `requirements.txt`

3. The app will automatically deploy and be available at your Space's URL

## Contributing

1. Fork the repository
2. Create your feature branch
3. Commit your changes
4. Push to the branch
5. Create a Pull Request

## License

MIT License - see the [LICENSE](../LICENSE) file for details

## Acknowledgments

- Built by dqbd
- Created with the generous help from Diagram
- Based on the [Turkish Morphological Tokenizer](https://github.com/malibayram/tokenizer)