Spaces:

alibayram
/

turkish_tiktokenizer

Sleeping

File size: 3,066 Bytes

---
title: Turkish Tiktokenizer
emoji: 👁
colorFrom: red
colorTo: red
sdk: streamlit
sdk_version: 1.41.1
app_file: app.py
pinned: false
license: cc-by-nc-nd-4.0
short_description: Turkish Morphological Tokenizer
---

# Turkish Tiktokenizer Web App

A Streamlit-based web interface for the Turkish Morphological Tokenizer. This app provides an interactive way to tokenize Turkish text with real-time visualization and color-coded token display.

## Features

- 🔤 Turkish text tokenization with morphological analysis
- 🎨 Color-coded token visualization
- 🔢 Token count and ID display
- 📊 Special token highlighting (uppercase, space, newline, etc.)
- 🔄 Version selection from GitHub commit history
- 🌐 Direct integration with GitHub repository

## Demo

You can try the live demo at [Hugging Face Spaces](https://huggingface.co/spaces/YOUR_USERNAME/turkish-tiktokenizer) (Replace with your actual Spaces URL)

## Installation

1. Clone the repository:
```bash
git clone https://github.com/malibayram/tokenizer.git
cd tokenizer/streamlit_app
```

2. Install dependencies:
```bash
pip install -r requirements.txt
```

## Usage

1. Run the Streamlit app:
```bash
streamlit run app.py
```

2. Open your browser and navigate to http://localhost:8501

3. Enter Turkish text in the input area and click "Tokenize"

## How It Works

1. **Text Input**: Enter Turkish text in the left panel
2. **Tokenization**: Click the "Tokenize" button to process the text
3. **Visualization**:
   - Token count is displayed at the top
   - Tokens are shown with color-coding:
     - Special tokens (uppercase, space, etc.) have predefined colors
     - Regular tokens get unique colors for easy identification
   - Token IDs are displayed below the visualization

## Code Structure

- `app.py`: Main Streamlit application
  - UI components and layout
  - GitHub integration
  - Tokenization logic
  - Color generation and visualization
- `requirements.txt`: Python dependencies

## Technical Details

- **Tokenizer Source**: Fetched directly from GitHub repository
- **Caching**: Uses Streamlit's caching for better performance
- **Color Generation**: HSV-based algorithm for visually distinct colors
- **Session State**: Maintains text and results between interactions
- **Error Handling**: Graceful handling of GitHub API and tokenization errors

## Deployment to Hugging Face Spaces

1. Create a new Space:
   - Go to https://huggingface.co/spaces
   - Click "Create new Space"
   - Select "Streamlit" as the SDK
   - Choose a name for your Space

2. Upload files:
   - `app.py`
   - `requirements.txt`

3. The app will automatically deploy and be available at your Space's URL

## Contributing

1. Fork the repository
2. Create your feature branch
3. Commit your changes
4. Push to the branch
5. Create a Pull Request

## License

MIT License - see the [LICENSE](../LICENSE) file for details

## Acknowledgments

- Built by dqbd
- Created with the generous help from Diagram
- Based on the [Turkish Morphological Tokenizer](https://github.com/malibayram/tokenizer)