Spaces:
Sleeping
Sleeping
File size: 3,066 Bytes
5665097 26ddb6c |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 |
---
title: Turkish Tiktokenizer
emoji: π
colorFrom: red
colorTo: red
sdk: streamlit
sdk_version: 1.41.1
app_file: app.py
pinned: false
license: cc-by-nc-nd-4.0
short_description: Turkish Morphological Tokenizer
---
# Turkish Tiktokenizer Web App
A Streamlit-based web interface for the Turkish Morphological Tokenizer. This app provides an interactive way to tokenize Turkish text with real-time visualization and color-coded token display.
## Features
- π€ Turkish text tokenization with morphological analysis
- π¨ Color-coded token visualization
- π’ Token count and ID display
- π Special token highlighting (uppercase, space, newline, etc.)
- π Version selection from GitHub commit history
- π Direct integration with GitHub repository
## Demo
You can try the live demo at [Hugging Face Spaces](https://huggingface.co/spaces/YOUR_USERNAME/turkish-tiktokenizer) (Replace with your actual Spaces URL)
## Installation
1. Clone the repository:
```bash
git clone https://github.com/malibayram/tokenizer.git
cd tokenizer/streamlit_app
```
2. Install dependencies:
```bash
pip install -r requirements.txt
```
## Usage
1. Run the Streamlit app:
```bash
streamlit run app.py
```
2. Open your browser and navigate to http://localhost:8501
3. Enter Turkish text in the input area and click "Tokenize"
## How It Works
1. **Text Input**: Enter Turkish text in the left panel
2. **Tokenization**: Click the "Tokenize" button to process the text
3. **Visualization**:
- Token count is displayed at the top
- Tokens are shown with color-coding:
- Special tokens (uppercase, space, etc.) have predefined colors
- Regular tokens get unique colors for easy identification
- Token IDs are displayed below the visualization
## Code Structure
- `app.py`: Main Streamlit application
- UI components and layout
- GitHub integration
- Tokenization logic
- Color generation and visualization
- `requirements.txt`: Python dependencies
## Technical Details
- **Tokenizer Source**: Fetched directly from GitHub repository
- **Caching**: Uses Streamlit's caching for better performance
- **Color Generation**: HSV-based algorithm for visually distinct colors
- **Session State**: Maintains text and results between interactions
- **Error Handling**: Graceful handling of GitHub API and tokenization errors
## Deployment to Hugging Face Spaces
1. Create a new Space:
- Go to https://huggingface.co/spaces
- Click "Create new Space"
- Select "Streamlit" as the SDK
- Choose a name for your Space
2. Upload files:
- `app.py`
- `requirements.txt`
3. The app will automatically deploy and be available at your Space's URL
## Contributing
1. Fork the repository
2. Create your feature branch
3. Commit your changes
4. Push to the branch
5. Create a Pull Request
## License
MIT License - see the [LICENSE](../LICENSE) file for details
## Acknowledgments
- Built by dqbd
- Created with the generous help from Diagram
- Based on the [Turkish Morphological Tokenizer](https://github.com/malibayram/tokenizer) |