|
--- |
|
title: Vibesmark Test Suite |
|
emoji: 🎯 |
|
colorFrom: blue |
|
colorTo: purple |
|
sdk: gradio |
|
sdk_version: 5.13.1 |
|
app_file: app.py |
|
pinned: false |
|
--- |
|
|
|
# Vibesmark Test Suite |
|
|
|
A benchmarking tool for comparing different language models side by side. This application allows users to: |
|
|
|
- Upload custom test questions |
|
- Compare responses from different language models |
|
- Record preferences between model outputs |
|
- Generate summary statistics of model performance |
|
|
|
## Setup |
|
|
|
1. Create a `.env` file with your OpenRouter API credentials: |
|
``` |
|
OPENROUTER_API_KEY=your_api_key_here |
|
OPENROUTER_BASE_URL=https://openrouter.ai/api/v1/chat/completions |
|
``` |
|
|
|
2. Install dependencies: |
|
```bash |
|
pip install -r requirements.txt |
|
``` |
|
|
|
3. Run the application: |
|
```bash |
|
python app.py |
|
``` |
|
|
|
## Usage |
|
|
|
1. Select two models to compare |
|
2. Upload a text file containing test questions (one per line) |
|
3. Start the test and evaluate responses |
|
4. View results summary when finished |
|
|
|
## Deployment |
|
|
|
This app is ready to deploy on Hugging Face Spaces. Just add your OpenRouter API credentials as secrets in your Space settings. |
|
|
|
## Features |
|
|
|
- Compare responses from different AI models side by side |
|
- Supports up to 10 questions per benchmark |
|
- Randomly selects different models for comparison |
|
- Real-time response generation |
|
|
|
## Supported Models |
|
|
|
- Claude 3 Opus |
|
- Claude 3 Sonnet |
|
- Gemini Pro |
|
- Mistral Medium |
|
- Claude 2.1 |
|
- GPT-4 Turbo |
|
- GPT-3.5 Turbo |
|
|
|
## License |
|
|
|
[Your chosen license] |
|
|
|
Run it with |
|
`python app.py` |