File size: 1,479 Bytes
4a5b0ec a0dc44c 4a5b0ec 67a3b3a 4a5b0ec 67a3b3a 4a5b0ec 67a3b3a 4a5b0ec 67a3b3a 4a5b0ec 67a3b3a 9512f11 a0dc44c |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 |
---
title: Vibesmark Test Suite
emoji: 🎯
colorFrom: blue
colorTo: purple
sdk: gradio
sdk_version: 5.13.1
app_file: app.py
pinned: false
---
# Vibesmark Test Suite
A benchmarking tool for comparing different language models side by side. This application allows users to:
- Upload custom test questions
- Compare responses from different language models
- Record preferences between model outputs
- Generate summary statistics of model performance
## Setup
1. Create a `.env` file with your OpenRouter API credentials:
```
OPENROUTER_API_KEY=your_api_key_here
OPENROUTER_BASE_URL=https://openrouter.ai/api/v1/chat/completions
```
2. Install dependencies:
```bash
pip install -r requirements.txt
```
3. Run the application:
```bash
python app.py
```
## Usage
1. Select two models to compare
2. Upload a text file containing test questions (one per line)
3. Start the test and evaluate responses
4. View results summary when finished
## Deployment
This app is ready to deploy on Hugging Face Spaces. Just add your OpenRouter API credentials as secrets in your Space settings.
## Features
- Compare responses from different AI models side by side
- Supports up to 10 questions per benchmark
- Randomly selects different models for comparison
- Real-time response generation
## Supported Models
- Claude 3 Opus
- Claude 3 Sonnet
- Gemini Pro
- Mistral Medium
- Claude 2.1
- GPT-4 Turbo
- GPT-3.5 Turbo
## License
[Your chosen license]
Run it with
`python app.py` |